Open Chess Game Database Standard

dangi12012 · Post by **dangi12012** » Fri Dec 03, 2021 2:22 pm

phhnguyen wrote: ↑Fri Dec 03, 2021 6:31 am I will use more threads to speed up. To manage them I use thread-pool a library from https://github.com/bshoshany/thread-pool. More threads can help game parsing faster. However, both reading from the PGN file and writing (to the database) should be in sequence, cannot be multithreading, thus they are bottlenecks of the performance.

I can add here that you can have a serial parser and a threadsafe blockingcollection with 64 entries or so where you just put your strings onto. When the buffer is full the singlethread parser will block until a slot is free again.

But on the other side of that collection you can have a Threadpool consuming the items with concurrent calls to pop().
Each thread parses the pgn and inserts it into the db. SQLite is threadsafe - but your bound insert parameters would need to be marked threadstatic.

This is the producer consumer pattern. 1 Producer -> 32 Consumers
This is also the way to parse a file in parallel.

Disclaimer:
Premature optimisation is evil

Dann Corbit · Post by **Dann Corbit** » Mon Dec 06, 2021 11:33 pm

Consider this position from a collection of games in SCID:
[fen]rnbq1rk1/ppp2pbp/3p2p1/4p3/2PPP1n1/2N1BN2/PP2BPPP/R2QK2R w KQ -[/fen]
Using a sequence of identical moves won't find most of the matches.
There are 8,992 games in my database that reach that position.
Here is a demonstration of how likely permutations are:
There were 286 move orders reaching this position. The top 8 are:

Code: Select all

  1:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.Nf3 O-O 6.Be2 e5 7.Be3 Ng4 (3039)
  2:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.Be2 O-O 6.Nf3 e5 7.Be3 Ng4 (901)
  3:  1.Nf3 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.d4 O-O 6.Be2 e5 7.Be3 Ng4 (785)
  4:  1.d4 Nf6 2.Nf3 g6 3.c4 Bg7 4.Nc3 O-O 5.e4 d6 6.Be2 e5 7.Be3 Ng4 (683)
  5:  1.d4 d6 2.Nf3 Nf6 3.c4 g6 4.Nc3 Bg7 5.e4 O-O 6.Be2 e5 7.Be3 Ng4 (345)
  6:  1.d4 Nf6 2.c4 g6 3.Nf3 Bg7 4.Nc3 O-O 5.e4 d6 6.Be2 e5 7.Be3 Ng4 (237)
  7:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 O-O 5.Nf3 d6 6.Be2 e5 7.Be3 Ng4 (229)
  8:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.Nf3 O-O 5.e4 d6 6.Be2 e5 7.Be3 Ng4 (224)

If you do not store an EPD or FEN position list, you will only find a tiny fraction of the matching games.
In this case, the most popular move order would have located 286/8992=3% of the games and any other move order would have done much worse.

dangi12012 · Post by **dangi12012** » Tue Dec 07, 2021 12:58 am

Dann Corbit wrote: ↑Mon Dec 06, 2021 11:33 pm Consider this position from a collection of games in SCID:
[fen]rnbq1rk1/ppp2pbp/3p2p1/4p3/2PPP1n1/2N1BN2/PP2BPPP/R2QK2R w KQ -[/fen]
Using a sequence of identical moves won't find most of the matches.
There are 8,992 games in my database that reach that position.
Here is a demonstration of how likely permutations are:
There were 286 move orders reaching this position. The top 8 are:
Code: Select all
  1:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.Nf3 O-O 6.Be2 e5 7.Be3 Ng4 (3039)
  2:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.Be2 O-O 6.Nf3 e5 7.Be3 Ng4 (901)
  3:  1.Nf3 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 d6 5.d4 O-O 6.Be2 e5 7.Be3 Ng4 (785)
  4:  1.d4 Nf6 2.Nf3 g6 3.c4 Bg7 4.Nc3 O-O 5.e4 d6 6.Be2 e5 7.Be3 Ng4 (683)
  5:  1.d4 d6 2.Nf3 Nf6 3.c4 g6 4.Nc3 Bg7 5.e4 O-O 6.Be2 e5 7.Be3 Ng4 (345)
  6:  1.d4 Nf6 2.c4 g6 3.Nf3 Bg7 4.Nc3 O-O 5.e4 d6 6.Be2 e5 7.Be3 Ng4 (237)
  7:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.e4 O-O 5.Nf3 d6 6.Be2 e5 7.Be3 Ng4 (229)
  8:  1.d4 Nf6 2.c4 g6 3.Nc3 Bg7 4.Nf3 O-O 5.e4 d6 6.Be2 e5 7.Be3 Ng4 (224)
If you do not store an EPD or FEN position list, you will only find a tiny fraction of the matching games.
In this case, the most popular move order would have located 286/8992=3% of the games and any other move order would have done much worse.

Yes a 192 bit compressed bitboard - or two 64 bit hashes per position would be great. Again Positions would be unique so for the 8x example above its only 1x. The actual moveid will take 48bit per actual move.

dangi12012 · Post by **dangi12012** » Wed Dec 08, 2021 7:05 pm

New idea for a query:

Average Centipawn Loss per second thinktime. Above 2000 Elo only.

This would make a good 3d graph about how thinking time correlates with the accuracy and that over elo.
My guess is that the standard deviation becomes narrower and the cp loss per move becomes lower.

https://i.pinimg.com/originals/46/5e/84 ... d66705.png

So it would be great to have TIME tags in each move as well as EVAL tags in each position so this becomes answerable in the literally 2 minutes it takes to translate that sentence into sql.

phhnguyen · Post by **phhnguyen** » Thu Dec 09, 2021 2:18 pm

Position searching: 3rd attempt

Some people have mentioned using hash keys as a solution for position searching. This time that idea should be given a try.

Creating database

Hash keys of all chess positions will be stored in a new table HashPositions with their game IDs and plies.

Code: Select all

CREATE TABLE HashPositions (ID INTEGER PRIMARY KEY, GameID INTEGER, AtPly INTEGER, HashKey INTEGER)

The stats when creating database as the below:

Code: Select all

Thread count: 4
#games: 228742, #errors: 0, #hashKeys: 17353710, elapsed: 56401ms 00:56, speed: 4055 games/s
#games: 444564, #errors: 0, #hashKeys: 33670584, elapsed: 107830ms 01:47, speed: 4122 games/s
#games: 660467, #errors: 0, #hashKeys: 50006190, elapsed: 160091ms 02:40, speed: 4125 games/s
#games: 876236, #errors: 0, #hashKeys: 66314397, elapsed: 212072ms 03:32, speed: 4131 games/s
#games: 1084961, #errors: 0, #hashKeys: 82413379, elapsed: 263623ms 04:23, speed: 4115 games/s
#games: 1282580, #errors: 0, #hashKeys: 98234700, elapsed: 312420ms 05:12, speed: 4105 games/s
#games: 1481295, #errors: 0, #hashKeys: 114086982, elapsed: 362654ms 06:02, speed: 4084 games/s
#games: 1682015, #errors: 0, #hashKeys: 129877265, elapsed: 409788ms 06:49, speed: 4104 games/s
#games: 1878876, #errors: 0, #hashKeys: 145665804, elapsed: 457185ms 07:37, speed: 4109 games/s
#games: 2070568, #errors: 0, #hashKeys: 161468086, elapsed: 504432ms 08:24, speed: 4104 games/s
#games: 2263597, #errors: 0, #hashKeys: 177174148, elapsed: 552428ms 09:12, speed: 4097 games/s
#games: 2454031, #errors: 0, #hashKeys: 192886927, elapsed: 602850ms 10:02, speed: 4070 games/s
#games: 2642990, #errors: 0, #hashKeys: 208655313, elapsed: 652779ms 10:52, speed: 4048 games/s
#games: 2831375, #errors: 0, #hashKeys: 224407558, elapsed: 704457ms 11:44, speed: 4019 games/s
#games: 2994165, #errors: 231, #hashKeys: 239157678, elapsed: 755334ms 12:35, speed: 3964 games/s
#games: 3141649, #errors: 308, #hashKeys: 252700404, elapsed: 799287ms 13:19, speed: 3930 games/s
#games: 3297553, #errors: 370, #hashKeys: 265944625, elapsed: 842684ms 14:02, speed: 3913 games/s
#games: 3456399, #errors: 372, #hashKeys: 277802645, elapsed: 883179ms 14:43, speed: 3913 games/s
Completed!

There are a total of over 277 million hash keys (equal to the total of all moves). Perhaps, it is too large for a table! They make the database file size increases from 2.08 to 8.87 GB, 4.3 times larger.

There is an interesting issue. My computer has 4 main (physical) and 4 virtual (hyper-threading) cores. In the previous attempt, the creating speed was increased clearly when I change from using 4 to 8 threads. However, this time, the speed is decreased more than 2 times. I am not sure the reason yet since the code is almost the same (with the previous attempt), no more computing. Perhaps it was caused by overflowing SQLite.

Searching

Below is the statement to search games with a hash key of the given position:

Code: Select all

        SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Timer, ECO, PlyCount, FEN, Moves, Hash
        FROM Games g
        INNER JOIN Players w ON WhiteID = w.ID
        INNER JOIN Players b ON BlackID = b.ID
        INNER JOIN HashPositions h ON g.ID = h.GameID
        WHERE h.HashKey = ?

I have run the app with the benchmark function, waited for a very long time, and see… nothing! After verifying, I realized the query took too much time. Below are the stats with the number of tests reduced a lot:

Code: Select all

ply: 1, #total queries: 20, elapsed: 4489259 ms, 1:14:49, time per query: 224462 ms, total results: 69014820, results/query: 3450741
ply: 2, #total queries: 20, elapsed: 3874209 ms, 1:04:34, time per query: 193710 ms, total results: 22869993, results/query: 1143499
ply: 3, #total queries: 20, elapsed: 3775149 ms, 1:02:55, time per query: 188757 ms, total results: 8906460, results/query: 445323
ply: 4, #total queries: 20, elapsed: 3766566 ms, 1:02:46, time per query: 188328 ms, total results: 5169930, results/query: 258496
ply: 5, #total queries: 20, elapsed: 3448294 ms, 57:28, time per query: 172414 ms, total results: 788888, results/query: 39444
ply: 6, #total queries: 20, elapsed: 3320534 ms, 55:20, time per query: 166026 ms, total results: 96279, results/query: 4813
ply: 7, #total queries: 20, elapsed: 3395092 ms, 56:35, time per query: 169754 ms, total results: 150744, results/query: 7537

I stopped the test here since I run out of the patient. A query may take 224 seconds (3.7 minutes). That speed is painfully slow! I think the reasons it took too much time are:
- it has to scan all hash keys (277 million)
- SQLite performs the search on the hard disk instead of in memory

I have tried to create the index for the field HashKey. The process took over 10 minutes and the file size increased to 13.74 GB.

The search stats are the below:

Code: Select all

ply: 1, #total queries: 1000, elapsed: 156168774 ms, 43:22:48, time per query: 156168 ms, total results: 1137429724, results/query: 1137429
ply: 2, #total queries: 1000, elapsed: 88930814 ms, 24:42:10, time per query: 88930 ms, total results: 359591200, results/query: 359591
ply: 3, #total queries: 1000, elapsed: 77368156 ms, 21:29:28, time per query: 77368 ms, total results: 199681330, results/query: 199681
ply: 4, #total queries: 1000, elapsed: 7232008 ms, 2:00:32, time per query: 7232 ms, total results: 37934732, results/query: 37934
ply: 5, #total queries: 1000, elapsed: 46667 ms, 00:46, time per query: 46 ms, total results: 5547110, results/query: 5547
ply: 6, #total queries: 1000, elapsed: 87649 ms, 01:27, time per query: 87 ms, total results: 6512286, results/query: 6512
ply: 7, #total queries: 1000, elapsed: 28355 ms, 00:28, time per query: 28 ms, total results: 2446515, results/query: 2446
ply: 8, #total queries: 1000, elapsed: 262181 ms, 04:22, time per query: 262 ms, total results: 5303523, results/query: 5303
ply: 9, #total queries: 1000, elapsed: 18622 ms, 00:18, time per query: 18 ms, total results: 2411908, results/query: 2411
ply: 10, #total queries: 1000, elapsed: 13884 ms, 00:13, time per query: 13 ms, total results: 2077677, results/query: 2077
ply: 11, #total queries: 1000, elapsed: 3335 ms, 00:03, time per query: 3 ms, total results: 420394, results/query: 420
ply: 12, #total queries: 1000, elapsed: 2306 ms, 00:02, time per query: 2 ms, total results: 283109, results/query: 283
ply: 13, #total queries: 1000, elapsed: 2114 ms, 00:02, time per query: 2 ms, total results: 238831, results/query: 238
ply: 14, #total queries: 1000, elapsed: 1498 ms, 00:01, time per query: 1 ms, total results: 145742, results/query: 145
ply: 15, #total queries: 1000, elapsed: 236 ms, 00:00, time per query: 0 ms, total results: 10677, results/query: 10
ply: 16, #total queries: 1000, elapsed: 838 ms, 00:00, time per query: 0 ms, total results: 36655, results/query: 36
ply: 17, #total queries: 1000, elapsed: 143 ms, 00:00, time per query: 0 ms, total results: 11374, results/query: 11
ply: 18, #total queries: 1000, elapsed: 111 ms, 00:00, time per query: 0 ms, total results: 8662, results/query: 8
ply: 19, #total queries: 1000, elapsed: 174 ms, 00:00, time per query: 0 ms, total results: 9301, results/query: 9
ply: 20, #total queries: 1000, elapsed: 142 ms, 00:00, time per query: 0 ms, total results: 4397, results/query: 4
ply: 21, #total queries: 1000, elapsed: 246 ms, 00:00, time per query: 0 ms, total results: 7290, results/query: 7
ply: 22, #total queries: 1000, elapsed: 94 ms, 00:00, time per query: 0 ms, total results: 4971, results/query: 4
ply: 23, #total queries: 1000, elapsed: 66 ms, 00:00, time per query: 0 ms, total results: 977, results/query: 0
ply: 24, #total queries: 1000, elapsed: 86 ms, 00:00, time per query: 0 ms, total results: 728, results/query: 0
ply: 25, #total queries: 1000, elapsed: 30 ms, 00:00, time per query: 0 ms, total results: 762, results/query: 0
ply: 26, #total queries: 1000, elapsed: 36 ms, 00:00, time per query: 0 ms, total results: 748, results/query: 0
ply: 27, #total queries: 1000, elapsed: 85 ms, 00:00, time per query: 0 ms, total results: 772, results/query: 0
ply: 28, #total queries: 1000, elapsed: 55 ms, 00:00, time per query: 0 ms, total results: 765, results/query: 0
ply: 29, #total queries: 1000, elapsed: 53 ms, 00:00, time per query: 0 ms, total results: 751, results/query: 0

I have run my computer near 4 days for the above stats. We may have some conclusions:
- The average search time per query is shorter than ones without index and reduce quicker by plies
- From 1-4 plies, the search time is slow, but from ply 5, they are so quick

Compare with the previous attempt:
- For the first two plies, the numbers of results are quite closed, the previous ones are a little bit larger (it is a myth because theory they should be smaller)
- From 3rd plies, the new attempt has more results than the previous ones and the gaps become significantly larger by plies

Code has been pushed with a new branch called "searchwithhashkeys".

dangi12012 · Post by **dangi12012** » Thu Dec 09, 2021 5:22 pm

phhnguyen wrote: ↑Thu Dec 09, 2021 2:18 pm Code has been pushed with a new branch called "searchwithhashkeys".

looks good for me after ply5! Thank you for implementing it looks promising! Yes its mandatory to have a index on something you want to lookup fast. When the resultset has 1137429724 hits - there is nothing sql can do to speed that up. You get around 1137429/150 = 7500 results/s for ply0 which is around what can be expected for something thats indexed but so often duplicated.
You see that at ply5 you get to the speed of 5547/0.05 = 110940 results/s which is much better.

I think still that you didnt implement the idea fully? You would lookup by gameid which is a running number. Only for insert you need the hash to know what is and isnt already a position. Or is the Hash 64 bit? then scratch the sentence. Then its the same.

why the inner joins? This should suffice?

Code: Select all

        SELECT *
        FROM Games g
        WHERE h.HashKey = xxx

You know I have a modern gui in the works - i might just be the first to use the new standard

I would really like to see time and eval as well. That would be all data that lichess gives us.

phhnguyen · Post by **phhnguyen** » Thu Dec 09, 2021 10:38 pm

dangi12012 wrote: ↑Thu Dec 09, 2021 5:22 pm I think still that you didnt implement the idea fully? You would lookup by gameid which is a running number. Only for insert you need the hash to know what is and isnt already a position. Or is the Hash 64 bit? then scratch the sentence. Then its the same.

Which idea?

As I have mentioned, this attempt is to try the idea of using hash keys for matching exactly positions. No more no less. Hash keys are "standard" one, 64 bit. They are actually Polyglot ones (similar to Polyglot opening books).

dangi12012 wrote: ↑Thu Dec 09, 2021 5:22 pm why the inner joins? This should suffice?
Code: Select all
        SELECT *
        FROM Games g
        WHERE h.HashKey = xxx

We have several tables. A search should return enough information, combining from some tables, thus it should use INNER JOIN.

dangi12012 wrote: ↑Thu Dec 09, 2021 5:22 pm You know I have a modern gui in the works - i might just be the first to use the new standard
I would really like to see time and eval as well. That would be all data that lichess gives us.

It is a big surprise! Writing GUI is a huge challenge task (from my view) and there are not many developers in that field. Congrats! When will you release it? Love to see someone starts supporting this standard!

Sopel · Post by **Sopel** » Fri Dec 10, 2021 1:26 am

phhnguyen wrote: ↑Thu Dec 09, 2021 2:18 pm I stopped the test here since I run out of the patient. A query may take 224 seconds (3.7 minutes). That speed is painfully slow! I think the reasons it took too much time are:
- it has to scan all hash keys (277 million)
- SQLite performs the search on the hard disk instead of in memory

I have tried to create the index for the field HashKey. The process took over 10 minutes and the file size increased to 13.74 GB.

So about 50GB/1B positions. That's about 5x more than my database that allows WDL queries for each position (and uses 72 bit hashes and resolves transpositions). Overall I'm surprised it's only 5x, it might be a workable tradeoff for some use-cases. I suggest trying to drop the primary key on that table as it doesn't seem important (I mean, it might seem interesting to be able to address a position, but the positions itself are not really interesting in itself, and most queries will join them with other tables anyway). I'd also suggest looking at BRIN indexes, and whether they are supported by the implementation you're using. They should do very well for index by hash in this case. You can also try creating a clustered index on the hashes instead of the primary keys (would speed up the queries with many results) (and as I said, completely give up on the primary key here)

amanjpro · Post by **amanjpro** » Fri Dec 10, 2021 4:21 am

Probably a sort of table sharding will also speedup querying. Postgres IIRC has partition by on relations, one can use that to make your queries faster

https://www.postgresql.org/docs/10/ddl- ... oning.html

phhnguyen · Post by **phhnguyen** » Fri Dec 10, 2021 4:36 am

Sopel wrote: ↑Fri Dec 10, 2021 1:26 am
phhnguyen wrote: ↑Thu Dec 09, 2021 2:18 pm I stopped the test here since I run out of the patient. A query may take 224 seconds (3.7 minutes). That speed is painfully slow! I think the reasons it took too much time are:
- it has to scan all hash keys (277 million)
- SQLite performs the search on the hard disk instead of in memory

I have tried to create the index for the field HashKey. The process took over 10 minutes and the file size increased to 13.74 GB.
So about 50GB/1B positions. That's about 5x more than my database that allows WDL queries for each position (and uses 72 bit hashes and resolves transpositions). Overall I'm surprised it's only 5x, it might be a workable tradeoff for some use-cases. I suggest trying to drop the primary key on that table as it doesn't seem important (I mean, it might seem interesting to be able to address a position, but the positions itself are not really interesting in itself, and most queries will join them with other tables anyway).

Size is not my concern at the moment since we have been finding solutions (not found nor fixed yet). On the other hand, omitting the primary key is not easy for SQLite (it requires RowID or another primary key column).

Sopel wrote: ↑Fri Dec 10, 2021 1:26 am I'd also suggest looking at BRIN indexes, and whether they are supported by the implementation you're using. They should do very well for index by hash in this case. You can also try creating a clustered index on the hashes instead of the primary keys (would speed up the queries with many results) (and as I said, completely give up on the primary key here)

The problem is that SQLite doesn't support BRIN indexes (AFAIK).

Open Chess Game Database Standard

Re: Position searching - 2nd attempt

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard