Open Chess Game Database Standard

Dann Corbit · Post by **Dann Corbit** » Wed Dec 01, 2021 5:32 pm

There is a benefit to leaving the move text.
Then if you create a list of EPD ID values, you already know how many elements there are, and you know the move that got you from position N to position N+1 without having to analyze anything. (Otherwise, you would need to do some diff based upon the EPD records to find out what the move was.)

dangi12012 · Post by **dangi12012** » Wed Dec 01, 2021 6:14 pm

Dann Corbit wrote: ↑Wed Dec 01, 2021 5:32 pm There is a benefit to leaving the move text.
Then if you create a list of EPD ID values, you already know how many elements there are, and you know the move that got you from position N to position N+1 without having to analyze anything. (Otherwise, you would need to do some diff based upon the EPD records to find out what the move was.)

yes you both convinced me. Having moves as text is good to have too!
I wanna query the moves!

Fulvio · Post by **Fulvio** » Wed Dec 01, 2021 7:43 pm

Dann Corbit wrote: ↑Tue Nov 30, 2021 7:37 pm BTW, if you are the Fulvio who works on SCID, that is very impressive work.

Yes, I am.

Dann Corbit wrote: ↑Tue Nov 30, 2021 7:37 pm Hats off.

Thanks

Sopel · Post by **Sopel** » Wed Dec 01, 2021 8:07 pm

Dann Corbit wrote: ↑Wed Dec 01, 2021 5:32 pm There is a benefit to leaving the move text.
Then if you create a list of EPD ID values, you already know how many elements there are, and you know the move that got you from position N to position N+1 without having to analyze anything. (Otherwise, you would need to do some diff based upon the EPD records to find out what the move was.)

You can achieve this and much more by storing a reverse move (<=27 bits in compressed form) along the position.

phhnguyen · Post by **phhnguyen** » Thu Dec 02, 2021 12:05 am

Position searching - first attempt

IMO the chess position search is mainly an approximate one and matching exactly (positions) is only a small part of that search. However, some database developers pointed out that exact matching is still important, some of their database apps have the feature to find and show all games from a database with matching to the current user-watching position.

I have tried to find solutions for that feature. This time it is a simple solution: matching the move list from the given game with all move lists stored in the database. Firstly we will create a string of the move list of the given game, then we will match it with strings in the field Moves (a text field in our databases). In SQL, we will use the term “LIKE” to match strings starting with the given one.

Code: Select all

 
SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Timer, ECO, PlyCount, FEN, Moves
FROM Games g
INNER JOIN Players w ON WhiteID = w.ID
INNER JOIN Players b ON BlackID = b.ID
WHERE g.Moves LIKE ?

There are some disadvantages of that method:
1. It is actually not position matching but opening-line matching. If the position is created from a FEN, this method can’t work since it has no move
2. If a move list inside the database has some comments, the matching may not work correctly
3. Any missing, additional characters in the move string may lead to a mismatch. E.g., rh4, rhh4, Rh4, Rh4?!, Rh4+ are totally different
4. Similar 2, 3, if the move list has some extra space, different move counters or format (say, 1.e4 or 1. e4 or 1)e4 )

Advantages:
- simple, don’t need extra space or computing when creating the database.

BTW, before going further, deeper to this way, we need to implement to see how good/bad it is and benchmark it to see if it is fast enough.

I have created a new branch “movematching” (pushed already to GitHub). The benchmark function picked a game randomly from the test set, create a string from the move list with fly from 1 to 19. It will extract all matching records. The stats as below:

Code: Select all

ply: 1, #total queries: 1000, elapsed: 2106746 ms, 35:06, time per query: 2106 ms, total results: 1138015658, results/query: 1138015
ply: 2, #total queries: 1000, elapsed: 1392696 ms, 23:12, time per query: 1392 ms, total results: 359609282, results/query: 359609
ply: 3, #total queries: 1000, elapsed: 1429413 ms, 23:49, time per query: 1429 ms, total results: 196074472, results/query: 196074
ply: 4, #total queries: 1000, elapsed: 1028502 ms, 17:08, time per query: 1028 ms, total results: 20333422, results/query: 20333
ply: 5, #total queries: 1000, elapsed: 931156 ms, 15:31, time per query: 931 ms, total results: 2736649, results/query: 2736
ply: 6, #total queries: 1000, elapsed: 865683 ms, 14:25, time per query: 865 ms, total results: 1509023, results/query: 1509
ply: 7, #total queries: 1000, elapsed: 860749 ms, 14:20, time per query: 860 ms, total results: 611549, results/query: 611
ply: 8, #total queries: 1000, elapsed: 859847 ms, 14:19, time per query: 859 ms, total results: 415272, results/query: 415
ply: 9, #total queries: 1000, elapsed: 858435 ms, 14:18, time per query: 858 ms, total results: 113256, results/query: 113
ply: 10, #total queries: 1000, elapsed: 856611 ms, 14:16, time per query: 856 ms, total results: 53289, results/query: 53
ply: 11, #total queries: 1000, elapsed: 857798 ms, 14:17, time per query: 857 ms, total results: 31985, results/query: 31
ply: 12, #total queries: 1000, elapsed: 857788 ms, 14:17, time per query: 857 ms, total results: 21570, results/query: 21
ply: 13, #total queries: 1000, elapsed: 858671 ms, 14:18, time per query: 858 ms, total results: 11201, results/query: 11
ply: 14, #total queries: 1000, elapsed: 857259 ms, 14:17, time per query: 857 ms, total results: 2437, results/query: 2
ply: 15, #total queries: 1000, elapsed: 896078 ms, 14:56, time per query: 896 ms, total results: 981, results/query: 0
ply: 16, #total queries: 1000, elapsed: 1003631 ms, 16:43, time per query: 1003 ms, total results: 760, results/query: 0
ply: 17, #total queries: 1000, elapsed: 1010973 ms, 16:50, time per query: 1010 ms, total results: 0, results/query: 0
ply: 18, #total queries: 1000, elapsed: 1054751 ms, 17:34, time per query: 1054 ms, total results: 0, results/query: 0
ply: 19, #total queries: 1000, elapsed: 857467 ms, 14:17, time per query: 857 ms, total results: 0, results/query: 0

From the stats we can see:
- Ply 1: the average query time is about 2s, mostly spent for scanning, extracting all results which are too many
- From ply 2: the average query time is reduced quickly
- From ply 17: There is no matching game

The method works! It can find many results. The query time is good too. For some “cleaver” SQL libraries such as Qt’s one, it is much more than enough since records may be started displaying when only a few results are found, making the display is almost instantly.

Just some thinking, planing about above disadvantages:
(1) I don’t worry much about it since it is just a trade-off
(3)(4) Could be fixed easily by some clean, standardization code
(2)(3)(4) Could be solved by adding a new column of pure move text in coordinate notation without move counters and comments

dangi12012 · Post by **dangi12012** » Thu Dec 02, 2021 12:16 am

phhnguyen wrote: ↑Thu Dec 02, 2021 12:05 am Position searching - first attempt

IMO the chess position search is mainly an approximate one and matching exactly (positions) is only a small part of that search. However, some database developers pointed out that exact matching is still important, some of their database apps have the feature to find and show all games from a database with matching to the current user-watching position.

I have tried to find solutions for that feature. This time it is a simple solution: matching the move list from the given game with all move lists stored in the database. Firstly we will create a string of the move list of the given game, then we will match it with strings in the field Moves (a text field in our databases). In SQL, we will use the term “LIKE” to match strings starting with the given one.
Code: Select all
 
SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Timer, ECO, PlyCount, FEN, Moves
FROM Games g
INNER JOIN Players w ON WhiteID = w.ID
INNER JOIN Players b ON BlackID = b.ID
WHERE g.Moves LIKE ?
There are some disadvantages of that method:
1. It is actually not position finding but opening-line matching. If the position is created from a FEN, the search can’t work since it has no move
2. If a move list inside the database has some comments, the matching may not work correctly
3. Any missing, additional characters in the move string may lead to a mismatch. E.g., rh4, rhh4, Rh4, Rh4?!, Rh4+ are totally different
4. Similar 2, 3, if the move list has some extra space, different move counters or format (say, 1.e4 or 1. e4 or 1)e4 )

Advantages:
- simple, don’t need extra space or computing when creating the database.

BTW, before going further, deeper to this way, we need to implement to see how good/bad it is and benchmark it to see if it is fast enough.

I have created a new branch “movematching”. The benchmark function picked a game randomly from the test set, create a string from the move list with fly from 1 to 19. It will extract all matching records. The stats as below:
Code: Select all
ply: 1, #total queries: 1000, elapsed: 2106746 ms, 35:06, time per query: 2106 ms, total results: 1138015658, results/query: 1138015
ply: 2, #total queries: 1000, elapsed: 1392696 ms, 23:12, time per query: 1392 ms, total results: 359609282, results/query: 359609
ply: 3, #total queries: 1000, elapsed: 1429413 ms, 23:49, time per query: 1429 ms, total results: 196074472, results/query: 196074
ply: 4, #total queries: 1000, elapsed: 1028502 ms, 17:08, time per query: 1028 ms, total results: 20333422, results/query: 20333
ply: 5, #total queries: 1000, elapsed: 931156 ms, 15:31, time per query: 931 ms, total results: 2736649, results/query: 2736
ply: 6, #total queries: 1000, elapsed: 865683 ms, 14:25, time per query: 865 ms, total results: 1509023, results/query: 1509
ply: 7, #total queries: 1000, elapsed: 860749 ms, 14:20, time per query: 860 ms, total results: 611549, results/query: 611
ply: 8, #total queries: 1000, elapsed: 859847 ms, 14:19, time per query: 859 ms, total results: 415272, results/query: 415
ply: 9, #total queries: 1000, elapsed: 858435 ms, 14:18, time per query: 858 ms, total results: 113256, results/query: 113
ply: 10, #total queries: 1000, elapsed: 856611 ms, 14:16, time per query: 856 ms, total results: 53289, results/query: 53
ply: 11, #total queries: 1000, elapsed: 857798 ms, 14:17, time per query: 857 ms, total results: 31985, results/query: 31
ply: 12, #total queries: 1000, elapsed: 857788 ms, 14:17, time per query: 857 ms, total results: 21570, results/query: 21
ply: 13, #total queries: 1000, elapsed: 858671 ms, 14:18, time per query: 858 ms, total results: 11201, results/query: 11
ply: 14, #total queries: 1000, elapsed: 857259 ms, 14:17, time per query: 857 ms, total results: 2437, results/query: 2
ply: 15, #total queries: 1000, elapsed: 896078 ms, 14:56, time per query: 896 ms, total results: 981, results/query: 0
ply: 16, #total queries: 1000, elapsed: 1003631 ms, 16:43, time per query: 1003 ms, total results: 760, results/query: 0
ply: 17, #total queries: 1000, elapsed: 1010973 ms, 16:50, time per query: 1010 ms, total results: 0, results/query: 0
ply: 18, #total queries: 1000, elapsed: 1054751 ms, 17:34, time per query: 1054 ms, total results: 0, results/query: 0
ply: 19, #total queries: 1000, elapsed: 857467 ms, 14:17, time per query: 857 ms, total results: 0, results/query: 0
From the stats we can see:
- Ply 1: the average query time is about 2s, mostly spent for scanning, extracting all results which are too many
- From ply 2: the average query time is reduced quickly
- From ply 17: There is no matching game

The method is work! It can find many results. The query time is good too. For some “cleaver” SQL libraries such as Qt’s one, it is much more than enough since records maybe start displaying when only a few results are found, making the display is almost instantly.

Just some thinking, planing about above disadvantages:
(1) I don’t worry much about it since it is just a trade-off
(3)(4) Could be fixed easily by some clean, standardization code
(2)(3)(4) Could be solved by adding a new column of pure move text in coordinate notation without move counters and comments

Well on performance i can say that SQLITE supports index on expressions. So I think string "LIKE" is not supported but maybe you can take a look at it when you fix the expression inside it should be supportable to be queried fast.
So you can create an index for x+y and then

Code: Select all

select * from table where a+b == 3

will be lightning fast!
https://www.sqlite.org/expridx.html

Yes normally the gui can already show the first results since you instantly get results!

I didnt even think of that PGN matching with "like" would also work for looking up all player games with that opening - but it works.
But I think a position table is still warranted because there you can insert the evaluation too and thats literally millions of cpu hours that lichess gives out for free.

"Find player Xs most played opening where he usually blunders"
Thats the perfect query. And as everyone here can see SQL is so generic that not much work is needed to support that.

Cool stuff - great work!

phhnguyen · Post by **phhnguyen** » Thu Dec 02, 2021 2:24 am

dangi12012 wrote: ↑Thu Dec 02, 2021 12:16 am Well on performance i can say that SQLITE supports index on expressions. So I think string "LIKE" is not supported but maybe you can take a look at it when you fix the expression inside it should be supportable to be queried fast.

You are right, the index system doesn’t work with LIKE. Actually, I have tried already, creating an index for the column Moves. That almost doubled the size (I think strings in Moves field are close to unique and they take most of the space). However, the search speed was almost the same.

dangi12012 · Post by **dangi12012** » Thu Dec 02, 2021 3:40 am

phhnguyen wrote: ↑Thu Dec 02, 2021 2:24 am
dangi12012 wrote: ↑Thu Dec 02, 2021 12:16 am Well on performance i can say that SQLITE supports index on expressions. So I think string "LIKE" is not supported but maybe you can take a look at it when you fix the expression inside it should be supportable to be queried fast.
You are right, the index system doesn’t work with LIKE. Actually, I have tried already, creating an index for the column Moves. That almost doubled the size (I think strings in Moves field are close to unique and they take most of the space). However, the search speed was almost the same.

Yes. An index is literally like an std::map or dictionary. So it wont speedup string search.
For fast string search there is the whole topic of tries (not trees) that are very special datastructures that are used in genetics. Has Nothing to do with sql - but lookup "ukkonen trie"

They find a substring M in a substring N in O(M) which is more than magic. So finding a 5 letter word in 80GB of data is dependent on 5 and not the 80gb. Could be 10TB of random data and speed is identical.

I always thought i understood programming but there are Algorithms that are so advanced that Experts need months or years to understand fully.

phhnguyen · Post by **phhnguyen** » Fri Dec 03, 2021 6:31 am

Position searching - 2nd attempt

In the previous attempt, I tried to match a given game with games in a database by comparing their strings of moves in SAN notation. The strings stored in the database were extracted directly from PGN files. There is a huge chance to mismatch when those strings are not “clean”, containing extra spaces, control characters, comments, evaluation symbols (such as !?)…

In this attempt, I am going to create “clean” strings of moves instead of using original ones from PGN files. We will see how more results it is, if it is worth for later use.

A clean move string for comparisons purposes should:

be in coordinate notation

use only spaces to separate between them

no comment, evaluation symbols, move counters, control characters

Below is an example of a clean string of a move list:

Code: Select all

e2e4 c7c6 d2d4 d7d5 b1c3 d5e4 c3e4 b8d7 e4g5 d7f6 f1c4 g8h6 c2c3 g7g6 g1f3 f8g7 h2h4 f6d5 h4h5 f7f6 g5e4

To create those strings we need to parse the move list of all games and store them in a new column named PureMoves.

In the first release, the app parsed all games and moves already. The speed was quite slow, it completed 3.45 million games in 4 hours. In the next releases, I removed that function and with other improvements, its speed improved significantly. The fastest release can complete 3.45 games in 1/2 minutes only.

Now we implement back the game parser. It has been rewritten to optimize for the task and should be faster a bit. I was interested with a bit nervous to see how far we have improved everything.

With the below stats, the app parses all games/moves using only one thread:

Code: Select all

#games: 228759, #errors: 0, elapsed: 54814ms 00:54, speed: 4173 games/s
#games: 444597, #errors: 0, elapsed: 106716ms 01:46, speed: 4166 games/s
#games: 660515, #errors: 0, elapsed: 160834ms 02:40, speed: 4106 games/s
#games: 876300, #errors: 0, elapsed: 215195ms 03:35, speed: 4072 games/s
#games: 1085040, #errors: 0, elapsed: 270245ms 04:30, speed: 4015 games/s
#games: 1282674, #errors: 0, elapsed: 322165ms 05:22, speed: 3981 games/s
#games: 1481405, #errors: 0, elapsed: 374305ms 06:14, speed: 3957 games/s
#games: 1682141, #errors: 0, elapsed: 425995ms 07:05, speed: 3948 games/s
#games: 1879017, #errors: 0, elapsed: 476195ms 07:56, speed: 3945 games/s
#games: 2070724, #errors: 0, elapsed: 525976ms 08:45, speed: 3936 games/s
#games: 2263769, #errors: 0, elapsed: 575856ms 09:35, speed: 3931 games/s
#games: 2454219, #errors: 0, elapsed: 626737ms 10:26, speed: 3915 games/s
#games: 2643194, #errors: 0, elapsed: 680266ms 11:20, speed: 3885 games/s
#games: 2831594, #errors: 0, elapsed: 733589ms 12:13, speed: 3859 games/s
#games: 2994631, #errors: 231, elapsed: 783895ms 13:03, speed: 3820 games/s
#games: 3142207, #errors: 308, elapsed: 829420ms 13:49, speed: 3788 games/s
#games: 3298187, #errors: 370, elapsed: 874262ms 14:34, speed: 3772 games/s
#games: 3457050, #errors: 372, elapsed: 915443ms 15:15, speed: 3776 games/s

15 minutes! Compared with the result of the first release, it is much faster. However, compared with the fastest of previous ones, it is about 30 times slower (15 minutes vs 1/2 minutes)

I will use more threads to speed up. To manage them I use thread-pool a library from https://github.com/bshoshany/thread-pool. More threads can help game parsing faster. However, both reading from the PGN file and writing (to the database) should be in sequence, cannot be multithreading, thus they are bottlenecks of the performance.

Generate:

Code: Select all

#games: 217108, #errors: 0, elapsed: 17496ms 00:17, speed: 12409 games/s
#games: 432560, #errors: 0, elapsed: 35438ms 00:35, speed: 12206 games/s
#games: 648542, #errors: 0, elapsed: 53405ms 00:53, speed: 12143 games/s
#games: 864464, #errors: 0, elapsed: 71809ms 01:11, speed: 12038 games/s
#games: 1074215, #errors: 0, elapsed: 89535ms 01:29, speed: 11997 games/s
#games: 1272050, #errors: 0, elapsed: 107175ms 01:47, speed: 11868 games/s
#games: 1470104, #errors: 0, elapsed: 124909ms 02:04, speed: 11769 games/s
#games: 1670976, #errors: 0, elapsed: 142621ms 02:22, speed: 11716 games/s
#games: 1868101, #errors: 0, elapsed: 160618ms 02:40, speed: 11630 games/s
#games: 2060194, #errors: 0, elapsed: 178299ms 02:58, speed: 11554 games/s
#games: 2253086, #errors: 0, elapsed: 197557ms 03:17, speed: 11404 games/s
#games: 2443930, #errors: 0, elapsed: 220736ms 03:40, speed: 11071 games/s
#games: 2632778, #errors: 0, elapsed: 238061ms 03:58, speed: 11059 games/s
#games: 2820954, #errors: 0, elapsed: 260790ms 04:20, speed: 10816 games/s
#games: 2986337, #errors: 231, elapsed: 284644ms 04:44, speed: 10491 games/s
#games: 3133800, #errors: 271, elapsed: 300659ms 05:00, speed: 10423 games/s
#games: 3290579, #errors: 370, elapsed: 319906ms 05:19, speed: 10286 games/s
#games: 3456399, #errors: 372, elapsed: 335739ms 05:35, speed: 10294 games/s

The app used all available cores in my old computer (half of them are hyper ones) and got 3 times faster.

Fig. Table Games With PureMoves field

The size is increased significantly from 2.08 GB (previous size) to 3.79 GB, 182%.

Now it is the time to test searching, using those clean strings (PureMoves column):

Code: Select all

ply: 1, #total queries: 1000, elapsed: 2860996 ms, 47:40, time per query: 2860 ms, total results: 1138015181, results/query: 1138015
ply: 2, #total queries: 1000, elapsed: 1921188 ms, 32:01, time per query: 1921 ms, total results: 359609282, results/query: 359609
ply: 3, #total queries: 1000, elapsed: 1944492 ms, 32:24, time per query: 1944 ms, total results: 196074472, results/query: 196074
ply: 4, #total queries: 1000, elapsed: 1387647 ms, 23:07, time per query: 1387 ms, total results: 20333422, results/query: 20333
ply: 5, #total queries: 1000, elapsed: 1367347 ms, 22:47, time per query: 1367 ms, total results: 2736649, results/query: 2736
ply: 6, #total queries: 1000, elapsed: 1343154 ms, 22:23, time per query: 1343 ms, total results: 1509023, results/query: 1509
ply: 7, #total queries: 1000, elapsed: 1354247 ms, 22:34, time per query: 1354 ms, total results: 611549, results/query: 611
ply: 8, #total queries: 1000, elapsed: 1338783 ms, 22:18, time per query: 1338 ms, total results: 415272, results/query: 415
ply: 9, #total queries: 1000, elapsed: 1338512 ms, 22:18, time per query: 1338 ms, total results: 113256, results/query: 113
ply: 10, #total queries: 1000, elapsed: 1352603 ms, 22:32, time per query: 1352 ms, total results: 53289, results/query: 53
ply: 11, #total queries: 1000, elapsed: 1345601 ms, 22:25, time per query: 1345 ms, total results: 31985, results/query: 31
ply: 12, #total queries: 1000, elapsed: 1380740 ms, 23:00, time per query: 1380 ms, total results: 21570, results/query: 21
ply: 13, #total queries: 1000, elapsed: 1336493 ms, 22:16, time per query: 1336 ms, total results: 11201, results/query: 11
ply: 14, #total queries: 1000, elapsed: 1336809 ms, 22:16, time per query: 1336 ms, total results: 2437, results/query: 2
ply: 15, #total queries: 1000, elapsed: 1338856 ms, 22:18, time per query: 1338 ms, total results: 1243, results/query: 1
ply: 16, #total queries: 1000, elapsed: 1336510 ms, 22:16, time per query: 1336 ms, total results: 1253, results/query: 1
ply: 17, #total queries: 1000, elapsed: 1336637 ms, 22:16, time per query: 1336 ms, total results: 1260, results/query: 1
ply: 18, #total queries: 1000, elapsed: 1350461 ms, 22:30, time per query: 1350 ms, total results: 1000, results/query: 1
ply: 19, #total queries: 1000, elapsed: 1336898 ms, 22:16, time per query: 1336 ms, total results: 1000, results/query: 1
ply: 20, #total queries: 1000, elapsed: 1704646 ms, 28:24, time per query: 1704 ms, total results: 1000, results/query: 1
ply: 21, #total queries: 1000, elapsed: 1731275 ms, 28:51, time per query: 1731 ms, total results: 1000, results/query: 1
ply: 22, #total queries: 1000, elapsed: 1401201 ms, 23:21, time per query: 1401 ms, total results: 1000, results/query: 1
ply: 23, #total queries: 1000, elapsed: 1520770 ms, 25:20, time per query: 1520 ms, total results: 1000, results/query: 1
ply: 24, #total queries: 1000, elapsed: 1411787 ms, 23:31, time per query: 1411 ms, total results: 1000, results/query: 1

Compare:

work slower: I don’t know yet the reason why it is slower. Actually, I expected it works faster. However, the differences are not significant thus I accept as it

there are move results but not significant. Interesting that it still got hit when ply is over 17, the one the previous attempt got nothing. Clearly, they can always find our test samples with any ply but not the previous attempt

All new code has been pushed already in a new branch “cleanmovematching”.

mvanthoor · Post by **mvanthoor** » Fri Dec 03, 2021 11:07 am

You'll probably already know this, but if there is _any_ way to avoid searching in the database for strings, and especially with LIKE, then do so. I've noticed at work (in some old legacy software) that replacing WHERE ... LIKE '%...%' with WHERE ... = 'actual_string' (where possible) made a HUGE impact on search speed. When the string search could be factored out completely, the impact became larger still, but dropping the LIKE takes the cake for speed.

Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Position searching - first attempt

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Position searching - 2nd attempt

Re: Open Chess Game Database Standard