Open Chess Game Database Standard

Discussion of chess software programming and technical issues.

Moderator: Ras

Sopel
Posts: 391
Joined: Tue Oct 08, 2019 11:39 pm
Full name: Tomasz Sobczyk

Re: Open Chess Game Database Standard

Post by Sopel »

Fulvio wrote: Thu Dec 30, 2021 6:18 pm
Sopel wrote: Thu Dec 30, 2021 5:34 pm edit. okay ,correction, now I saw there's a "lichess" tab, still there's about 200M games "only"
There is a "gear" icon which let's you select the games by time control, average rating and date.
If you select everything there are 383,251,358 games
I see, thanks.
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.

Maybe you copied your stockfish commits from someone else too?
I will look into that.
ndbd
Posts: 9
Joined: Mon Jan 11, 2021 11:15 am
Full name: Roger S.

Re: Open Chess Game Database Standard

Post by ndbd »

In chess one stores games as sequences of moves. However most of the thread deals with searches for particular positions, for example specific opening positions or partial positions with certain patterns - like a fiancetto bishop, a certain kind of rook ending etc.

There are basially only two ways to deal with that. Either
a) store each position (maybe in a way that you can search for partial positions), or
b) simply go through all stored move sequences, re-create the position and check if it matches your search pattern

The disadvantage of a) is that it takes a lot of space. That's no issue if you run this in a client-server model, where you just equip the server with enough space and memory. For a client-only approach it's not very desireable due to the sheer size.

The disadvantage of b) is that it is potentially very slow. However as ChessBase, SCID and Chess Assistant have shown, it works out pretty well if you store some particular information about each game which can be used to rule out most of the games for typical search queries, and therefore have to inspect only a fraction of all games - which significantly improves speed.

The problem I infer from this thread is that this approach cannot be easily (or can it?) re-constructed using SQL, namely

matching games-list = []
while nextgame:
_if not some condition to rule out the game matches:
___for each position of the game
______if the position matches, add to matching games-list


That's why SQL is _not_ the industry standard for chess databases. I am looking forward to what HCE has done once it is released. Depending on what they offer, they maybe have solved all these issues (I wonder how).

But I would encourage you to take a users perspective.

If I am an amateur I probably don't need a chess database anyway.

If I am a serious tournament player, please convince me why I shouold not shell out the $300 for a commercial database program. In any event I would need a decent database like Big or Mega which is $200. Which I cannot use with a free software program, because none can import the proprietary format.

So next best thing is SCID or one of the derivatives for which one of the free databases is < 1 GB, or this approach which is vastly larger. So instead I should download a 18 GB (maybe zip-compressed) database file with which I can search slower than any of the above.

If it's really about just exchanging games between programs, then there is (zip-compressed) PGN, which I can import into any database program.
User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Position searching: 8th attempt - all calculated on the fly

Post by phhnguyen »

Position searching: 8th attempt - all calculated on the fly

This attempt is inspired by the information from discussions in this topic about SCID has calculated almost all things on the fly for position searching, without pre-calculating and storing those data as we did in previous attempts (Attempts 5, 6, 7). This attempt tries to answer consequent/curious questions if we could do the same and how good/bad it is.

In previous works, we measured a lot of parameters, including all kinds of speed we could do. However, some information was still not clear since they weren’t independently measured but with other things (such as creating/storing with SQLite databases). Thus it is better we do and re-measure everything from zero levels. SQL is a new and slow factor in this project. In this attempt, we put it aside to measure some basic features. For all tests, we don’t store any move or game in any form. All information is discarded immediately after passing each game.

1. Tests

Test 1
In this test, we measured how fast to read a large PGN file. We use a PGN file of 2.42 GB (3.45 million games) as the main data to test. We will read that data in multi-blocks without processing anything.

At the first glance, it should be an easy test: just read all data and measure the elapsed periods. However, we quickly realized there’s a “funny” factor involved. The system (mine is a macOS) is quite clever and it may catch some huge parts or even whole data after our first read. That made the elapsed time be quickly reduced to almost zero. When measuring it with some time-consuming tasks, the differences are just small proportions and could be ignored but could be seen clearly in this test. Btw, below are some stats:

Code: Select all

Elapsed: 20789 ms, 00:20, total size: 2415129988; Block sz: 8388608, block count: 288, time per block: 72 ms

Code: Select all

Elapsed: 12676 ms, 00:12, total size: 2415129988; Block sz: 8388608, block count: 288, time per block: 44 ms

Code: Select all

Elapsed: 292 ms, 00:00, total size: 2415129988; Block sz: 8388608, block count: 288, time per block: 1 ms
In the worst case (totally missed the cache), we need about 20 seconds to read all 2.4 GB data by 288 blocks. In cases of being cache-hit, we need 1/3 seconds only.

To measure loading time when missing the cache hits, we have to try many things such as moving/removing files, resetting the computer. That turned a test from being supposed easy-quick to hard-consume time. In the end, we gave up. Instead of insisting on avoiding cache, we have just run the test several times, to make sure they are maximized cached.


Test 2
We parse the loaded data into chess games, convert them into moves, then make and verify all moves. We use 4 threads (our computer has 4 physical cores) to parse concurrently.

Code: Select all

Thread count: 4
Elapsed: 225957 ms, 03:45, total size: 2415129988; Block sz: 8388608, block count: 288, time per block: 784 ms
We need 3:45 (under 4) minutes to scan all possible positions/moves directly from a PGN database of 3.45 million games.

Test 3 - approximate position searching
In this test, we try to answer 3 questions of approximate-position-searching as the below.

Find all games have:
  • Three White Queens
  • Two Black Rooks in the middle squares e4, e5, d4, d5
  • Three White Pawns in d4, e5, f4, g4, Black King in b7
To answer each question, we have to load whole data from the PGN file, parse game by game, move by move, and for every position of each move we verify if it satisfies the searching conditions. We use bitboards tech and bitwise operators as previous attempts. All run with 4 threads.

Stats for question 1 (Three White Queens):

Code: Select all

Querying 3 White Queens...
1. gameId: 13612, fen: r1b3Q1/pp6/1kn5/2bp4/5Q2/4QK2/1pq2P2/5BNR b -  0 23
2. gameId: 185317, fen: r1QQ1r2/p5bk/1p4p1/4p3/4q1p1/6Q1/P6P/R1R3K1 b -  0 35
3. gameId: 203113, fen: 1QQ1rbk1/5ppp/p2q1n2/3p4/4n3/1P1BPr1P/1B3P2/Q1R2RK1 b -  0 34
4. gameId: 286973, fen: 1QQ5/5pk1/5qp1/7p/4B3/1Q3P2/6PP/3R3K b -  0 42
5. gameId: 419966, fen: 1QQ5/4r1pk/Q6p/8/8/7P/6PK/8 b -  0 45
6. gameId: 793094, fen: Q1Q5/8/1Q6/6k1/3K2p1/1P4P1/8/8 b -  0 60
7. gameId: 824992, fen: BQQ5/5pk1/3Q1b1p/4p1p1/6P1/8/5PP1/4R1K1 b -  0 46
8. gameId: 864359, fen: r1b1kbQQ/2qp1p2/2n1p3/8/3N1P2/8/2P1B1P1/qqBQ1R1K b q  0 20
9. gameId: 904452, fen: 2QQ1rk1/5pp1/8/4QP2/8/q6P/pp4P1/4R2K b -  0 42
10. gameId: 1113376, fen: QQ3rk1/6bp/3p4/3Pnq1r/8/N3p3/P5PP/1R1Q3K b -  0 31
11. gameId: 1160557, fen: 3QQ3/2k5/1p3qp1/5n2/8/4QN2/P4KP1/8 b -  0 51
12. gameId: 1290445, fen: 2QQ1rk1/5p1p/4n1p1/4q3/2R5/r6P/6Q1/3R2NK b -  0 40
13. gameId: 1448906, fen: 1kb3QR/p1p5/1p2q3/8/3Qp3/2P1Q1R1/r4PK1/8 b -  0 40
14. gameId: 1484005, fen: 3Q2k1/4Q3/1q3ppp/8/8/8/6PP/q2r1Q1K b -  0 53
15. gameId: 1486171, fen: 5Q1K/4q1QP/1kpq4/p7/8/5Q2/8/8 b -  0 74
16. gameId: 1512007, fen: 3QQ1nk/6p1/1p3p2/pPp5/q5p1/5P2/5QKP/8 b -  0 44
17. gameId: 1552941, fen: 1QQ2r1k/6p1/7p/8/8/4BqP1/5P1P/5QK1 b -  0 67
18. gameId: 1570381, fen: Q6Q/1kn2Q2/2bq2B1/3p4/8/6N1/6PK/2q5 b -  0 58
19. gameId: 1601475, fen: 2QQ1r1k/6p1/4K2p/3PQ3/1p6/5qP1/8/8 b -  0 55
20. gameId: 1632795, fen: 1QQ2bk1/6r1/3p4/3Pp3/7p/5Pq1/4Q1Np/7K b -  0 39
21. gameId: 1658648, fen: 5Q2/kp6/p3n2q/2p5/8/6Q1/5QK1/8 b -  0 79
22. gameId: 1747921, fen: 4QQ2/8/kn6/p7/1ppP4/PqP5/1P6/1K2Q3 b -  0 50
23. gameId: 1759847, fen: QQ3rk1/7p/4p1p1/3pPr2/3B3q/2P2P2/6Q1/5RK1 b -  0 55
24. gameId: 1804829, fen: Q5n1/1KQ2pqk/2Q5/3N4/4P3/8/5q2/6q1 b -  0 58
25. gameId: 1927150, fen: Q7/6pk/6qp/5r2/5p2/3Q1P2/1R2p3/1K2Q3 b -  0 44
26. gameId: 1978517, fen: 1QQ5/8/8/2RQ4/1k6/6p1/6K1/8 b -  0 78
27. gameId: 2013189, fen: r4QQQ/1b6/pkp3r1/1p1p4/2n2R2/2P5/PP4P1/5RK1 b -  0 33
28. gameId: 2074368, fen: Q1Q5/5pk1/2Q3p1/5q2/7p/6P1/5P1P/2R3K1 b -  0 49
29. gameId: 2078314, fen: QQ2r3/5q1k/8/3p2p1/3B1pQ1/7P/5P1K/4q3 b -  0 54
30. gameId: 2139072, fen: 6Q1/8/2Q5/4p3/3qP3/5Q2/1q3PK1/qk6 b -  0 73
31. gameId: 2150224, fen: QQ1r3k/6pp/2p5/8/1PQ2p1q/4p2P/1R4P1/5KR1 b -  0 40
32. gameId: 2348023, fen: 1Q5Q/K7/6p1/8/6k1/5rq1/7p/3Q4 b -  0 61
33. gameId: 2689356, fen: 2Q1Q2N/R2Q2B1/8/5K2/8/8/P1k2P2/8 b -  0 73
34. gameId: 2951841, fen: 2QQ2R1/1Q6/4K3/4P1P1/8/P7/8/k7 b -  0 84
35. gameId: 2952331, fen: 6Q1/8/1Q6/8/P4N2/1P6/4RP1Q/2k4K b -  0 67
36. gameId: 2968213, fen: 4krQQ/1br5/p2qp3/1p4Q1/3p4/P2B2P1/1PP5/K5n1 b -  0 41
37. gameId: 2982337, fen: 5QQQ/8/8/4B3/5K2/3k4/8/8 b -  0 90
38. gameId: 3011690, fen: 1Q2Q3/P7/8/6K1/8/4Q3/7k/8 b -  0 67
39. gameId: 3016976, fen: 3Q3Q/8/2Q1N3/8/8/3K2k1/8/8 b -  0 75
40. gameId: 3020618, fen: 1QQQ4/2K1P3/8/8/k7/8/8/8 b -  0 93
41. gameId: 3066505, fen: Q7/1Q6/3ppkp1/4p2p/8/4QKPP/2q2P2/6r1 b -  0 46
42. gameId: 3072455, fen: Q2k4/1Q6/1K5Q/8/8/8/8/8 b -  0 70
43. gameId: 3114874, fen: 4QQQ1/8/8/8/8/2K5/2B5/4k3 b -  0 71
44. gameId: 3125226, fen: 1Q2Q3/1Q6/8/6N1/5K2/8/k7/8 b -  0 74
45. gameId: 3168700, fen: 1QQ5/6pk/6nr/8/8/4P1Q1/6PP/5RK1 b -  0 51
46. gameId: 3170156, fen: 2Q5/PP6/8/8/8/8/3Q4/1k2K1Q1 b -  0 95
47. gameId: 3171152, fen: 1QQ5/5ppk/7p/q1pp4/5B2/3b2PP/1Q2r1PK/8 b -  0 41
48. gameId: 3211109, fen: 1QQ1r1k1/5pp1/p3r2q/8/1P2N1Pp/P2Q4/4K2P/R6R b -  0 32
49. gameId: 3220759, fen: 1K4QQ/5Q2/8/8/8/3B4/7k/8 b -  0 77
50. gameId: 3264752, fen: 2Q3Q1/8/4b3/8/1k6/8/p4K2/Q7 b -  0 65
51. gameId: 3265287, fen: Q2QQR2/6k1/8/8/5KP1/8/8/8 b -  0 71
52. gameId: 3389290, fen: QQ6/8/8/3K2p1/6kp/8/7Q/8 b -  0 63
53. gameId: 3437475, fen: QQ6/8/K7/8/1Q6/8/k7/8 b -  0 94
54. gameId: 3438422, fen: 1QQ2rk1/4q1bp/Q2p1np1/4p3/2P1Pp1P/5P2/5BP1/4KB1R b K  0 28
55. gameId: 3444433, fen: 1Q1QQ3/8/8/5k2/8/8/P7/2K5 b -  0 59
56. gameId: 3445945, fen: 4Q2k/5Q2/p5p1/1p2P3/3P4/P2p2QP/3q4/2r3RK b -  0 48
57. gameId: 3450569, fen: 1Q3Q1Q/6B1/8/1p2p3/4k3/2P1p3/PP2K3/7R b -  0 47

Elapsed: 239582 ms, 03:59, total size: 2415129988; gameCnt: 3456399, succCount: 57, time per result: 4203 ms
Completed querying 3 White Queens.
SuccCount is the number of games satisfying the searching conditions. There is a total of 57 games from that PGN file as the result.


Stats for question 2 (Two Black Rooks in the middle squares e4, e5, d4, d5):

Code: Select all

…
6912. gameId: 2937024, fen: 8/5pk1/5pp1/pQ2r3/4r2p/1Pq4P/P4PP1/3R1RK1 w -  0 29
7168. gameId: 3009061, fen: 2k5/p1p4p/1np5/3r4/1B2r3/1P6/P1K3RP/R7 w -  0 23
7424. gameId: 3105028, fen: 8/6p1/5p1p/2k1r2P/2P1r1P1/8/3RBK2/8 w -  7 49
7680. gameId: 3206615, fen: 8/4k3/1p6/p1bPr3/P2Nr3/2PQ2P1/1P5P/6K1 w -  1 46
7936. gameId: 3309662, fen: 8/pp4k1/2p4p/3p3P/3rr1PK/8/PP3R2/5N2 w -  0 40
8192. gameId: 3447693, fen: 8/pp3k1p/5p2/4r1p1/1P2r3/P4R2/6PP/6K1 w -  1 33

Elapsed: 245043 ms, 04:05, total size: 2415129988; gameCnt: 3456399, succCount: 8212, time per result: 29 ms
Completed querying 2 Black Rooks in the middle squares.
There is a total of 8212 games from that PGN file, satisfying the searching conditions.


Stats for question 3 (Three White Pawns in d4, e5, f4, g4, Black King in b7):

Code: Select all

...
476. gameId: 3386119, fen: 7r/pk1b1ppr/1pn1p1p1/3pP3/q1pP1PP1/P1P2R1P/1QPB1R2/5BK1 b - g3 0 23
477. gameId: 3390873, fen: 3r1b1r/pk1qn1p1/1pn1p2p/3pPp1P/3P1PP1/5N1Q/PP1BN1K1/2R4R w -  1 18
478. gameId: 3402430, fen: 3rn2r/pk1q1pbp/1pp1p1p1/3pP3/3P1PP1/3RQNNP/PPP5/2K4R w -  2 20
479. gameId: 3408345, fen: 1rnq3r/pk1b1pp1/1p2p2p/n2pP2P/PBpP1PPN/2P5/2P1B3/R1Q1K2R w KQ  1 17
480. gameId: 3411214, fen: r4q2/bk2nr2/1pp3pp/3pP3/PP1P1PP1/2NQ1RKP/8/2R5 w -  3 29
481. gameId: 3432728, fen: 6n1/1k3pbp/6p1/p1BpP3/P2P1PP1/7P/8/N5K1 w -  1 33
482. gameId: 3444805, fen: 8/pk3ppp/1p2p3/3pP3/P1rP1PPP/R2K4/8/8 w -  0 32
483. gameId: 3452951, fen: 3r1b1r/pkqbnpp1/1p2p2p/n2pP2P/2pP1PP1/P1P1B2B/1P1NN3/R2QK2R w KQ  1 15

Elapsed: 264016 ms, 04:24, total size: 2415129988; gameCnt: 3456399, succCount: 483, time per result: 546 ms
Completed querying 2 White Pawns in the middle squares & Black King.
There is a total of 483 games from that PGN file, satisfying the searching conditions.


2. Quick conclusions for this attempt
  • In this attempt, we test the abilities and speed of implementing the approximate-possible-searching by using on-fly-calculating only. The idea (using on-fly-calculating only) is learned from SCID and seems to work well
  • Even those answers use all information calculated on the fly, their elapsed times are very close or just a bit larger than the time for reading and parsing PGN games
IMHO, those speeds are comparable to SCID. Our search function is still using bitboards, guaranteeing all necessary information is always available to answer any kind of position questions.


3. Source code
All code of this Attempt has been pushed already into a new branch “testspeedonfly”.

There are some notes:
  • The chessboard in the code is taken from the Banksia project, using a mailbox as the board representation. It was invested a lot of labor focusing on correctness and easiness to expand to other chess variants. However, it is not yet optimized for speed. That kind of board doesn’t have bitboards either but we create on the fly too whenever needed
  • Out chessboard is also using our own coordination, quite different from popular chess engines such as Stockfish (using little-endian coordination). For being convenient to other programmers, in this attempt, we use Stockfish chess coordinating for bitboards instead. That means our bitboards can match exactly or compare directly with ones from Stockfish
All above mentions mean: if needed our chessboard could be replaced easily by any Stockfish-compatible-chessboard, for getting more speed and convenience.

We have trimmed out all redundant as well as SQL functions. Thus the code became clean and convenient if someone wants to study the search and/or use it for other purposes.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
ndbd
Posts: 9
Joined: Mon Jan 11, 2021 11:15 am
Full name: Roger S.

Re: Open Chess Game Database Standard

Post by ndbd »

So what's your conclusion, i.e. your personal opinion on these data?

a) you could pursue the approach of creating SQL table structures in order to find game information
b) just keep PGNs, load PGNs into memory and do everything in-memory (same as done with tarraschdb, or, when reading through the forums at hce, also done there)
c) use approach b) but simply dump your in-memory data into SQLite, i.e. essentially use SQLite as a storage format (you could also try hd5, or simply binary blobs).
User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Open Chess Game Database Standard

Post by phhnguyen »

ndbd wrote: Sun Jan 02, 2022 10:17 pm So what's your conclusion, i.e. your personal opinion on these data?

a) you could pursue the approach of creating SQL table structures in order to find game information
b) just keep PGNs, load PGNs into memory and do everything in-memory (same as done with tarraschdb, or, when reading through the forums at hce, also done there)
c) use approach b) but simply dump your in-memory data into SQLite, i.e. essentially use SQLite as a storage format (you could also try hd5, or simply binary blobs).
We do conclude when we have been completing or have redundant information. At the moment, we are lacking some information and we still want to try many ideas.

We have planned to use SQLite as much as possible since it is "open" and has many advantages. However, we are driven by experiments and evidence, thus I am not sure at the end how much/deep the integration is or if we should use it at all.

I knew many people are eager to see the end. But you all could see we have been working hard, intensively for that.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Position searching: 9th attempt

Post by phhnguyen »

Position searching: 9th attempt

In this attempt, we try to integrate back the previous work (Attempt 8th) with SQLite.

1. Build the database
We parse the PGN file to get games and insert them into the database. Because the method of approximate-position-searching of Attempt 8th is quite different from older ones, the database will be back to the simple design of early days, quite similar to very early attempts (1 & 2). We store all moves of a game as text extracted from the PGN file, without parsing it into moves. No index has been added at the moment.

Code: Select all

#games: 3456399, elapsed: 57489ms 00:57, speed: 60122 games/s
When running on my computer, the program took about a minute to complete from a PGN file of 3.45 million games. The data file is 2.13 GB, smaller but still comparable to the size of the original PGN file (2.4 GB).

For a while with some attempts, now we have seen something such small and fast back :D


2. Scan all records without parsing
We scan all records of the new SQLite database, get FEN, all moves of each game as two text strings, and measure the consuming time. To make sure the database loaded all necessary text, we calculate their lengths to find their average ones. We don’t parse any FEN, move strings yet.

Below are some stats, loading, and scanning at different times:

Code: Select all

Elapsed: 45680 ms, 00:45, total games: 3456399, ave length: 453

Code: Select all

Elapsed: 13652 ms, 00:13, total games: 3456399

Code: Select all

Elapsed: 4755 ms, 00:04, total games: 3456399, fenCnt: 67, ave length: 49
Again, caches caused various periods. The database may take from 45 seconds to only 4 seconds to read/scan all records.
Loading all games via SQLite is a bit slower than loading via PGN file. However, the extra time is not much.


3. Scan all records and parse their moves, using 1 thread only

This test scans all records, one by one, and parses all moves, using only one thread.

Code: Select all

Elapsed: 733943 ms, 12:13, total games: 3456399, hdpLen: 274346246, ave length: 79
Parsing all moves can take over 12 minutes, several times longer than the above tests.

4. Scan all records, parse their moves, using multi threads
When reading data from either the PGN file or from the database, multi-threading cannot help since in/out are serial accesses. However, parsing games and moves could be done concurrently.

Code: Select all

Elapsed: 229020 ms, 03:49
The elapsed time to read and parse all records is about 4 minutes, very similar to the previous attempt when we have read and parse all moves from the PGN file.


5. Searching
We find answers for some approximate-position-searching queries. As with previous attempts, we use bitboards as the main tech. All data is calculated when needed only (create on the fly). In this attempt, the code returns a redundant set of information, including all necessary bitboards (black, white, kings, queens, rooks, bishops, knights, pawns), whiteKingSquare, blackKingSquare and the hash of the position. That makes sure we can have all information we need.

a. Three White Queens

Stats for question 1 (Three White Queens):

Code: Select all

…
54. gameId: 3438421, fen: 1QQ2rk1/4q1bp/Q2p1np1/4p3/2P1Pp1P/5P2/5BP1/4KB1R b K  0 28
55. gameId: 3444432, fen: 1Q1QQ3/8/8/5k2/8/8/P7/2K5 b -  0 59
56. gameId: 3445943, fen: 4Q2k/5Q2/p5p1/1p2P3/3P4/P2p2QP/3q4/2r3RK b -  0 48
57. gameId: 3450570, fen: 1Q3Q1Q/6B1/8/1p2p3/4k3/2P1p3/PP2K3/7R b -  0 47

Elapsed: 242692 ms, 04:02, total results: 57, time per results: 4257 ms

Stats for question 2 (Two Black Rooks in the middle squares e4, e5, d4, d5):

Code: Select all

…
6400. gameId: 2757600, fen: 3b4/1p4k1/p4p1p/4rR2/P1Br2P1/1PR4P/5PK1/8 w -  2 39
6656. gameId: 2849353, fen: 8/2p5/p1k2b2/3r1p1p/Np2r3/1P1R2PP/P4PB1/5K2 w -  2 32
6912. gameId: 2937024, fen: 8/5pk1/5pp1/pQ2r3/4r2p/1Pq4P/P4PP1/3R1RK1 w -  0 29
7168. gameId: 3009060, fen: 2k5/p1p4p/1np5/3r4/1B2r3/1P6/P1K3RP/R7 w -  0 23
7424. gameId: 3105027, fen: 8/6p1/5p1p/2k1r2P/2P1r1P1/8/3RBK2/8 w -  7 49
7680. gameId: 3206614, fen: 8/4k3/1p6/p1bPr3/P2Nr3/2PQ2P1/1P5P/6K1 w -  1 46
7936. gameId: 3309660, fen: 8/pp4k1/2p4p/3p3P/3rr1PK/8/PP3R2/5N2 w -  0 40
8192. gameId: 3447693, fen: 8/pp3k1p/5p2/4r1p1/1P2r3/P4R2/6PP/6K1 w -  1 33

Elapsed: 241844 ms, 04:01, total results: 8212, time per results: 29 ms
There is a total of 8212 games from that database, satisfying the searching conditions.


Stats for question 3 (Three White Pawns in d4, e5, f4, g4, Black King in b7):

Code: Select all

…
470. gameId: 3316122, fen: 4q1rr/pkn1npbp/1pp1pBp1/3pP3/3P1PP1/1NPB1Q1P/PP6/2KR3R w -  1 19
471. gameId: 3319879, fen: 8/1k2np2/1rp1p1pp/3pP3/P2P1PP1/KN6/2R4P/8 w -  2 36
472. gameId: 3333216, fen: 3r3r/1knnbp2/qp2p1p1/2ppP3/p2P1PP1/2P1BNN1/PP2Q1K1/R4R2 w -  1 19
473. gameId: 3359101, fen: 5r2/1k3p1p/p1p1p1p1/1nP1P3/1NpP1PP1/7P/8/KR6 w -  1 36
474. gameId: 3374675, fen: 8/1k3p2/4p3/p2pP3/p2P1PP1/4K3/Nb6/8 b - g3 0 44
475. gameId: 3381812, fen: 4r1b1/1kp3Kp/2p3p1/B1RpPp2/3P1PP1/P6P/8/8 b -  0 51
476. gameId: 3386121, fen: 7r/pk1b1ppr/1pn1p1p1/3pP3/q1pP1PP1/P1P2R1P/1QPB1R2/5BK1 b - g3 0 23
477. gameId: 3390871, fen: 3r1b1r/pk1qn1p1/1pn1p2p/3pPp1P/3P1PP1/5N1Q/PP1BN1K1/2R4R w -  1 18
478. gameId: 3402430, fen: 3rn2r/pk1q1pbp/1pp1p1p1/3pP3/3P1PP1/3RQNNP/PPP5/2K4R w -  2 20
479. gameId: 3408345, fen: 1rnq3r/pk1b1pp1/1p2p2p/n2pP2P/PBpP1PPN/2P5/2P1B3/R1Q1K2R w KQ  1 17
480. gameId: 3411213, fen: r4q2/bk2nr2/1pp3pp/3pP3/PP1P1PP1/2NQ1RKP/8/2R5 w -  3 29
481. gameId: 3432727, fen: 6n1/1k3pbp/6p1/p1BpP3/P2P1PP1/7P/8/N5K1 w -  1 33
482. gameId: 3444804, fen: 8/pk3ppp/1p2p3/3pP3/P1rP1PPP/R2K4/8/8 w -  0 32
483. gameId: 3452949, fen: 3r1b1r/pkqbnpp1/1p2p2p/n2pP2P/2pP1PP1/P1P1B2B/1P1NN3/R2QK2R w KQ  1 15

Elapsed: 243748 ms, 04:03, total results: 483, time per results: 504 ms
There is a total of 483 games from that database, satisfying the searching conditions.


6. Quick conclusions for this attempt
- All results are identical to ones of the previous attempt event there is a bit different on game IDs, affected by multi-threading (they make read/write be a bit different between time)
- As fast as the previous Attempt (which worked directly with the PGN file).


7. Source code
All code of this Attempt has been pushed already into a new branch “searchwithonflycalculating”.


8. Discussion
We compare the stats from this attempt, using SQLite database with ones from the previous attempt, using directly PGN file:
- are a bit smaller
- similar speed for approximate-position-searching
- have huge advantages when querying information in game-headers

We are happy with the current results and may use the design/method of this attempt as the last one even we still need some more experiments.

Perhaps, in the coming time, we will:
- use simple, straightforward design for SQL/SQLite databases. With that design, we won’t have any problem with ID sizes, the limited number of records or memory problems when creating/using
- use all calculations on the fly for exact-position-matching as well as approximate-position-searching
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Solutions for approximate-position-searching?

Post by phhnguyen »

Discussion needed for the next steps: Solutions for approximate-position-searching?


Since someone got hard to be members of this forum, I create the main discussion on GitHub of this project. Please use that link if you could. Otherwise posting here is fine too.

In this post, we will make a brief about what we have/done and then what issues we are facing. We would like to hear your opinions, suggestions and get help from some experts.

A. Current status

1. Building databases
With the latest Attempt 9th, we store game contents into SQLite databases in a simple, straightforward way. Each game contains two important fields, one is a string of FEN (empty if the game started from the start position), the other is a string of all moves. The last string is not yet split nor parsed into individual moves.

This way makes the building process be very fast. A PGN file of 3.45 million games required under a minute to be converted into a SQL database stored on the hard disk (the program used only one thread, running on my old computer Quad-Core i7, 3.6 GHz, 16 GB RAM).

2. Extract moves/positions
Because we don’t store any information about moves nor their positions we much create all those information on the fly, whenever we need. Basically to do position matching or searching, we need to read all games from the database, parse the text of moves into moves, make all those moves in a chessboard to get all information about their positions.

In our tests, the time to read all games from the SQLite database is just similar to the time to read them from the PGN file. Total time to read and parse all games of a database 3.45 million games is under 4 minutes on my computer (IMHO, that is fast, somewhat is comparable to SCID - the fastest chess database program so far).

The code to read all games, get their pair strings of FEN-moves without splitting and parsing as the following:

Code: Select all

for (auto cnt = 0; statement.executeStep(); cnt++) {
    auto gameID = statement.getColumn("ID").getInt64(); assert(gameID > 0);
    auto fenText = statement.getColumn("FEN").getText();
    auto moveText = statement.getColumn("Moves").getText();
}
Below is the code to parse the pair strings into a chessboard:

Code: Select all

for auto cnt = 0(; statement.executeStep(); cnt++) {
    auto gameID = statement.getColumn("ID").getInt64();
    auto fenText = statement.getColumn("FEN").getText();
    auto moveText = statement.getColumn("Moves").getText();
    threadParsePGNGame(gameID, fenText, moveText);
}

3. Exact-position-matching and approximate-position-searching
Both matching and searching need the information of all positions which are created on the fly as mentioned in the above part. After parsing/making each move the program calls a callback function, provides it with redundant parameters, including a full set of bitboards, king squares, hash key of the position. We can add code for that callback function to do any task we need.

For example, to find all positions with 3-White-Queens we create the callback as the following. The code takes the bitboard of all Queens, do operator AND with the bitboard of side White and then count all bit 1, using the function popCount:

Code: Select all

checkToStop = [=](int64_t gameId, const std::vector<uint64_t>& bitboardVec, const bslib::BoardCore* board) -> bool {
                    auto White = bitboardVec[static_cast<int>(bslib::BBIdx::white)];
                    auto Queens = bitboardVec[static_cast<int>(bslib::BBIdx::queens)];

                    if (popCount(White & Queens) == 3) {
                        succCount++;
                        std::cout << succCount << ". gameId: " << gameId << ", fen: " << board->getFen() << std::endl;
                        return true;
                    }

                    return false;
                };
 
All works well. The callback function runs very fast. It takes some extra time but just a little bit, thus the total time is quite close to the time needed to read and parse games (about 4 minutes).

B. The issue and considering solutions
Even the way to match/search as above part works, it requires hard-coding. We need to write some code, compile and run it. That is impossible for average users.

We need solutions for this issue. It should help users to query what they want without coding.

One of the simplest solutions is to use dialog boxes. The program creates some dialog boxes, users tick some boxes, select something from dropdowns. After all the program will run the callback function with parameters from that dialog box (using some If-Switch commands). However, using dialog boxes has a huge limit since they can’t cover all cases and any effort to expand covering may pay by heavy complexity. Dialog boxes are suitable for the programs with GUI but not for ones in the form of console/command lines.

We think solutions should relate to the text-only: users can query what they want just by entering some simple text without compiling and rerunning the program.

At the moment we have been considering two solutions as the below.

1. Using CQL (Chess Query Language)
Frankly speaking, even though I have known CQL for a while but I didn’t high evaluate it nor plan to use it since it looked so complicated for me. I didn’t learn to use it and was confused about its license. Clearly, I didn’t have enough information and avoided it as much as possible. However, some experts have given me some simple examples of usage, enough for me to understand how strong it is and how simple to use it is (at least there are some simple ways to use it).

If we could integrate CQL with SQL databases all will be done in a quick way. CQL has been developing for a long time by a good team thus it should be good for its tasks.

We have started finding more information, got the latest code as well as some confirmations from CQL authors (thanks a lot and high appreciation to them). The authors don’t have any objection about using/integrating their code and may support us too.

The good news is that the code written by those authors is an MIT license, compatible with the license of this project, and could be used in any other project.

However, they warned us that CQL is using some code from SCID with a GPL license which conflicts with our license. They can’t contact Shane Hudson (the first author of SCID) to change the license. That code is mostly for PGN parsing and may take a lot of effort to replace. Yes, we have code to parse PGN but looks like not an easy task (to replace).

2. Developing a simple expression parser

Instead of coding, users can enter an expression as a text. The program will compile, auto changes it to parameters, and run with the callback function. For example, users can enter such as below expression strings:

Code: Select all

// find all positions having 3 White Queens
“popCount(White & Queens) == 3”

// Find all positions having two Black Rooks in the middle squares
“popCount(Black & Rooks & bb(e4, e5, f4, f5)) == 2”

// White Pawns in d4, e5, f4, g4, Black King in b7
“(White & Pawns & bb(d4, e5, f4, g4)) == bb(d4, e5, f4, g4) && blackKingSquare == b7
Since that is just simple expressions, short strings, focusing on working with bitboards, and some specific chess constants (such as d4), it should be much easier to implement than CQL. Of course, that is our code we won’t have any problem with license conflicts.


C. Discussion

We love to hear your opinions, especially someone who has experience working and/or implementing CQL. Any suggestion, help on implementation is highly appreciated.
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager
mar
Posts: 2654
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Solutions for approximate-position-searching?

Post by mar »

I love #2 (a simple expression parser)

but I'd probably go full bitboard even with the square constants, seems more natural to me, you could then simply or the squares instead of the bb glue
you could also have bitboard constants like Rank1 or DFile and so on, sounds pretty cool to me
for moves, from/to could be unpacked to bitboards

so a simple recursive descent parser with constant folding might already work very well, the only type would be 64-bit integers.
in order to perform better, however, I'd suggest to process in batches, like 64 filters at once (because C++ => VM calls can be expensive), packing results into a 64-bit result and returning
that to the C/C++, later you could even translate IR/bytecode to machine code and while far from the performance of C/C++, you'd still beat
any interpreter out there comfortably.
if it's powerful enough, you might even be able to compile higher lever "query languages", basically unlimited possibilities

EDIT: also, I think simple shortcuts like WhitePawns = (White & Pawns) would be neat
Fulvio
Posts: 396
Joined: Fri Aug 12, 2016 8:43 pm

Re: Solutions for approximate-position-searching?

Post by Fulvio »

phhnguyen wrote: Fri Jan 07, 2022 8:10 am

Code: Select all

// find all positions having 3 White Queens
“popCount(White & Queens) == 3”

// Find all positions having two Black Rooks in the middle squares
“popCount(Black & Rooks & bb(e4, e5, f4, f5)) == 2”

// White Pawns in d4, e5, f4, g4, Black King in b7
“(White & Pawns & bb(d4, e5, f4, g4)) == bb(d4, e5, f4, g4) && blackKingSquare == b7
Maybe an SQL-like syntax:
COUNT(WQ%) = 3
COUNT(BRe4, BRe5, BRf4, BRf5) = 2
COUNT(WPd4, WPe5, WPf4, WPg4, BKb6) = 5
COUNT(_P_4) -> count pawns (both colors) on the 4th rank
User avatar
phhnguyen
Posts: 1524
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Open Chess Game Database Standard

Post by phhnguyen »

Look like implementing CQL for our SQL database is not a quick/easy task and may delay much. When hopping and waiting for someone to jump in to help, we started the backup solution: develop a simple condition/expression parser to compile strings from users.

IMO, CQL has some good ideas to make conditions/expressions for position searching. Thus we have accepted and tried to make our work to be compatible as much as possible with CQL on conditions/expressions. However, we ignore other parts of CQL since SQL itself can search all other things.

Because the authors of CQL didn’t publish its language defines, we have started creating a BNF (Backus Naur Form) for the syntax of the expression/condition parser, used for position searching.

Code: Select all

clause = condition { ("and" | "or" | "&&" | "||") condition }
condition = expression { ( "=" | "<" | "<="| " >" | ">=" | "!=" | "<>" ) expression }
expression = [ "+" | "-" ] term  {( "+" | "-" ) term }
term = factor {( "*" | "/" ]) factor} 

factor = number | piece | "(" expression ")"
piece = piecename (<empty> | square | squareset)

piecename = "K" | "Q" | "R" | "B" | "N" | "P" | "k" | "q" | "r" | "b" | "n" | "p" | "white" | "black"

squareset = column | row | "[" (square | squarerange | columnrange | rowrange) { "," (square | squarerange | columnrange | rowrange) } "]"

squarerange = square "-" square
columnrange = column "-" column
rowrange = row "-" row
square = column row 
column = "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h"
row = "1" | "2" | "3" | "4" | "5" | "6" | "7" | "8"
A condition/expression may have some chess piece names. For evaluating, they are cardinalities/total numbers of those chess pieces on the chessboard. For examples:

Code: Select all

R                the total number of White Rooks
qb3              the total number of Black Queens on square b3
B3               the total number of White Bishops on row 3
bb               the total number of Black Bishops on column b
n[b-e]           the total number of Black Knights from column b to e
P[a4, c5, d5]    the total number of White Pawns on squares a4, c5, and d5
The condition may be implicit or explicit:

Code: Select all

R 		the implicit form of the comparison R != 0
R == 3		the total of White Rooks must be 3
q[5-7] >= 2     the total of Black Queens from row 5 to row 7 must be equal or larger than 2
Some other examples:

Code: Select all

// find all positions having 3 White Queens
Q = 3

// Find all positions having two Black Rooks in the middle squares
r[e4, e5, f4, f5] = 2

// White Pawns in d4, e5, f4, g4, Black King in b7
P[d4, e5, f4, g4] = 4 and kb7

// Black Pawns in column c more than 1
pc > 1

// White Pawns in row 3 from 2
P3 >= 2

// Two Bishops in column c, d, e, f
B[c-f] + b[c-f] = 2
https://banksiagui.com
The most features chess GUI, based on opensource Banksia - the chess tournament manager