Open Chess Game Database Standard

phhnguyen · Post by **phhnguyen** » Thu Dec 16, 2021 11:14 pm

Position searching: 5th attempt

In the 3rd and 4th attempts, we have tried to use hash keys to match positions. Each hash key is a 64-bit integer, representing a chess position created by making a move. Each hash key is stored in the table HashPositions as a record/row with some extra info such as ply and game ID. There are over 277 million records for the database of 3.45 million games. It is huge and increases the size of the database from 2.2 GB to 8.02 GB (without indexes) or 13.74 GB (with indexes).

One of the problems is the time to pick up all results for a given hash key. Some hash keys may have over million records and the program may need to spend multi minutes to scan and pick up them all!

In this attempt, we tried to merge all records of a hash key into one record only. By doing that we can save some space since we can avoid repeating some info such as hash keys. The search could be faster in some cases because of avoiding scanning the whole database.

To merge all records of a hash key we keep only their Game IDs. We need to merge those IDs into an array of integers to store them in one column only. The most suitable data type to store those arrays of integers is SQL’s blob.

1. Building the database

As we have mentioned the database has over 277 positions/hash keys. It is huge, may push everything to the edge of resources and time. We had to do many tries:

1st try
For the first implementation, we tried to store everything in the memory. Typically that is the simplest and fastest method, using standard data structures such as std::vector, std::map… we design that each working thread has an individual set of data (to avoid conflicting and using thread-locks) and their data will be merged when writing down.

Building result: the program always crashed when processing about 2/3 of all games. Discarded since it can’t complete. FAILED

2nd try
The method is the opposite of the 1st one. All data is stored directly in the database. When writing down (a pair of hash-key - game-ID), the app will load old data of that hash key (using SELECT statements), append the new game ID into end then update that record (using UPDATE statements).

Result: The update speed dramatically slowed down after a short period to worse than crawling, under 100 records per second (the speed when I had stopped). With that speed, we need to wait for months or even years to complete. I really it may take too much time since statements SELECT and UPDATE have to scan all records (a huge number) of the database for each loading and updating. FAILED

3rd try
It is a mixed-method between 1st & 2nd, creating data in memory and writing it down all after a while. Old data is cleared after writing down to release memory. For the database of 3.45 million games, it is written down 3 times only.

Building result: it isn’t crashed. However, each time written down, it still needs to load/update old records which took too much time as the previous try. FAILED

4th try
In this try, we use only one (share) data set for all working threads, just to reduce the total data size in memory. Those threads need some thread-locks to access data. That can prevent the program from crashing on the 1st try. In our practice, the program can hold all hash keys in memory after parsing all 3.45 million games.

To avoid being slowed down, we hold the data of all hash keys till the end, then write them down by using simple INSERT statements, but not SELECT, UPDATE ones (which need to scan all database, the causes of terrible speed). For each hash key, all its belonging Game IDs will be written down as a blob.

Building result: the update speed was slowed down dramatically after a period to worse. FAILED

5th try
After studying for a while, we realized inserting too many blobs caused that terrible speed. Look like there isn’t a solution except reducing the number of blobs.

From our chess experience, we knew that there may be many games for the first few moves/positions, but for deeper moves, the positions become almost unique, one game per position. The real stats (in the below parts) showed that they (unique positions have only one game each) actually took a very large scale, over 96%.

We create two tables, the first one is for the pair hash key - game ID, for a unique position. The second table is for the pair hash key - blob (the array of game IDs) for the rest of the positions.

The SQLite statements to create those tables are as below:

Code: Select all

CREATE TABLE HashPositionsBlob (HashKey INTEGER, MultiGameIDs BLOB)

CREATE TABLE HashPositionsOne (HashKey INTEGER, GameID INTEGER)

There are some tricks we have found and applied to avoid crashing: update the Blob table first then the other one and release memory when we can; commit the transactions a few times.

The stats of creating the database are as the below:

Code: Select all

Completed writing hash keys, total: 202320464, one: 194519041, blob: 7801423, elapsed: 401915ms 06:41, total elapsed: 401915ms 06:41, speed: 503391 hash keys/s
#games: 3456399, #errors: 0, #hashKeys: 0, #posCnt: 552148891, elapsed: 1393777ms 23:13, speed: 2479 games/s

Instead of having over 277 million hash key records, now we have only 202 million. The number of hash keys having more than one Game IDs (need to store as blobs) is 7.8 million, only 4%. The database size now is 6.75 GB.

Building result: SUCCEED

6th try
Having two tables could create some trouble when querying since we may have to query twice and have two different data sets. To avoid that, we create one table only with two columns, one for GameID and one for blob (MultiGameIDs). For any record, only one column is used to store data and the other one contains a NULL value.

Code: Select all

CREATE TABLE HashPositions (HashKey INTEGER, GameID INTEGER DEFAULT NULL, MultiGameIDs BLOB DEFAULT NULL

The stats of creating the database are as the below:

Code: Select all

Completed writing hash keys, total: 202320464, one: 194519041, blob: 7801423%, elapsed: 562816ms 09:22, speed: 359478 hash keys/s
#games: 3456399, #errors: 0, #hashKeys: 202320464, #posCnt: 552148891, elapsed: 1510472ms 25:10, speed: 2288 games/s

Because of storing an extra NULL for every record, the database size is a little bit bigger: 6.95 GB. It took about 25 minutes.

Building result: SUCCEED

2. Search
For a hash key created from the given position, we need only one read to get all game IDs:

Code: Select all

SELECT * FROM HashPositions WHERE HashKey=?

That is very fast, simple and maybe enough for many purposes.

Sometimes we may need some game data such as player names, results, moves… we can query one by one, it is simple too:

Code: Select all

SELECT * FROM Games WHERE GameID=?

However, there is a challenge if we use some libraries to display/process data such as Qt. Those libraries typically require to run one query only to display all records, not accept querying one by one. SQL cannot work directly with our array extracted from BLOB. The array may have millions of numbers, preventing us to convert them into SQL text commands since the string will be too long, exceeding the limit.

Luckily, we found a solution: using the callback function which is supported by SQLite. We create a callback function to check if a game ID is in the result set. The result set we have from querying the table HashPositions as above.

Code: Select all

SELECT * FROM Games WHERE inResultSet(GameID)

The real query to get all necessary information, combined from several tables as the below:

Code: Select all

SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Timer, ECO, PlyCount, FEN, Moves
FROM Games g
INNER JOIN Players w ON WhiteID = w.ID
INNER JOIN Players b ON BlackID = b.ID 
WHERE inResultSet(g.ID)

The stats of querying the database (twice for each hash key, one for HashPositions, one for Games) without indexes, stored on hard disks (not being loaded into memory), extracting all records are as the below:

Code: Select all

ply: 1, #total queries: 1000, elapsed: 2762002 ms, 46:02, time per query: 2762 ms, total results: 1138013701, results/query: 1138013
ply: 2, #total queries: 1000, elapsed: 1919293 ms, 31:59, time per query: 1919 ms, total results: 359609242, results/query: 359609
ply: 3, #total queries: 1000, elapsed: 1584611 ms, 26:24, time per query: 1584 ms, total results: 199681101, results/query: 199681
ply: 4, #total queries: 1000, elapsed: 1256597 ms, 20:56, time per query: 1256 ms, total results: 37937493, results/query: 37937
ply: 5, #total queries: 1000, elapsed: 1116575 ms, 18:36, time per query: 1116 ms, total results: 5548104, results/query: 5548
ply: 6, #total queries: 1000, elapsed: 1109502 ms, 18:29, time per query: 1109 ms, total results: 6513526, results/query: 6513
ply: 7, #total queries: 1000, elapsed: 1084842 ms, 18:04, time per query: 1084 ms, total results: 2447268, results/query: 2447
ply: 8, #total queries: 1000, elapsed: 1062462 ms, 17:42, time per query: 1062 ms, total results: 5304993, results/query: 5304
ply: 9, #total queries: 1000, elapsed: 2849016 ms, 47:29, time per query: 2849 ms, total results: 2521642, results/query: 2521
ply: 10, #total queries: 1000, elapsed: 1749207 ms, 29:09, time per query: 1749 ms, total results: 1994074, results/query: 1994
ply: 11, #total queries: 1000, elapsed: 1590678 ms, 26:30, time per query: 1590 ms, total results: 429535, results/query: 429
ply: 12, #total queries: 1000, elapsed: 1615125 ms, 26:55, time per query: 1615 ms, total results: 298440, results/query: 298
ply: 13, #total queries: 1000, elapsed: 2203547 ms, 36:43, time per query: 2203 ms, total results: 235978, results/query: 235
ply: 14, #total queries: 1000, elapsed: 1710170 ms, 28:30, time per query: 1710 ms, total results: 153136, results/query: 153
ply: 15, #total queries: 1000, elapsed: 2267163 ms, 37:47, time per query: 2267 ms, total results: 10639, results/query: 10
ply: 16, #total queries: 1000, elapsed: 2313609 ms, 38:33, time per query: 2313 ms, total results: 34300, results/query: 34
ply: 17, #total queries: 1000, elapsed: 3275912 ms, 54:35, time per query: 3275 ms, total results: 11582, results/query: 11
ply: 18, #total queries: 1000, elapsed: 2913447 ms, 48:33, time per query: 2913 ms, total results: 9781, results/query: 9

A query took under 3 seconds. It is quite fast thus we don’t bother to create the index for it.

Compared with results from 3rd Attempt
- The database size of 6.95 GB is significantly smaller than the size 8.87 GB (without index) or half of 13.74 GB (with index)
- Building speed is slower (25 minutes vs 15 minutes) but the same if we count 10 minutes of creating indexes for the previous one
- The search speed is much faster, in worst cases, that is 3 seconds vs 3.7 minutes (no index) or 156 seconds (with indexes) - 50 times faster

3. Quick conclusions for this attempt
- It was one of the hardest attempts with multi try-error and took a lot of time to find some reasonable solutions
- It works at the end with a significantly smaller size and much faster searching
- There are some issues:
+ memory pressure when building. Look like the data is reached the boundary of resources. We may have problems when working with larger databases, say, 7 million games. That may cause the app to crash or slow down significantly. Thus we may need to create some special classes, do almost ourselves instead of using the standard library for collections (std::vector, std::map…), just for managing everything, reduce data size, swap to hard disks if needed…
+ if a hash key has multi games, those Game IDs are stored in the blob as an array of integers. In that case, it is not easy to use normal SQL browsers/tools to query all at once. We may need our code/app to create a callback function.

4. Source code
It has been pushed already with a new branch “searchwithhashkeysblob”.

Sopel · Post by **Sopel** » Fri Dec 17, 2021 1:27 am

How do you now query, say, first 1000 game ids for some early position? This is going backwards, using blobs here is a TERRIBLE idea, these game ids become worse than useless. Getting ALL the games doesn't SCALE. What is needed is support for pagination and cursors. By restricting this to ALL the games you force the users to read potentially HUNDREDS OF GBs for some queries.

> We may have problems when working with larger databases, say, 7 million games.
7 million is nothing. I would expect a generic SQL database to handle lichess with ease.

phhnguyen · Post by **phhnguyen** » Fri Dec 17, 2021 2:00 am

Sopel wrote: ↑Fri Dec 17, 2021 1:27 am How do you now query, say, first 1000 game ids for some early position? This is going backwards, using blobs here is a TERRIBLE idea, these game ids become worse than useless. Getting ALL the games doesn't SCALE. What is needed is support for pagination and cursors. By restricting this to ALL the games you force the users to read potentially HUNDREDS OF GBs for some queries.

Not sure why you thought it is terrible. Half-size, 50-times-faster querying is a terrible result???

Sopel wrote: ↑Fri Dec 17, 2021 1:27 am > We may have problems when working with larger databases, say, 7 million games.
7 million is nothing. I would expect a generic SQL database to handle lichess with ease.

IMO, that is just a matter of workload.

This attempt may not work with 7 million games. We may test to confirm and find the solution with other attempts.

Please note that this is a working project. So far all are just attempts/tries, not the final design or decision. If we don't try we don't have information, don't know about their advantages and drawbacks. If we don't want to try any new, we may stop from one of previous attempts, or even we don't need to start this project at all.

Sopel · Post by **Sopel** » Fri Dec 17, 2021 2:18 am

phhnguyen wrote: ↑Fri Dec 17, 2021 2:00 am
Sopel wrote: ↑Fri Dec 17, 2021 1:27 am How do you now query, say, first 1000 game ids for some early position? This is going backwards, using blobs here is a TERRIBLE idea, these game ids become worse than useless. Getting ALL the games doesn't SCALE. What is needed is support for pagination and cursors. By restricting this to ALL the games you force the users to read potentially HUNDREDS OF GBs for some queries.
Not sure why you thought it is terrible. Half-size, 50-times-faster querying is a terrible result???

Does it matter if it made it useless? Who in their right mind even needs to query ALL the games for a position AT ONCE? This is trying to optimize something that no one will ever use, while completely removing the possibility of partial queries.

phhnguyen · Post by **phhnguyen** » Fri Dec 17, 2021 6:14 am

Sopel wrote: ↑Fri Dec 17, 2021 2:18 am
phhnguyen wrote: ↑Fri Dec 17, 2021 2:00 am
Sopel wrote: ↑Fri Dec 17, 2021 1:27 am How do you now query, say, first 1000 game ids for some early position? This is going backwards, using blobs here is a TERRIBLE idea, these game ids become worse than useless. Getting ALL the games doesn't SCALE. What is needed is support for pagination and cursors. By restricting this to ALL the games you force the users to read potentially HUNDREDS OF GBs for some queries.
Not sure why you thought it is terrible. Half-size, 50-times-faster querying is a terrible result???
Does it matter if it made it useless? Who in their right mind even needs to query ALL the games for a position AT ONCE? This is trying to optimize something that no one will ever use, while completely removing the possibility of partial queries.

All I have mentioned in Attempt 5th. Please read the explanations.

If you don’t need to get all results of a given hash key (hm, I doubt about that, sometimes the number of results is just small), you may get all game IDs by one read then query yourself other information one by one. So easy and so quick.

The more complicated solution (to query all with one query) is vital for using with some libraries such as Qt. Many developers including me have been using Qt for their chess GUIs and tools. With Qt, you may see how easy and fast to display records from a database. Users can sort by column, scroll up, down, edit some data, click to some data to view, update the board… Developers are required to do just a few jobs, including giving it one query only since that Qt’s feature can’t work with multi queries. My solution is for that purpose.

Fig. Display database records by using Qt’s library. It works with ONE query only

Of course, if your program doesn’t use such libraries, you want to display everything yourself, or you prefer querying one by one… you don’t need that extra complicated solution (query all by one query). I guess you may still get benefit from having only a half-size database (you don’t need to create the index) and query very fast.

Fulvio · Post by **Fulvio** » Fri Dec 17, 2021 12:06 pm

Maybe to have something more generic you could try to keep the game data separate from the tables used for position searches.
If I read the code correctly, at the moment you don't even memorize all the game tags?
The basic tables in my opinion are:

Code: Select all

PLAYERS
------
PlayerID
Surname
Name
Birth_date
FIDE_ID

The FIDE ID and date of birth are important because often the names of the players are incorrect or incomplete (Magnus, C).
You need those to be able to find all the games of a player (in SCID these data are stored in the "spelling" file).

Code: Select all

GAMES
-------
GameID
Date
White_playerID
Black_playerID
MoveText

Here what can be debated is how to memorize the moves.
Maybe just use the PGN notation, perhaps with some enhancements, such as the uci notation instead of SAN.

Code: Select all

GAME_TAGS
----------
Tag_name
Tag_value
GameID

I believe this is it all for a standard database, although I may have forgotten something.
Tables for the position search can then be added separately.

Sopel · Post by **Sopel** » Fri Dec 17, 2021 12:18 pm

phhnguyen wrote: ↑Fri Dec 17, 2021 6:14 am If you don’t need to get all results of a given hash key (hm, I doubt about that, sometimes the number of results is just small), you may get all game IDs by one read then query yourself other information one by one. So easy and so quick.

You really don't see a problem? This blob can get to hundreds of gigabytes for large databases.

This is pagination 101, come on... No one is talking about "one by one".

dangi12012 · Post by **dangi12012** » Fri Dec 17, 2021 3:15 pm

Sopel wrote: ↑Fri Dec 17, 2021 12:18 pm
phhnguyen wrote: ↑Fri Dec 17, 2021 6:14 am If you don’t need to get all results of a given hash key (hm, I doubt about that, sometimes the number of results is just small), you may get all game IDs by one read then query yourself other information one by one. So easy and so quick.
You really don't see a problem? This blob can get to hundreds of gigabytes for large databases.

This is pagination 101, come on... No one is talking about "one by one".

phhnguyen: Sopel likes to come here to tell everyone either:
1) Its impossible
2) Its useless
3) Offtopic Trolling

His comments can be ignored 100% of the time. They are worse than useless - because they dont add anything but they try to derail the discussion. So his messages are worse than if he had written nothing at all.

Your method is 50-times-faster at querying. Thats speaks for itself. Now the Troll Sopel cant say sql cannot work (like he did before) or sql is too slow (like he did before).
Now he says that the whole topic is useless. This guy needs a permaban ASAP.

phhnguyen · Post by **phhnguyen** » Sat Dec 18, 2021 12:29 am

Sopel wrote: ↑Fri Dec 17, 2021 12:18 pm
phhnguyen wrote: ↑Fri Dec 17, 2021 6:14 am If you don’t need to get all results of a given hash key (hm, I doubt about that, sometimes the number of results is just small), you may get all game IDs by one read then query yourself other information one by one. So easy and so quick.
You really don't see a problem? This blob can get to hundreds of gigabytes for large databases.

This is pagination 101, come on... No one is talking about "one by one".

How come you have that number (hundreds of gigabytes)? I knew you have just made an estimate but it should have something to support it!

First, we don’t build the database for unlimited numbers of games. We don’t dare to create something that can fit for the next century or for extreme purposes. All numbers should be in reasonable ranges. Our target is just normal/typical usage. The number of games should be around 5 - 7 million games - a very popular number that many users have recently. The database should work with larger numbers, say 10-20 million games. However we don’t guarantee larger ones, at least that is not our target at the moment.

My test set is a database of 3.45 million games. The size is about 10 GB, varying a bit depending on attempts. Thus with the above limit of 20 million games, the size may reach 40-50 GB. Even counting the whole database, the size is still too far from your number!!!

Back to blob size issues. We don’t store game IDs for the starting position (we knew all games belong to it). Thus the largest number of game IDs for a position must be much smaller than the total of games. It means, 5 million games may have 3 million as the largest number sharing a position. We store those 3 million game IDs as an array of integers and save them to a blob. An integer is 4 bytes, thus that blob size is only 12 MB. That is just the size of some images which are typically stored in blobs for normal applications. Again, that size is too far from GB and almost nothing to process. Even we increase that number 10 times, the situation is almost the same! There are only a few positions/blobs that have that large sizes. Other positions have significantly smaller sizes. Over 94% of positions have only one game.

Here is the actual number. In my test database of 3.45 million games, the largest blob has 1.6 million IDs, took only 6.4 MB.

If you are still confused about what we mentioned, just look again into the end result: the database size took only half size and the search speed is 50 times faster. Even each test queries twice, loading blobs, scanning all results but takes (average) under 3 seconds - very fast for a database of over 3 million games.

Why do you have to worry? What do you expect more?

Sopel · Post by **Sopel** » Sat Dec 18, 2021 1:04 pm

phhnguyen wrote: ↑Sat Dec 18, 2021 12:29 am
Sopel wrote: ↑Fri Dec 17, 2021 12:18 pm
phhnguyen wrote: ↑Fri Dec 17, 2021 6:14 am If you don’t need to get all results of a given hash key (hm, I doubt about that, sometimes the number of results is just small), you may get all game IDs by one read then query yourself other information one by one. So easy and so quick.
You really don't see a problem? This blob can get to hundreds of gigabytes for large databases.

This is pagination 101, come on... No one is talking about "one by one".
How come you have that number (hundreds of gigabytes)? I knew you have just made an estimate but it should have something to support it!

A single position adds 8 bytes to the blob. Lichess has ~2B games. That's 8GB for startpos blob, ~1-3GB for next position. Though this is less than my initial wrong estimate it is BAD. This is not how you do databases. It hurts me physically. I'll repeat again that one important word.

PAGINATION

and you're making simple queries actually HARD, by requiring querying more than needed, and requiring additional processing from the application and further queries

btw. this is also an important read https://en.wikipedia.org/wiki/Database_normalization

Open Chess Game Database Standard

Position searching: 5th attempt

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard

Re: Open Chess Game Database Standard