In the 3rd and 4th attempts, we have tried to use hash keys to match positions. Each hash key is a 64-bit integer, representing a chess position created by making a move. Each hash key is stored in the table HashPositions as a record/row with some extra info such as ply and game ID. There are over 277 million records for the database of 3.45 million games. It is huge and increases the size of the database from 2.2 GB to 8.02 GB (without indexes) or 13.74 GB (with indexes).
One of the problems is the time to pick up all results for a given hash key. Some hash keys may have over million records and the program may need to spend multi minutes to scan and pick up them all!
In this attempt, we tried to merge all records of a hash key into one record only. By doing that we can save some space since we can avoid repeating some info such as hash keys. The search could be faster in some cases because of avoiding scanning the whole database.
To merge all records of a hash key we keep only their Game IDs. We need to merge those IDs into an array of integers to store them in one column only. The most suitable data type to store those arrays of integers is SQL’s blob.
1. Building the database
As we have mentioned the database has over 277 positions/hash keys. It is huge, may push everything to the edge of resources and time. We had to do many tries:
1st try
For the first implementation, we tried to store everything in the memory. Typically that is the simplest and fastest method, using standard data structures such as std::vector, std::map… we design that each working thread has an individual set of data (to avoid conflicting and using thread-locks) and their data will be merged when writing down.
Building result: the program always crashed when processing about 2/3 of all games. Discarded since it can’t complete. FAILED
2nd try
The method is the opposite of the 1st one. All data is stored directly in the database. When writing down (a pair of hash-key - game-ID), the app will load old data of that hash key (using SELECT statements), append the new game ID into end then update that record (using UPDATE statements).
Result: The update speed dramatically slowed down after a short period to worse than crawling, under 100 records per second (the speed when I had stopped). With that speed, we need to wait for months or even years to complete. I really it may take too much time since statements SELECT and UPDATE have to scan all records (a huge number) of the database for each loading and updating. FAILED
3rd try
It is a mixed-method between 1st & 2nd, creating data in memory and writing it down all after a while. Old data is cleared after writing down to release memory. For the database of 3.45 million games, it is written down 3 times only.
Building result: it isn’t crashed. However, each time written down, it still needs to load/update old records which took too much time as the previous try. FAILED
4th try
In this try, we use only one (share) data set for all working threads, just to reduce the total data size in memory. Those threads need some thread-locks to access data. That can prevent the program from crashing on the 1st try. In our practice, the program can hold all hash keys in memory after parsing all 3.45 million games.
To avoid being slowed down, we hold the data of all hash keys till the end, then write them down by using simple INSERT statements, but not SELECT, UPDATE ones (which need to scan all database, the causes of terrible speed). For each hash key, all its belonging Game IDs will be written down as a blob.
Building result: the update speed was slowed down dramatically after a period to worse. FAILED
5th try
After studying for a while, we realized inserting too many blobs caused that terrible speed. Look like there isn’t a solution except reducing the number of blobs.
From our chess experience, we knew that there may be many games for the first few moves/positions, but for deeper moves, the positions become almost unique, one game per position. The real stats (in the below parts) showed that they (unique positions have only one game each) actually took a very large scale, over 96%.
We create two tables, the first one is for the pair hash key - game ID, for a unique position. The second table is for the pair hash key - blob (the array of game IDs) for the rest of the positions.
The SQLite statements to create those tables are as below:
Code: Select all
CREATE TABLE HashPositionsBlob (HashKey INTEGER, MultiGameIDs BLOB)
CREATE TABLE HashPositionsOne (HashKey INTEGER, GameID INTEGER)
The stats of creating the database are as the below:
Code: Select all
Completed writing hash keys, total: 202320464, one: 194519041, blob: 7801423, elapsed: 401915ms 06:41, total elapsed: 401915ms 06:41, speed: 503391 hash keys/s
#games: 3456399, #errors: 0, #hashKeys: 0, #posCnt: 552148891, elapsed: 1393777ms 23:13, speed: 2479 games/s
Building result: SUCCEED
6th try
Having two tables could create some trouble when querying since we may have to query twice and have two different data sets. To avoid that, we create one table only with two columns, one for GameID and one for blob (MultiGameIDs). For any record, only one column is used to store data and the other one contains a NULL value.
Code: Select all
CREATE TABLE HashPositions (HashKey INTEGER, GameID INTEGER DEFAULT NULL, MultiGameIDs BLOB DEFAULT NULL
The stats of creating the database are as the below:
Code: Select all
Completed writing hash keys, total: 202320464, one: 194519041, blob: 7801423%, elapsed: 562816ms 09:22, speed: 359478 hash keys/s
#games: 3456399, #errors: 0, #hashKeys: 202320464, #posCnt: 552148891, elapsed: 1510472ms 25:10, speed: 2288 games/s
Because of storing an extra NULL for every record, the database size is a little bit bigger: 6.95 GB. It took about 25 minutes.
Building result: SUCCEED
2. Search
For a hash key created from the given position, we need only one read to get all game IDs:
Code: Select all
SELECT * FROM HashPositions WHERE HashKey=?
Sometimes we may need some game data such as player names, results, moves… we can query one by one, it is simple too:
Code: Select all
SELECT * FROM Games WHERE GameID=?
Luckily, we found a solution: using the callback function which is supported by SQLite. We create a callback function to check if a game ID is in the result set. The result set we have from querying the table HashPositions as above.
Code: Select all
SELECT * FROM Games WHERE inResultSet(GameID)
Code: Select all
SELECT g.ID, g.Round, Date, w.Name White, WhiteElo, b.Name Black, BlackElo, Result, Timer, ECO, PlyCount, FEN, Moves
FROM Games g
INNER JOIN Players w ON WhiteID = w.ID
INNER JOIN Players b ON BlackID = b.ID
WHERE inResultSet(g.ID)
Code: Select all
ply: 1, #total queries: 1000, elapsed: 2762002 ms, 46:02, time per query: 2762 ms, total results: 1138013701, results/query: 1138013
ply: 2, #total queries: 1000, elapsed: 1919293 ms, 31:59, time per query: 1919 ms, total results: 359609242, results/query: 359609
ply: 3, #total queries: 1000, elapsed: 1584611 ms, 26:24, time per query: 1584 ms, total results: 199681101, results/query: 199681
ply: 4, #total queries: 1000, elapsed: 1256597 ms, 20:56, time per query: 1256 ms, total results: 37937493, results/query: 37937
ply: 5, #total queries: 1000, elapsed: 1116575 ms, 18:36, time per query: 1116 ms, total results: 5548104, results/query: 5548
ply: 6, #total queries: 1000, elapsed: 1109502 ms, 18:29, time per query: 1109 ms, total results: 6513526, results/query: 6513
ply: 7, #total queries: 1000, elapsed: 1084842 ms, 18:04, time per query: 1084 ms, total results: 2447268, results/query: 2447
ply: 8, #total queries: 1000, elapsed: 1062462 ms, 17:42, time per query: 1062 ms, total results: 5304993, results/query: 5304
ply: 9, #total queries: 1000, elapsed: 2849016 ms, 47:29, time per query: 2849 ms, total results: 2521642, results/query: 2521
ply: 10, #total queries: 1000, elapsed: 1749207 ms, 29:09, time per query: 1749 ms, total results: 1994074, results/query: 1994
ply: 11, #total queries: 1000, elapsed: 1590678 ms, 26:30, time per query: 1590 ms, total results: 429535, results/query: 429
ply: 12, #total queries: 1000, elapsed: 1615125 ms, 26:55, time per query: 1615 ms, total results: 298440, results/query: 298
ply: 13, #total queries: 1000, elapsed: 2203547 ms, 36:43, time per query: 2203 ms, total results: 235978, results/query: 235
ply: 14, #total queries: 1000, elapsed: 1710170 ms, 28:30, time per query: 1710 ms, total results: 153136, results/query: 153
ply: 15, #total queries: 1000, elapsed: 2267163 ms, 37:47, time per query: 2267 ms, total results: 10639, results/query: 10
ply: 16, #total queries: 1000, elapsed: 2313609 ms, 38:33, time per query: 2313 ms, total results: 34300, results/query: 34
ply: 17, #total queries: 1000, elapsed: 3275912 ms, 54:35, time per query: 3275 ms, total results: 11582, results/query: 11
ply: 18, #total queries: 1000, elapsed: 2913447 ms, 48:33, time per query: 2913 ms, total results: 9781, results/query: 9
Compared with results from 3rd Attempt
- The database size of 6.95 GB is significantly smaller than the size 8.87 GB (without index) or half of 13.74 GB (with index)
- Building speed is slower (25 minutes vs 15 minutes) but the same if we count 10 minutes of creating indexes for the previous one
- The search speed is much faster, in worst cases, that is 3 seconds vs 3.7 minutes (no index) or 156 seconds (with indexes) - 50 times faster
3. Quick conclusions for this attempt
- It was one of the hardest attempts with multi try-error and took a lot of time to find some reasonable solutions
- It works at the end with a significantly smaller size and much faster searching
- There are some issues:
+ memory pressure when building. Look like the data is reached the boundary of resources. We may have problems when working with larger databases, say, 7 million games. That may cause the app to crash or slow down significantly. Thus we may need to create some special classes, do almost ourselves instead of using the standard library for collections (std::vector, std::map…), just for managing everything, reduce data size, swap to hard disks if needed…
+ if a hash key has multi games, those Game IDs are stored in the blob as an array of integers. In that case, it is not easy to use normal SQL browsers/tools to query all at once. We may need our code/app to create a callback function.
4. Source code
It has been pushed already with a new branch “searchwithhashkeysblob”.