Speed up arbitrary position search in database

dkl · Post by **dkl** » Sun Mar 18, 2018 12:25 am

Suppose we have a chess database, like SCID or Chessbase, i.e. where games are stored as a sequence of moves in a binary file on disk.

Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).

The naive search strategy would then be to linearly search the binary file, i.e. load game #1, execute the game by applying move by move, and comparing each position to the search pattern, load game #2, and continue until all games have been visited.

Obviously only very few games will match. So to speed up search, it makes sense to store additional information, to rule out non-matching games as quickly as possible. I am aware that Chessbase uses a "search booster" file, but no information about the stored information is available.

SCID apprently stores two kind of information: 1.) The sequence on how pawns left their home fields (apparently to speed up opening position search, i.e. if 1e4 e5 is indexed information stored for a game and we search for a position where black's e-pawn is still at e7, then we can immeditately rule out this game. 2.) The material of the final position. Suppose we search for a position where there is no black bishop, but in the final position of a game there is (still) a black bishop. Skipping the case for previous pawn promotions for a moment, we can again rule out this game immediately.

So to come to my question: Is this optimal? What other information should be stored to speed up position search? Are there any investigations/material/statistics/practical experience on what works, and what doesn't work? Also considering from a perspective of tradeoff of required storage/implementational complexity vs gained speed.

Any input is greatly appreciated!

jdart · Post by **jdart** » Sun Mar 18, 2018 1:49 am

The solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon

CheckersGuy · Post by **CheckersGuy** » Sun Mar 18, 2018 3:09 am

he solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon

Would be my first choice too. I might be wrong but the reason this may not be used is that hashtables arent well suited for storing something on a harddrive/disk.

Suppose we have a chess database, like SCID or Chessbase, i.e. where games are stored as a sequence of moves in a binary file on disk.

Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).

The naive search strategy would then be to linearly search the binary file, i.e. load game #1, execute the game by applying move by move, and comparing each position to the search pattern, load game #2, and continue until all games have been visited.

Obviously only very few games will match. So to speed up search, it makes sense to store additional information, to rule out non-matching games as quickly as possible. I am aware that Chessbase uses a "search booster" file, but no information about the stored information is available.

SCID apprently stores two kind of information: 1.) The sequence on how pawns left their home fields (apparently to speed up opening position search, i.e. if 1e4 e5 is indexed information stored for a game and we search for a position where black's e-pawn is still at e7, then we can immeditately rule out this game. 2.) The material of the final position. Suppose we search for a position where there is no black bishop, but in the final position of a game there is (still) a black bishop. Skipping the case for previous pawn promotions for a moment, we can again rule out this game immediately.

So to come to my question: Is this optimal? What other information should be stored to speed up position search? Are there any investigations/material/statistics/practical experience on what works, and what doesn't work? Also considering from a perspective of tradeoff of required storage/implementational complexity vs gained speed.

Any input is greatly appreciated!

I had a similiar problem for a different dataset and used a B-Tree to store the information https://en.wikipedia.org/wiki/B-tree. The set needs to be ordered (position A < position B). Some information, like the things u mentioned, could be used for that I guess[/quote]

syzygy · Post by **syzygy** » Sun Mar 18, 2018 3:33 am

dkl wrote:Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).

The simple case might be too simple. Being able to look up a specific position quickly is of course useful, but that ability is not going to be of any help once you want to support slightly more complicated queries.

phhnguyen · Post by **phhnguyen** » Sun Mar 18, 2018 3:48 am

jdart wrote:The solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon

Hash code is perfect for searching "exact" positions. However, from my experience, finding exact positions is not useful and there are so few games have exact positions to other games, specially from middle games. Users prefer to search similar positions which few pieces may be omitted and/or in diffident cells. Other problem is that hash data may take too much space to store themselves and some pointers to point back games.

For me, I have used material counters / signatures. They don't take too much space and can help to limit number of involving games and positions.

Dann Corbit · Post by **Dann Corbit** » Sun Mar 18, 2018 7:48 am

Use an actual SQL database.
Store the games as a sequence of EPD positions that are uniquely indexed.
FastDB and Gigabase are good at caching.
Or MonetDB is another high speed option.

Even standard SQL stores like SQL Server, PostgreSQL would work well.

Eg:
Game has game header and then a child table like this:
unsigned short ply;
unsigned long long PositionID;
<whatever else you want to store for the position>

AlvaroBegue · Post by **AlvaroBegue** » Sun Mar 18, 2018 8:09 am

The first thing that comes to mind is to use a Bloom filter to encode the positions a game has gone through, and quickly discard games where the position didn't appear.

The idea is to compute m bits (say m=1024) per game. Start with all bits set to 0, compute k different hash functions (say k=7) of each position and mark the bits that correspond to the values of the hash functions modulo m for all the positions the game goes through. When you are looking for a specific position, you just need to compute the k hash functions modulo m and go through the games, checking if the corresponding k bits are set.

This procedure will not miss any games where the position did appear, and it will give a rate of false positives which is

p = (1 - (1 - 1/m)^(k * n))^k

for a game that goes through n positions. If n=100, m=1024 and k=7, then p=.00732. So of the games that don't contain the position, you only have to check move by move one in 137 or so (I don't know if n=100 is typical or not; it probably depends on the source of the games in your database).

Speed up arbitrary position search in database

Speed up arbitrary position search in database

Re: Speed up arbitrary position search in database

Re: Speed up arbitrary position search in database

Re: Speed up arbitrary position search in database

Re: Speed up arbitrary position search in database

Re: Speed up arbitrary position search in database

Re: Speed up arbitrary position search in database