Speed up arbitrary position search in database

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dkl
Posts: 28
Joined: Wed Jan 14, 2015 5:55 pm

Speed up arbitrary position search in database

Post by dkl »

Suppose we have a chess database, like SCID or Chessbase, i.e. where games are stored as a sequence of moves in a binary file on disk.

Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).

The naive search strategy would then be to linearly search the binary file, i.e. load game #1, execute the game by applying move by move, and comparing each position to the search pattern, load game #2, and continue until all games have been visited.

Obviously only very few games will match. So to speed up search, it makes sense to store additional information, to rule out non-matching games as quickly as possible. I am aware that Chessbase uses a "search booster" file, but no information about the stored information is available.

SCID apprently stores two kind of information: 1.) The sequence on how pawns left their home fields (apparently to speed up opening position search, i.e. if 1e4 e5 is indexed information stored for a game and we search for a position where black's e-pawn is still at e7, then we can immeditately rule out this game. 2.) The material of the final position. Suppose we search for a position where there is no black bishop, but in the final position of a game there is (still) a black bishop. Skipping the case for previous pawn promotions for a moment, we can again rule out this game immediately.

So to come to my question: Is this optimal? What other information should be stored to speed up position search? Are there any investigations/material/statistics/practical experience on what works, and what doesn't work? Also considering from a perspective of tradeoff of required storage/implementational complexity vs gained speed.

Any input is greatly appreciated!
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Speed up arbitrary position search in database

Post by jdart »

The solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon
CheckersGuy
Posts: 273
Joined: Wed Aug 24, 2016 9:49 pm

Re: Speed up arbitrary position search in database

Post by CheckersGuy »

he solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon
Would be my first choice too. I might be wrong but the reason this may not be used is that hashtables arent well suited for storing something on a harddrive/disk.
Suppose we have a chess database, like SCID or Chessbase, i.e. where games are stored as a sequence of moves in a binary file on disk.

Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).

The naive search strategy would then be to linearly search the binary file, i.e. load game #1, execute the game by applying move by move, and comparing each position to the search pattern, load game #2, and continue until all games have been visited.

Obviously only very few games will match. So to speed up search, it makes sense to store additional information, to rule out non-matching games as quickly as possible. I am aware that Chessbase uses a "search booster" file, but no information about the stored information is available.

SCID apprently stores two kind of information: 1.) The sequence on how pawns left their home fields (apparently to speed up opening position search, i.e. if 1e4 e5 is indexed information stored for a game and we search for a position where black's e-pawn is still at e7, then we can immeditately rule out this game. 2.) The material of the final position. Suppose we search for a position where there is no black bishop, but in the final position of a game there is (still) a black bishop. Skipping the case for previous pawn promotions for a moment, we can again rule out this game immediately.

So to come to my question: Is this optimal? What other information should be stored to speed up position search? Are there any investigations/material/statistics/practical experience on what works, and what doesn't work? Also considering from a perspective of tradeoff of required storage/implementational complexity vs gained speed.

Any input is greatly appreciated!

I had a similiar problem for a different dataset and used a B-Tree to store the information https://en.wikipedia.org/wiki/B-tree. The set needs to be ordered (position A < position B). Some information, like the things u mentioned, could be used for that I guess[/quote]
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speed up arbitrary position search in database

Post by syzygy »

dkl wrote:Suppose further, a user wants to search for a specific chess position (maybe parametrized, but let's stick to the simple case for now).
The simple case might be too simple. Being able to look up a specific position quickly is of course useful, but that ability is not going to be of any help once you want to support slightly more complicated queries.
User avatar
phhnguyen
Posts: 1434
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Speed up arbitrary position search in database

Post by phhnguyen »

jdart wrote:The solution that immediately comes to mind is: hash code the position (Zobrist hash) and index the hash column in the DB.

--Jon
Hash code is perfect for searching "exact" positions. However, from my experience, finding exact positions is not useful and there are so few games have exact positions to other games, specially from middle games. Users prefer to search similar positions which few pieces may be omitted and/or in diffident cells. Other problem is that hash data may take too much space to store themselves and some pointers to point back games.

For me, I have used material counters / signatures. They don't take too much space and can help to limit number of involving games and positions.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Speed up arbitrary position search in database

Post by Dann Corbit »

Use an actual SQL database.
Store the games as a sequence of EPD positions that are uniquely indexed.
FastDB and Gigabase are good at caching.
Or MonetDB is another high speed option.

Even standard SQL stores like SQL Server, PostgreSQL would work well.

Eg:
Game has game header and then a child table like this:
unsigned short ply;
unsigned long long PositionID;
<whatever else you want to store for the position>
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Speed up arbitrary position search in database

Post by AlvaroBegue »

The first thing that comes to mind is to use a Bloom filter to encode the positions a game has gone through, and quickly discard games where the position didn't appear.

The idea is to compute m bits (say m=1024) per game. Start with all bits set to 0, compute k different hash functions (say k=7) of each position and mark the bits that correspond to the values of the hash functions modulo m for all the positions the game goes through. When you are looking for a specific position, you just need to compute the k hash functions modulo m and go through the games, checking if the corresponding k bits are set.

This procedure will not miss any games where the position did appear, and it will give a rate of false positives which is

p = (1 - (1 - 1/m)^(k * n))^k

for a game that goes through n positions. If n=100, m=1024 and k=7, then p=.00732. So of the games that don't contain the position, you only have to check move by move one in 137 or so (I don't know if n=100 is typical or not; it probably depends on the source of the games in your database).