Compression of chess databases

Edmund · Post by **Edmund** » Mon Jul 09, 2012 10:28 am

Say I have a collection of games I want to store in a database using as little memory as possible.

Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.

Is there any algorithm that would tell me how to optimally populate the table with opening moves?

It sort of reminds me on huffman-encoding, but it is slightly different.

rvida · Post by **rvida** » Mon Jul 09, 2012 1:14 pm

Edmund wrote: Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.

For move encoding you don't need any table. Just write a legal move generator which generates moves in a reproducible order. Then you can store only the move number (you get effectively 1move per byte). If you assume that in most positions there are not so many moves, you can get away with just 6bits/move. If the move number is < 63 then store in 6 bits. If it is greater than 63 store 63 and in the following byte the real move number.

rvida · Post by **rvida** » Mon Jul 09, 2012 1:36 pm

rvida wrote:
Edmund wrote: Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.
For move encoding you don't need any table. Just write a legal move generator which generates moves in a reproducible order. Then you can store only the move number (you get effectively 1move per byte). If you assume that in most positions there are not so many moves, you can get away with just 6bits/move. If the move number is < 63 then store in 6 bits. If it is greater than 63 store 63 and in the following byte the real move number.

You can improve upon this with standard move ordering tricks (good SEE captures first, good SEE quiet moves second sorted by PSQ gain, losing moves at the end). Most of the moves will end up in the low range (0..15) which enables using even fewer bits per move.

Resulting data should still be compressible with standard compression algorithms like LZMA / Bzip2 etc.

Edmund · Post by **Edmund** » Mon Jul 09, 2012 1:48 pm

Thanks for your fast response. I am aware of the move compression you are talking about.

However my question was particularly focused on the opening.

For example my database 1/30 of all games begin with the sicilian
1. e4 c4 2. Nf3 d7 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3

Using the naive approach you suggested (which is probably the best option for the midgame) you would take about 0.5*9=4.5 byte. Just using a single bit sicilian/not sicilian would already save on filesize.

My question was how you could optimize this algorithm ...

rvida · Post by **rvida** » Mon Jul 09, 2012 1:58 pm

Edmund wrote:Thanks for your fast response. I am aware of the move compression you are talking about.

However my question was particularly focused on the opening.

For example my database 1/30 of all games begin with the sicilian
1. e4 c4 2. Nf3 d7 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3

Using the naive approach you suggested (which is probably the best option for the midgame) you would take about 0.5*9=4.5 byte. Just using a single bit sicilian/not sicilian would already save on filesize.

My question was how you could optimize this algorithm ...

Simply sort the games. Store in the first byte how many half-moves (if any) should be repeated from the previous game. Unfortunately this approach does not exploit transpositions

Edmund · Post by **Edmund** » Mon Jul 09, 2012 3:03 pm

rvida wrote:
Edmund wrote:Thanks for your fast response. I am aware of the move compression you are talking about.

However my question was particularly focused on the opening.

For example my database 1/30 of all games begin with the sicilian
1. e4 c4 2. Nf3 d7 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3

Using the naive approach you suggested (which is probably the best option for the midgame) you would take about 0.5*9=4.5 byte. Just using a single bit sicilian/not sicilian would already save on filesize.

My question was how you could optimize this algorithm ...
Simply sort the games. Store in the first byte how many half-moves (if any) should be repeated from the previous game. Unfortunately this approach does not exploit transpositions

good idea. transpositions are not wanted anyway when describing chess games.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 09, 2012 5:06 pm

rvida wrote:
rvida wrote:
Edmund wrote: Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.
For move encoding you don't need any table. Just write a legal move generator which generates moves in a reproducible order. Then you can store only the move number (you get effectively 1move per byte). If you assume that in most positions there are not so many moves, you can get away with just 6bits/move. If the move number is < 63 then store in 6 bits. If it is greater than 63 store 63 and in the following byte the real move number.
You can improve upon this with standard move ordering tricks (good SEE captures first, good SEE quiet moves second sorted by PSQ gain, losing moves at the end). Most of the moves will end up in the low range (0..15) which enables using even fewer bits per move.

Resulting data should still be compressible with standard compression algorithms like LZMA / Bzip2 etc.

I don't think you can save any bits with heuristics since there is always an off chance that one weird move takes more bits than the rest. You can only improve up on the compression that will be achieved later. When you use a compressor made from Lempel-ziv + huffman, a single bit could suffice to represent the whole sicilian line. For example, if all the database starts from sicilian, it is more than likely that is what happens i.e the whole sicilian line taking a single bit. When compressing my bitbases from uncompressed reperesentation that takes say 4 bits/pos or 8bits/pos, the compressed file usually turns out to be the same. So it may be a good ideas to first try to store the database as it is and then use a block compressor such as ZIP.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 09, 2012 5:27 pm

Another option is to use a compressed representation such as Binary decision diagrams (BDD). This has already been used for bitbases. You have three terminal states WDL. The squares (or moves in this case) are coded by testing each bit so one square could take 8 plies I think. http://en.wikipedia.org/wiki/Binary_decision_diagram
This method is good if you require fast read since no decompression is involved. BTW the method that encodes the moves by their location in move list is going to be slow during decompression ,since you have generate every move to get which move 12 is f.i. Oops but that is exactly what I did in my GPU chess due to lack of memory ...

rvida · Post by **rvida** » Mon Jul 09, 2012 7:34 pm

Daniel Shawul wrote:I don't think you can save any bits with heuristics since there is always an off chance that one weird move takes more bits than the rest.

I had a variable bit length encoding in mind.

Read one bit. If it is 0 then move_nr is encoded in the next 3 bits. If it is 1 then read one more bit. If it is 0 then move_nr is in the next 4 bits, if it is 1 then read the one more bit, etc...

With this scheme most moves should use only 4 bits per move. Moves which map to more bits should be rare and probably blunders anyway.

Worst case fits in 14 bits. (6 bits escape prefix + 8 bits move nr.)

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 09, 2012 8:23 pm

rvida wrote:
Daniel Shawul wrote:I don't think you can save any bits with heuristics since there is always an off chance that one weird move takes more bits than the rest.
I had a variable bit length encoding in mind.

Read one bit. If it is 0 then move_nr is encoded in the next 3 bits. If it is 1 then read one more bit. If it is 0 then move_nr is in the next 4 bits, if it is 1 then read the one more bit, etc...

With this scheme most moves should use only 4 bits per move. Moves which map to more bits should be rare and probably blunders anyway.

Worst case fits in 14 bits. (6 bits escape prefix + 8 bits move nr.)

But that is what huffman or arithimetic encoder of the compressor does. Variable length encoding is better done with those, because doing it your way would be sub-optimal. The first two methods do statistical analysis over the whole file (or a block of it) to determine which moves (or whole lines if you use LZ or RLE before that) happen frequently and assign the shortest bit lengths to the most frequent ones. Also the database could contain other information so encoding moves directly is a bit pre-mature IMO. Also I suspect that the final result will be the same if you encode the moves as you said and apply LZMA,because the algorithms are redundunt. OTOH SEE and other domain specific heuristics will probably help compression as those are compliementary to what the compressor does ...

Compression of chess databases

Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases