Say I have a collection of games I want to store in a database using as little memory as possible.
Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.
Is there any algorithm that would tell me how to optimally populate the table with opening moves?
It sort of reminds me on huffman-encoding, but it is slightly different.
Compression of chess databases
Moderators: hgm, Rebel, chrisw
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: Compression of chess databases
For move encoding you don't need any table. Just write a legal move generator which generates moves in a reproducible order. Then you can store only the move number (you get effectively 1move per byte). If you assume that in most positions there are not so many moves, you can get away with just 6bits/move. If the move number is < 63 then store in 6 bits. If it is greater than 63 store 63 and in the following byte the real move number.Edmund wrote: Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: Compression of chess databases
You can improve upon this with standard move ordering tricks (good SEE captures first, good SEE quiet moves second sorted by PSQ gain, losing moves at the end). Most of the moves will end up in the low range (0..15) which enables using even fewer bits per move.rvida wrote:For move encoding you don't need any table. Just write a legal move generator which generates moves in a reproducible order. Then you can store only the move number (you get effectively 1move per byte). If you assume that in most positions there are not so many moves, you can get away with just 6bits/move. If the move number is < 63 then store in 6 bits. If it is greater than 63 store 63 and in the following byte the real move number.Edmund wrote: Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.
Resulting data should still be compressible with standard compression algorithms like LZMA / Bzip2 etc.
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: Compression of chess databases
Thanks for your fast response. I am aware of the move compression you are talking about.
However my question was particularly focused on the opening.
For example my database 1/30 of all games begin with the sicilian
1. e4 c4 2. Nf3 d7 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3
Using the naive approach you suggested (which is probably the best option for the midgame) you would take about 0.5*9=4.5 byte. Just using a single bit sicilian/not sicilian would already save on filesize.
My question was how you could optimize this algorithm ...
However my question was particularly focused on the opening.
For example my database 1/30 of all games begin with the sicilian
1. e4 c4 2. Nf3 d7 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3
Using the naive approach you suggested (which is probably the best option for the midgame) you would take about 0.5*9=4.5 byte. Just using a single bit sicilian/not sicilian would already save on filesize.
My question was how you could optimize this algorithm ...
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: Compression of chess databases
Simply sort the games. Store in the first byte how many half-moves (if any) should be repeated from the previous game. Unfortunately this approach does not exploit transpositionsEdmund wrote:Thanks for your fast response. I am aware of the move compression you are talking about.
However my question was particularly focused on the opening.
For example my database 1/30 of all games begin with the sicilian
1. e4 c4 2. Nf3 d7 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3
Using the naive approach you suggested (which is probably the best option for the midgame) you would take about 0.5*9=4.5 byte. Just using a single bit sicilian/not sicilian would already save on filesize.
My question was how you could optimize this algorithm ...
-
- Posts: 670
- Joined: Mon Dec 03, 2007 3:01 pm
- Location: Barcelona, Spain
Re: Compression of chess databases
good idea. transpositions are not wanted anyway when describing chess games.rvida wrote:Simply sort the games. Store in the first byte how many half-moves (if any) should be repeated from the previous game. Unfortunately this approach does not exploit transpositionsEdmund wrote:Thanks for your fast response. I am aware of the move compression you are talking about.
However my question was particularly focused on the opening.
For example my database 1/30 of all games begin with the sicilian
1. e4 c4 2. Nf3 d7 3. d4 cxd4 4. Nxd4 Nf6 5. Nc3
Using the naive approach you suggested (which is probably the best option for the midgame) you would take about 0.5*9=4.5 byte. Just using a single bit sicilian/not sicilian would already save on filesize.
My question was how you could optimize this algorithm ...
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Compression of chess databases
I don't think you can save any bits with heuristics since there is always an off chance that one weird move takes more bits than the rest. You can only improve up on the compression that will be achieved later. When you use a compressor made from Lempel-ziv + huffman, a single bit could suffice to represent the whole sicilian line. For example, if all the database starts from sicilian, it is more than likely that is what happens i.e the whole sicilian line taking a single bit. When compressing my bitbases from uncompressed reperesentation that takes say 4 bits/pos or 8bits/pos, the compressed file usually turns out to be the same. So it may be a good ideas to first try to store the database as it is and then use a block compressor such as ZIP.rvida wrote:You can improve upon this with standard move ordering tricks (good SEE captures first, good SEE quiet moves second sorted by PSQ gain, losing moves at the end). Most of the moves will end up in the low range (0..15) which enables using even fewer bits per move.rvida wrote:For move encoding you don't need any table. Just write a legal move generator which generates moves in a reproducible order. Then you can store only the move number (you get effectively 1move per byte). If you assume that in most positions there are not so many moves, you can get away with just 6bits/move. If the move number is < 63 then store in 6 bits. If it is greater than 63 store 63 and in the following byte the real move number.Edmund wrote: Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.
Resulting data should still be compressible with standard compression algorithms like LZMA / Bzip2 etc.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Compression of chess databases
Another option is to use a compressed representation such as Binary decision diagrams (BDD). This has already been used for bitbases. You have three terminal states WDL. The squares (or moves in this case) are coded by testing each bit so one square could take 8 plies I think. http://en.wikipedia.org/wiki/Binary_decision_diagram
This method is good if you require fast read since no decompression is involved. BTW the method that encodes the moves by their location in move list is going to be slow during decompression ,since you have generate every move to get which move 12 is f.i. Oops but that is exactly what I did in my GPU chess due to lack of memory ...
This method is good if you require fast read since no decompression is involved. BTW the method that encodes the moves by their location in move list is going to be slow during decompression ,since you have generate every move to get which move 12 is f.i. Oops but that is exactly what I did in my GPU chess due to lack of memory ...
-
- Posts: 481
- Joined: Thu Apr 16, 2009 12:00 pm
- Location: Slovakia, EU
Re: Compression of chess databases
I had a variable bit length encoding in mind.Daniel Shawul wrote:I don't think you can save any bits with heuristics since there is always an off chance that one weird move takes more bits than the rest.
Read one bit. If it is 0 then move_nr is encoded in the next 3 bits. If it is 1 then read one more bit. If it is 0 then move_nr is in the next 4 bits, if it is 1 then read the one more bit, etc...
With this scheme most moves should use only 4 bits per move. Moves which map to more bits should be rare and probably blunders anyway.
Worst case fits in 14 bits. (6 bits escape prefix + 8 bits move nr.)
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Compression of chess databases
But that is what huffman or arithimetic encoder of the compressor does. Variable length encoding is better done with those, because doing it your way would be sub-optimal. The first two methods do statistical analysis over the whole file (or a block of it) to determine which moves (or whole lines if you use LZ or RLE before that) happen frequently and assign the shortest bit lengths to the most frequent ones. Also the database could contain other information so encoding moves directly is a bit pre-mature IMO. Also I suspect that the final result will be the same if you encode the moves as you said and apply LZMA,because the algorithms are redundunt. OTOH SEE and other domain specific heuristics will probably help compression as those are compliementary to what the compressor does ...rvida wrote:I had a variable bit length encoding in mind.Daniel Shawul wrote:I don't think you can save any bits with heuristics since there is always an off chance that one weird move takes more bits than the rest.
Read one bit. If it is 0 then move_nr is encoded in the next 3 bits. If it is 1 then read one more bit. If it is 0 then move_nr is in the next 4 bits, if it is 1 then read the one more bit, etc...
With this scheme most moves should use only 4 bits per move. Moves which map to more bits should be rare and probably blunders anyway.
Worst case fits in 14 bits. (6 bits escape prefix + 8 bits move nr.)