Compression of chess databases

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

phhnguyen wrote:
Daniel Shawul wrote:
I divide compression a chess database into two phases:
1) compress by encode moves such as changing from 2 bytes to 1 byte or 6 bits or 4 bits per move
2) compress further by applying normal compressor such as zip, tar, bsz2... library / software on data files
Don't play dumb here. If you did that you will get far more than 10% but I guess we won't see data from you.
I am usually disappointed about phase 2 because of poor rate.

Back to your work, if you save your moves by 4 bits only, you can save immediately half of your size without using any extra compressor. That job is also faster and using less code (you don't need to integrate some compression libs).
It doesn't matter if you use 8 bits/move or 4 bits/move you will probable get same result. That is as long as you use the same number of bits per move and not mess up the entropy.
I guess you don't understand my post.

I save 1 move into 4 bits only. It means that I can store 2 moves in 1 byte only. By that way, I need a half of file size without using any compressor.
I did understand what you said but unfortunately you don't seem to understand the consequences of it. You did use compression when you apply variable length encoding by yourself first. Feeding that to a general compressor is plain stupid for the same reason as compressing a zipped file again with another compressor say 7z is. You will get more compression if you just applied compression on the raw data first. You think you are being smart by applying a sub-optimal variable length encoding by yourself but that definitely is not! Just use a byte or two for _all_ moves and apply any prediction you want if that is necessary at all. But if you can get 700% compression without doing anything why bother? You are not going to store them in RAM , or are you ?
User avatar
phhnguyen
Posts: 1437
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Compression of chess databases

Post by phhnguyen »

Daniel Shawul wrote: I did understand what you said but unfortunately you don't seem to understand the consequences of it. You did use compression when you apply variable length encoding by yourself first. Feeding that to a general compressor is plain stupid for the same reason as compressing a zipped file again with another compressor say 7z is. You will get more compression if you just applied compression on the raw data first. You think you are being smart by applying a sub-optimal variable length encoding by yourself but that definitely is not! Just use a byte or two for _all_ moves and apply any prediction you want if that is necessary at all. But if you can get 700% compression without doing anything why bother? You are not going to store them in RAM , or are you ?
Thanks for your discussion.

Let me explain little bit my purpose of compressing database. I simply need a database with reasonable size in practise. Thus I won't compress it by any price, in term of computer power, space and programmer effort. The compress rate is almost nothing but the last size of database is more important.

I divide compression process into two phrases does not mean that I always implement them all. If the first one is good and the second one is not good enough, I won't use the second one in my program (it will use only phase one to compress database). If I stop using compressor that can help me to save some code, library, and time to de-compress. Therefore there is no mess up, disturbing compressor nor smart thing here.

When Richard Vida shows that he can save 50% of his data size by encoding a move into 4 bits then compressing the data, I have seen that we can have the same size by simply using some logic operators like SHIFT, AND, OR... to store 2 moves into 1 byte and cut off using compressor.

The compressor will be integrated with my program if it can compress further, can learn more complicated patterns than my chess knowledge. Otherwise, compress by encoding move is good enough for me.

Using normal compression is very expensive. I always avoid if I can.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

phhnguyen wrote:
Daniel Shawul wrote: I did understand what you said but unfortunately you don't seem to understand the consequences of it. You did use compression when you apply variable length encoding by yourself first. Feeding that to a general compressor is plain stupid for the same reason as compressing a zipped file again with another compressor say 7z is. You will get more compression if you just applied compression on the raw data first. You think you are being smart by applying a sub-optimal variable length encoding by yourself but that definitely is not! Just use a byte or two for _all_ moves and apply any prediction you want if that is necessary at all. But if you can get 700% compression without doing anything why bother? You are not going to store them in RAM , or are you ?
Thanks for your discussion.

Let me explain little bit my purpose of compressing database. I simply need a database with reasonable size in practise. Thus I won't compress it by any price, in term of computer power, space and programmer effort. The compress rate is almost nothing but the last size of database is more important.

I divide compression process into two phrases does not mean that I always implement them all. If the first one is good and the second one is not good enough, I won't use the second one in my program (it will use only phase one to compress database). If I stop using compressor that can help me to save some code, library, and time to de-compress. Therefore there is no mess up, disturbing compressor nor smart thing here.

When Richard Vida shows that he can save 50% of his data size by encoding a move into 4 bits then compressing the data, I have seen that we can have the same size by simply using some logic operators like SHIFT, AND, OR... to store 2 moves into 1 byte and cut off using compressor.

The compressor will be integrated with my program if it can compress further, can learn more complicated patterns than my chess knowledge. Otherwise, compress by encoding move is good enough for me.

Using normal compression is very expensive. I always avoid if I can.
No Richard used 8bits/move for all moves. That means no variable length encoding since the number of moves is below 256. With your 4bits/pos you need to patch them up with escape bits for moves >= 16. That is why your are getting 10% only when I got 700% feeding raw pgn data. I mean that will really mess up what ever pattern you may have in the original database. So what he measured is the effect of prediction alone. If he used 16 bits/move the upper bits will all be zero and the compressor will have no problem compressing it to almost same size. Also a PGN file contains more than moves, headers & comments may be 90% of a file sometimes. Also encoding by order of moves in a specialized move generator (or search) may not necessarily increase compression as I will explain later.

Even using fixed size 8bits/pos may not improve things at all since that adds entropy that was not present in the original PGN format. I will give you an example. In PGN , SAN moves are written in compressed format so for example Be5 may mean a bishop coming from anywhere, but if you use order of the move Ba1e5,Bb2e5,Bc3e5...etc (i.e all 14 of them) take different patterns.
General compressors do a lot of things by themselves even going further as predicting with a neural network (f.i PPM) so make sure whatever prediction you use covers the ground that is not covered by those. Otherwise it is a complete waste of time which turned out to be the case for my bitbases. Using domain specific things for pgn surely may improve some (questionable worth) ,but forget about 4bits/move stuff as any compressor will do far better job.
User avatar
phhnguyen
Posts: 1437
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Compression of chess databases

Post by phhnguyen »

Daniel Shawul wrote: No Richard used 8bits/move for all moves. That means no variable length encoding since the number of moves is below 256. With your 4bits/pos you need to patch them up with escape bits for moves >= 16. That is why your are getting 10% only
Let me clarify: Richard stores a move in 8 bits, but the kind of his encoding before saving is actually 4 bits encoding. He did not bother to compress the database but let compressor to do the rest.

If I use the same ways, but store 2 moves in 1 byte, even I need to add some escapes, I still have a close size to his database size, but without using any compressor. It is NOT 10%, but 200% or a saving of 50% of original size.

When I use normal compressor to compress further my compressed database as I mention above, I may save an extra of 10% of current compressed size.
Daniel Shawul wrote: when I got 700% feeding raw pgn data. I mean that will really mess up what ever pattern you may have in the original database. So what he measured is the effect of prediction alone. If he used 16 bits/move the upper bits will all be zero and the compressor will have no problem compressing it to almost same size. Also a PGN file contains more than moves, headers & comments may be 90% of a file sometimes. Also encoding by order of moves in a specialized move generator (or search) may not necessarily increase compression as I will explain later.

Even using fixed size 8bits/pos may not improve things at all since that adds entropy that was not present in the original PGN format. I will give you an example. In PGN , SAN moves are written in compressed format so for example Be5 may mean a bishop coming from anywhere, but if you use order of the move Ba1e5,Bb2e5,Bc3e5...etc (i.e all 14 of them) take different patterns.
General compressors do a lot of things by themselves even going further as predicting with a neural network (f.i PPM) so make sure whatever prediction you use covers the ground that is not covered by those. Otherwise it is a complete waste of time which turned out to be the case for my bitbases. Using domain specific things for pgn surely may improve some (questionable worth) ,but forget about 4bits/move stuff as any compressor will do far better job.
As I said in previous post, compression rate is almost nothing but ultimate size of database is more important. You have a great rate because you compress the text file (PGN file). However, in my case, after ecoding and saving moves in binary file, I have a significant smaller database size (compare with PGN compressed files) without using any compressor.

At this moment, I am very happy with the size I can save by encoding. Just wonder if I can do more.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

phhnguyen wrote:
Daniel Shawul wrote: No Richard used 8bits/move for all moves. That means no variable length encoding since the number of moves is below 256. With your 4bits/pos you need to patch them up with escape bits for moves >= 16. That is why your are getting 10% only
Let me clarify: Richard stores a move in 8 bits, but the kind of his encoding before saving is actually 4 bits encoding. He did not bother to compress the database but let compressor to do the rest.
What are you talking about now? He clearly mentioned that he used 8bits/move which is the best choice as it needs no extra bit. Try variable length encoding and see what you get... We already said 4 bits/move needs extra bits. It will not gain you anything because now every move has different bit length. That will mess up things big time. I think you should read a bit on compression theory because you seem to be stuck.
If I use the same ways, but store 2 moves in 1 byte, even I need to add some escapes, I still have a close size to his database size, but without using any compressor. It is NOT 10%, but 200% or a saving of 50% of original size.
Now you say it is 200% hmm... If you get the same size as Richard using 4 bits/ move then it means you saved nothing at all! You could have used 8bits/move and save some effort without doing variable length encoding yourself. That will be done at the last stage by the compressor itself and much better than what you do here. A statisical analysis is done over the whole file to see which byte happens most frequent and is assigned the shortest bit length. What you do is sub-optimal to say the least.
When I use normal compressor to compress further my compressed database as I mention above, I may save an extra of 10% of current compressed size.
If you loose a possible compression ratio by doing something the compressor does better, then gaining 10% is nothing. You probably lost more from doing that unnecessary step.
As I said in previous post, compression rate is almost nothing but ultimate size of database is more important. You have a great rate because you compress the text file (PGN file). However, in my case, after ecoding and saving moves in binary file, I have a significant smaller database size (compare with PGN compressed files) without using any compressor.
Again whatever saving he gained came from prediction _not_ because you managed to squeeze most moves in 4 or 8 bits. Please understand that once and for all. It doesn't help the compressor if I use 8bits/move or 16bits/move , but using 5bits for one move , 3 for another , 7 for the other messes up whatever pattern is there and hurts your compression. I am sure if I do the prediction over the pgn file without encoding them in 8bits (i.e directly over the PGN), I will get 2MB for that file ...
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

I did a quick compression without using prediction just using location of move generator and I got 2.467 MB. So Richard got about 20% improvement from prediction with a 1-ply + qsearch. See prediction is already looking like a bit overrated :) What do you get with your variable encoding ? Original size of file was 41MB pgn database so I transformed it into moves only database. If I compress the SAN as it is I get 4MB and if I transformed to coordinate notation I get 4.7MB. A coordinate notation using two bytes (order of move genrator) gives 4.4MB which is greater than the san file. That testifies that SAN by itself is a good representation for compression! Most moves of SAN using > 2 bytes so you would expect it to compress less compared to the coordinate notation but that is not the case. So you see just because you represented the move with fewer bits doesn't mean the final size will be smaller.
User avatar
phhnguyen
Posts: 1437
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Compression of chess databases

Post by phhnguyen »

I have just tested on Olivier Deville's collection.

The collection has 50861 games, the rar file is 9.4 MB, PGN (unzip) is 42.5 MB.

After encoding into 6 bits / move (I have not implemented 4 bits like Richard yet - still worrying about heavy computation) and store data into few binary files (those files have already some added extra mem/structures for management), the total size is 4.3 MB. Among them, the file of moves-only is 3.4 MB.

Therefore I have:
- I need only near half size (46%) of rar file to store my data (4.3 MB vs 9.4)
- I don't need to use any compressor. Thus my program may work faster with database
- The binary helps my program load, search... very fast (compare with using PGN form)

If I change from 6 to 4 bits encode like Richard, I may save another 1/3 of size. It is still bigger than Richard's one, but I think the gap is small, especially I have added some extra file management structures (for example, my index file is over 800KB - it helps to access any game information and game moves almost immediately). My clear advantage is the absence of compressor.

PS: I have compressed my above move-only file (which is 3.4MB) by normal zip application. The zip file is 3 MB - an extra reduce about 10% on size. I think that percent is actually the gap size between mine and Richard's one (compressor can go great jobs), but for speed and simpler code, it is acceptable.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

phhnguyen wrote:I have just tested on Olivier Deville's collection.

The collection has 50861 games, the rar file is 9.4 MB, PGN (unzip) is 42.5 MB.

After encoding into 6 bits / move (I have not implemented 4 bits like Richard yet - still worrying about heavy computation) and store data into few binary files (those files have already some added extra mem/structures for management), the total size is 4.3 MB. Among them, the file of moves-only is 3.4 MB.
You are not listening are you? Richard used 8 bits/move without escape bits. You can't use 4 bits per move without affecting the possible compression achieved with zip. Why are you afraid to report your number after you use 4 bits/move on average and then zip ?? Haven't you been saying it is the final size that matters ? Well I gave you my a 2.4 mb figure that I got storing moves in 8bits and then zipping. Zip your file which used 4bits/move on average and tell us the size. Not 6 bits/move or anything closer to 8bits/move.
Therefore I have:
- I need only near half size (46%) of rar file to store my data (4.3 MB vs 9.4)
- I don't need to use any compressor. Thus my program may work faster with database
That was not what you were preaching. Infact deflate will be as fast as yours in decompression speed. It is optimized for decompression which is why we also use it for bitbases. You used variable length encoding which will be as slows as the one deflate uses. The preliminary lempel-ziv compressor that zip has on top of the huffman is exteremely easy to decompress so you don't gain anything here. Oh btw since you don't have LZ part , you loose big on compression ratio. Anyway it is meaningless to compare decompression speeds since you don't do the LZ part. What you do is a sub-optimal entropy coder. What zip uses is LZ + optimal entropy coder.
- The binary helps my program load, search... very fast (compare with using PGN form)
Read above. Zip will be as fast decompressing.
If I change from 6 to 4 bits encode like Richard, I may save another 1/3 of size.
It is still bigger than Richard's one, but I think the gap is small, especially I have added some extra file management structures (for example, my index file is over 800KB - it helps to access any game information and game moves almost immediately). My clear advantage is the absence of compressor.
Neither did he use 4 bits/move nor will you be able to save anything. Infact you will loose to the method that just compresses the file with zip.
You have no clear advantage compare to deflate. It is one of the fastest decompressors out there which is why we use it to probe compressed bitbases ...
PS: I have compressed my above move-only file (which is 3.4MB) by normal zip application. The zip file is 3 MB - an extra reduce about 10% on size. I think that percent is actually the gap size between mine and Richard's one (compressor can go great jobs), but for speed and simpler code, it is acceptable.
See you lost. I got about 2.4mb representing them with plain 8 bits/move and no hassle. Now that was for your representation that averaged 6 bits/move. I am willing to guess the 4 bits/move would be worse not better ...
User avatar
phhnguyen
Posts: 1437
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: Compression of chess databases

Post by phhnguyen »

Thank you for replying.

I think we still have different views about database and methods of compressing. I may have some misunderstand some numbers and some details. Thus the best now I would like to know more, could you help me?

From what I have, I see that Richard's database size is reasonable for me. However, I don't understand how you got the number 2.467 MB. How did you create data? What is size of your raw data size before compressing? What name/kind of compressor did you use?

More details is the best. I would like to try myself to make my comparison and conclusion before sticking long time to a method.

Thanks.
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Compression of chess databases

Post by Daniel Shawul »

phhnguyen wrote:Thank you for replying.

I think we still have different views about database and methods of compressing. I may have some misunderstand some numbers and some details. Thus the best now I would like to know more, could you help me?
Sorry it is not about views as this is not a matter of opinion but facts. Thats why we need numbers.
From what I have, I see that Richard's database size is reasonable for me. However, I don't understand how you got the number 2.467 MB. How did you create data? What is size of your raw data size before compressing? What name/kind of compressor did you use?
I just zipped the raw data that is made of an 8 bit/move. I get almost same raw data as Richard that is 4.4mb. He went on to use prediction and zipping to get a size of 1.9mb but I did not predict so it became 2.46 mb. So that is only 20% improvement for prediction. OTOH you used 6 bits/move to a size of 3.4mb which actually is (6/8) * 4.4mb so that makes sense. BUT your zip compression was so bad so it only brought you 10% to 3mb. This is what I have been saying all along. Infact this was so generous for you because there are only about 5 or so moves that can not be represented by 6bits/move in that database! So you only needed to patch up those 5 moves but still your compression ratio was so bad. With 4 bits/move there are far more number of moves that won't fit and the problem will worsen..

Note that I can further improve up on my compression ratio using a compressor that uses neural nets or something similar... But yours will be difficult to improve upon and it has already lost anyway after making a mess of the order. It may even be possible to compress the san moves (i.e 32bits/move+) as they are down to what you get i.e 3mb as those are already at 4mb anyway. It is not how many bits you spend on representing the move but the amount of disorder in it that matters most.

Note that you are doing entropy coder first which is the problem. Applying LZ over that data will not bring much. In any compression that is usually the _last_ step of compression e.g RLE + huffman , or LZ + huffman. Not the other way round as you did it i.e variable_length + LZ.
More details is the best. I would like to try myself to make my comparison and conclusion before sticking long time to a method.

Thanks.