Compression of chess databases

phhnguyen · Post by **phhnguyen** » Mon Jul 16, 2012 4:57 pm

rvida wrote:
phhnguyen wrote:It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.
Anyone has better compression results?
10% of what? What was the original format and which compression method you used?

My largest database file contains only moves (no fen, no game info) which is used 6-8 bit per move (similar to your previous posts). The format is very simple so we can consider it as moves only.

I have tested by using zip (include para -9), tar app from my Mac to compress that file (the move file - compress whole file into one compressed file) and can reduce only 10% of its size.

If I compress it by chunks, I guess I may gain less than 10% because of ineffective chunk sizes and adding some management structures.

Do you have better experience? Thanks

AlvaroBegue · Post by **AlvaroBegue** » Mon Jul 16, 2012 8:01 pm

In general, good compression requires good ability to predict the next symbol. So what you need is a piece of code that computes the probability of each of the available moves (including resigning) being picked. You could do something like a depth-10 search for each move with some strong engine, and then use something like softmax (http://en.wikipedia.org/wiki/Softmax_ac ... n_function) to assign probabilities.

Once you have that, the best you can do is apply arithmetic coding to convert the probability distribution at each node to an actual representation of the data. Huffman encoding would give you reasonable results too, but it would use at least one bit per move.

If you want more compression, find a better predictor of the next move.

phhnguyen · Post by **phhnguyen** » Tue Jul 17, 2012 2:59 am

AlvaroBegue wrote:In general, good compression requires good ability to predict the next symbol. So what you need is a piece of code that computes the probability of each of the available moves (including resigning) being picked. You could do something like a depth-10 search for each move with some strong engine, and then use something like softmax (http://en.wikipedia.org/wiki/Softmax_ac ... n_function) to assign probabilities.

Once you have that, the best you can do is apply arithmetic coding to convert the probability distribution at each node to an actual representation of the data. Huffman encoding would give you reasonable results too, but it would use at least one bit per move.

If you want more compression, find a better predictor of the next move.

Thank you very much for suggestions.

So the key here is to have "a better predictor of the next move".

However, it is not practicable for me because:
- Not easy to find one for chess. I know we can consider some methods such as opening lines, using shallow search, SEE... but it is hard to integrate with compressor
- In term of data, it is very hard task for moves in my database when they are encoded into 6-8 bit
- The gain is still tiny. I guess that from my experience because current commercial compressors are so good (I hope someone can confirm I am wrong here). Few more percents of compression (on top of my current 10%) is not worth for me. I need at lest 20-30% to use

I have been disappointed about my compression rate for a while, just ask here if anyone has a better rate in practice.

Thanks,

kbhearn · Post by **kbhearn** » Tue Jul 17, 2012 4:39 am

It's not so hard to integrate, it's as simple as encoding your move as its place in a deterministically sorted move list such that the most likely moves occur in the top N. If your orderer is very good then the gain should be very significant after you pass the resulting database through a good standard compression algorithm.

The main concern would be that this is going to take a lot of extra processing time should you want to search your database, since for every move you're going to have to order the moves of the new position. And with storage as cheap as it is, why bother eeking out a few extra percentage points for a less usable database. If anything the other direction makes more sense to me where you waste more space creating a comprehensive set of indexes but can search quickly on partial position information.

jwes · Post by **jwes** » Tue Jul 17, 2012 7:20 am

phhnguyen wrote:
rvida wrote:
phhnguyen wrote:It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.
Anyone has better compression results?
10% of what? What was the original format and which compression method you used?
My largest database file contains only moves (no fen, no game info) which is used 6-8 bit per move (similar to your previous posts). The format is very simple so we can consider it as moves only.

I have tested by using zip (include para -9), tar app from my Mac to compress that file (the move file - compress whole file into one compressed file) and can reduce only 10% of its size.

If I compress it by chunks, I guess I may gain less than 10% because of ineffective chunk sizes and adding some management structures.

Do you have better experience? Thanks

Did you sort the games so that games starting with the same moves will appear together?

phhnguyen · Post by **phhnguyen** » Tue Jul 17, 2012 9:19 am

jwes wrote: Did you sort the games so that games starting with the same moves will appear together?

Very good idea!!! Thanks.

I did not sort games like that yet. It seems be worth to try. I doubt that my compression rate of 10% size is mainly from the pattern repeats of openings. Therefore this idea may improve the rate.

I may also use that idea in different ways such as sorting games based on openings and group them in chunks then cut down all similar opening moves to save space. That is also useful for searching by openings.

rvida · Post by **rvida** » Tue Jul 17, 2012 11:04 pm

phhnguyen wrote:
rvida wrote:
phhnguyen wrote:It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.
Anyone has better compression results?
10% of what? What was the original format and which compression method you used?
My largest database file contains only moves (no fen, no game info) which is used 6-8 bit per move (similar to your previous posts). The format is very simple so we can consider it as moves only.

I have tested by using zip (include para -9), tar app from my Mac to compress that file (the move file - compress whole file into one compressed file) and can reduce only 10% of its size.

If I compress it by chunks, I guess I may gain less than 10% because of ineffective chunk sizes and adding some management structures.

Do you have better experience? Thanks

I think 50% reduction can be easily achieved using a domain specific predictor (i.e. a chess engine). Right now I am running some tests on a sample PGN database and the prediction rates are very compression friednly. Looks like 3-3.5 bits per move is easily achievable while still maintaining the ability to do random accesses with game granularity.

More details later...

Daniel Shawul · Post by **Daniel Shawul** » Tue Jul 17, 2012 11:25 pm

phhnguyen wrote:
rvida wrote:
phhnguyen wrote:It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.
Anyone has better compression results?
10% of what? What was the original format and which compression method you used?
My largest database file contains only moves (no fen, no game info) which is used 6-8 bit per move (similar to your previous posts). The format is very simple so we can consider it as moves only.

I have tested by using zip (include para -9), tar app from my Mac to compress that file (the move file - compress whole file into one compressed file) and can reduce only 10% of its size.

If I compress it by chunks, I guess I may gain less than 10% because of ineffective chunk sizes and adding some management structures.

Do you have better experience? Thanks

What exactly did you compress?? A PGN/EPD file for me compresses by far more than 200% - 300% using ZIP. I am sure other compressors will compress it far better. First make sure you don't have other options before going for the ultimate option of prediction with a chess engine. Infact we used that for bitbases not PGNs and we had reasons for that... Also note that you need a chess engine running search during decompression. It is not like you do the compression with prediction and forget about it... The whole thing is messy.

syzygy · Post by **syzygy** » Tue Jul 17, 2012 11:29 pm

phhnguyen wrote:- In term of data, it is very hard task for moves in my database when they are encoded into 6-8 bit

Are you applying your compressor on data that has the moves encoded in a variable number of bits?

Most compressors are designed to compress bytes rather than bits, so it would most probably be better to feed it data with one move per byte.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jul 18, 2012 12:12 am

syzygy wrote:
phhnguyen wrote:- In term of data, it is very hard task for moves in my database when they are encoded into 6-8 bit
Are you applying your compressor on data that has the moves encoded in a variable number of bits?

Most compressors are designed to compress bytes rather than bits, so it would most probably be better to feed it data with one move per byte.

Yet another reason why one shouldn't do pre-mature variable lenght encoding. Entropy coder is supposed to be the last step of compression otherwise you will risk a chance of adding more disorder that could have been avoided. Many compressors do frequency calculations on bytes so a bitarray will destroy whatever pattern the original data has. So this might actually be the case what he did ...

Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases