Compression of chess databases

gladius · Post by **gladius** » Mon Jul 09, 2012 10:39 pm

Daniel Shawul wrote: But that is what huffman or arithimetic encoder of the compressor does. Variable length encoding is better done with those, because doing it your way would be sub-optimal.

A standard arithmetic encoder is encoding per symbol, and goes off of the frequency of the symbol occurring. That is not a good predictor in general for a chess game.

If instead, you use Richards encoding, then it *is* a good predictor, as most moves will be <= 16. So, in fact, you could think of the move encoding scheme as a transform to help compression.

If you wanted to get really fancy, you could add an existing chess program (say Stockfish at depth 3), as a model into one of the http://en.wikipedia.org/wiki/PAQ family of compressors. It will be quite slow though

.

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 09, 2012 11:01 pm

gladius wrote:
Daniel Shawul wrote: But that is what huffman or arithimetic encoder of the compressor does. Variable length encoding is better done with those, because doing it your way would be sub-optimal.
A standard arithmetic encoder is encoding per symbol, and goes off of the frequency of the symbol occurring. That is not a good predictor in general for a chess game.

If instead, you use Richards encoding, then it *is* a good predictor, as most moves will be <= 16. So, in fact, you could think of the move encoding scheme as a transform to help compression.

If you wanted to get really fancy, you could add an existing chess program (say Stockfish at depth 3), as a model into one of the http://en.wikipedia.org/wiki/PAQ family of compressors. It will be quite slow though .

Brother please !

It seems you do not understand what I said at all. Huffman or even better arithmetic encoding are variable length encoding schemes that will be applied after you first do transformations to kind of 'prepare' the data. Now SEE is good for preparation but not the huffman like encoding of the move since that is redudunt. For general compression, lempel-ziv (as in LZMA) or a burrow-wheel transform (bzip2) will prepare and complement huffman. So the whole sicilian line could be represented by a single bit since the one (offset,length) pair will be enough. Arithmetic is even better than huffman since it doesn't necessarily waste 1bit to encode a character! Now the predition as in SEE is good and it not new. You should know I used SEE,search,eval and what have you to predict the results in my bitbases to help compression. But each and every one failed or gave only minimal improvement. That also means more decompression time since you have to do the prediction during decompression, more access time. In the end I decided against it.

So if you apply some redundunt transformation like a huffman encoding of the move and then apply LZMA, don't expect any gain as that already has huffman. Try predictors that compliment it such as SEE. So you see I have some experience (albeit a bad one), so better prepare your arguments well

Bye

gladius · Post by **gladius** » Mon Jul 09, 2012 11:17 pm

Daniel Shawul wrote:Brother please ! It seems you do not understand what I said at all. Huffman or even better arithmetic encoding are variable length encoding schemes that will be applied after you first do transformations to kind of 'prepare' the data.

I was going off of your description of Huffman/Arithmetic encoding "The first two methods do statistical analysis over the whole file (or a block of it) to determine...", which to me, implied that it was encoding the moves without any preprocessing.

Regardless, I agree that doing the variable length encoding is probably not going to be a win over using a simple preprocessor, and then a standard compression algorithm.

I'm surprised you found no help by using a good predictor for compressing the bitbases though. How was the prediction accuracy? Were you using it as input into the arithmetic coding model? Or as a pre-process step?

Daniel Shawul · Post by **Daniel Shawul** » Mon Jul 09, 2012 11:29 pm

gladius wrote:
Daniel Shawul wrote:Brother please ! It seems you do not understand what I said at all. Huffman or even better arithmetic encoding are variable length encoding schemes that will be applied after you first do transformations to kind of 'prepare' the data.
I was going off of your description of Huffman/Arithmetic encoding "The first two methods do statistical analysis over the whole file (or a block of it) to determine...", which to me, implied that it was encoding the moves without any preprocessing.

I mentioned "statistical analysis" to empasize that the shortest length codes are assigned to the most frequent ones i.e by studying the file not statically like Richard did it. If you directly encode the moves, the assignment will be sub-optimal because now the bit lengths are assigned based on order by SEE. Note the SEE is used in both cases. It is clear statistical analysis will give better results for any file but the latter depends on the accuracy of the predictor. There is a static version of Huffmann for text files that encodes based on the general frequency of letters ('e' taking the shortest bit length).

Regardless, I agree that doing the variable length encoding is probably not going to be a win over using a simple preprocessor, and then a standard compression algorithm.

I'm surprised you found no help by using a good predictor for compressing the bitbases though. How was the prediction accuracy? Were you using it as input into the arithmetic coding model? Or as a pre-process step?

I tried up to shallow searches of 3 plies (I think) and didn't get much improvement. It already does a one ply search i.e storing only white/black bitbases not both. Taking both 5 men bitbases, I may have got some improvement with 5 men but dropping one of the sides gave no improvement in size. That is without even considering the penatly of doing a 3-ply search. I am sure many will shy away from using scorpio bitbases if I added a 3-ply search predictor. SEE is better interms of speed but obviously gives worse results.

diep · Post by **diep** » Wed Jul 11, 2012 2:14 am

Edmund wrote:Say I have a collection of games I want to store in a database using as little memory as possible.

Because the move-choice in the opening is very restricted I would like to save the first couple of moves by linking into a table with n entries.

Is there any algorithm that would tell me how to optimally populate the table with opening moves?

It sort of reminds me on huffman-encoding, but it is slightly different.

This is all 80s what was posted so far.

You can go 2 ways. Either use a clever format that stores things in not too many bits and be happy with that, or really search for compression algorithms.

The known compression algorithms of course hammer down size of such database in a manner that no clever format will be able to do.

For anything i did do in computerchess, especially bittables, but also games, the total superior algorithm is PPM*, with the * (asterix) reserved for all sorts of 'variations' onto the PPM algorithm. It's about the multidimensional compression idea of PPM that kicks so much butt. Litterally following it never is a good idea in fields like this, yet it's about the spirit and open mind.

Now realize a big winner for many modern compressors is sophisticated and huge dictionary sizes. This is tricky to use as it means you need to decompress everything starting at the start.

The bad news is that there isn't too many compressors that can decompress blockwise and still have a good compression.

Yet if you combine that with PPM* then you'll beat the hell out of anything compressionwise in computerchess, be it games ,be it EGTBs.

By the way in such supercompression attempts, don't take a look at the 'compression list first'. That's total inferior for my EGTBs. Usually a yank who with some parameter tuning can get the very tiny benchmark set real tiny and when i tried it at the same settings, then it was factors bigger the resulting compressed Diep files.

Supercompression versus normal fast compression is a huge difference in size for huge databases. It's really really a huge difference in size. Factor 4 is quite common difference.

I had one of the guys hosting a supercompression website (i might remind the chaps name if you're interested) comparing all compressors at a few smaller Diep EGTBs (uncompressed a 300MB or something). I was quite shocked seeing how an oldie program out of New Zealand kicked the hell out of anything i thought was good

Realize however the big problem. All those compressors aren't that much faster to decompress than they comopress.

If it takes 24 hours to compress at 1 core say a 200MB, it'll take also half a day to decompress that 200MB.

So it would be quite a challenge to show up with something good that's not doing some braindead runlength encoding.

The idea to order the moves a tad is actually not so stupid. That'll get you magnificent compression even with relative braindead compression algorithms.

What's relative fast is Kadatch compressor. If you ship him an email maybe he wants to contribute his compressor. It's using old methods, yet it's very very fast and it can work blockwise, so unpack just a block of 64KB for example which is probably what you want, without need to decompress everything else, and it kicks the hell out of old slow junk like gzip,bzip2 and such.

The interesting thing would be build a PPM compressor that takes huge time to compress and stores things in a simple to decompress manner, of course that'll eat some overhead, but still you end up with 25% the size of most other compressors and half the size of 7-zip.

Note that 7-zip is doing a pretty good job, yet of course it's using a dictionary and i bet that's a problem for you

Note that compression is a similar field like computerchess: you can get anything you want customized as you like it, as long as you are prepared to pay. If you want something that's world c lass it'll cost not that much more, you'll kick butt thoug hwith your code then. You get what you pay for.

Want worlds strongest engine? No problem. Just pay someone a salary. Peanuts. Big range of programmers can deliver it to you then, just know whom you pick.

Usually they pick the guy who never invented an algorithm for that - bad idea. Same with compression...

Without payment be happy with years 80 algorithms combined a tad and except for Kadatch i don't know any compressor that can compress buckets very well. Yo ucan try adapt 7-zip to it, but i'm not sure that's a good idea, as when losing the dictionary i bet it'll suck as it compresses things fast.

If you want to roll something yourself realize a PPM type algorithm that works bucketsize and that can decompress fast meanwhile take huge time to compress - that would be something you can really sell well to companies storing data...

phhnguyen · Post by **phhnguyen** » Sun Jul 15, 2012 10:05 am

rvida wrote: ...
Resulting data should still be compressible with standard compression algorithms like LZMA / Bzip2 etc.

I am not interested about result of standard compression for database (6-8 bit / move). It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.

Anyone has better compression results?

jwes · Post by **jwes** » Sun Jul 15, 2012 8:02 pm

phhnguyen wrote:
rvida wrote: ...
Resulting data should still be compressible with standard compression algorithms like LZMA / Bzip2 etc.
I am not interested about result of standard compression for database (6-8 bit / move). It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.

Anyone has better compression results?

The real question is "How are you going to use this compressed database?"
You could use different layouts and compression schemes depending on acceptable access times.

phhnguyen · Post by **phhnguyen** » Mon Jul 16, 2012 1:15 am

jwes wrote:
phhnguyen wrote:
rvida wrote: ...
Resulting data should still be compressible with standard compression algorithms like LZMA / Bzip2 etc.
I am not interested about result of standard compression for database (6-8 bit / move). It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.

Anyone has better compression results?
The real question is "How are you going to use this compressed database?"
You could use different layouts and compression schemes depending on acceptable access times.

Actually I am trying to answer the first question "Can we compress the database to a better size?". If not, no point to go further.

For me, reducing 10% of size is not enough when compressing whole database (using commercial app to get estimation). At least 30% needed to consider.

syzygy · Post by **syzygy** » Mon Jul 16, 2012 3:01 am

phhnguyen wrote:
jwes wrote:The real question is "How are you going to use this compressed database?"
You could use different layouts and compression schemes depending on acceptable access times.
Actually I am trying to answer the first question "Can we compress the database to a better size?". If not, no point to go further.

The answer to that question is practically always "yes". The real question is the one that J. Wesley Cleveland poses.

rvida · Post by **rvida** » Mon Jul 16, 2012 2:31 pm

phhnguyen wrote:It helps to reduce around 10% only for whole database. It may gain less if it is compressed by chunks.
Anyone has better compression results?

10% of what? What was the original format and which compression method you used?

Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases

Re: Compression of chess databases