syzygy request for info

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: syzygy request for info

Post by elcabesa »

I'm just studying your code for pleasure and it looks like a very interesting compression logic for a very specific domain.
As you can understand reading code is important to understand details but a documentation about the format, algorithm used or design choices can gave us a deeper understanding of your beautiful work
Since I wasn't able to found any kind of this documentation, I wanted to know if I missed it in this wide ocean called internet :)

up to now this is what i think to have understood.
you use simmetric transformation to reduce the number of positions to be stored, and you use an negamax like search to fuirther reduce the number of positions for winning capture. You encode the files using Huffman coding.
I still have to study it but it's a very beautiful work
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: syzygy request for info

Post by syzygy »

elcabesa wrote: Fri Jun 08, 2018 11:07 pm you use simmetric transformation to reduce the number of positions to be stored, and you use an negamax like search to fuirther reduce the number of positions for winning capture. You encode the files using Huffman coding.
And in between are a search for a permutation of the pieces that improves compression and a process of building a dictionary consisting of a tree of paired symbols. The latter is based on Re-Pair compression.

Interesting papers:
http://papersdb.cs.ualberta.ca/~papersd ... ases10.pdf
http://www.larsson.dogma.net/dcc99.pdf
http://www.eecs.harvard.edu/~michaelm/E210/huffman.pdf
https://jespertk.dk/master_thesis/report/report.pdf
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: syzygy request for info

Post by elcabesa »

thank you for the papers, I'll try to study them.

have you ever thinked about using arithmetic coding instead of Huffman? and in case you have rejected it, why? is it complex/slower to decomress? it does't give enough compression boost?
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: syzygy request for info

Post by syzygy »

elcabesa wrote: Sat Jun 09, 2018 1:16 am thank you for the papers, I'll try to study them.

have you ever thinked about using arithmetic coding instead of Huffman? and in case you have rejected it, why? is it complex/slower to decomress? it does't give enough compression boost?
It is a lot slower than Huffman, and the Re-Pair step ensures that the symbol frequencies are such that Huffman will do OK.

I have recently had another at variations of arithmetic coding like "Finite State Entropy" coding. The problem is that those approaches tend to need rather large tables which might not perform well. The probing code needs to decompress only about 32 bytes at a time, so those tables cannot be assumed to be in the L1 cache (and may not be in the CPU cache at all).

Huffman decoding basically just needs a small table with one integer per codeword length. This is also its weakness: each symbol is encoded with an integral number of bits. But if you in some way want to allow a fractional number of bits per symbol, you cannot avoid needing a table that somehow encodes that fractional number for each symbol. And decoding will somehow have to loop through that table (with about 4k elements) or it will need to do a lookup in a large table for mapping a relatively large integer to the right symbol. (16k elements if you want to be able to assign fractional codeword lengths to the 4k symbols with a resolution of 1/16th bit.)

Having said that, it would be interesting to know how much theoretically could be saved by using an arithmetic coding variant.

Maybe I am too pessimistic about the performance of modern arithmetic coding variants. One would really need to measure it in a realistic setting (so in a chess engine doing a deep search in the late middlegame and getting lots of TB hits in a large variety of TBs).
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: syzygy request for info

Post by elcabesa »

I have no hands-on experience with tablebase or aritmetic coding so at the moment it was just curiosity.
Sesse
Posts: 300
Joined: Mon Apr 30, 2018 11:51 pm

Re: syzygy request for info

Post by Sesse »

syzygy wrote: Sat Jun 09, 2018 2:04 am Having said that, it would be interesting to know how much theoretically could be saved by using an arithmetic coding variant.

Maybe I am too pessimistic about the performance of modern arithmetic coding variants. One would really need to measure it in a realistic setting (so in a chess engine doing a deep search in the late middlegame and getting lots of TB hits in a large variety of TBs).
Just a simple Huffman → arithmetic change would be easy to calculate if you have the data. Huffman needs, for each symbol i (with probability p_i), sum(-p_i * ceil(log2(p_i))) bits per symbol. Arithmetic is the same, just drop the ceil.

However, arithmetic coding (and its modern cousin rANS) has another advantage that's not immediately obvious; it's trivial to switch probability distributions whenever you want. This opens up for adaptive models (e.g. if the bit encoded was a 1, increase p_1 a bit), and also for having different distributions depending on context.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: syzygy request for info

Post by syzygy »

Sesse wrote: Sat Jun 09, 2018 4:03 pm
syzygy wrote: Sat Jun 09, 2018 2:04 am Having said that, it would be interesting to know how much theoretically could be saved by using an arithmetic coding variant.

Maybe I am too pessimistic about the performance of modern arithmetic coding variants. One would really need to measure it in a realistic setting (so in a chess engine doing a deep search in the late middlegame and getting lots of TB hits in a large variety of TBs).
Just a simple Huffman → arithmetic change would be easy to calculate if you have the data. Huffman needs, for each symbol i (with probability p_i), sum(-p_i * ceil(log2(p_i))) bits per symbol. Arithmetic is the same, just drop the ceil.
Huffman does better than that, since not all lengths will be rounded up. My generator already outputs "raw" Huffman size before the compressed data is split over 32- or 64-byte blocks. It would be easy to add a similar estimation for arithmetic coding. In practice, one will have to accept suboptimal precision (like 1/16th of a bit accuracy), but Huffman would certainly be beaten in terms of compression ratio. The question is if it is worth the slowdown in this particular use case. (That question is easier to answer for DTZ tables which only need to be probed at the root.)
However, arithmetic coding (and its modern cousin rANS) has another advantage that's not immediately obvious; it's trivial to switch probability distributions whenever you want. This opens up for adaptive models (e.g. if the bit encoded was a 1, increase p_1 a bit), and also for having different distributions depending on context.
But to make use of that one would need to come up with a completely new compression scheme. For WDL, I could imagine a run-length type scheme where the probabilities depend on the last couple of symbols, but I don't think it will be easy to come up with a good scheme. That scheme should also be able to cope with W50 and L50 (so 5 values in total). If anyone can beat the current recursive-pairing based scheme, I will be interested. (What it means to beat the current scheme is not entirely well defined, though.)
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: syzygy request for info

Post by elcabesa »

Just question for you Ronald, what are the symbols that you pass to the huffman compressor? W,CW,D,CL,L; DTZ? or you take the generated table and create a statistic model from the bynary stream?
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: syzygy request for info

Post by syzygy »

elcabesa wrote: Sat Jun 09, 2018 6:12 pm Just question for you Ronald, what are the symbols that you pass to the huffman compressor? W,CW,D,CL,L; DTZ? or you take the generated table and create a statistic model from the bynary stream?
The 4K symbols produced by the recursive-pairing compression step. Each symbol is either a table value or a pair of other symbols.
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: syzygy request for info

Post by elcabesa »

thank you, I have tu study the code, and read the papers :)