## syzygy request for info

**Moderators:** bob, hgm, Harvey Williamson

**Forum rules**

This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

### Re: syzygy request for info

I'm just studying your code for pleasure and it looks like a very interesting compression logic for a very specific domain.

As you can understand reading code is important to understand details but a documentation about the format, algorithm used or design choices can gave us a deeper understanding of your beautiful work

Since I wasn't able to found any kind of this documentation, I wanted to know if I missed it in this wide ocean called internet

up to now this is what i think to have understood.

you use simmetric transformation to reduce the number of positions to be stored, and you use an negamax like search to fuirther reduce the number of positions for winning capture. You encode the files using Huffman coding.

I still have to study it but it's a very beautiful work

As you can understand reading code is important to understand details but a documentation about the format, algorithm used or design choices can gave us a deeper understanding of your beautiful work

Since I wasn't able to found any kind of this documentation, I wanted to know if I missed it in this wide ocean called internet

up to now this is what i think to have understood.

you use simmetric transformation to reduce the number of positions to be stored, and you use an negamax like search to fuirther reduce the number of positions for winning capture. You encode the files using Huffman coding.

I still have to study it but it's a very beautiful work

### Re: syzygy request for info

And in between are a search for a permutation of the pieces that improves compression and a process of building a dictionary consisting of a tree of paired symbols. The latter is based on Re-Pair compression.

Interesting papers:

http://papersdb.cs.ualberta.ca/~papersd ... ases10.pdf

http://www.larsson.dogma.net/dcc99.pdf

http://www.eecs.harvard.edu/~michaelm/E210/huffman.pdf

https://jespertk.dk/master_thesis/report/report.pdf

### Re: syzygy request for info

thank you for the papers, I'll try to study them.

have you ever thinked about using arithmetic coding instead of Huffman? and in case you have rejected it, why? is it complex/slower to decomress? it does't give enough compression boost?

have you ever thinked about using arithmetic coding instead of Huffman? and in case you have rejected it, why? is it complex/slower to decomress? it does't give enough compression boost?

### Re: syzygy request for info

It is a lot slower than Huffman, and the Re-Pair step ensures that the symbol frequencies are such that Huffman will do OK.

I have recently had another at variations of arithmetic coding like "Finite State Entropy" coding. The problem is that those approaches tend to need rather large tables which might not perform well. The probing code needs to decompress only about 32 bytes at a time, so those tables cannot be assumed to be in the L1 cache (and may not be in the CPU cache at all).

Huffman decoding basically just needs a small table with one integer per codeword length. This is also its weakness: each symbol is encoded with an integral number of bits. But if you in some way want to allow a fractional number of bits per symbol, you cannot avoid needing a table that somehow encodes that fractional number for each symbol. And decoding will somehow have to loop through that table (with about 4k elements) or it will need to do a lookup in a large table for mapping a relatively large integer to the right symbol. (16k elements if you want to be able to assign fractional codeword lengths to the 4k symbols with a resolution of 1/16th bit.)

Having said that, it would be interesting to know how much theoretically could be saved by using an arithmetic coding variant.

Maybe I am too pessimistic about the performance of modern arithmetic coding variants. One would really need to measure it in a realistic setting (so in a chess engine doing a deep search in the late middlegame and getting lots of TB hits in a large variety of TBs).

### Re: syzygy request for info

I have no hands-on experience with tablebase or aritmetic coding so at the moment it was just curiosity.

### Re: syzygy request for info

Just a simple Huffman → arithmetic change would be easy to calculate if you have the data. Huffman needs, for each symbol i (with probability p_i), sum(-p_i * ceil(log2(p_i))) bits per symbol. Arithmetic is the same, just drop the ceil.syzygy wrote: ↑Sat Jun 09, 2018 12:04 amHaving said that, it would be interesting to know how much theoretically could be saved by using an arithmetic coding variant.

Maybe I am too pessimistic about the performance of modern arithmetic coding variants. One would really need to measure it in a realistic setting (so in a chess engine doing a deep search in the late middlegame and getting lots of TB hits in a large variety of TBs).

However, arithmetic coding (and its modern cousin rANS) has another advantage that's not immediately obvious; it's trivial to switch probability distributions whenever you want. This opens up for adaptive models (e.g. if the bit encoded was a 1, increase p_1 a bit), and also for having different distributions depending on context.

### Re: syzygy request for info

Huffman does better than that, since not all lengths will be rounded up. My generator already outputs "raw" Huffman size before the compressed data is split over 32- or 64-byte blocks. It would be easy to add a similar estimation for arithmetic coding. In practice, one will have to accept suboptimal precision (like 1/16th of a bit accuracy), but Huffman would certainly be beaten in terms of compression ratio. The question is if it is worth the slowdown in this particular use case. (That question is easier to answer for DTZ tables which only need to be probed at the root.)Sesse wrote: ↑Sat Jun 09, 2018 2:03 pmJust a simple Huffman → arithmetic change would be easy to calculate if you have the data. Huffman needs, for each symbol i (with probability p_i), sum(-p_i * ceil(log2(p_i))) bits per symbol. Arithmetic is the same, just drop the ceil.syzygy wrote: ↑Sat Jun 09, 2018 12:04 amHaving said that, it would be interesting to know how much theoretically could be saved by using an arithmetic coding variant.

Maybe I am too pessimistic about the performance of modern arithmetic coding variants. One would really need to measure it in a realistic setting (so in a chess engine doing a deep search in the late middlegame and getting lots of TB hits in a large variety of TBs).

But to make use of that one would need to come up with a completely new compression scheme. For WDL, I could imagine a run-length type scheme where the probabilities depend on the last couple of symbols, but I don't think it will be easy to come up with a good scheme. That scheme should also be able to cope with W50 and L50 (so 5 values in total). If anyone can beat the current recursive-pairing based scheme, I will be interested. (What it means to beat the current scheme is not entirely well defined, though.)However, arithmetic coding (and its modern cousin rANS) has another advantage that's not immediately obvious; it's trivial to switch probability distributions whenever you want. This opens up for adaptive models (e.g. if the bit encoded was a 1, increase p_1 a bit), and also for having different distributions depending on context.

### Re: syzygy request for info

Just question for you Ronald, what are the symbols that you pass to the huffman compressor? W,CW,D,CL,L; DTZ? or you take the generated table and create a statistic model from the bynary stream?

### Re: syzygy request for info

The 4K symbols produced by the recursive-pairing compression step. Each symbol is either a table value or a pair of other symbols.

### Re: syzygy request for info

thank you, I have tu study the code, and read the papers