mcostalba wrote:Browsing through the sources I realized that there is a lot of 'content' in this code: Huffman, combinatorics, special encoding, piece configurations, etc..
And then there is the generator with some ideas that have no counterpart in the probing code
Can I ask how much time did it take to design and write all this? Did you start from scratch or used some reference material (books, internet, other TB sources) ?
I think I wrote a 4-piece generator not too long after Steven Edwards released his tables. I think mainly because I did not like his indexing scheme having (unused) room for pawns on ranks 1 and 8. This must have been in 1998 or maybe 1999.
In 2000, I generated all 4-piece tables for suicide chess (there are many more of them than for regular chess, because the king is a regular piece and you can promote to it). I think they were dtm and they had some primitive Huffman compression (fixed number of values in a fixed-size block; not very good because different parts may have very different compression ratios). I had 13 separate encoding functions (also covering 4-piece suicide piece combinations).
The generator was based on the regular move generator. I'm not sure if it used any retrograde analysis.
In 2005, I rewrote everything and generated 5-piece tables for regular and suicide chess. They were WDL and DTZ (but the 50-move rule was not taken into account). I decided I needed a compression scheme with a fixed (per-table) dictionary generated "offline" (compression schemes like Lempel-Ziv generate a dictionary as the input is being compressed / decompressed, but that only works well with large enough blocks and I wanted to keep the blocks small). I found this nice paper with a very promising ttitle by Larsson and Moffat:
Offline Dictionary-Based Compression. It introduces "RE-PAIR" (recursive pairing) compression. I did not implement their linear-time algorithm as it seemed to require too much memory, but I did use their basic idea.
Another possibility would have been a run-length encoding scheme as used in the Chinook tables (for checkers). I don't think that would have worked very well for DTZ values and I'm not sure how well it would work for WDL in chess. The RE-PAIR scheme in a way is not so different from RLE but more flexible (certainly more flexible than a fixed RLE-encoding).
The 2005 implementation used "don't cares" for illegal positions and, in suicide chess, for positions with captures (capturing being mandatory in suicide chess). The Chinook tables also do the latter (capturing is also mandatory in checkers). I don't know if I took the idea from a Chinook paper or if it was obvious to do anyway. The RE-PAIR scheme made it relatively easy to make use of "don't cares" to improve compression.
Hmm, maybe you can read more about it in
this paper.

(Well, maybe it includes an acknowledgement. But the references suggest it does not.)
I also looked around for a good Huffman decoding algorithm and found a paper explaining canonical Huffman codes and how to decode them efficiently.
The file format then stored a fixed number of values in variable-sized blocks (of up to 256 bytes) with a complicated indexing structure. I had 51 separate encoding functions (covering also suicide chess piece combinations) which fully took into account all symmetries. The generator also tried to find a good ordering of the pieces to improve compression, but the encoding functions did not leave as much flexibility for reordering as the current more generic encoding functions. The idea to try different orderings came from this nice
master's thesis from Jesper Kristensen. I'm not so sure that ODDBs are really useful, but it has many good ideas. (The idea of using "don't cares" for broken/illegal positions is there too, I just noticed.)
The generator was still based on the regular move generator, but used some retrograde analysis to keep track of the positions that it should look at in the next iteration. It used 2 bytes per position, which meant that 5-piece tables with pawns required about 1.5 GB of RAM.
In 2008, there was an interesting discussion on the CCRL forum that made me realise how the 50-move rule could be taken into account without losing the possibility to convert "cursed" wins. It involved 5 values for WDL and a somewhat complicated scoring of DTZ. The 5 values for WDL would go well with my compression scheme.
In September 2011 I finally decided to do another rewrite of my generator, this time taking into account the 50-move rule and doing 6-piece tables. 32 GB of RAM in a PC was already realistic then.
The generator was rewritten from scratch, this time with a straightforward indexing scheme that allows the generator to work on the index directly instead of setting up a position and using a regular move generator, multi-threading, 1 byte per position, pawn tables file by file, much smarter retrograde analysis, an interesting trick to deal with tables for which DTZ would not fit in 1 byte (save the intermediate result to disk, "reduce" the values and continue). In addition, the generator keeps track of capture information during generation (e.g. is the position drawn by a capture), which is then used to improve compression (don't cares and partial don't cares).
The compressor converts the generated data from the simple indexing scheme to a more compact indexing scheme (after determining a good order for the pieces) and does all the mapping, sorting, etc. and also uses the don't care information (like winning captures, drawing captures, as I explained earlier). DTZ now does not include information already in the WDL table, improving compression a lot. I also improved the heuristics for filling in (partial) don't cares. The new file format and index structure (variable number of values in fixed-size blocks) are also better.
I generated my 6-piece tables in September 2012. So the generator was created over a period of a year, but obviously with the help of the earlier experiences (and little fragments of code may have survived here and there).
In addition to the papers I mentioned I have of course read Ken Thompson's and Lewis Stiller's papers. And a lot is written here and there in newsgroups and internet forums.