Question to syzygy author

mcostalba · Post by **mcostalba** » Sun May 15, 2016 2:44 pm

Another obscure corner is the code to init Pawnidx[] and Pfactor[]

    for (int k = 0; k < 5; ++k) {
        int n = 0;

        for (int f = 1; f <= 4; ++f) {
            int s = 0;

            for ( ; n < 6 * f; ++n) {
                Pawnidx[k][n] = s;
                s += Binomial[k][47 - 2 * n];
            }

            Pfactor[k][f - 1] = s;
        }
    }

Can you please explain what's going on here?

P.S: Looking at how Pawnidx and Pfactor are used, it seems the outer loop could be:

Code: Select all

for (int k = 0; k < 4; ++k)

syzygy · Post by **syzygy** » Sun May 15, 2016 3:14 pm

mcostalba wrote:Just one quick question: number of blocks in a table is real_num_blocks, but actually you use num_blocks that is the former plus the content of a uint8_t data in the file. Can you please explain the difference?

This difference has to do with the fact that the "sparse index" does not point to "k * d->span", but to "k * d->span + d->span / 2".

Since k = idx / d->span, we know that k * d->span <= idx, so k * d->span is a valid index into the table. But k * d->span + d->span / 2 might be a value that is bigger than the largest index for the table (if "idx" happens to be near the end of the table).

So the last valid entry in the SparseIndex[] array might have to a point to a block and (sub)index that is not part of the real table but comes "after" it. To make this work, the generator adds entries for a few "fake" blocks, each of maximum size 65536, to the blockLength[] array so that there is something to point to for the last valid entry in the SparseIndex[] array. These fake blocks do not correspond to any compressed data.

So the fake blocks avoid the need to detect and handle this special case in decompress_pairs().

mcostalba · Post by **mcostalba** » Sun May 15, 2016 6:09 pm

syzygy wrote:[So the fake blocks avoid the need to detect and handle this special case in decompress_pairs().

Thanks for your quick and clear answer!

Browsing through the sources I realized that there is a lot of 'content' in this code: Huffman, combinatorics, special encoding, piece configurations, etc..

Can I ask how much time did it take to design and write all this? Did you start from scratch or used some reference material (books, internet, other TB sources) ?

syzygy · Post by **syzygy** » Sun May 15, 2016 6:52 pm

mcostalba wrote:Another obscure corner is the code to init Pawnidx[] and Pfactor[]

Code: Select all

    for (int k = 0; k < 5; ++k) {
        int n = 0;

        for (int f = 1; f <= 4; ++f) {
            int s = 0;

            for ( ; n < 6 * f; ++n) {
                Pawnidx[k][n] = s;
                s += Binomial[k][47 - 2 * n];
            }

            Pfactor[k][f - 1] = s;
        }
    }

With my indexing scheme, if the leading color has k+1 pawns, there are Pfactor[k][file] ways to place them with the leading pawn in file a, b, c or d (0...3).

This index range of size Pfactor[k ][file] is split up in 6 parts corresponding to the leading pawn on rank 2, 3, 4, 5, 6, 7. The start of the subrange for the leading pawn on square sq (encoding both file and rank) is Pawnidx[k][n], where n runs from 0 to 23 as sq runs through a2-a7,b2-b7,c2-c7,d2-d8.

So Pawnidx[k][n] "restarts" at n = 0, 6, 12, 18, because the positions with the leading pawn in one file have their own table (and their own index range starting from 0).

The term Binomial[k][47 - 2 * n] is easily explained. If the leading pawn is on a2 (so n = 0), there are 47 squares left for the remaining k pawns of the leading color: they can't be on a2. If it's on a3, there are 2 fewer squares left: they can't be on a2, h2, a3 (a3 is taken and if there were a pawn on a2 or h2, that pawn would have been the leading pawn). And so on. Binomial[k][n] is the number of ways to select k squares from a set of n squares ( = n! / (k! * (n-k)!) ).

P.S: Looking at how Pawnidx and Pfactor are used, it seems the outer loop could be:
Code: Select all
for (int k = 0; k < 4; ++k)

With up to 6 pieces, the highest number of pawns of the same color occurs for KPPPPvK. So yes, at most 3 + 1.

syzygy · Post by **syzygy** » Sun May 15, 2016 11:56 pm

mcostalba wrote:Browsing through the sources I realized that there is a lot of 'content' in this code: Huffman, combinatorics, special encoding, piece configurations, etc..

And then there is the generator with some ideas that have no counterpart in the probing code

Can I ask how much time did it take to design and write all this? Did you start from scratch or used some reference material (books, internet, other TB sources) ?

I think I wrote a 4-piece generator not too long after Steven Edwards released his tables. I think mainly because I did not like his indexing scheme having (unused) room for pawns on ranks 1 and 8. This must have been in 1998 or maybe 1999.

In 2000, I generated all 4-piece tables for suicide chess (there are many more of them than for regular chess, because the king is a regular piece and you can promote to it). I think they were dtm and they had some primitive Huffman compression (fixed number of values in a fixed-size block; not very good because different parts may have very different compression ratios). I had 13 separate encoding functions (also covering 4-piece suicide piece combinations).

The generator was based on the regular move generator. I'm not sure if it used any retrograde analysis.

In 2005, I rewrote everything and generated 5-piece tables for regular and suicide chess. They were WDL and DTZ (but the 50-move rule was not taken into account). I decided I needed a compression scheme with a fixed (per-table) dictionary generated "offline" (compression schemes like Lempel-Ziv generate a dictionary as the input is being compressed / decompressed, but that only works well with large enough blocks and I wanted to keep the blocks small). I found this nice paper with a very promising ttitle by Larsson and Moffat: Offline Dictionary-Based Compression. It introduces "RE-PAIR" (recursive pairing) compression. I did not implement their linear-time algorithm as it seemed to require too much memory, but I did use their basic idea.

Another possibility would have been a run-length encoding scheme as used in the Chinook tables (for checkers). I don't think that would have worked very well for DTZ values and I'm not sure how well it would work for WDL in chess. The RE-PAIR scheme in a way is not so different from RLE but more flexible (certainly more flexible than a fixed RLE-encoding).

The 2005 implementation used "don't cares" for illegal positions and, in suicide chess, for positions with captures (capturing being mandatory in suicide chess). The Chinook tables also do the latter (capturing is also mandatory in checkers). I don't know if I took the idea from a Chinook paper or if it was obvious to do anyway. The RE-PAIR scheme made it relatively easy to make use of "don't cares" to improve compression.

Hmm, maybe you can read more about it in this paper.

(Well, maybe it includes an acknowledgement. But the references suggest it does not.)

I also looked around for a good Huffman decoding algorithm and found a paper explaining canonical Huffman codes and how to decode them efficiently.

The file format then stored a fixed number of values in variable-sized blocks (of up to 256 bytes) with a complicated indexing structure. I had 51 separate encoding functions (covering also suicide chess piece combinations) which fully took into account all symmetries. The generator also tried to find a good ordering of the pieces to improve compression, but the encoding functions did not leave as much flexibility for reordering as the current more generic encoding functions. The idea to try different orderings came from this nice master's thesis from Jesper Kristensen. I'm not so sure that ODDBs are really useful, but it has many good ideas. (The idea of using "don't cares" for broken/illegal positions is there too, I just noticed.)

The generator was still based on the regular move generator, but used some retrograde analysis to keep track of the positions that it should look at in the next iteration. It used 2 bytes per position, which meant that 5-piece tables with pawns required about 1.5 GB of RAM.

In 2008, there was an interesting discussion on the CCRL forum that made me realise how the 50-move rule could be taken into account without losing the possibility to convert "cursed" wins. It involved 5 values for WDL and a somewhat complicated scoring of DTZ. The 5 values for WDL would go well with my compression scheme.

In September 2011 I finally decided to do another rewrite of my generator, this time taking into account the 50-move rule and doing 6-piece tables. 32 GB of RAM in a PC was already realistic then.

The generator was rewritten from scratch, this time with a straightforward indexing scheme that allows the generator to work on the index directly instead of setting up a position and using a regular move generator, multi-threading, 1 byte per position, pawn tables file by file, much smarter retrograde analysis, an interesting trick to deal with tables for which DTZ would not fit in 1 byte (save the intermediate result to disk, "reduce" the values and continue). In addition, the generator keeps track of capture information during generation (e.g. is the position drawn by a capture), which is then used to improve compression (don't cares and partial don't cares).

The compressor converts the generated data from the simple indexing scheme to a more compact indexing scheme (after determining a good order for the pieces) and does all the mapping, sorting, etc. and also uses the don't care information (like winning captures, drawing captures, as I explained earlier). DTZ now does not include information already in the WDL table, improving compression a lot. I also improved the heuristics for filling in (partial) don't cares. The new file format and index structure (variable number of values in fixed-size blocks) are also better.

I generated my 6-piece tables in September 2012. So the generator was created over a period of a year, but obviously with the help of the earlier experiences (and little fragments of code may have survived here and there).

In addition to the papers I mentioned I have of course read Ken Thompson's and Lewis Stiller's papers. And a lot is written here and there in newsgroups and internet forums.

mar · Post by **mar** » Mon May 16, 2016 12:25 am

syzygy wrote:I also looked around for a good Huffman decoding algorithm and found a paper explaining canonical Huffman codes and how to decode them efficiently.

Yes, very elegant, I think it's the paper by Moffat and Turpin.
If I'm not mistaken you need to encode msbit->lsbit and read as big endian + right-justify the input.
Good thing is it uses very small LUTs instead of direct multi-LUT with returning bits to the accumulator.
Out of curiosity: have you tried the latter? If so, how did both methods compare in terms of performance?

Nordlandia · Post by **Nordlandia** » Mon May 16, 2016 12:49 am

Possible 7-Men Syzygy by ~2020.

I heard someone mentioned Syzygy-7 with max compression about 1/7 of Lomo7 at 140, which is 20 TB.

Syzygy 7 at 20 TB is within arms reach within a few years.

Perhaps a tad optimistic considering compression methods.

syzygy · Post by **syzygy** » Mon May 16, 2016 2:05 am

mar wrote:
syzygy wrote:I also looked around for a good Huffman decoding algorithm and found a paper explaining canonical Huffman codes and how to decode them efficiently.
Yes, very elegant, I think it's the paper by Moffat and Turpin.

I think you're right! I did not remember that Moffat was co-author for both papers.
On the Implementation of Minimum Redundancy Prefix Codes
It is clear that I took the algorithm from that paper. Same notation

If I'm not mistaken you need to encode msbit->lsbit and read as big endian + right-justify the input.

Correct. I thought about writing and reading compressed blocks from back to front in order to avoid having to perform byteswaps on little-endian x86, but I think the first bytes of a cacheline become available earlier than the last bytes. (Another possibility would have been to store the data "pre-byteswapped", but that would have limited the ways in which the decoder could have read the data.)

Good thing is it uses very small LUTs instead of direct multi-LUT with returning bits to the accumulator.
Out of curiosity: have you tried the latter? If so, how did both methods compare in terms of performance?

The probing code in my generator constructs a look-up table for decoding multiple symbols at once. See here:
https://github.com/syzygy1/tb/blob/master/src/probe.c
Starting from line 1980:

Code: Select all

  for (i = 0; i < num_lu; i++) {
    long64 code = tmp_base[LUBITS - min_len] + (((long64)i) << (64 - LUBITS));
    int bits = LUBITS;
    d->lookup_len[i] = 0;
    d->lookup_bits[i] = 0;
    for (;;) {
      int l = 0;
      while (code < tmp_base[l]) l++;
      if (l + min_len > bits) break;
      int sym = d->offset[l] + ((code - tmp_base[l]) >> (64 - (l + min_len)));
      d->lookup_len[i] += d->symlen[sym] + 1;
      d->lookup_bits[i] += l + min_len;
      bits -= l + min_len;
      code <<= (l + min_len);
    }
  }

and from line 2275:

Code: Select all

  long64 code = __builtin_bswap64(*((long64 *)ptr));
  ptr += 2;
  bitcnt = 0; // number of "empty bits" in code
  for (;;) {
    if (m <= LUBITS && code >= base[LUBITS]) {
      int lu = (code - base[LUBITS]) >> (64 - LUBITS);
      if (litidx < d->lookup_len[lu]) {
        for (;;) {
          l = m;
          while (code < base[l]) l++;
          sym = offset[l] + ((code - base[l]) >> (64 - l));
          if (litidx < (int)symlen[sym] + 1) break;
          litidx -= (int)symlen[sym] + 1;
          code <<= l;
        }
        break;
      }
      litidx -= d->lookup_len[lu];
      l = d->lookup_bits[lu];
    } else {
      l = LUBITS + 1;
      while (code < base[l]) l++;
      sym = offset[l] + ((code - base[l]) >> (64 - l));
      if (litidx < (int)symlen[sym] + 1) break;
      litidx -= (int)symlen[sym] + 1;
    }
    code <<= l;
    bitcnt += l;
    if (bitcnt >= 32) {
      bitcnt -= 32;
      code |= ((long64)(__builtin_bswap32(*ptr++))) << bitcnt;
    }
  }

Maybe it is possible to remove the duplicate portion in an elegant way...
(LUBITS is set to 12 at line 16.)

This must have sped up the generator a bit (or I would have disabled it). I did not add it to the engine probing code because I figured there would be too many look-up tables that would eat up too much cache. Whether that is a correct guess is difficult to say without testing, but it is clear that the engine will often be probing a great many tables, whereas the generator only probes the successor tables of the table it is currently generating (so the tables after a capture or promotion).

It might be a good idea to do, say, an 8-bit lookup for tables with a very small minimum symbol length.

bob · Post by **bob** » Mon May 16, 2016 6:00 am

For the cache line stuff, I believe everyone today uses "critical word first" which means that it first reads the word addressed, then goes sequentially to the end of the cache block, then wraps around and reads up to the critical (first) word.

About the only thing you don't want to do is to access memory in descending sequential order.

Can't guarantee which or if all Intel chips do this, but it is a pretty common cache optimization that has been around for a long while.

mcostalba · Post by **mcostalba** » Mon May 16, 2016 10:21 am

mcostalba wrote:Another obscure corner is the code to init Pawnidx[] and Pfactor[]

Thanks to your explanations now also this part is fully documented:

Code: Select all

    // Init tables for the encoding of leading pawns group: with 6-men TB we
    // can have up to 4 leading pawns (KPPPPK).
    for (int leadPawnsCnt = 1; leadPawnsCnt <= 4; ++leadPawnsCnt)
    {
        int mappedLeadSq = 0; // From a2 to d7 (a2a7...d2d7) mapped to 0..23

        for (File f = FILE_A; f <= FILE_D; ++f)
        {
            // Restart the index at every file because TB table is splitted
            // by file, so we can reuse the same index for different files.
            int idx = 0;

            // Sum all possible combinations for a given file, starting with
            // the leading pawn on rank 2 and increasing the rank.
            for (Rank r = RANK_2; r <= RANK_7; ++r, ++mappedLeadSq)
            {
                LeadPawnIdx[leadPawnsCnt - 1][mappedLeadSq] = idx;

                // Given the leadSq any other pawn cannot be below or more
                // toward the edge of leadSq. Start with 47 available squares
                // when lead pawn is in a2 and after an increase the number is
                // reduced by two due to mirroring, e.g. a3 -> no a2, h2
                idx += Binomial[leadPawnsCnt - 1][47 - 2 * mappedLeadSq];
            }
            // After a file is traversed, store the cumulated per-file index
            LeadPawnsGroupSize[leadPawnsCnt - 1][f] = idx;
        }
    }

I still don't understand why if

Code: Select all

LeadPawnIdx[leadPawnsCnt - 1][mappedLeadSq] = idx;

then, in the actual encoding, we use it in this way

Code: Select all

idx = LeadPawnIdx[leadPawnsCnt - 1][23 - MapToEdges[squares[0]] / 2];

instead of

Code: Select all

idx = LeadPawnIdx[leadPawnsCnt - 1][MapToEdges[squares[0]]];

P.S: I have tried the latter: it crashes.

Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author

Re: Question to syzygy author