Syzygy bases from memory

hgm · Post by **hgm** » Wed Jun 30, 2021 11:38 am

Pio wrote: ↑Wed Jun 30, 2021 1:22 amThe number of file accesses with my method will be approximately 20 for looking up whether the position is a draw or not but each read is independent of the other and could be done in parallel.

Memory accesses in such a large data set (whiche xceeds cache size) are quite expensive. Anything more than 1 should be considered very undesirable. This was also my criticism on binary search through a sorted list of indexes.

I know how Huffman encoding works (since I have implemented it myself using min-heap and binary tree) and run-length encoding is trivial but I do not see how you use it.

Please tell me how much you can compress the 4 billion positions of which we are only interested in the 1% draws and how many accesses to the file system you need and if they can be done in parallel.

Many thanks!!!

About compressing:

The first thing that came to my mind (and I do not doubt there exist better methods) for encoding sparse sets was run-length; when the positions of interest (say wins/losses) are randomly scattered through the EGT with density d, the frequency of N draws followed by a win/loss is d*(1-d)^N. I.e. it decreases geometrically with N. For d ~ 10% is approximately halves when N is increased by 6. Factors 2 in frequency are important for designing Huffman encodings, where ideally you would use 1 bit extra for encoding something that is twice less frequent. 6 is a bit of an unlucky number, though, and we cannot expect perfect compression for every possible density anyway, so let us round it to 8. With 3 bits we could encode run-length 0-7. But then we would have no codes left for run length 8 and larger. Run length 8-15 is approximately twice less frequent than 0-7, though. (0.43 times as frequent, when d=0.1, to be exact.) So we would want to use a code for that with one more bit, and for 16-23 two more bits, etc. So we could use a Huffman code where 1 stands for 8 draws, and 0nnn stands for 0-7 draws (as indicated by the nnn) followed by a win/loss. This requires one 4-bit code for every win/loss, which on average will also specify about 3 draws; the remaining draws (87% of the positions, 8.7 times as many as the wins/losses) will be encoded as goups of 8 through a single bit, i.e. taking 1/8 bits each, so that the average size per win/loss is 4 + 8.7/8 ~ 5.1 bit. Which is 0.51 bit per total position at d=0.1, which is close to the information-theoretical value.

Disadvantage is that there is no direct mapping of an index to a bit (or bit group) in the compressed set. I think that actual EGT formats ameliorate this by dividing the EGT into sections large enough to not deviate to much from the average statistics, but otherwise as small as possible. On a probe they then only decompress that (complete) section. When the sections are variable length they can use a 'directory table' to indicate where in the data set a section for a given index range starts.

About accesses:

If d ~ 1%, as in your example, you can use 100 bits per listed position before you need more size than the uncompressed EGT. The information content for d=0.01 is 0.08 bit per position, which still leaves 8 bits per listed position.

What I would do is put the positions of interest in a hash table. In your example the positions are represented by a 32-bit 'key'. Let's take a hash table with 1M buckets, and use the 20 low-order bits of the key to map it to a bucket. So the positions that map to the bucket can be identified by their upper 12 bits (1.5 byte 'signature'). A cache line of 64 bytes could then contain 42 such signatures of the psoitions we want to list that map into the bucket. So that the total table can hold 42M positions. Which incidentally is about the number of positions you want to list.

But the positions map randomly, so not every bucket will get exactly 40 positions. The number of positions mapping into a bucket will have a Poisson distribution with average 40, and thus standard deviation sqrt(40) = 6.4. So I double the number of the buckets to 2M. On average each bucket will now get 20 positions, with standard deviation sqrt(20) = 4.5. So we have to be unlucky by 4.7 standard deviations to get more than 41 positions in one bucket. We can store the positions that did not fit by another method. There should be very, very few of those, so it doesn't really matter what method. When on probing we then fail to find the sought signature in an entirely filled bucket, we continue to search it in the 'overflow table'.

The buckets themselves are organized as a small hash table. Let us put 41 entries in the bucket (because that is a prime), and hash the given 12-bit signature to a number between 1 and 40. This determines the number of the entry in the bucket where it should be stored. Of course there will be collisions, so if we find a different signature there we should look for it in another entry; only if we find an invalid signature we can conclude the signature is not in the bucket. We use 'add-the-hash rehash' (modulo 41) to step through the bucket. This guarantees probing of the entire bucket (because the size was prime). Since buckets on average are half-empty, we should on average stumble on an invalid signature after two tries, and can then conclude the position is not in the list.

Since the signatures measure a fractional number of bytes, packing them would be necessary. To prevent the unpacking from slowing down the probing, the signatures can be split into a byte (stored in byte 0-40 of the bucket) plus a 4-bit 'extension' (packed in pairs in byte 41-61). Initially we only compare the first byte for a match. In 99.6% of the cases that already excludes there is a match, so we don't have to examine the extension bits. If there is a match we know where to find the corresponding extension, and have to test those for a match too.

This approach required a 128MB hash table, somewhat larger than your bloom filter. But that is very much co-incidental, because the numbers pan out a bit unfavorable. By abandoning powers of 2 we could tweek the table size better to the number of positions that need to be stored, to get a better average filling fraction of the buckets. Which of course goes at the expense of a higher probability for overflow. But in the example this was unnecessary low to non-existent, while in practice having 1% of the buckets overflowing and putting the positions that did not fit into a 100 times smaller secondary table would be entirely acceptable.

Syzygy bases from memory

Re: Syzygy bases from memory