Tablebase suggestion

Discussion of anything and everything relating to chess playing software and machines.

Moderator: Ras

syzygy
Posts: 5892
Joined: Tue Feb 28, 2012 11:56 pm

Re: Tablebase suggestion

Post by syzygy »

syzygy wrote: Wed Feb 04, 2026 3:52 pm So it seems that if 58 K-slices can fit in RAM, which corresponds to about 300 GB, the K-slice approach should be as efficient as the regular approach that requires 2.38 TB of RAM. With less RAM you have to be a bit more careful to do things in a clever order.
In the case of black moving and the wK being fixed, RAM for 19 K-slices should be enough to not lose efficiency.

Ignoring the wK, we let the bK zig-zag a1-h1-h2-a2-a3-h3...-a8. We load the K-slices that we need and we drop those that we no longer need.
When we are in g2, we have loaded 19 slices (a1-h1,a2-h2,f3,g3,h3).
[d]8/8/8/8/8/5kkk/kkkkkkkk/kkkkkkkk b - - 0 1
Then at f2 we drop h1 and load e3. So still 19 slices.
[d]8/8/8/8/8/4kkkk/kkkkkkkk/kkkkkkk1 b - - 0 1/fen]
And it seems it continues this way, i.e. 19 slices at once in RAM means we don't need to reload anything.
The presence of the wK somewhere in a1-d1-d4 only reduces the number of slices you need to load. (And if the wK is on the diagonal, the bK only needs to visit the lower half of the board including the diagonal, so even less to do.)

The case of white moving in a1-d1-d4 with the bK "fixed" in an orbit seems more complicated.
But can't we still do exactly the same?
wK starts in a1, moves to d1. Then when it moves to e1, the board flips horizontally and the wK is back on d1 but the black king is in another location. The wK continues to a1, then moves up (corresponding to h1-h2). Instead of ending up in a2, the board flips diagonally and the wK is now b1 and then moves up to b2. If I'm not mistaken, the loaded king slices remain as valid as before. It just looks a lot more chaotic.
If this indeed works, then 19 K-slices are again enough.

19 K-slices take up 97.9 GB, so 128 GB of RAM would be comfortable.
syzygy
Posts: 5892
Joined: Tue Feb 28, 2012 11:56 pm

Re: Tablebase suggestion

Post by syzygy »

Ender8pieces wrote: Wed Feb 04, 2026 5:29 pm
syzygy wrote: Wed Feb 04, 2026 3:58 pm
Disk I/O will be the limiting factor, so compression of sparse bitmaps will speed things up.
...
A few hundred full-drive writes will finish the SSD.
It is difficult to be certain without the raw numbers, but I’d like to address the points as candidly as possible:

While Disk I/O is indeed the primary bottleneck, compression has a double-edged effect. It might save bandwidth, but as you noted, it significantly impacts drive endurance and can cause slowdowns due to redundant writes (Write Amplification). A potential middle ground could be running compressed during the initial, data-heavy iterations and switching to raw/uncompressed once the updates become sparse.
With compression you'll also need to write a lot less. Splitting things up in 462 K-slices may indeed have the advantage that fewer K-slices may need to be rewritten in later iterations. In principle each K-slice could be further subdivided (on disk) to increase this effect.

I guess your point is that keeping things uncompressed on SSD allows you to only rewrite the SSD blocks that have changed data (this would only be relevant for the full W and B tables). But SSD blocks are quite large, and rewriting a single byte means rewriting the whole block. Anyway, in my view HDDs are preferred here, and then compression is a total win. And this is just an implementation detail.
If we assume 1 byte per position in a raw file, writing to a single byte within a 4KB sector necessitates writing the entire sector.
Note that SSD blocks are much bigger than 4 KB, at least 128 KB and nowadays apparently often more than 512 KB. And you have to make sure that the OS understands what you're doing. You can't just write the changed byte, you'll have to know which portions of the file correspond to a physical block and write that block in one go.
In the early iterations, this is frequent, making the difference between compressed and raw formats negligible in terms of disk wear.
Compressed is much smaller. Big difference.
syzygy
Posts: 5892
Joined: Tue Feb 28, 2012 11:56 pm

Re: Tablebase suggestion

Post by syzygy »

syzygy wrote: Wed Feb 04, 2026 5:45 pmIgnoring the wK, we let the bK zig-zag a1-h1-h2-a2-a3-h3...-a8.
I think zig-zag does not add anything. We can also do a1->h1, a2->h2, a3->h3, etc.
Can we show that 19 is optimal?

If we have RAM for 20 K-slices, loading/saving can be overlapped with calculating more easily.
Ender8pieces
Posts: 15
Joined: Wed Jan 21, 2026 9:16 am
Full name: JS

Re: Tablebase suggestion

Post by Ender8pieces »

Maybe considering SSD vs. HDD or raw vs. compressed is somewhat premature. These are implementation details that depend on the specific task and hardware budget. One could even consider a hybrid approach: for example running the initial, heavy iterations on HDDs to preserve flash endurance, and then switching to Enterprise SSDs for the sparse, late-game iterations where random access latency becomes the primary bottleneck.

I haven’t fully delved into the 5D transition analysis yet, but it is fascinating to see the problem converge into a well-defined, abstract mathematical challenge. It reminds me of cache-locality optimizations in other fields. Techniques like Hilbert space-filling curves, which are commonly used to linearize multi-dimensional data while preserving locality, might be highly relevant here to minimize "jumps" between slices, though I haven't mapped it out specifically for this case
syzygy
Posts: 5892
Joined: Tue Feb 28, 2012 11:56 pm

Re: Tablebase suggestion

Post by syzygy »

Ender8pieces wrote: Wed Feb 04, 2026 8:11 pmwhere random access latency becomes the primary bottleneck.
I don't see random disk access becoming the bottleneck. There will be disk seeks to access each of the 462 parts of a table, but that is not much.

I expect the bottleneck to be sequential disk throughput. Multiple drives in a RAID configuration would help. An SSD or multiple SSDs would help even more, but as mentioned the amounts of data being written are very considerable. Without compression, the largest pawnless tables are 37.2 TB with 1 byte per position (some might need 2 bytes/position). Compression helps, but you can't expect a miracle.

It would be interesting to know how Marc Bourzutschky's system stores the generated tables and the intermediate results. Almost certainly he did not keep the generated tables.

What we really need is new storage technology...
I haven’t fully delved into the 5D transition analysis yet, but it is fascinating to see the problem converge into a well-defined, abstract mathematical challenge. It reminds me of cache-locality optimizations in other fields. Techniques like Hilbert space-filling curves, which are commonly used to linearize multi-dimensional data while preserving locality, might be highly relevant here to minimize "jumps" between slices, though I haven't mapped it out specifically for this case
I am reasonably sure that 19 K-slices as I described is optimal.
But it is still possible that the whole thing does not work because I overlooked something (or am hallucinating).
If it works, then the largest 8-piece pawnless tables can apparently be generated with somewhat reasonable efficiency on a "low-end" system with 128 GB of RAM. It would be good to have ECC memory, though.

Storing and distributing those tables will be a very major problem if the goal is to create a complete set. In fact, it really is a hopeless task if you also want to generate the pawnful tables.
But it might be nice to outdo Bourzutschky and generate KQRBvKQRN.
Ender8pieces
Posts: 15
Joined: Wed Jan 21, 2026 9:16 am
Full name: JS

Re: Tablebase suggestion

Post by Ender8pieces »

syzygy wrote: Wed Feb 04, 2026 9:44 pm ....
I am reasonably sure that 19 K-slices as I described is optimal.
But it is still possible that the whole thing does not work because I overlooked something (or am hallucinating).
If it works, then the largest 8-piece pawnless tables can apparently be generated with somewhat reasonable efficiency on a "low-end" system with 128 GB of RAM. It would be good to have ECC memory, though.

Storing and distributing those tables will be a very major problem if the goal is to create a complete set. In fact, it really is a hopeless task if you also want to generate the pawnful tables.
But it might be nice to outdo Bourzutschky and generate KQRBvKQRN.
As for Marc Bourzutschky’s Work, From what I’ve gathered, Marc calculated multiple 8-piece pawnless tables but not all of them and he did not store them for distribution. Why did he stop there? Was it strictly a storage/IO limitation of the time? Also, why were single-pawn tables (7+1) or more complex 8-piece endings with pawns excluded? I couldn't find a definitive reference for his specific 8-man coverage and limitations, so any light you can shed on this would be invaluable.

As for the 19-Slice Model, To clarify the memory layout, are these 19 K-slices out of the standard 462 or 924 (turn-dependent)or 64? If you load bka2, does that includes all possible wk under symmetry?
syzygy
Posts: 5892
Joined: Tue Feb 28, 2012 11:56 pm

Re: Tablebase suggestion

Post by syzygy »

Ender8pieces wrote: Thu Feb 05, 2026 3:08 pm As for Marc Bourzutschky’s Work, From what I’ve gathered, Marc calculated multiple 8-piece pawnless tables but not all of them and he did not store them for distribution. Why did he stop there?
The available data suggests that what stopped him was the generator's RAM requirement of 1 bit per position (a requirement I did not read anywhere, but which seems to fit with the data). KQRBvKQRN requires 2.32 TB, which does not fit in 1.5 TB and indeed seems not to have been generated. KQRRvKQRB requires only 1.16 TB, which does fit in 1.5 TB. He might have generated KQRBvKQRB because you can split it in positions with equally colored bishops and positions with opposite bishops. He probably did not generate KQRNvKQRN.

If he indeed did not store the generated pawnless tables, that explains why he did not generate the pawnful tables (apart from the "opposing pawns" ones). Even if he did store the pawnless tables at least temporarily, the form in which they were stored may have been unsuitable to initialize promotions at the start of generating a pawnful table.
As for the 19-Slice Model, To clarify the memory layout, are these 19 K-slices out of the standard 462 or 924 (turn-dependent)or 64? If you load bka2, does that includes all possible wk under symmetry?
It seems it is enough to be able to hold in RAM 19 K-slices (of a single color) out of 2x462 K-slices at 1 bit per position. It has yet to be proven that this works by someone providing an actual implementation. The amount of data to be streamed to/from disk would still be the same. If instead you can only hold 9 K-slices, it can still be made to work but there would be a lot more streaming to/from disk.

Marc's generator seems to require holding 462 K-slices (of 1 color) in RAM.
I think his generator used Thompson's 1996 algorithm, which basically generates WDL plus DTZ for white losses only. You then need to run it again to get DTZ for black losses, and you need to do a 1-ply search to get DTZ for winning positions.

The Wu-Beal agorithm has the same RAM requirement but generates full DTZ (or with a bit more work DTM). I tnink according to the Wu-Beal paper it streams more data than Thompson's algorithm run twice (once for each side), but it also creates a more complete table.

In the end some of the information generated will be thrown away again during compression, so generating the complete table might be overkill. DTZ storing only losses but for both sides is conceptually quite nice because the 1-ply search to get DTZ for a winning position would only require you to check the winning moves, which might be few (and checking whether a move is winning can be done by probing WDL, which is cheap(er)). However, it is annoying for tables that are almost completely won for one side, because then (1) nearly every move will be winning, so there is no reduction in moves to search and, worse, (2) the side with the losses will typically have far more legal positions than the side with the wins -> compressed size is much larger than a half-sided syzygy DTZ table would be. (E.g. a randomly picked KQQQQvK position with white to move will probably be illegal because black is in check, whereas the same position with black to move will be legal because white is not in check (the btm position might be unreachable but that is typically not checked during generation).)

19 K-slices require about 98 GB, so fit easily on a machine with 128 GB.
A machine with 256 GB of RAM could hold 2 sets of 19 K-slices (or 19 K-slices with 2 bits per position), which might help speed up the generation by modifying the algorithm to overlap some of the stages (my generator works very similar to the Wu-Beal algorithm, but overlaps the second half of one iteration with the first half of the next iteration, which halves the number of iterations). This will need a better look.

But first the basic idea should be tested, e.g. on some 5- or 6-piece tables.