Tablebase suggestion

syzygy · Post by **syzygy** » Wed Feb 04, 2026 5:45 pm

syzygy wrote: ↑Wed Feb 04, 2026 3:52 pm So it seems that if 58 K-slices can fit in RAM, which corresponds to about 300 GB, the K-slice approach should be as efficient as the regular approach that requires 2.38 TB of RAM. With less RAM you have to be a bit more careful to do things in a clever order.

In the case of black moving and the wK being fixed, RAM for 19 K-slices should be enough to not lose efficiency.

Ignoring the wK, we let the bK zig-zag a1-h1-h2-a2-a3-h3...-a8. We load the K-slices that we need and we drop those that we no longer need.
When we are in g2, we have loaded 19 slices (a1-h1,a2-h2,f3,g3,h3).
[d]8/8/8/8/8/5kkk/kkkkkkkk/kkkkkkkk b - - 0 1
Then at f2 we drop h1 and load e3. So still 19 slices.
[d]8/8/8/8/8/4kkkk/kkkkkkkk/kkkkkkk1 b - - 0 1/fen]
And it seems it continues this way, i.e. 19 slices at once in RAM means we don't need to reload anything.
The presence of the wK somewhere in a1-d1-d4 only reduces the number of slices you need to load. (And if the wK is on the diagonal, the bK only needs to visit the lower half of the board including the diagonal, so even less to do.)

The case of white moving in a1-d1-d4 with the bK "fixed" in an orbit seems more complicated.
But can't we still do exactly the same?
wK starts in a1, moves to d1. Then when it moves to e1, the board flips horizontally and the wK is back on d1 but the black king is in another location. The wK continues to a1, then moves up (corresponding to h1-h2). Instead of ending up in a2, the board flips diagonally and the wK is now b1 and then moves up to b2. If I'm not mistaken, the loaded king slices remain as valid as before. It just looks a lot more chaotic.
If this indeed works, then 19 K-slices are again enough.

19 K-slices take up 97.9 GB, so 128 GB of RAM would be comfortable.

syzygy · Post by **syzygy** » Wed Feb 04, 2026 6:00 pm

Ender8pieces wrote: ↑Wed Feb 04, 2026 5:29 pm
syzygy wrote: ↑Wed Feb 04, 2026 3:58 pm
Disk I/O will be the limiting factor, so compression of sparse bitmaps will speed things up.
...
A few hundred full-drive writes will finish the SSD.
It is difficult to be certain without the raw numbers, but I’d like to address the points as candidly as possible:

While Disk I/O is indeed the primary bottleneck, compression has a double-edged effect. It might save bandwidth, but as you noted, it significantly impacts drive endurance and can cause slowdowns due to redundant writes (Write Amplification). A potential middle ground could be running compressed during the initial, data-heavy iterations and switching to raw/uncompressed once the updates become sparse.

With compression you'll also need to write a lot less. Splitting things up in 462 K-slices may indeed have the advantage that fewer K-slices may need to be rewritten in later iterations. In principle each K-slice could be further subdivided (on disk) to increase this effect.

I guess your point is that keeping things uncompressed on SSD allows you to only rewrite the SSD blocks that have changed data (this would only be relevant for the full W and B tables). But SSD blocks are quite large, and rewriting a single byte means rewriting the whole block. Anyway, in my view HDDs are preferred here, and then compression is a total win. And this is just an implementation detail.

If we assume 1 byte per position in a raw file, writing to a single byte within a 4KB sector necessitates writing the entire sector.

Note that SSD blocks are much bigger than 4 KB, at least 128 KB and nowadays apparently often more than 512 KB. And you have to make sure that the OS understands what you're doing. You can't just write the changed byte, you'll have to know which portions of the file correspond to a physical block and write that block in one go.

In the early iterations, this is frequent, making the difference between compressed and raw formats negligible in terms of disk wear.

Compressed is much smaller. Big difference.

syzygy · Post by **syzygy** » Wed Feb 04, 2026 7:04 pm

syzygy wrote: ↑Wed Feb 04, 2026 5:45 pmIgnoring the wK, we let the bK zig-zag a1-h1-h2-a2-a3-h3...-a8.

I think zig-zag does not add anything. We can also do a1->h1, a2->h2, a3->h3, etc.
Can we show that 19 is optimal?

If we have RAM for 20 K-slices, loading/saving can be overlapped with calculating more easily.

Ender8pieces · Post by **Ender8pieces** » Wed Feb 04, 2026 8:11 pm

Maybe considering SSD vs. HDD or raw vs. compressed is somewhat premature. These are implementation details that depend on the specific task and hardware budget. One could even consider a hybrid approach: for example running the initial, heavy iterations on HDDs to preserve flash endurance, and then switching to Enterprise SSDs for the sparse, late-game iterations where random access latency becomes the primary bottleneck.

I haven’t fully delved into the 5D transition analysis yet, but it is fascinating to see the problem converge into a well-defined, abstract mathematical challenge. It reminds me of cache-locality optimizations in other fields. Techniques like Hilbert space-filling curves, which are commonly used to linearize multi-dimensional data while preserving locality, might be highly relevant here to minimize "jumps" between slices, though I haven't mapped it out specifically for this case

syzygy · Post by **syzygy** » Wed Feb 04, 2026 9:44 pm

Ender8pieces wrote: ↑Wed Feb 04, 2026 8:11 pmwhere random access latency becomes the primary bottleneck.

I don't see random disk access becoming the bottleneck. There will be disk seeks to access each of the 462 parts of a table, but that is not much.

I expect the bottleneck to be sequential disk throughput. Multiple drives in a RAID configuration would help. An SSD or multiple SSDs would help even more, but as mentioned the amounts of data being written are very considerable. Without compression, the largest pawnless tables are 37.2 TB with 1 byte per position (some might need 2 bytes/position). Compression helps, but you can't expect a miracle.

It would be interesting to know how Marc Bourzutschky's system stores the generated tables and the intermediate results. Almost certainly he did not keep the generated tables.

What we really need is new storage technology...

I haven’t fully delved into the 5D transition analysis yet, but it is fascinating to see the problem converge into a well-defined, abstract mathematical challenge. It reminds me of cache-locality optimizations in other fields. Techniques like Hilbert space-filling curves, which are commonly used to linearize multi-dimensional data while preserving locality, might be highly relevant here to minimize "jumps" between slices, though I haven't mapped it out specifically for this case

I am reasonably sure that 19 K-slices as I described is optimal.
But it is still possible that the whole thing does not work because I overlooked something (or am hallucinating).
If it works, then the largest 8-piece pawnless tables can apparently be generated with somewhat reasonable efficiency on a "low-end" system with 128 GB of RAM. It would be good to have ECC memory, though.

Storing and distributing those tables will be a very major problem if the goal is to create a complete set. In fact, it really is a hopeless task if you also want to generate the pawnful tables.
But it might be nice to outdo Bourzutschky and generate KQRBvKQRN.

Ender8pieces · Post by **Ender8pieces** » Thu Feb 05, 2026 3:08 pm

syzygy wrote: ↑Wed Feb 04, 2026 9:44 pm ....
I am reasonably sure that 19 K-slices as I described is optimal.
But it is still possible that the whole thing does not work because I overlooked something (or am hallucinating).
If it works, then the largest 8-piece pawnless tables can apparently be generated with somewhat reasonable efficiency on a "low-end" system with 128 GB of RAM. It would be good to have ECC memory, though.

Storing and distributing those tables will be a very major problem if the goal is to create a complete set. In fact, it really is a hopeless task if you also want to generate the pawnful tables.
But it might be nice to outdo Bourzutschky and generate KQRBvKQRN.

As for Marc Bourzutschky’s Work, From what I’ve gathered, Marc calculated multiple 8-piece pawnless tables but not all of them and he did not store them for distribution. Why did he stop there? Was it strictly a storage/IO limitation of the time? Also, why were single-pawn tables (7+1) or more complex 8-piece endings with pawns excluded? I couldn't find a definitive reference for his specific 8-man coverage and limitations, so any light you can shed on this would be invaluable.

As for the 19-Slice Model, To clarify the memory layout, are these 19 K-slices out of the standard 462 or 924 (turn-dependent)or 64? If you load bka2, does that includes all possible wk under symmetry?

syzygy · Post by **syzygy** » Thu Feb 05, 2026 4:34 pm

Ender8pieces wrote: ↑Thu Feb 05, 2026 3:08 pm As for Marc Bourzutschky’s Work, From what I’ve gathered, Marc calculated multiple 8-piece pawnless tables but not all of them and he did not store them for distribution. Why did he stop there?

The available data suggests that what stopped him was the generator's RAM requirement of 1 bit per position (a requirement I did not read anywhere, but which seems to fit with the data). KQRBvKQRN requires 2.32 TB, which does not fit in 1.5 TB and indeed seems not to have been generated. KQRRvKQRB requires only 1.16 TB, which does fit in 1.5 TB. He might have generated KQRBvKQRB because you can split it in positions with equally colored bishops and positions with opposite bishops. He probably did not generate KQRNvKQRN.

If he indeed did not store the generated pawnless tables, that explains why he did not generate the pawnful tables (apart from the "opposing pawns" ones). Even if he did store the pawnless tables at least temporarily, the form in which they were stored may have been unsuitable to initialize promotions at the start of generating a pawnful table.

As for the 19-Slice Model, To clarify the memory layout, are these 19 K-slices out of the standard 462 or 924 (turn-dependent)or 64? If you load bka2, does that includes all possible wk under symmetry?

It seems it is enough to be able to hold in RAM 19 K-slices (of a single color) out of 2x462 K-slices at 1 bit per position. It has yet to be proven that this works by someone providing an actual implementation. The amount of data to be streamed to/from disk would still be the same. If instead you can only hold 9 K-slices, it can still be made to work but there would be a lot more streaming to/from disk.

Marc's generator seems to require holding 462 K-slices (of 1 color) in RAM.
I think his generator used Thompson's 1996 algorithm, which basically generates WDL plus DTZ for white losses only. You then need to run it again to get DTZ for black losses, and you need to do a 1-ply search to get DTZ for winning positions.

The Wu-Beal agorithm has the same RAM requirement but generates full DTZ (or with a bit more work DTM). I tnink according to the Wu-Beal paper it streams more data than Thompson's algorithm run twice (once for each side), but it also creates a more complete table.

In the end some of the information generated will be thrown away again during compression, so generating the complete table might be overkill. DTZ storing only losses but for both sides is conceptually quite nice because the 1-ply search to get DTZ for a winning position would only require you to check the winning moves, which might be few (and checking whether a move is winning can be done by probing WDL, which is cheap(er)). However, it is annoying for tables that are almost completely won for one side, because then (1) nearly every move will be winning, so there is no reduction in moves to search and, worse, (2) the side with the losses will typically have far more legal positions than the side with the wins -> compressed size is much larger than a half-sided syzygy DTZ table would be. (E.g. a randomly picked KQQQQvK position with white to move will probably be illegal because black is in check, whereas the same position with black to move will be legal because white is not in check (the btm position might be unreachable but that is typically not checked during generation).)

19 K-slices require about 98 GB, so fit easily on a machine with 128 GB.
A machine with 256 GB of RAM could hold 2 sets of 19 K-slices (or 19 K-slices with 2 bits per position), which might help speed up the generation by modifying the algorithm to overlap some of the stages (my generator works very similar to the Wu-Beal algorithm, but overlaps the second half of one iteration with the first half of the next iteration, which halves the number of iterations). This will need a better look.

But first the basic idea should be tested, e.g. on some 5- or 6-piece tables.

Ender8pieces · Post by **Ender8pieces** » Thu Feb 05, 2026 8:20 pm

I’ve been running a quantitative check on the 19-slice flow, and I’d like to get your thoughts regarding the "Side to Move" (STM) dependency.

When we are processing a slice for write for White to Move (WTM), we essentially iterate through all legal Black responses. However, when the iteration reaches the Black to Move (BTM) stage, we need to read those same source slices again to calculate the values for the other side.

So each slice is read twice per iteration,once for WTM and once for BTM or we could hold 38 slices (19 for each side) out of the 924 total slices in RAM to avoid the re-read.

syzygy · Post by **syzygy** » Thu Feb 05, 2026 11:40 pm

Ender8pieces wrote: ↑Thu Feb 05, 2026 8:20 pm I’ve been running a quantitative check on the 19-slice flow, and I’d like to get your thoughts regarding the "Side to Move" (STM) dependency.

When we are processing a slice for write for White to Move (WTM), we essentially iterate through all legal Black responses. However, when the iteration reaches the Black to Move (BTM) stage, we need to read those same source slices again to calculate the values for the other side.

So each slice is read twice per iteration,once for WTM and once for BTM or we could hold 38 slices (19 for each side) out of the 924 total slices in RAM to avoid the re-read.

The first half of the iteration is:

Code: Select all

S1 = Load(W, WIN_IN_N(n));
R1 = Predecessor(S1);
S2 = Load(B, UNKNOWN);
S2 = S2 & R1;
----
R2 = Load(W, WIN_IN<=N(n));
S3 = ProveSuccessor(S2, R2);
B = Add(S3, LOSE_IN_N(n));
----
R3 = Predecessor(S3);
S4 = Load(W, UNKNOWN);
S4 = S4 & R3;
W = Add(S4, WIN_IN_N(n+1));

(and then we do the same steps with W and B reversed)
R1 is the set of btm positions that lead, by a black move, to a wtm position that is a win-in-n. They are potentially loss-in-n. (Counting in moves, when generating DTZ it is better to count in ply, at least if you care about the 50-move rule.)
S2 is the subset of R1 of not-yet-resolved positions.
R2 is the set of wtm positions that are win-in-k with k<=n. R2 is used to check which positions in S2 are really loss-in-n.
S3 is that that set of btm loss-in-n positions. (So S3 is a subset of S2 is a subset of R1).
R3 is the set of wtm positions that lead, by a white move, to a btm position that is a loss-in-n. Those that are not yet resolved (R3 & S4) are win_in_(n+1).
[So R1 and R3 are constructed in RAM and then written to disk when they are dropped from memory. R2 is loaded from disk and the just dropped.]

What you mean is that W is read both in the first and the second part of the iteration to create S2/R1 and R2.
However, we cannot use the R2 K-slices to calculate S3 from S2 until enough of S2 has been created.
For example, we should already drop the a1 K-slices before we calculate the c3 K-slice, but both a1 and c3 are needed to calculate the b2 K-slice of S3.
So you would need to keep the R2 K-slices quite a bit longer in memory. Judging from the a1-b2-c3 example, we need a1 up to d4, so 3*8+4 = 28 K-slices instead of 19 K-slices. So in total 19+28 = 47 K-slices simultaneously in RAM, which means 242.2 GB, so still doable if you have 256 GB.

But I think there is a better way of using more memory (2 bits per position):

Wu-Beal treats two plies at once by counting moves instead. This is problematic for DTZ50. Bourzutschky's generator does the same (see the readme.txt).
My generator counts in plies, but it works on two plies at once.
To translate this to Wu-Beal, it loads WIN_IN_(N-1) and WIN_IN_N into S1 (without distinguishing, so 1 bit per position).
The ProveSuccessor(S3) step then checks whether a position marked as a potential loss is LOSS_IN_N or LOSS_IN_(N+1).
For the first part of the algorithm, this is free. (You just have a bigger subset in S1.)
The middle part is more tricky. Loading R2 from W is still the same amount of streaming (all of W) but you need two bits per position (or one 3-valued "trit" with 5 trits per byte). ProveSuccessor(S2,R2) is a bit more work but not that much. S3 will again be a 3-valued tritmap (or more simply 2 bits per position). Same for R3 and S4.

Can we do more?
If we also load LOSS_IN_N as part of S1, the Predecessor(S1) operation also finds WIN_IN_(N+1).
Suppose we have 2 bits per position in S1, so values 0-3:
0: neither LOSS_IN_(N-1)/N nor WIN_IN_(N-1)/N
1: one of the two wins
2: LOSS_IN_(N-1)
3: LOSS_IN_N
Now Predecessor(S1) constructs K-slices with values 0-3:
0: no predecssor
1: potential loss
2: WIN_IN_N
3: WIN_IN_(N+1)
all provided the position is still unknown, so we still must AND with the UNKNOWN bitmap from B to obtain S2 on disk.
Then, in the 2nd part of the iteration, we stream through S2 (while loading 2-bit R2 K-slices). Each win can be added directly to B. Each potential loss still has to be verified including its precise dtz determined, and is then added to B.

Now the 3rd part of the iteration is the 1st part of the half iteration for the other side to move.
To make this work we have to stagger the dtz: n-1/n for wtm, then n/n+1 for btm, then n+1/n+2 for wtm. (As my generator does.)

This might be the best use of having sufficient memory for 2 bits per position in each of the 19 K-slices in RAM.

It shouldn't be too difficult to implement both strategies in the same generator.

Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion

Re: Tablebase suggestion