Ender8pieces wrote: ↑Sun May 31, 2026 10:15 am
I observed that the K4Nk2r table took around 6 hours for 111 iterations (actually 2 tables with 48 and 63 iterations respectively).
Or even 222 iterations if you count win-in-N and loss-in-(N+1) as two iterations.
The main loop (until the 50-move rule):
Code: Select all
int n;
for (n = 1; n <= DRAW_RULE; n++) {
more_wb_next = more_lw && calc_L(WHITE, n, more_lw);
more_ww_next = more_lb && calc_L(BLACK, n, more_lb);
more_lb = more_ww && calc_W(WHITE, n, more_ww);
more_lw = more_wb && calc_W(BLACK, n, more_wb);
more_wb = more_wb_next;
more_ww = more_ww_next;
}
calc_L(stm, n) calculates stm losses with dtz=N, calc_W(stm,n) calculates stm wins with dtz=N.
This seems quite surprising since it does not include disk reads. If it did include disk reads, it should take ~1,000 seconds for 111 iterations on a disk with 10 GB/s throughput, assuming the database is 100 GB on disk. That is roughly 20 minutes, which is much less than 6 hours.
Compressed it is a lot less I/O than that. Uncompressed it is a lot more.
One bit per position for one side to move is 49.6 GB uncompressed, so I can see how you arrive at 111 x 100 GB.
But calc_W(wtm, N):
- reads all btm losses for dtz = N-1
- reads all cumulative wtm wins for dtz <= N-1
- writes all wtm wins for dtz = N
- updates all cumulative wtm wins for dtz <= N (*)
and calc_L(wtm, N):
- reads all btm wins for dtz = N-1
- writes all candidate wtm losses for dtz = N
- reads all candidate wtm losses for dtz = N
- reads all cumulative btm wins for dtz <= N-1 to check whether all successor positions of candidate wtm losses lose.
- must somehow also check that all captures from candidate wtm losses lose (there are several options for this). (**)
- writes all wtm losses for dtz = N.
(*) The straightforward way is to write back the updated cumulative wins at each calc_W() iteration. Unlike the dtz=N wins/losses slices, these cumulative wins slices do not typically get sparser for higher iterations, so compressing does not get faster. Since decompressing is faster than compressing, I do not write them back at every iteration but instead use the smaller (if compressed) dtz=N win slices as deltas and update the cumulative slices only when reading the deltas starts to get too much. (I learned that deciding when to update is an example of the "ski-rental-problem".)
(**) Checking captures can be done in multiple ways:
- by probing subtables over and over again. Very slow (but perhaps optimal once there are very few positions left to be resolved).
- by generating a separate bitmap with positions with a drawing capture (or one that leads to a cursed loss) and reading this bitmap in every calc_L() iteration (another 50 GB uncompressed). As far as calc_L() is concerned, this bitmap could be merged into the "cumulative wins" bitmap, but this does not work for calc_W().
- by generating smaller sub-K-slice bitmaps for the positions after the capture, basically an uncompressed form of tablebase once loaded into memory, and testing captures against those. This is what my pawnless generator does. (Adds a bit to RAM usage but not much, less than a 21st full K-slice.)
For pawnful tables, you also need to check pawn moves (including promotions). Now generating sub-K-slice bitmaps does not work because pushing a pawn does not remove a piece from the board and therefore does not drastically reduce the number of positions like a capture does. So the pawnful generator generates K-slice bitmaps for positions with a non-losing pawn move. Since these have to be read anyway during each calc_L() pass, I also add the positions with a non-losing capture to them, so now there is no need for the sub-K-slices.
So you easily get to 10 x 50 GB I/O per iteration, if uncompressed, of which 3 or 4 are write steps. If you choose 3, then you need to do many reads of full-sized uncompressed dtz=N win slices, so you should want to write back the cumulative wins at each iteration, i.e. 4 write steps.
So per iteration 200 GB written. With 111 iteration this is 22.2 TB. For a full-size TB, you can multiple the size of each slice by 48 and the number of iterations maybe by 5. That is 5328 TB. It seems the better consumer SSDs are nowadays rated for 600 TB write endurance. So you'll need 9 SSDs in the process and figure out how not to lose any data while replacing SSDs, for a single large 8-men TB. This is getting expensive.
So let's be clear: it is nonsensical to consider not using compression here. Let's just forget about that, I think we've discussed this enough. If you really want to burn SSDs, then take my code and remove the compression yourself.
I assume this ratio would be preserved in larger databases, so it seems the bottleneck is not the disk (hence more RAM won't help), but the retrograde analysis itself—which I assume is mostly limited by the RAM reads of all the neighboring positions.
With good compression, I/O may indeed not be a major bottleneck even with slow HDDs (but with slow HDDs it would help to do more scheduling of reads and writes so computation does not have to wait for I/O to finish).
The retrograde analysis does a lot of index calculations. The indexing scheme for generation that I use now is a lot more expensive than what I used in my old generator, but from the 7-piece generation effort I learned that memory bandwidth is the main limiting factor, especially as the number of available cores goes up.
Of course compression/decompression also takes time. This can still be tuned (different compression levels, different compression methods).
With a modest amount of extra RAM, it should be possible to hold all or at least most of the "candidate losses" compressed in RAM (if I rearrange calc_L() a bit to consume those slices whenever the stm king has done a sweep of the board). Keeping all the cumulative wins slices compressed in RAM would be helpful too, but that will usually be too much (perhaps for pawnful tables this might often fit). So I should allow the generator to keep part of its working tree in a RAM-based filesystem and the rest on disk.
On the computation side there are:
- compression/decompression of temporary files
- index calculations. My previous calculator used a very simple and fast indexing scheme during generation. My new generator uses a more expensive scheme (I think iterating through positions is now pretty fast; most of the cycles probably go into calculating the position index after a move/unmove). But I learned from the 7-piece generation effort that memory bandwidth is more likely to be the bottleneck than calculation, especially with more and more cores being available.
- so memory bandwidth.
What also takes a lot of time (and is limited by memory bandwidth) is the final compression phase. Here I/O basically plays no role. I think something can be gained algorithmically if more memory is available.