disk-based 8-piece TB generator

Discussion of chess software programming and technical issues.

Moderator: Ras

Koistinen
Posts: 32
Joined: Sun May 23, 2021 10:05 pm
Location: Stockholm, Sweden
Full name: Urban Koistinen

Re: disk-based 8-piece TB generator

Post by Koistinen »

syzygy wrote: Fri May 29, 2026 11:43 pm
Koistinen wrote: Fri May 29, 2026 4:15 pm Nice this is going again. I did some estimation for how long it would take using 150 * 24TB hdds for just white wins and estimate 5 hours per 8-man.

Still takes about 3 years just to compute them all! (5*4795h is about 3 years)
5 hours per 8-man on average is definitely way too optimistic. KNNNNvKRR didn't take too long, but it is 48x smaller than some others.

But 4795 seems to be overcounting it.

The total number of tables for 3-8 pieces:

Code: Select all

3-piece		5
4-piece		30
5-piece		110
6-piece		365
7-piece		1001
8-piece		2520
The 8-piece tables can be nicely split into:
- 868 pawnless tables
- 792 tables with exactly one pawn
- 860 tables with 2+ pawns.

The 868 pawnless tables are now the "easy" ones.
The 792 tables with one pawn are the hard ones (although they need the least RAM to generate). They will take up the most space.
The 860 tables with 2+ pawns can be generated in reasonable RAM, which should mean that they can be generated relatively quickly.

All the 8-men pawnless tables can be generated independently and thus in parallel (you just need to download the right 7-men WDL tables, which isn't a big deal).
When that is done, all 8-men tables with 1 pawn can be generated independently (but you need to download the right 8-men WDL tables, which starts to be an issue).
What that is done, all 8-men tables with 2 pawns, then 3 pawns, etc.

The number of 8-men subtables you need to generate a specific pawnful table isn't too large, but if each table is multiple TBs, it is a lot of data. Still, this seems within reach if you don't live in a backward country like Germany.

The real problem is: who wants to host 2 PB of data.
Don't know what the retarded germans got to do with it.

I think my counting method is better when generating one side's wins at a time.
For storing the results, your approach is better.

I can afford to host 2-3PB but am looking into if perhaps some university would care to. Potentially less tax that way.
syzygy
Posts: 6007
Joined: Tue Feb 28, 2012 11:56 pm

Re: disk-based 8-piece TB generator

Post by syzygy »

Koistinen wrote: Sat May 30, 2026 10:55 am
syzygy wrote: Fri May 29, 2026 11:43 pm
Koistinen wrote: Fri May 29, 2026 4:15 pm Nice this is going again. I did some estimation for how long it would take using 150 * 24TB hdds for just white wins and estimate 5 hours per 8-man.

Still takes about 3 years just to compute them all! (5*4795h is about 3 years)
5 hours per 8-man on average is definitely way too optimistic. KNNNNvKRR didn't take too long, but it is 48x smaller than some others.

But 4795 seems to be overcounting it.

The total number of tables for 3-8 pieces:

Code: Select all

3-piece		5
4-piece		30
5-piece		110
6-piece		365
7-piece		1001
8-piece		2520
The 8-piece tables can be nicely split into:
- 868 pawnless tables
- 792 tables with exactly one pawn
- 860 tables with 2+ pawns.

The 868 pawnless tables are now the "easy" ones.
The 792 tables with one pawn are the hard ones (although they need the least RAM to generate). They will take up the most space.
The 860 tables with 2+ pawns can be generated in reasonable RAM, which should mean that they can be generated relatively quickly.

All the 8-men pawnless tables can be generated independently and thus in parallel (you just need to download the right 7-men WDL tables, which isn't a big deal).
When that is done, all 8-men tables with 1 pawn can be generated independently (but you need to download the right 8-men WDL tables, which starts to be an issue).
What that is done, all 8-men tables with 2 pawns, then 3 pawns, etc.

The number of 8-men subtables you need to generate a specific pawnful table isn't too large, but if each table is multiple TBs, it is a lot of data. Still, this seems within reach if you don't live in a backward country like Germany.

The real problem is: who wants to host 2 PB of data.
Don't know what the retarded germans got to do with it.
Internet download/upload speeds are 10 years behind.
Or, more positively: in many countries internet speeds are getting to the point where downloading or uploading a few terabytes of data becomes feasible.
I think my counting method is better when generating one side's wins at a time.
For storing the results, your approach is better.
My method is very close to one that you proposed in 2010:
https://kirill-kryukov.com/chess/discus ... php?t=5163
But I use 1 bit per position to calculate just DTZ=N instead of 2 bits to calculate DTZ=N and N+1 (easier to implement, less RAM and probably not significantly slower if at all).
I can afford to host 2-3PB but am looking into if perhaps some university would care to. Potentially less tax that way.
So maybe this can be done :)
User avatar
Ajedrecista
Posts: 2246
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Disk-based 8-piece TB generator.

Post by Ajedrecista »

Hello Ronald and Urban:

Thanks to both for your answers. I think you are counting different things: Ronald counts different types of endgames (as I would count) and Urban is counting computations that he needs in order to get the tablebases. Let me show you what I understand from the easy example of 4-man EGTB:

Code: Select all

4-man EGTB.

Ronald's    Urban's
typical ——  needed
endgames    computations

 1 KQQk ——  1 QQ
 2 KQRk ——  2 QR
 3 KQBk ——  3 QB
 4 KQNk ——  4 QN
 5 KQPk ——  5 QP
 6 KRRk ——  6 RR
 7 KRBk ——  7 RB
 8 KRNk ——  8 RN
 9 KRPk ——  9 RP
10 KBBk —— 10 BB
11 KBNk —— 11 BN
12 KBPk —— 12 BP
13 KNNk —— 13 NN
14 KNPk —— 14 NP
15 KPPk —— 15 PP

16 KQkq —— 16 Qq

17 KQkr —— 17 Qr
        —— 18 Rq

18 KQkb —— 19 Qb
        —— 20 Bq

19 KQkn —— 21 Qn
        —— 22 Nq

20 KQkp —— 23 Qp
        —— 24 Pq

21 KRkr —— 25 Rr

22 KRkb —— 26 Rb
        —— 27 Br

23 KRkn —— 28 Rn
        —— 29 Nr

24 KRkp —— 30 Rp
        —— 31 Pr

25 KBkb —— 32 Bb

26 KBkn —— 33 Bn
        —— 34 Nb

27 KBkp —— 35 Bp
        —— 36 Pb

28 KNkn —— 37 Nn

29 KNkp —— 38 Np
        —— 39 Pn

30 KPkp —— 40 Pp
The table was written by hand, but I hope no typos. The tablebase metrics are also different: Ronald's typical WDL differenciates between {win, draw, lose} in a given position; while Urban's W/DL is a {win, no win} output, without distinguishing between drawn and lost positions. I do not know if there are differences in how each of you treat cursed wins (wins longer than the fifty-move rule).

------------

Taking advantage of NULP's results for 8-man, I copied them to Excel and toyed with filters. I confirm Ronald's results of 868 pawnless tables (I filtered p and got 1652 results, so 2520 - 1652 = 868), with circa 3.81763e+16 - 3.39903e+16 ~ 4.186e+15 positions or around 10.965% of the positions of 8-man EGTB using NULP's results.

I can also confirm 792 tables with exactly one pawn, using =LEN(A1)-LEN(SUBSTITUTE(A1,"p","")) formula (thanks, Google!) to count how many 'p' are in each cell and filter to 1. This gave circa 2.03920e+16 positions or around 53.415% of the positions of 8-man EGTB using NULP's results. Expanding, I get around 27.119%, 7.308%, 1.102%, 0.088% and 0.003% of the positions for endgames with 2, 3, 4, 5 and 6 pawns, without taking into account how many pawns are per side. Logically: 2520 - 868 - 792 = 860 tables with two or more pawns.

I agree that the table of dependencies is not easy-peasy. I guess that for each of the 792 tables with exactly one pawn, there are pawnless tables that must be pre-computed (replacing 'p' by either 'q', 'r', 'b' and 'n'?) and do something similar recursively with more pawns. It is easier to tell than to make it work, but both of you are very smart.

Regards from Spain.

Ajedrecista.
Ender8pieces
Posts: 22
Joined: Wed Jan 21, 2026 9:16 am
Full name: JS

Re: disk-based 8-piece TB generator

Post by Ender8pieces »

I observed that the K4Nk2r table took around 6 hours for 111 iterations (actually 2 tables with 48 and 63 iterations respectively).

This seems quite surprising since it does not include disk reads. If it did include disk reads, it should take ~1,000 seconds for 111 iterations on a disk with 10 GB/s throughput, assuming the database is 100 GB on disk. That is roughly 20 minutes, which is much less than 6 hours.

I assume this ratio would be preserved in larger databases, so it seems the bottleneck is not the disk (hence more RAM won't help), but the retrograde analysis itself—which I assume is mostly limited by the RAM reads of all the neighboring positions.

I wonder if a GPU could help in such cases and if that implementation is possible.

Anyway, right now it seems very feasible, and maybe a community project can be adapted if a central server becomes available.
syzygy
Posts: 6007
Joined: Tue Feb 28, 2012 11:56 pm

Re: disk-based 8-piece TB generator

Post by syzygy »

Ender8pieces wrote: Sun May 31, 2026 10:15 am I observed that the K4Nk2r table took around 6 hours for 111 iterations (actually 2 tables with 48 and 63 iterations respectively).
Or even 222 iterations if you count win-in-N and loss-in-(N+1) as two iterations.
The main loop (until the 50-move rule):

Code: Select all

  int n;
  for (n = 1; n <= DRAW_RULE; n++) {
    more_wb_next = more_lw && calc_L(WHITE, n, more_lw);
    more_ww_next = more_lb && calc_L(BLACK, n, more_lb);

    more_lb = more_ww && calc_W(WHITE, n, more_ww);
    more_lw = more_wb && calc_W(BLACK, n, more_wb);

    more_wb = more_wb_next;
    more_ww = more_ww_next;
  }
calc_L(stm, n) calculates stm losses with dtz=N, calc_W(stm,n) calculates stm wins with dtz=N.
This seems quite surprising since it does not include disk reads. If it did include disk reads, it should take ~1,000 seconds for 111 iterations on a disk with 10 GB/s throughput, assuming the database is 100 GB on disk. That is roughly 20 minutes, which is much less than 6 hours.
Compressed it is a lot less I/O than that. Uncompressed it is a lot more.

One bit per position for one side to move is 49.6 GB uncompressed, so I can see how you arrive at 111 x 100 GB.
But calc_W(wtm, N):
- reads all btm losses for dtz = N-1
- reads all cumulative wtm wins for dtz <= N-1
- writes all wtm wins for dtz = N
- updates all cumulative wtm wins for dtz <= N (*)

and calc_L(wtm, N):
- reads all btm wins for dtz = N-1
- writes all candidate wtm losses for dtz = N
- reads all candidate wtm losses for dtz = N
- reads all cumulative btm wins for dtz <= N-1 to check whether all successor positions of candidate wtm losses lose.
- must somehow also check that all captures from candidate wtm losses lose (there are several options for this). (**)
- writes all wtm losses for dtz = N.

(*) The straightforward way is to write back the updated cumulative wins at each calc_W() iteration. Unlike the dtz=N wins/losses slices, these cumulative wins slices do not typically get sparser for higher iterations, so compressing does not get faster. Since decompressing is faster than compressing, I do not write them back at every iteration but instead use the smaller (if compressed) dtz=N win slices as deltas and update the cumulative slices only when reading the deltas starts to get too much. (I learned that deciding when to update is an example of the "ski-rental-problem".)

(**) Checking captures can be done in multiple ways:
- by probing subtables over and over again. Very slow (but perhaps optimal once there are very few positions left to be resolved).
- by generating a separate bitmap with positions with a drawing capture (or one that leads to a cursed loss) and reading this bitmap in every calc_L() iteration (another 50 GB uncompressed). As far as calc_L() is concerned, this bitmap could be merged into the "cumulative wins" bitmap, but this does not work for calc_W().
- by generating smaller sub-K-slice bitmaps for the positions after the capture, basically an uncompressed form of tablebase once loaded into memory, and testing captures against those. This is what my pawnless generator does. (Adds a bit to RAM usage but not much, less than a 21st full K-slice.)

For pawnful tables, you also need to check pawn moves (including promotions). Now generating sub-K-slice bitmaps does not work because pushing a pawn does not remove a piece from the board and therefore does not drastically reduce the number of positions like a capture does. So the pawnful generator generates K-slice bitmaps for positions with a non-losing pawn move. Since these have to be read anyway during each calc_L() pass, I also add the positions with a non-losing capture to them, so now there is no need for the sub-K-slices.

So you easily get to 10 x 50 GB I/O per iteration, if uncompressed, of which 3 or 4 are write steps. If you choose 3, then you need to do many reads of full-sized uncompressed dtz=N win slices, so you should want to write back the cumulative wins at each iteration, i.e. 4 write steps.

So per iteration 200 GB written. With 111 iteration this is 22.2 TB. For a full-size TB, you can multiple the size of each slice by 48 and the number of iterations maybe by 5. That is 5328 TB. It seems the better consumer SSDs are nowadays rated for 600 TB write endurance. So you'll need 9 SSDs in the process and figure out how not to lose any data while replacing SSDs, for a single large 8-men TB. This is getting expensive.

So let's be clear: it is nonsensical to consider not using compression here. Let's just forget about that, I think we've discussed this enough. If you really want to burn SSDs, then take my code and remove the compression yourself.
I assume this ratio would be preserved in larger databases, so it seems the bottleneck is not the disk (hence more RAM won't help), but the retrograde analysis itself—which I assume is mostly limited by the RAM reads of all the neighboring positions.
With good compression, I/O may indeed not be a major bottleneck even with slow HDDs (but with slow HDDs it would help to do more scheduling of reads and writes so computation does not have to wait for I/O to finish).

The retrograde analysis does a lot of index calculations. The indexing scheme for generation that I use now is a lot more expensive than what I used in my old generator, but from the 7-piece generation effort I learned that memory bandwidth is the main limiting factor, especially as the number of available cores goes up.

Of course compression/decompression also takes time. This can still be tuned (different compression levels, different compression methods).

With a modest amount of extra RAM, it should be possible to hold all or at least most of the "candidate losses" compressed in RAM (if I rearrange calc_L() a bit to consume those slices whenever the stm king has done a sweep of the board). Keeping all the cumulative wins slices compressed in RAM would be helpful too, but that will usually be too much (perhaps for pawnful tables this might often fit). So I should allow the generator to keep part of its working tree in a RAM-based filesystem and the rest on disk.

On the computation side there are:
- compression/decompression of temporary files
- index calculations. My previous calculator used a very simple and fast indexing scheme during generation. My new generator uses a more expensive scheme (I think iterating through positions is now pretty fast; most of the cycles probably go into calculating the position index after a move/unmove). But I learned from the 7-piece generation effort that memory bandwidth is more likely to be the bottleneck than calculation, especially with more and more cores being available.
- so memory bandwidth.

What also takes a lot of time (and is limited by memory bandwidth) is the final compression phase. Here I/O basically plays no role. I think something can be gained algorithmically if more memory is available.
Ender8pieces
Posts: 22
Joined: Wed Jan 21, 2026 9:16 am
Full name: JS

Re: disk-based 8-piece TB generator

Post by Ender8pieces »

OK, I hear you. We discussed this in the past, but now I fully understand your point.

Let me see if I have this right: the compressed files are roughly 10% the size of the uncompressed ones.
In the first iterations, both the compressed and uncompressed approaches will write a lot of data (though uncompressed obviously writes significantly more). In the later iterations, the compressed approach writes very little because of your "delta update" trick (which actually resolves my main concern).

Your application of the ski-rental problem here is very interesting; I used to work on similar optimization instances.

Do you have any rough estimate for the total write volume (in TB) for the full project?
Also, do you have a rough estimate of the total time required to generate the full tablebase, assuming a single machine similar to the one you are running on (and enough SSD capacity)?

Regarding memory bandwidth: could the index function be designed to support reading in "blocks of bits" instead of random bits? Perhaps the pieces with the highest mobility should be the least in the index hierarchy.

As I say, a GPU might be useful for this , but I it wouldn't be a simple implementation.