New 6-piece tablebases

Discussion of chess software programming and technical issues.

Moderator: Ras

syzygy
Posts: 5774
Joined: Tue Feb 28, 2012 11:56 pm

Re: New 6-piece tablebases

Post by syzygy »

mar wrote:If I understand correctly you first probe WDL and then DTZ table to make progress?
Correct. During the search only the WDL tables are accessed. Once a position with 6 or fewer pieces is on the board, the DTZ tables are used to make progress.

To make sure the engine makes progress towards a tablebase win, the search should treat tablebase positions that are won or lost similar to mate positions, but using a different range of values.

Of course pure DTZ-optimal play (i.e. minimaxing the number of moves to the next capture or pawn move by either side) can be very unnatural, so it might be desirable to let the engine do searches on the winning moves until it becomes clear that insufficient progress is being made and only then switch to DTZ-optimal play (e.g. by detecting repetitions and monitoring the 50-move counter).

The main reason for DTZ is clearly size, but also important is the fact that DTZ plays well with the 50-move rule. Actually it is not really DTZ but what I would call DTZ50+. For a position that is won or lost under the 50-move rule, my tables give the number of moves to the next zeroing move with optimal play respecting the 50-move rule (the "DTZ50" value). For a position that is drawn under the 50-move rule, but won or lost without the 50-move rule, my tables give the number of moves either to a position that is won or lost under the 50-move rule, or to a zeroing move leading to a position that is drawn under the 50-move rule but won or lost without it.

This may look a bit complicated, but the result is that when ignoring the 50-move rule, a winning path can be found for all positions that are won, and when respecting the 50-move rule, all positions that are won under the 50-move rule can be won and all positions that are drawn under the 50-move rule (but otherwise lost) can be drawn.

Similarly, the WDL tables are in fact WDL50+ tables. The probe returns -2,-1,0,1,2 depending on whether the position is lost under the 50-move rule, lost but drawn under the 50-move rule, drawn, won but drawn under the 50-move rule, won under the 50-move rule.
syzygy
Posts: 5774
Joined: Tue Feb 28, 2012 11:56 pm

Re: New 6-piece tablebases

Post by syzygy »

rvida wrote:
syzygy wrote: The tables use custom compression. Total compressed size:

Code: Select all

                       WDL          DTZ
up to 5 pieces       378 MB        561 MB
up to 6 pieces       68.3 GB       81.9 GB
Is the custom compression scheme any better than LZMA (which is used in Gaviota) ?
My compression scheme allows for compressed blocks of just 64 bytes that can be decompressed quickly, which means that there is no need to cache uncompressed data.

In addition, my compression scheme allows me more control on what to do with "don't care" values. The value of illegal positions can be set to whatever compresses best (which is not necessarily the value occurring with the highest frequency but depends on the neighbouring values). To further improve compression, the WDL probing code does a round of captures and probes the resulting positions (which should be fast because tables with 1 piece less are much smaller and may be assumed to be cached in RAM). This means that for positions with a winning capture I can store whatever I want. For drawn positions with a drawing capture I can store either a loss value or a draw value. (And since WDL50+ is 5-valued there are even more possibilities.) Theoretically this is possible with other compression schemes as well, but it is an integral part of my compression scheme.

Getting DTZ small was done by making them 1-sided (only wtm or only btm, whichever is smaller), storing draws as don't cares, and not storing whether a position wins or loses (since that information is available from the WDL tables). In addition, if a capture or pawn move is winning I store a don't care. Where possible I store distance in number of moves, but where necessary for accurate 50-move rule play I store distance in plies.

I believe the capture resolution idea is from Chinook. It is also used for Robbobases and I would be surprised if the Shredderbases (5-piece WDL, 441 MB 2-sided and 157 MB 1-sided) did not use it as well. Chinook uses a run-length encoding scheme and sets the don't care value to the most frequent of W, D, L (Robbobases is said to do the same). My scheme is a bit more complicated.
lucasart
Posts: 3242
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: New 6-piece tablebases

Post by lucasart »

syzygy wrote:
lucasart wrote:How does the size of 5-men table size base compare with: the Nalimov TB ? the Gaviota TB ? And in terms of "average probing speed" ?
The 5-piece Nalimov TB are over 7 GB. I believe the 6-piece are over 1.2 TB (and those lack the 5v1 tables).
The 5-piece Gaviota TB are 6.5GB with LZMA compression, see here.

Average probing speed of my tables should be considerably better than that of Nalimov tables, but to be honest I have not measured it. But probing 1.2 TB Nalimov 6-piece tables will of course lead to much more disk trashing than probing 68.3 GB of tables. And 68.3 GB on SSD is much more affordable than 1.2 TB on SSD.
Woaw! So you're being modest, but in reality, your TB beats the pants out of everything out there. And yes, probing speed will surely be linked to size (disk access much slower than RAM). The size of your TB is more comparable to the size of a bitbase than a TB, so it's really a big improvement over Nalimov and Gaviota.

Would it be possible to modify your code to generate a bitbase only:
- one bit per position (disregarding 50-move counter)
- bit just says if it's a draw or not.

if not a draw, good engines will (in the vast majority of cases) be able to find the win on their own. the real difficulty is to understand that certain positions are a "no progress situation", despite a material advantage, or an apparent advantage in terms of eval score.

A very small bitbase with only draw info, that could fit in RAM for 5-men, would be really nice.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
syzygy
Posts: 5774
Joined: Tue Feb 28, 2012 11:56 pm

Re: New 6-piece tablebases

Post by syzygy »

lucasart wrote:The size of your TB is more comparable to the size of a bitbase than a TB, so it's really a big improvement over Nalimov and Gaviota.
I believe most available "bitbases" are really WDL tables. So it's fair to compare my WDL tables to bitbases.
Would it be possible to modify your code to generate a bitbase only:
- one bit per position (disregarding 50-move counter)
- bit just says if it's a draw or not.
That is possible, but I don't think it would reduce their size by much. E.g. with KQvKR the positions where black K or R can immediately capture the white Q are already stored as "don't care". The few positions where black can mate or take the white Q in more than 1 ply don't take up much space, and the space that it takes up seems to be worthwhile since the evaluation will usually not catch these situations. Leaving it out would either result in erroneous results or force the engine to search a few more ply at each node where it could otherwise immediately return a result.

For tables that really have only two values such as KPvK or KBNvK, there would be no difference at all.

The story is the similar for the additional size due to the 50-move rule: for most tables there is hardly any impact and for tables such as KBBvKN having the information is quite useful.

If I find some time I might try to generate some draw/non-draw tables just to give some numbers.
A very small bitbase with only draw info, that could fit in RAM for 5-men, would be really nice.
Nowadays 378 MB should fit in RAM ;-).

What would really decrease the size is to make the WDL tables single sided. But that would affect probing speed quite significantly. Depending on how much RAM is available it might still be an overall win for 6-piece tables, but I have not tried it out.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: New 6-piece tablebases

Post by Houdini »

syzygy wrote:The probing code is fully multithreaded as well.
Ronald,

Congrats for your new EGTB system!
Could you comment on the multi-threaded performance? Suppose 8 threads are simultaneously probing, how much of the probing code can run simultaneously?

All the EGTB solutions I've tested so far (Gaviota, Nalimov, Scorpio) have a serious problem with multi-threaded access, too much of the code is behind a single mutex (critical section) and when 8 threads probe there are 7 threads waiting while the first thread is doing its probe.

Robert
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: New 6-piece tablebases

Post by Daniel Shawul »

Houdini wrote:
syzygy wrote:The probing code is fully multithreaded as well.
Ronald,

Congrats for your new EGTB system!
Could you comment on the multi-threaded performance? Suppose 8 threads are simultaneously probing, how much of the probing code can run simultaneously?

All the EGTB solutions I've tested so far (Gaviota, Nalimov, Scorpio) have a serious problem with multi-threaded access, too much of the code is behind a single mutex (critical section) and when 8 threads probe there are 7 threads waiting while the first thread is doing its probe.

Robert
Hah..this is false at least for scorpio. I have code that allows upto 8 searchers to do probes simultaneously. The LRU cache has locks but that is just about it. Please lets not spread FUD.
syzygy
Posts: 5774
Joined: Tue Feb 28, 2012 11:56 pm

Re: New 6-piece tablebases

Post by syzygy »

Houdini wrote:Could you comment on the multi-threaded performance? Suppose 8 threads are simultaneously probing, how much of the probing code can run simultaneously?
Multi-threaded access is completely simultaneous, except for a single lock that is taken when accessing a particular table for the first time (this is negligible).

One of the reasons for not needing the mutex that the other systems have is that I let the OS do all the caching. I simply mmap() the compressed tables into memory and directly acccess the compressed data. For 5-piece tables that should never be a problem (it's just 378 MB). For 6-piece tables I don't know how well e.g. Windows handles it.

Accessing 6-piece tables on 32-bit systems would need some work, because mmap()ing them will run into problems, but on 32-bit systems it seems to make more sense to stay with 5-piece tables anyway. I think the decompression code currently expects a 64-bit processor, but that can be changed quite easily.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: New 6-piece tablebases

Post by Houdini »

Daniel Shawul wrote:Hah..this is false at least for scorpio. I have code that allows upto 8 searchers to do probes simultaneously. The LRU cache has locks but that is just about it. Please lets not spread FUD.
AFAIK with large cache sizes a lot of time is spent traversing the LRU cache, and that is exactly what you are locking.
Do you have any performance data that would suggest otherwise?

Robert
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: New 6-piece tablebases

Post by Houdini »

syzygy wrote:Multi-threaded access is completely simultaneous, except for a single lock that is taken when accessing a particular table for the first time (this is negligible).
That's great, I'm looking forward to testing your system.

Cheers,
Robert
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: New 6-piece tablebases

Post by Daniel Shawul »

Houdini wrote:
Daniel Shawul wrote:Hah..this is false at least for scorpio. I have code that allows upto 8 searchers to do probes simultaneously. The LRU cache has locks but that is just about it. Please lets not spread FUD.
AFAIK with large cache sizes a lot of time is spent traversing the LRU cache, and that is exactly what you are locking.
Do you have any performance data that would suggest otherwise?
You ask me data for your claim? You are the one who said all 3 TBs can only probe with one thread while the rest are stuck, so lets see it.
My caches have always been small and they store _decompressed_ data. So a hit is very fast while if you let the OS cache it you will have to decompress a block of data everytime. I tested this a long time ago with upto 4 threads and found no problems, and the lru cache did speed things up.