hash collisions

jdart · Post by **jdart** » Tue Jan 28, 2020 5:09 am

It is pretty well known now that with the common 64-bit hash codes in use for A-B engines, hash collisions are rare but certainly do occur. On a very big machine with a lot of threads they may even start to be frequent. Engines protect themselves from this in various ways, mostly pretty effective.

I recently got a note from TCEC team and they were basically saying that since they are going to be running on 176 cores for the next season, they are concerned that engines be able to handle this well. They mention this testing technique suggested by the Demmolito author:

I found a crashing bug linked to HT races. I don't know if this is the only bug causing crashes, but it sure is the most likely candidate (ie. rare crashes, whose probability of occurrence increases with nb of threads).

This might be of interest to other programmers. The debugging technique consists of simulating in single threaded search, the effect of SMP races. The important thing to note is that, even if HT read/writes are racing, the read and write operations are still atomic on 64-bit words. So the result of an HT race is not completely random, but of the form e'=(key1,data2) mixing two entries e1=(key1,data1) and e2=(key2,data2) that were racing. This can be simulated in single threaded mode (which is 100% deterministic), with as high probability of crashing as one desires, by reducing the number of bits used (ie. generate more collisions). Example using only 8-bits instead of 64-bit, and demolito crashes very easily:

if (((e->keyXorData ^ e->data) & 0xFFull) == (key & 0xFFull)) {...}

Or by randomizing the hash move. My first thought was, well, I don't have this issue, but I can make problems occur with the suggested method. One of them is this scenario: a hash move is fetched when the side to move is in check, and it would be a legal move except it is not an evasion. Furthermore that move is also a checking move. The program executes it, and in the next ply, it does not detect the illegality because of the check - it just generates evasion moves, when it fact it also has a capture of the opposing King, which would ordinarily result in illegality detection. I am pretty sure there are a few other rare conditions like this. The easiest fix would be to just go to to a 128-bit hash code, as I think Steven Edwards did for his program back once upon a time, but that does increase memory usage/reduce hash entries for a fixed amount of memory.

Tony P. · Post by **Tony P.** » Tue Jan 28, 2020 7:30 am

CPW suggests some remedies for data races... I have no idea how much overhead they add in practice, though.

As for collisions, maybe store >1 move per hash entry?

mar · Post by **mar** » Tue Jan 28, 2020 8:28 am

If his engine crashes, then he should fix his hashmove validation.

A better test is to generate all possible moves during search (in hashmove format of course, typically 16 bits), generate all legal moves per node and make sure your validator correctly classifies these as legal/illegal.
After a couple of games in this mode your hashmove validation becomes robust.

YUFe · Post by **YUFe** » Tue Jan 28, 2020 1:30 pm

mar wrote: ↑Tue Jan 28, 2020 8:28 am If his engine crashes, then he should fix his hashmove validation.

I agree. Do not botch/fudge it, fix it.

Daniel Shawul · Post by **Daniel Shawul** » Tue Jan 28, 2020 5:38 pm

I think lockless hashing was pretty safe at modest number of threads most people have removed the need for legality checking of the hash move. Now it may be necessary to bring that back in.

jstanback · Post by **jstanback** » Tue Jan 28, 2020 5:42 pm

Wasp had 3 crashes in the recent TCEC tournament where it used 128 threads and a 16 Gbyte hash table. I was unable to duplicate the crashes using 32 threads and an 8 Gbyte hash table even after many tries searching the position prior to the crash to fill the hash table and then the position Wasp crashed on and running searches up to 10 hours on each crash position. Wasp stores 64 bits of position hash in the HT and uses the lockless approach where the move and score are xor'd with some of those bits. Others suggested that I do the test proposed by the Demolito author where I use few or no bits other than the hash address to see if Wasp crashes. I tried it, and Wasp crashed almost immediately on any position. My hash move legality tests were very poor, only checking that there was a piece of the right color on the from_square and not a piece of that color on the to_square and that the move does not leave the king in check. I was ignoring the fact that the move from the hash table can be just about anything, say b2f6, yet in the current position the piece on b2 could be a pawn or rook, or the path to f6 might be blocked. Wasp did not test for those things. I was relying on the total bits of hash verification which was seemingly enough. But with a 16G hash table (1G positions for wasp) and 4 probes per position, once the hash table is full (true after the first few moves in TCEC) I really only had 32 bits of verification. So every 4 billion hash probes (roughly every 80 seconds on TCEC hardware) I could expect a hash collision. My poor hash move legality test would catch many of the bad moves, but it's a little surprising that Wasp didn't crash more often at TCEC and that I couldn't get it to crash in my testing.

Anyway, I fixed the hash move legality function and Wasp no longer crashes even with no additional bits of verification and long searches where zillions of bad moves are fetched from the HT. It's very helpful to have Wasp run on big hardware at TCEC, even though it's no fun to see my engine crash.

John

Tony P. · Post by **Tony P.** » Tue Jan 28, 2020 8:42 pm

Daniel Shawul wrote: ↑Tue Jan 28, 2020 5:38 pm I think lockless hashing was pretty safe at modest number of threads most people have removed the need for legality checking of the hash move. Now it may be necessary to bring that back in.

So it looks like the SF devs were right when they decided not to store the full hash key in the TT (currently, only the first 16 bits are stored if I'm not mistaken), making the storage more economical (a 3-entry struct in 32 bytes), but to check the TT move for legality instead. I see that it was discussed 8 years ago: Stockfish hash implementation.

Uri Blass · Post by **Uri Blass** » Tue Jan 28, 2020 9:24 pm

Tony P. wrote: ↑Tue Jan 28, 2020 8:42 pm
Daniel Shawul wrote: ↑Tue Jan 28, 2020 5:38 pm I think lockless hashing was pretty safe at modest number of threads most people have removed the need for legality checking of the hash move. Now it may be necessary to bring that back in.
So it looks like the SF devs were right when they decided not to store the full hash key in the TT (currently, only the first 16 bits are stored if I'm not mistaken), making the storage more economical (a 3-entry struct in 32 bytes), but to check the TT move for legality instead. I see that it was discussed 8 years ago: Stockfish hash implementation.

I would like to know result of tablebases test about these hash collisions for engines including stockfish.

The idea is to take a lot of random positions from the 7 piece syzygy tablebases in order to test engines when the engines do not use tablebases
and may use let say 2^25 hash entries.

The question is what is the probability to have a wrong mate score because of hash collision.

Wrong mate score can be one of the following:
1)A move with mate score when the side to move does not have a mate with the relevant move
2)A move with mate score that is shorter than the real shortest mate.

In order to check if hash collision is the problem of wrong scores analysis should be compared to analysis when you store the full position in the TT(with the same number of hash entries) when in all cases you use a single core so results can be reproducable.

Note that you may need to change the code of the engines in order to store the full position in the TT.

chrisw · Post by **chrisw** » Tue Jan 28, 2020 9:45 pm

YUFe wrote: ↑Tue Jan 28, 2020 1:30 pm
mar wrote: ↑Tue Jan 28, 2020 8:28 am If his engine crashes, then he should fix his hashmove validation.
I agree. Do not botch/fudge it, fix it.

Me too. Under commercial conditions it is completely unacceptable to leave a potential program crash bug that could be fixed. Try supplying a Japanese publisher on that basis. The rule is NO bugs at all, under any circumstances. Is just plain lazy programming.

D Sceviour · Post by **D Sceviour** » Tue Jan 28, 2020 9:54 pm

Daniel Shawul wrote: ↑Tue Jan 28, 2020 5:38 pm I think lockless hashing was pretty safe at modest number of threads most people have removed the need for legality checking of the hash move. Now it may be necessary to bring that back in.

Hash verification might be worth trying. In a recent 4 CPU gauntlet game with Deep Shredder-Schooner, black was showing a black to mate score and then played Kh7????

[d]5r1k/rb3P2/1n2B1pP/3pP1B1/1pnP4/1Pp5/2P5/1K1R3R b - - 4 36

It would be nice to know how this happened. It is not reproducible from 1 CPU to 8 CPU. I will try some hash position and move verification when using multiple threads to find if this makes a difference.

hash collisions

hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions

Re: hash collisions