lockless hashing

syzygy · Post by **syzygy** » Thu May 12, 2016 9:33 pm

bob wrote:Yes, if somewhat unsafely. Today 8 byte-aligned writes of size 8 are done atomically. Who knows about tomorrow? And there are other architectures with a more relaxed memory model (the Alpha was probably the most aggressive) where such an assumption might blow up.

Tomorrow your unsafely implemented xor-trick might get miscompiled as well thanks to undefined behavior

.
The Alpha already might have misexecuted it.

However, 8 byte hash entries are really problematic in terms of collisions, which means to get rid of the lockless code, you still have to vet the move to be sure it is OK to make, assuming you have potential crash conditions (such as castling with no king on e1, which would produce a second one, etc... I have an 8 byte version but I was not happy with the number of collisions.

I think Evert proposes a 16-byte entry with hash move and 32-bit hash key in the same 8-byte word. So if the 32-bit hash key matches, then the hash move will be unaffected by SMP corruption provided that access to the 8-byte word is atomic.

hgm · Post by **hgm** » Thu May 12, 2016 11:44 pm

bob wrote:How does that solve the problem of non-atomic stores for things > 8 bytes long (or even for things less than 8 bytes long if not properly aligned)???

'That' referring to XOR'ing the pstEval into the key to get an effectivey longer key? It doesn't. it is just a trick for getting a longer key, if you are not confident that 40 bits gets the crash frequency low enough, and you don;t want to bother verifying the hash key.

As to the non-atomic store problem: you don't have to solve it if you can live with it. And as the move seems the only critical part, you can put it in the same atomically accessed word as the signature. And just live with the occaionally corrupted score, depth, age, etc.

bob · Post by **bob** » Fri May 13, 2016 3:04 am

hgm wrote:
bob wrote:How does that solve the problem of non-atomic stores for things > 8 bytes long (or even for things less than 8 bytes long if not properly aligned)???
'That' referring to XOR'ing the pstEval into the key to get an effectivey longer key? It doesn't. it is just a trick for getting a longer key, if you are not confident that 40 bits gets the crash frequency low enough, and you don;t want to bother verifying the hash key.

As to the non-atomic store problem: you don't have to solve it if you can live with it. And as the move seems the only critical part, you can put it in the same atomically accessed word as the signature. And just live with the occaionally corrupted score, depth, age, etc.

It is not about "a longer key". It is about being certain that the best move goes along with this hash signature, rather than being written in an interleaved way with other threads... If an out-of-order interleaved store is done, the table key won't xor with the best move and stuff and then produce a match with the current real signature. That's all this is trying to address, and it is only an issue when one uses threads. The more threads, the bigger the problem becomes. With 20 it is a killer if an illegal move will cause problems for your program. If not, you can ignore this completely.

bob · Post by **bob** » Fri May 13, 2016 3:06 am

syzygy wrote:
bob wrote:Yes, if somewhat unsafely. Today 8 byte-aligned writes of size 8 are done atomically. Who knows about tomorrow? And there are other architectures with a more relaxed memory model (the Alpha was probably the most aggressive) where such an assumption might blow up.
Tomorrow your unsafely implemented xor-trick might get miscompiled as well thanks to undefined behavior .
The Alpha already might have misexecuted it.

However, 8 byte hash entries are really problematic in terms of collisions, which means to get rid of the lockless code, you still have to vet the move to be sure it is OK to make, assuming you have potential crash conditions (such as castling with no king on e1, which would produce a second one, etc... I have an 8 byte version but I was not happy with the number of collisions.
I think Evert proposes a 16-byte entry with hash move and 32-bit hash key in the same 8-byte word. So if the 32-bit hash key matches, then the hash move will be unaffected by SMP corruption provided that access to the 8-byte word is atomic.

Anything can happen. The key is it works today, even on the latest alpha that was produced before it went out of production. One can always twerk around with "volatile" to limit the compiler's ability to reorder things, if that becomes necessary. I'll probably re-visit this again one day but the only issue is lockless works everywhere it has been tried, even on old Crays. Doing tricky compiler stuff can easily become non-portable.

Certainly combining the key and move in one 8-byte value will work assuming one maintains 8 byte alignment to keep the write atomic. One might run into trouble with 32 bit architectures, still.

hgm · Post by **hgm** » Fri May 13, 2016 8:43 am

On 32-bit architectures with non-atomic 64-bit access you would indeed have to take different measures. You could use a 32-bit signature in one word, and then XOR it with the word containing the move, and leave the other data in the entry unchecked.

You could also put 4 bytes of the signature in one word, and combine two more key bytes with the move in a second word. With 32-bit atomic stores only 64-bit operations will likely have a performance penalty anyway, so it would be an advantage to be able to single out potential hits by a 32-bit compare, and worry only about the rest of the key when that matches, even if what you have to do than is a bit more complex. (Like separating key andmove from the other word.) This would reduce the number of undetected data corruptions by 2^16 = 65k, which could be low enough.

flok · Post by **flok** » Fri May 13, 2016 12:02 pm

I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.

In my program I use read/write locks in my transposition table implementation.
I measured what the overhead is. For that I measured how long it takes to reach depth 9 in a single-thread situation. This I repeated 512 times.

with locking:

Code: Select all

took&#58; 1652.27s, average&#58; 3.23 &#40;higher due to overhead&#41;
512     3158.22 37.48

without locking:

Code: Select all

took&#58; 1642.15s, average&#58; 3.21 &#40;higher due to overhead&#41;
512     3137.52 29.08

So without locking is +/- 0.6% faster.

The second row by the way is: number-of-iterations, avg-time (in ms), std-dev.

Note that I have multiple locks:

Code: Select all

rw_lock **locks = new rw_locks&#91;256&#93;;
...
locks&#91;hash % 256&#93; -> r_lock&#40;) or w_lock&#40;)

The number of locks depends on the number of threads.

jdart · Post by **jdart** » Fri May 13, 2016 12:36 pm

I actually store and use a full 64-bit hash code. I do not try to save bits by storing fewer signature bits.

--Jon

hgm · Post by **hgm** » Fri May 13, 2016 2:08 pm

My point was that if you store the redundant bits anyway, you might make them actually useful by XOR'ing them with the pstEval. Or with part of the Pawn hash key, or material key.

Storing 64 key bits instead of 32 is not without cost, though. It means typically that you will have only 4 entries per 64-byte bucket, while with a 32-bit signature you could have 7. That is a factor 1.75 more hash entries. Now it is true that the time-to-depth typically only scales like the 12th root of the hash size, but 1.75^(1/12) = 1.0477, which means a nearly 5% speed-up. Which again would correspond to a 5-Elo gain. It would be difficult to achieve such a speed-up / Elo gain merely by code optimization.

bob · Post by **bob** » Fri May 13, 2016 3:59 pm

hgm wrote:My point was that if you store the redundant bits anyway, you might make them actually useful by XOR'ing them with the pstEval. Or with part of the Pawn hash key, or material key.

Storing 64 key bits instead of 32 is not without cost, though. It means typically that you will have only 4 entries per 64-byte bucket, while with a 32-bit signature you could have 7. That is a factor 1.75 more hash entries. Now it is true that the time-to-depth typically only scales like the 12th root of the hash size, but 1.75^(1/12) = 1.0477, which means a nearly 5% speed-up. Which again would correspond to a 5-Elo gain. It would be difficult to achieve such a speed-up / Elo gain merely by code optimization.

Seven? Or five? I get 5.

Cray Blitz used 12 byte hash entries, as did Crafty at one point. 12 bytes per entry gives 5 entries per 64 bytes. Not sure where your 7 comes from.

bob · Post by **bob** » Fri May 13, 2016 4:00 pm

flok wrote:I don't like lockless hashing. It expects behaviour from the cpu/system that may be different for a new generation of that processor and it is also not portable to other systems at all.

In my program I use read/write locks in my transposition table implementation.
I measured what the overhead is. For that I measured how long it takes to reach depth 9 in a single-thread situation. This I repeated 512 times.

with locking:
Code: Select all
took&#58; 1652.27s, average&#58; 3.23 &#40;higher due to overhead&#41;
512     3158.22 37.48
without locking:
Code: Select all
took&#58; 1642.15s, average&#58; 3.21 &#40;higher due to overhead&#41;
512     3137.52 29.08
So without locking is +/- 0.6% faster.

The second row by the way is: number-of-iterations, avg-time (in ms), std-dev.

Note that I have multiple locks:
Code: Select all
rw_lock **locks = new rw_locks&#91;256&#93;;
...
locks&#91;hash % 256&#93; -> r_lock&#40;) or w_lock&#40;)
The number of locks depends on the number of threads.

There is no measurable lock overhead when only using one thread. The overhead comes from contention, which requires multiple threads.

lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing

Re: lockless hashing