atomic TT

bob · Post by **bob** » Mon Sep 14, 2015 9:21 pm

sje wrote:
bob wrote:256 locks = 8 cache lines. So there is contention when you get to a significant number of cores. but more importantly, you generate a TON of cache traffic. My 20 core box will definitely show the problem as you have 20 cores fighting over 8 cache blocks. Number of locks is irrelevant here, it is the cache blocks that are the issue.
A while back I ran a test on a quad core box with and without spinlocked regions. The test without spinlocks showed some data errors as expected. But it also ran only about three percent faster than the locking version. This was done doing perft() stashing/fetching perft(2) and deeper sums at every opportunity.

With the spinlocked region approach, I don't have to know processor dependent details about 64 bit atomicity, I don't have to be concerned with 32 bit vs 64 bit architecture diferences, and with 128 bit signatures, I don't need to worry about false positives or move verification. So, for only a few speed percentage points, I get peace of mind.

Yes, but on 16 cores and up, there is going to be more than 3% cost. It adds up in a BIG hurry when you have 8 blocks of cache and 20 cores sharing them. There you see a ton of forwarding traffic which becomes a bottleneck quickly.

sje · Post by **sje** » Mon Sep 14, 2015 10:06 pm

bob wrote:
sje wrote:With the spinlocked region approach, I don't have to know processor dependent details about 64 bit atomicity, I don't have to be concerned with 32 bit vs 64 bit architecture diferences, and with 128 bit signatures, I don't need to worry about false positives or move verification. So, for only a few speed percentage points, I get peace of mind.
Yes, but on 16 cores and up, there is going to be more than 3% cost. It adds up in a BIG hurry when you have 8 blocks of cache and 20 cores sharing them. There you see a ton of forwarding traffic which becomes a bottleneck quickly.

If I ever get a box with more than 16 cores, then I can try this:

Code: Select all

class FatSpinlock&#58; public Spinlock
&#123;
  private&#58;
    char padding&#91;CacheLineSize - sizeof&#40;Spinlock&#41;&#93;;
&#125;;

Which will give each spinlock its own cache line.

syzygy · Post by **syzygy** » Mon Sep 14, 2015 10:30 pm

lucasart wrote:But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ?

Not really, but you can get close using CMPXCHG16B:
http://stackoverflow.com/questions/1817 ... umber-in-c

Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?

If it works, the compiler will likely generate explicit locking instructions. Not efficient.

* Stockfish style: allow races and dont care, just check for legality of the TT move

The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".

On a processor with TSX it might be possible to wrap all hash table accesses with a single mutex without measurable overhead. But I have not tried this (don't have such a processor). Not terribly useful, as it would kill performance on all other processors (unless disabled) and it does not solve any actually existing problem.

syzygy · Post by **syzygy** » Mon Sep 14, 2015 10:34 pm

Joost Buijs wrote:
lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The Stockfish style looks very dirty and prone to errors.
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?

Unless something has changed recently, SF has tens of thousands of hash collisions per second. The very few extra caused by races really won't matter.

Rein Halbersma · Post by **Rein Halbersma** » Mon Sep 14, 2015 10:44 pm

syzygy wrote: The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".

relaxed memory order does not introduce harmless data races. it only relaxes sequential consistency. data races are always undefined behavior in C++11.

the few experts on relaxed atomics discourage any casual user from using them, as they are very hard to get right. see Herb Sutter's talk http://herbsutter.com/2013/02/11/atomic ... -hardware/

Dann Corbit · Post by **Dann Corbit** » Mon Sep 14, 2015 10:59 pm

I imagine that there is a similar way to do this in Linux:
https://msdn.microsoft.com/en-us/library/bb514094.aspx

syzygy · Post by **syzygy** » Mon Sep 14, 2015 11:05 pm

Rein Halbersma wrote:
syzygy wrote: The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".
relaxed memory order does not introduce harmless data races. it only relaxes sequential consistency. data races are always undefined behavior in C++11.

The data races are not "data races" anymore within the meaning of the C++11 standard and are therefore not undefined behaviour.

the few experts on relaxed atomics discourage any casual user from using them, as they are very hard to get right. see Herb Sutter's talk http://herbsutter.com/2013/02/11/atomic ... -hardware/

Sure, casual users will mess up.

Locking is not a real option. Ignoring data corruption caused by smp races is just fine. Crafty's approach is fine as well. But to get either "right" under C++11, you need relaxed atomics. Without atomics, demons may fly out of your nose. With stricter than relaxed atomics, performance goes down the drain. (Of course pre-C++11 it was simply impossible to avoid undefined behaviour here.)

bob · Post by **bob** » Mon Sep 14, 2015 11:26 pm

syzygy wrote:
Joost Buijs wrote:
lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The Stockfish style looks very dirty and prone to errors.
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?
Unless something has changed recently, SF has tens of thousands of hash collisions per second. The very few extra caused by races really won't matter.

Surely not using 64 bit Zobrist...

bob · Post by **bob** » Mon Sep 14, 2015 11:28 pm

syzygy wrote:
lucasart wrote:But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ?
Not really, but you can get close using CMPXCHG16B:
http://stackoverflow.com/questions/1817 ... umber-in-c

Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
If it works, the compiler will likely generate explicit locking instructions. Not efficient.

* Stockfish style: allow races and dont care, just check for legality of the TT move
The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".

On a processor with TSX it might be possible to wrap all hash table accesses with a single mutex without measurable overhead. But I have not tried this (don't have such a processor). Not terribly useful, as it would kill performance on all other processors (unless disabled) and it does not solve any actually existing problem.

How is a race condition not "undefined behavior"?

hgm · Post by **hgm** » Mon Sep 14, 2015 11:41 pm

cdani wrote:Now storing and retrieving the whole uint64_t, I can be sure that at least this uint64_t is coherent. In fact now I don't find moves not related to the position, previously happened various times in a single game. That was my simple objective.

Ah OK, now I see. At least it guarantees you that the move belongs to the position. You probably care less about wrong scores or bounds, as the cannot crash your engine so easily.

atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT

Re: atomic TT