Yes, but on 16 cores and up, there is going to be more than 3% cost. It adds up in a BIG hurry when you have 8 blocks of cache and 20 cores sharing them. There you see a ton of forwarding traffic which becomes a bottleneck quickly.sje wrote:A while back I ran a test on a quad core box with and without spinlocked regions. The test without spinlocks showed some data errors as expected. But it also ran only about three percent faster than the locking version. This was done doing perft() stashing/fetching perft(2) and deeper sums at every opportunity.bob wrote:256 locks = 8 cache lines. So there is contention when you get to a significant number of cores. but more importantly, you generate a TON of cache traffic. My 20 core box will definitely show the problem as you have 20 cores fighting over 8 cache blocks. Number of locks is irrelevant here, it is the cache blocks that are the issue.
With the spinlocked region approach, I don't have to know processor dependent details about 64 bit atomicity, I don't have to be concerned with 32 bit vs 64 bit architecture diferences, and with 128 bit signatures, I don't need to worry about false positives or move verification. So, for only a few speed percentage points, I get peace of mind.
atomic TT
Moderators: hgm, Rebel, chrisw
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: atomic TT
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Re: atomic TT
If I ever get a box with more than 16 cores, then I can try this:bob wrote:Yes, but on 16 cores and up, there is going to be more than 3% cost. It adds up in a BIG hurry when you have 8 blocks of cache and 20 cores sharing them. There you see a ton of forwarding traffic which becomes a bottleneck quickly.sje wrote:With the spinlocked region approach, I don't have to know processor dependent details about 64 bit atomicity, I don't have to be concerned with 32 bit vs 64 bit architecture diferences, and with 128 bit signatures, I don't need to worry about false positives or move verification. So, for only a few speed percentage points, I get peace of mind.
Code: Select all
class FatSpinlock: public Spinlock
{
private:
char padding[CacheLineSize - sizeof(Spinlock)];
};
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: atomic TT
Not really, but you can get close using CMPXCHG16B:lucasart wrote:But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ?
http://stackoverflow.com/questions/1817 ... umber-in-c
If it works, the compiler will likely generate explicit locking instructions. Not efficient.Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".* Stockfish style: allow races and dont care, just check for legality of the TT move
On a processor with TSX it might be possible to wrap all hash table accesses with a single mutex without measurable overhead. But I have not tried this (don't have such a processor). Not terribly useful, as it would kill performance on all other processors (unless disabled) and it does not solve any actually existing problem.
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: atomic TT
Unless something has changed recently, SF has tens of thousands of hash collisions per second. The very few extra caused by races really won't matter.Joost Buijs wrote:The Stockfish style looks very dirty and prone to errors.lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry
But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
Re: atomic TT
relaxed memory order does not introduce harmless data races. it only relaxes sequential consistency. data races are always undefined behavior in C++11.syzygy wrote: The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".
the few experts on relaxed atomics discourage any casual user from using them, as they are very hard to get right. see Herb Sutter's talk http://herbsutter.com/2013/02/11/atomic ... -hardware/
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: atomic TT
I imagine that there is a similar way to do this in Linux:
https://msdn.microsoft.com/en-us/library/bb514094.aspx
https://msdn.microsoft.com/en-us/library/bb514094.aspx
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: atomic TT
The data races are not "data races" anymore within the meaning of the C++11 standard and are therefore not undefined behaviour.Rein Halbersma wrote:relaxed memory order does not introduce harmless data races. it only relaxes sequential consistency. data races are always undefined behavior in C++11.syzygy wrote: The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".
Sure, casual users will mess up.the few experts on relaxed atomics discourage any casual user from using them, as they are very hard to get right. see Herb Sutter's talk http://herbsutter.com/2013/02/11/atomic ... -hardware/
Locking is not a real option. Ignoring data corruption caused by smp races is just fine. Crafty's approach is fine as well. But to get either "right" under C++11, you need relaxed atomics. Without atomics, demons may fly out of your nose. With stricter than relaxed atomics, performance goes down the drain. (Of course pre-C++11 it was simply impossible to avoid undefined behaviour here.)
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: atomic TT
Surely not using 64 bit Zobrist...syzygy wrote:Unless something has changed recently, SF has tens of thousands of hash collisions per second. The very few extra caused by races really won't matter.Joost Buijs wrote:The Stockfish style looks very dirty and prone to errors.lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry
But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: atomic TT
How is a race condition not "undefined behavior"?syzygy wrote:Not really, but you can get close using CMPXCHG16B:lucasart wrote:But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ?
http://stackoverflow.com/questions/1817 ... umber-in-c
If it works, the compiler will likely generate explicit locking instructions. Not efficient.Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The correct C++11 way to do this uses atomics with relaxed semantics. That basically tells the compiler that race conditions are by design and are OK. In principle that should not introduce any overhead and the program will not contain "undefined behaviour".* Stockfish style: allow races and dont care, just check for legality of the TT move
On a processor with TSX it might be possible to wrap all hash table accesses with a single mutex without measurable overhead. But I have not tried this (don't have such a processor). Not terribly useful, as it would kill performance on all other processors (unless disabled) and it does not solve any actually existing problem.
-
- Posts: 27796
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: atomic TT
Ah OK, now I see. At least it guarantees you that the move belongs to the position. You probably care less about wrong scores or bounds, as the cannot crash your engine so easily.cdani wrote:Now storing and retrieving the whole uint64_t, I can be sure that at least this uint64_t is coherent. In fact now I don't find moves not related to the position, previously happened various times in a single game. That was my simple objective.