Page 1 of 6

atomic TT

Posted: Sun Sep 13, 2015 10:22 pm
by lucasart
Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?

Re: atomic TT

Posted: Sun Sep 13, 2015 10:50 pm
by kbhearn
everything i've found suggests that movdqa (aligned 16 byte move) is NOT guaranteed to be atomic on x64. so std::atomic<> on a 16 byte class would introduce a lock. you can always test it with .is_lock_free() on a given implementation - if the compiler could get away with it i'm sure it'd try :)

Re: atomic TT

Posted: Sun Sep 13, 2015 10:57 pm
by sje
A fourth way to ensure atomicity:

4) Lock the enclosing transposition table region during entry access to avoid all race hazards.

This is what Symbolic does for each of its six different transposition tables. Each table is divided into 2^8 = 256 equal-sized contiguous regions with each region guarded by its own spinlock. Because the region count is rather larger than the thread count and per access time is small, there is very little contention; locking and unlocking a spinlock requires little time when there is only one thread accessing the particular region.

The 2^8 number was not chosen at random; it is based on my tests with up to 16 active threads where I could not detect any difference between 256 region total times and 512 region total times. Using 128 spinlocks caused only a very slight increase in total time.

Re: atomic TT

Posted: Sun Sep 13, 2015 11:00 pm
by kbhearn
It does seem though that since there's single instruction moves a single LOCK-prefix operation might be used which is still an improvement over a software lock i imagine.

http://stackoverflow.com/questions/7646 ... ory-access
is related

Re: atomic TT

Posted: Sun Sep 13, 2015 11:34 pm
by lucasart
OK, so the short answer is no. I was a bit over optimistic here.

I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.

Re: atomic TT

Posted: Mon Sep 14, 2015 12:36 am
by bob
sje wrote:A fourth way to ensure atomicity:

4) Lock the enclosing transposition table region during entry access to avoid all race hazards.

This is what Symbolic does for each of its six different transposition tables. Each table is divided into 2^8 = 256 equal-sized contiguous regions with each region guarded by its own spinlock. Because the region count is rather larger than the thread count and per access time is small, there is very little contention; locking and unlocking a spinlock requires little time when there is only one thread accessing the particular region.

The 2^8 number was not chosen at random; it is based on my tests with up to 16 active threads where I could not detect any difference between 256 region total times and 512 region total times. Using 128 spinlocks caused only a very slight increase in total time.
One thing you MUST do to make this work well... each 4 byte lock MUST be in a separate 64 byte block of memory, so that you don't share locks within a cache line... The read-for-ownership type overhead becomes untenable...

Re: atomic TT

Posted: Mon Sep 14, 2015 7:51 am
by sje
bob wrote:One thing you MUST do to make this work well... each 4 byte lock MUST be in a separate 64 byte block of memory, so that you don't share locks within a cache line... The read-for-ownership type overhead becomes untenable...
This is apparently not a problem with 256 locks, even if there is cache line shared residency. If it were otherwise, then there would be a big time difference between 256 regions and 512 regions -- there was none.

I started the testing with 64 regions and doubled that until there was no more time savings.

Perhaps the region count should be a one-time calculation at startup where it is given a value of 16 times the maximum in-use thread count or core count.

I've also run tests with up to 256 threads running on from one to 16 cores. It works surprisingly well, in part because a lock is held for only a few nanoseconds and so there's little chance of a timer interrupt suspending the locking thread.

Re: atomic TT

Posted: Mon Sep 14, 2015 8:01 am
by hgm
Lock prefixes, software locks, legality checking, all are outrageously more expensive than the checksum / lockless hashing trick. This really should not be an issue.

Re: atomic TT

Posted: Mon Sep 14, 2015 8:57 am
by Joost Buijs
hgm wrote:Lock prefixes, software locks, legality checking, all are outrageously more expensive than the checksum / lockless hashing trick. This really should not be an issue.
The lockless hashing trick only takes 1 extra 64 bit XOR instruction.
Empirically I can not find much difference in speed when I compare it with 'lock cmpxchg16b'.
When you only look at TT access times there is a difference, but in the engine overall you hardly notice this.
So in practice both approaches will work.

Software locks are not done, in the past I tried locking the TT with a critical section and that totally destroyed performance.

Re: atomic TT

Posted: Mon Sep 14, 2015 9:15 am
by Joost Buijs
lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The Stockfish style looks very dirty and prone to errors.
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?