atomic TT

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

atomic TT

Post by lucasart »

Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: atomic TT

Post by kbhearn »

everything i've found suggests that movdqa (aligned 16 byte move) is NOT guaranteed to be atomic on x64. so std::atomic<> on a 16 byte class would introduce a lock. you can always test it with .is_lock_free() on a given implementation - if the compiler could get away with it i'm sure it'd try :)
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: atomic TT

Post by sje »

A fourth way to ensure atomicity:

4) Lock the enclosing transposition table region during entry access to avoid all race hazards.

This is what Symbolic does for each of its six different transposition tables. Each table is divided into 2^8 = 256 equal-sized contiguous regions with each region guarded by its own spinlock. Because the region count is rather larger than the thread count and per access time is small, there is very little contention; locking and unlocking a spinlock requires little time when there is only one thread accessing the particular region.

The 2^8 number was not chosen at random; it is based on my tests with up to 16 active threads where I could not detect any difference between 256 region total times and 512 region total times. Using 128 spinlocks caused only a very slight increase in total time.
kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

Re: atomic TT

Post by kbhearn »

It does seem though that since there's single instruction moves a single LOCK-prefix operation might be used which is still an improvement over a software lock i imagine.

http://stackoverflow.com/questions/7646 ... ory-access
is related
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: atomic TT

Post by lucasart »

OK, so the short answer is no. I was a bit over optimistic here.

I will try to fit 4 TT entries and an atomic lock in a cache line, and see how it goes, performance-wise.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: atomic TT

Post by bob »

sje wrote:A fourth way to ensure atomicity:

4) Lock the enclosing transposition table region during entry access to avoid all race hazards.

This is what Symbolic does for each of its six different transposition tables. Each table is divided into 2^8 = 256 equal-sized contiguous regions with each region guarded by its own spinlock. Because the region count is rather larger than the thread count and per access time is small, there is very little contention; locking and unlocking a spinlock requires little time when there is only one thread accessing the particular region.

The 2^8 number was not chosen at random; it is based on my tests with up to 16 active threads where I could not detect any difference between 256 region total times and 512 region total times. Using 128 spinlocks caused only a very slight increase in total time.
One thing you MUST do to make this work well... each 4 byte lock MUST be in a separate 64 byte block of memory, so that you don't share locks within a cache line... The read-for-ownership type overhead becomes untenable...
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: atomic TT

Post by sje »

bob wrote:One thing you MUST do to make this work well... each 4 byte lock MUST be in a separate 64 byte block of memory, so that you don't share locks within a cache line... The read-for-ownership type overhead becomes untenable...
This is apparently not a problem with 256 locks, even if there is cache line shared residency. If it were otherwise, then there would be a big time difference between 256 regions and 512 regions -- there was none.

I started the testing with 64 regions and doubled that until there was no more time savings.

Perhaps the region count should be a one-time calculation at startup where it is given a value of 16 times the maximum in-use thread count or core count.

I've also run tests with up to 256 threads running on from one to 16 cores. It works surprisingly well, in part because a lock is held for only a few nanoseconds and so there's little chance of a timer interrupt suspending the locking thread.
User avatar
hgm
Posts: 27787
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: atomic TT

Post by hgm »

Lock prefixes, software locks, legality checking, all are outrageously more expensive than the checksum / lockless hashing trick. This really should not be an issue.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: atomic TT

Post by Joost Buijs »

hgm wrote:Lock prefixes, software locks, legality checking, all are outrageously more expensive than the checksum / lockless hashing trick. This really should not be an issue.
The lockless hashing trick only takes 1 extra 64 bit XOR instruction.
Empirically I can not find much difference in speed when I compare it with 'lock cmpxchg16b'.
When you only look at TT access times there is a difference, but in the engine overall you hardly notice this.
So in practice both approaches will work.

Software locks are not done, in the past I tried locking the TT with a critical section and that totally destroyed performance.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: atomic TT

Post by Joost Buijs »

lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The Stockfish style looks very dirty and prone to errors.
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?