atomic TT

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 3:48 am

Re: atomic TT

Post by kbhearn » Tue Sep 15, 2015 2:05 am

mar wrote:
kbhearn wrote:just because it's a race, and c++ will not guarantee anything in racy code. Demons may fly out your nose and whatnot.
I think it was more about potential reordering than races. I think the conclusion was that this can't happen in practice unless the compiler is broken.

There's no reliable/easy way to detect races at compile time, it could be detected at runtime but that would be a huge performance hit so very unlikely
(you could as well use some interpreted language instead and get the same performance).
Hardware could detect this (in fact CPU already has to synchronize caches somehow to maintain consistency if writes happen to cache lines holding
the same piece of physical memory - certainly in the case of atomics),
but IMHO this would be only practically useful for debugging.

So the only UB I see here is on hw level because of the race itself, I don't see what C++ has to do with it.

That being said I would prefer to finally see alignment attributes being standardized.
in practice, i think that is the conclusion that the UB is irrelevant because the compiler isn't smart enough to try to make cross-thread optimisations detecting and exploiting the potential race or its illegality (if there even is any such thing) so you'll just get the code you might expect each thread to run and what the hardware will do in the event of the race.

In terms of the contract the standard provides though, it is UB whether that be because of hypothetical future cross-thread optimisation or freedom when generating code for future unknown hardware or (perhaps most likely) just because precise language on the topic might be overly constricting to unknown/unrelated scenarios with no real gain to be had by specifying it.

Joost Buijs
Posts: 953
Joined: Thu Jul 16, 2009 8:47 am
Location: Almere, The Netherlands

Re: atomic TT

Post by Joost Buijs » Tue Sep 15, 2015 5:14 am

syzygy wrote:
Joost Buijs wrote:
lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The Stockfish style looks very dirty and prone to errors.
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?
Unless something has changed recently, SF has tens of thousands of hash collisions per second. The very few extra caused by races really won't matter.
Well, I don't know what you think is a collision, but tens of thousands per second is ridiculous.
And don't underestimate the number of TT races, because they are not few when you don't take precautions to avoid them.
Especially on a large number of cores there are many of them.
Maybe this is one of the reasons the Stockfish SMP does not perform the way it should.

syzygy
Posts: 4450
Joined: Tue Feb 28, 2012 10:56 pm

Re: atomic TT

Post by syzygy » Tue Sep 15, 2015 8:17 pm

lucasart wrote:I don't understand how the C++ standard allows the compiler to break lock-less hashing xor trick. Anyone can explain (in lay man words) ?
See here.

Crafty's HashProbe() contains:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word1 = htable->word1;
    word2 = htable->word2 ^ word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    ...
A C99 or C11 compiler may legally rewrite this as:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word2 = htable->word2 ^ htable->word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    word1 = htable->word1;
    ...
A C99 compiler does not know about threads and therefore only needs to care about preserving single threaded semantics.

A C11 compiler knows about threads, but may assume that no data races are present.

If I am not mistaken, one can prevent the reload from memory in C11 by declaring the word1 and word2 fields as atomic and accessing them using memory_order_relaxed.

This might still leave other opportunities for the compiler to miscompile...

syzygy
Posts: 4450
Joined: Tue Feb 28, 2012 10:56 pm

Re: atomic TT

Post by syzygy » Tue Sep 15, 2015 8:37 pm

Joost Buijs wrote:
syzygy wrote:
Joost Buijs wrote:
lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The Stockfish style looks very dirty and prone to errors.
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?
Unless something has changed recently, SF has tens of thousands of hash collisions per second. The very few extra caused by races really won't matter.
Well, I don't know what you think is a collision, but tens of thousands per second is ridiculous.
I am thinking of false hash hits.

SF only uses 16 bits to distinguish positions within a TT bucket. Each bucket (of 32 bytes) contains 3 positions. So if the TT is full, a position that is not in the TT will be matched to one of the positions in its bucket about once every 22000 (2^16 / 3) probes.

So tens of thousands per second is a bit on the high side, but at 22Mnps you will be getting towards 1000 false hits per second (it will be less because there are also true hits).

I don't believe this is optimal at large time controls, but that has to be convincingly proven before it will be changed to something more sensible.
And don't underestimate the number of TT races, because they are not few when you don't take precautions to avoid them.
They are very rare relative to false hash hits if the TT table is large, unless you're in an endgame position with few unique positions in the search tree. A TT race occurs only when two threads happen to hit the same TT entry at the same time. Very large TT tables are standard nowadays.

User avatar
lucasart
Posts: 3037
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: atomic TT

Post by lucasart » Tue Sep 15, 2015 9:35 pm

syzygy wrote:
lucasart wrote:I don't understand how the C++ standard allows the compiler to break lock-less hashing xor trick. Anyone can explain (in lay man words) ?
See here.

Crafty's HashProbe() contains:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word1 = htable->word1;
    word2 = htable->word2 ^ word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    ...
A C99 or C11 compiler may legally rewrite this as:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word2 = htable->word2 ^ htable->word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    word1 = htable->word1;
    ...
A C99 compiler does not know about threads and therefore only needs to care about preserving single threaded semantics.

A C11 compiler knows about threads, but may assume that no data races are present.

If I am not mistaken, one can prevent the reload from memory in C11 by declaring the word1 and word2 fields as atomic and accessing them using memory_order_relaxed.

This might still leave other opportunities for the compiler to miscompile...
What if htable->word1 changes under our feet when we break from the for loop, just after having set word2. We would end up with inconsistent word1 and word2, while the if test was passed. I don't see how memory_order_relaxed prevent that. Don't we need memory_order_acquire here ? Or at lease memory_order_consume ? This could lead to really subtle bugs, that only occurs randomly & rarely on non x86 platforms like ARM…
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

syzygy
Posts: 4450
Joined: Tue Feb 28, 2012 10:56 pm

Re: atomic TT

Post by syzygy » Tue Sep 15, 2015 10:17 pm

lucasart wrote:
syzygy wrote:
lucasart wrote:I don't understand how the C++ standard allows the compiler to break lock-less hashing xor trick. Anyone can explain (in lay man words) ?
See here.

Crafty's HashProbe() contains:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word1 = htable->word1;
    word2 = htable->word2 ^ word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    ...
A C99 or C11 compiler may legally rewrite this as:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word2 = htable->word2 ^ htable->word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    word1 = htable->word1;
    ...
A C99 compiler does not know about threads and therefore only needs to care about preserving single threaded semantics.

A C11 compiler knows about threads, but may assume that no data races are present.

If I am not mistaken, one can prevent the reload from memory in C11 by declaring the word1 and word2 fields as atomic and accessing them using memory_order_relaxed.

This might still leave other opportunities for the compiler to miscompile...
What if htable->word1 changes under our feet when we break from the for loop, just after having set word2. We would end up with inconsistent word1 and word2, while the if test was passed.
That is exactly why the second version is broken.

The first version (that is in Crafty) is broken because the compiler may compile it into code equivalent to the second version, i.e. with a reload of htable->word1 from memory after setting word2. (If the compiler does not produce the reload, then the code will run fine, but the C standard does not guarantee this.)
I don't see how memory_order_relaxed prevent that.
I am pretty sure that memory_order_relaxed already prevents the compiler from loading the value from memory twice if the C code reads the variable only once. So you're guaranteed that word1 will have the same value everywhere and the same for word2. That is all you need. Thanks to the xor-trick you don't care whether word1 and word2 are read and written atomically, but the trick breaks down if one of the local variables word1 and word2 may "magically" change its value between lines.

User avatar
lucasart
Posts: 3037
Joined: Mon May 31, 2010 11:29 am
Full name: lucasart
Contact:

Re: atomic TT

Post by lucasart » Tue Sep 15, 2015 10:25 pm

After carefully considering the options, I will go ahead with lock-less hashing, but implemented correctly using atomics with whatever is the correct and least strict memory order:
* Steven's solution of 256 spinlocks has the same asymptotic efficiency than locking the entire table (even though for typical hash size, threads, search time, it may perform ok) Not very elegant IMHO.
* Stockfish solution is to ignore the problem, and requires to write a move_is_legal() function which is a lot of code
* An atomic lock per entry (or per cluster) may have bad asymptotic behaviour, as pointed out by Bob. More importantly, the atomic_flag of the spinlock wastes precious space in TT entries, that I want to pack efficiently.
* Lockless hashing (done correctly so the above reordering cannot occur) has no memory overhead, and the speed overhead is ridiculously small, I doubt you can even measure it. Plus is guarantees (assuming 64 bit keys) that collisons or races that pas the xor test are so rare they can be assumed to never happen, so I do not even need to write a move_is_legal() function.

Thanks for all the help and clarifications.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

syzygy
Posts: 4450
Joined: Tue Feb 28, 2012 10:56 pm

Re: atomic TT

Post by syzygy » Tue Sep 15, 2015 11:07 pm

lucasart wrote:so I do not even need to write a move_is_legal() function.
As long as your program does not crash on an illegal move.

bob
Posts: 20470
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: atomic TT

Post by bob » Tue Sep 15, 2015 11:33 pm

syzygy wrote:
lucasart wrote:
syzygy wrote:
lucasart wrote:I don't understand how the C++ standard allows the compiler to break lock-less hashing xor trick. Anyone can explain (in lay man words) ?
See here.

Crafty's HashProbe() contains:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word1 = htable->word1;
    word2 = htable->word2 ^ word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    ...
A C99 or C11 compiler may legally rewrite this as:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word2 = htable->word2 ^ htable->word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    word1 = htable->word1;
    ...
A C99 compiler does not know about threads and therefore only needs to care about preserving single threaded semantics.

A C11 compiler knows about threads, but may assume that no data races are present.

If I am not mistaken, one can prevent the reload from memory in C11 by declaring the word1 and word2 fields as atomic and accessing them using memory_order_relaxed.

This might still leave other opportunities for the compiler to miscompile...
What if htable->word1 changes under our feet when we break from the for loop, just after having set word2. We would end up with inconsistent word1 and word2, while the if test was passed.
That is exactly why the second version is broken.

The first version (that is in Crafty) is broken because the compiler may compile it into code equivalent to the second version, i.e. with a reload of htable->word1 from memory after setting word2. (If the compiler does not produce the reload, then the code will run fine, but the C standard does not guarantee this.)
I don't see how memory_order_relaxed prevent that.
I am pretty sure that memory_order_relaxed already prevents the compiler from loading the value from memory twice if the C code reads the variable only once. So you're guaranteed that word1 will have the same value everywhere and the same for word2. That is all you need. Thanks to the xor-trick you don't care whether word1 and word2 are read and written atomically, but the trick breaks down if one of the local variables word1 and word2 may "magically" change its value between lines.
However, GCSE optimization does suggest this will NEVER happen. The compiler knows better than to load the same thing twice, and since there is no "register jam" in that code, it would almost certainly only happen if the compiler is broken. Yes it COULD do that. But doing so would be a pretty lousy way of producing object code...

bob
Posts: 20470
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: atomic TT

Post by bob » Tue Sep 15, 2015 11:35 pm

lucasart wrote:After carefully considering the options, I will go ahead with lock-less hashing, but implemented correctly using atomics with whatever is the correct and least strict memory order:
* Steven's solution of 256 spinlocks has the same asymptotic efficiency than locking the entire table (even though for typical hash size, threads, search time, it may perform ok) Not very elegant IMHO.
* Stockfish solution is to ignore the problem, and requires to write a move_is_legal() function which is a lot of code
* An atomic lock per entry (or per cluster) may have bad asymptotic behaviour, as pointed out by Bob. More importantly, the atomic_flag of the spinlock wastes precious space in TT entries, that I want to pack efficiently.
* Lockless hashing (done correctly so the above reordering cannot occur) has no memory overhead, and the speed overhead is ridiculously small, I doubt you can even measure it. Plus is guarantees (assuming 64 bit keys) that collisons or races that pas the xor test are so rare they can be assumed to never happen, so I do not even need to write a move_is_legal() function.

Thanks for all the help and clarifications.
Note that this fixes "false collisions" caused by SMP behavior. But REAL collisions still happen. If a bad move will cause problems, it still needs to be tested.

Post Reply