atomic TT

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: atomic TT

Post by lucasart »

In DiscoCheck, I never tested for legality of the hash move. I chose to keep enough bits from the key so that probability of collision is practically zero. With 64 bits, you probably have more chance to crash due to a bit flip caused by a gust of solar wind, than by a zobrist collision :lol:
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: atomic TT

Post by bob »

lucasart wrote:In DiscoCheck, I never tested for legality of the hash move. I chose to keep enough bits from the key so that probability of collision is practically zero. With 64 bits, you probably have more chance to crash due to a bit flip caused by a gust of solar wind, than by a zobrist collision :lol:
I actually detect an occasional collision. I don't currently display them, but have in the past (at least detected by signature match, but illegal move) and one every game or two or three was not uncommon... Sometimes a few more.
Joost Buijs
Posts: 1563
Joined: Thu Jul 16, 2009 10:47 am
Location: Almere, The Netherlands

Re: atomic TT

Post by Joost Buijs »

syzygy wrote:
Joost Buijs wrote:
syzygy wrote:
Joost Buijs wrote:
lucasart wrote:Seems there are a few ways of managing TT races:
* Crafty style: lockless hashing trick
* Stockfish style: allow races and dont care, just check for legality of the TT move
* Cilk Chess style: atomic lock for each TT entry

But, at least on x86-64, is it not possible to atomically load or store an entire 128bit TT entry ? Would a simple `std::atomic<TTEntry>` not do the job ? Or maybe through compiler intrinsics ? Anyone tried that ?
The Stockfish style looks very dirty and prone to errors.
If you check the validity of the move there is a chance the entry is ok, but how do you check this in case of an upperbound when there is no move available?
Unless something has changed recently, SF has tens of thousands of hash collisions per second. The very few extra caused by races really won't matter.
Well, I don't know what you think is a collision, but tens of thousands per second is ridiculous.
I am thinking of false hash hits.

SF only uses 16 bits to distinguish positions within a TT bucket. Each bucket (of 32 bytes) contains 3 positions. So if the TT is full, a position that is not in the TT will be matched to one of the positions in its bucket about once every 22000 (2^16 / 3) probes.

So tens of thousands per second is a bit on the high side, but at 22Mnps you will be getting towards 1000 false hits per second (it will be less because there are also true hits).

I don't believe this is optimal at large time controls, but that has to be convincingly proven before it will be changed to something more sensible.
And don't underestimate the number of TT races, because they are not few when you don't take precautions to avoid them.
They are very rare relative to false hash hits if the TT table is large, unless you're in an endgame position with few unique positions in the search tree. A TT race occurs only when two threads happen to hit the same TT entry at the same time. Very large TT tables are standard nowadays.
I never looked at the Stockfish TT code in detail, but I can imagine that when you only use 16 bits to distinguish between positions in a bucket you will have a lot of - what you call - false hash hits.
That still doesn't mean you can ignore TT races because they are fewer than - in the Stockfish case - false hash hits, or because the tables are so large nowadays.
Specially in the endgame it can wreck you, and my point is that it is so easy to solve by using a lockless scheme or an interlocked function.
syzygy
Posts: 5562
Joined: Tue Feb 28, 2012 11:56 pm

Re: atomic TT

Post by syzygy »

lucasart wrote:In DiscoCheck, I never tested for legality of the hash move. I chose to keep enough bits from the key so that probability of collision is practically zero. With 64 bits, you probably have more chance to crash due to a bit flip caused by a gust of solar wind, than by a zobrist collision :lol:
The probability is much higher.

If you store 32 bits in a TT entry (and use the other 32 bits to index into the TT), you will have 1 false hit per 4 billion probes. If you do 10 million probes per second, that is one false hit per 400 seconds. It is worse if you use buckets of entries.

Storing all 64 bits in a TT entry is not going to help much, as the bits you use for indexing into the TT won't help you to distinguish between positions mapped to the same TT entry.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: atomic TT

Post by hgm »

syzygy wrote:Storing all 64 bits in a TT entry is not going to help much, as the bits you use for indexing into the TT won't help you to distinguish between positions mapped to the same TT entry.
This is why I XOR these bits with the incremental PST eval to derive the index.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: atomic TT

Post by wgarvin »

bob wrote:
syzygy wrote:
lucasart wrote:
syzygy wrote:
lucasart wrote:I don't understand how the C++ standard allows the compiler to break lock-less hashing xor trick. Anyone can explain (in lay man words) ?
See here.

Crafty's HashProbe() contains:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word1 = htable->word1;
    word2 = htable->word2 ^ word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    ...
A C99 or C11 compiler may legally rewrite this as:

Code: Select all

  for &#40;entry = 0; entry < 4; entry++, htable++) &#123;
    word2 = htable->word2 ^ htable->word1;
    if &#40;word2 == temp_hashkey&#41;
      break;
  &#125;
  if &#40;entry < 4&#41; &#123;
    word1 = htable->word1;
    ...
A C99 compiler does not know about threads and therefore only needs to care about preserving single threaded semantics.

A C11 compiler knows about threads, but may assume that no data races are present.

If I am not mistaken, one can prevent the reload from memory in C11 by declaring the word1 and word2 fields as atomic and accessing them using memory_order_relaxed.

This might still leave other opportunities for the compiler to miscompile...
What if htable->word1 changes under our feet when we break from the for loop, just after having set word2. We would end up with inconsistent word1 and word2, while the if test was passed.
That is exactly why the second version is broken.

The first version (that is in Crafty) is broken because the compiler may compile it into code equivalent to the second version, i.e. with a reload of htable->word1 from memory after setting word2. (If the compiler does not produce the reload, then the code will run fine, but the C standard does not guarantee this.)
I don't see how memory_order_relaxed prevent that.
I am pretty sure that memory_order_relaxed already prevents the compiler from loading the value from memory twice if the C code reads the variable only once. So you're guaranteed that word1 will have the same value everywhere and the same for word2. That is all you need. Thanks to the xor-trick you don't care whether word1 and word2 are read and written atomically, but the trick breaks down if one of the local variables word1 and word2 may "magically" change its value between lines.
However, GCSE optimization does suggest this will NEVER happen. The compiler knows better than to load the same thing twice, and since there is no "register jam" in that code, it would almost certainly only happen if the compiler is broken. Yes it COULD do that. But doing so would be a pretty lousy way of producing object code...
You know, I feel certain that we've had this conversation before! :lol:

Ah yes, here it is:
http://talkchess.com/forum/viewtopic.ph ... 328#546328
wgarvin wrote:
bob wrote:2. I am certainly betting that the compiler, in its "common subexpression" optimization pass, will realize that going BACK to memory is about a thousand times worse than recomputing a typical common subexpression. And it will NOT generate multiple memory accesses.
Okay, wait a second there -- if we're talking about a compiler that is compiling "single threaded" code, and it knows it just accessed the memory location recently, why would it NOT reload it from memory instead of recomputing the expression? L1 cache is extremely fast, and rematerialization might look cheaper to it than spilling since it doesn't have to spill the value it loaded the first time if it just re-loads it. Compilers, even smart ones, don't always make optimal decisions about what to keep live in a register, and e.g. I think clang aggressively splits live ranges to try and make better use of registers around loops etc.
...
http://talkchess.com/forum/viewtopic.ph ... 473#546473
wgarvin wrote:
hgm wrote:Or is the problem the specific way it is programmed in Crafty, not making a copy, but following the pointer twice, relying on the optimizer having made the copy in the register anyway. The latter seems risky. Doing a second memory (= L1 cache) access is actually the fastest way to do this on many x86 CPUs; code that only uses architectural registers often runs very slow there, and you seem to need a mix of memory operands and register operands to be able to reach maximum throughput. If I would hand-optimize the code I probably would opt for a second memory access. Especially if the 64-bit data part was in a register, but I needed only a single byte of it that stored the draft. Just reading the required byte from memory would definitely consume fewer CPU resources than shifting and masking the value stored in the register.
Yes, that was the point I was trying to make about L1 being so fast -- the compiler likely knows that reading from it a second time is quite cheap.

But without the volatile, even making a copy of the hash entry into a local variable, may not protect you from compiler optimizations. By alias analysis, it will know that nothing else has written to the original hash slot (since it considers only the "single threaded" code you are executing, and isn't aware of other threads). The compiler might choose to keep each member of the structure in a register, so the "copy" will consist of read(s) from the hash table into registers. It might then decide to split the live range of those registers and/or spill them to the stack. At that point it will notice that it can re-create the value by reading it straight from the original memory, and if it does that it doesn't have to actually spill the register to memory (so it saves a store if it rematerializes from the original location instead of loading from the local variable on the stack). Since it knows the original location was recently accessed, it knows it will hit in L1. It has to preserve the abstract C machine behavior only for a "single threaded" machine; it doesn't know or care that you might actually have other threads. It could easily turn the one read from your source code into two or more reads in the machine code.
...
Edit: though I was mostly describing C++9x/C99 compilers, I think the principle still applies. Though C11/C++11 introduce the memory-ordering stuff and may make it easier to write threaded programs which do subtle stuff while still being legal C or C++ programs with defined semantics. But I don't have any experience using that new stuff. :)
User avatar
Rebel
Posts: 6991
Joined: Thu Aug 18, 2011 12:04 pm

Re: atomic TT

Post by Rebel »

syzygy wrote:
lucasart wrote:In DiscoCheck, I never tested for legality of the hash move. I chose to keep enough bits from the key so that probability of collision is practically zero. With 64 bits, you probably have more chance to crash due to a bit flip caused by a gust of solar wind, than by a zobrist collision :lol:
The probability is much higher.

If you store 32 bits in a TT entry (and use the other 32 bits to index into the TT), you will have 1 false hit per 4 billion probes. If you do 10 million probes per second, that is one false hit per 400 seconds. It is worse if you use buckets of entries.

Storing all 64 bits in a TT entry is not going to help much, as the bits you use for indexing into the TT won't help you to distinguish between positions mapped to the same TT entry.
It's (probably) for this reason I do 2 extra checks before rewarding the pruning of the tree, aside from the depth (standard) the ply as well as the iteration depth have to match. It's costly, 6-7% but eng-eng play gives a small positive score, in the order of roughly 51%. The disadvantage is of course such a system doesn't allow the TT contents to be used for the next move, except for move ordering. For the record, I store the full hash key.
90% of coding is debugging, the other 10% is writing bugs.