Lockless hash: Thank you Bob Hyatt!

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Another way

Post by hgm »

Indeed. When I tried to apply the lockless hashing in one of my engines I encountered two problems, which both were easily solved if you realise you are dealing with a checksum. The first was that I only use 32-bit signature, so that XORing two words would not detect corruption in the other half of the word. And the chances of two entries having the same depth and move (if these are stored in the bytes that are XORed with the signature) but different score are not that small. But the signature could be XORed with both the low and high half of the other word (or with the word multiplied by 0x100000001LL, if the signature is in the high half), to check for modifications in both halfs.

The other problem is when the hash entry occupies more than two words. In that case you have to XOR (or add, or subtract) the word containing the signature with the data in all other words of the entry, to make the signature sensitive to changes in every part of it. Like word1 ^= FAC1 * word2 ^ FAC2 * word3 if the signature is in the high bytes of word1, with FAC1 and FAC2 some judiciously chosen constants (which map all bits of the words to the high part that corresponds to the signature in word1).

The latter also helps if you have 'irregularly stored' entries. I sometimes use a storage scheme like

Code: Select all

typedef struct {
  int signature[7];
  short int score[7];
  short int move[7];
  unsigned char depth[7];
  char filler; // to make 64
} HashBucket;
In that case you could do the lockless hashing by signature ^= score ^ move ^ depth;

The main trick of lockless hashing is that it hides the checksum in the signature, so that you don't have to store and test those separately, because a faulty checksum spoils the signature an thus forces an automatic miss.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Another way

Post by bob »

hgm wrote:Indeed. When I tried to apply the lockless hashing in one of my engines I encountered two problems, which both were easily solved if you realise you are dealing with a checksum. The first was that I only use 32-bit signature, so that XORing two words would not detect corruption in the other half of the word. And the chances of two entries having the same depth and move (if these are stored in the bytes that are XORed with the signature) but different score are not that small. But the signature could be XORed with both the low and high half of the other word (or with the word multiplied by 0x100000001LL, if the signature is in the high half), to check for modifications in both halfs.

The other problem is when the hash entry occupies more than two words. In that case you have to XOR (or add, or subtract) the word containing the signature with the data in all other words of the entry, to make the signature sensitive to changes in every part of it. Like word1 ^= FAC1 * word2 ^ FAC2 * word3 if the signature is in the high bytes of word1, with FAC1 and FAC2 some judiciously chosen constants (which map all bits of the words to the high part that corresponds to the signature in word1).

The latter also helps if you have 'irregularly stored' entries. I sometimes use a storage scheme like

Code: Select all

typedef struct {
  int signature[7];
  short int score[7];
  short int move[7];
  unsigned char depth[7];
  char filler; // to make 64
} HashBucket;
In that case you could do the lockless hashing by signature ^= score ^ move ^ depth;

The main trick of lockless hashing is that it hides the checksum in the signature, so that you don't have to store and test those separately, because a faulty checksum spoils the signature an thus forces an automatic miss.


I remember when we first ran into this issue. 1983 when we first used 2 cpus on the Cray. We were always sensitive to bad moves from the TT because they would corrupt things (CB was not a copy/make type program). So I simply validity-checked the best-move from the tt, and if it failed the test, I printed an error message and zeroed the move so it would not be played.

We would occasionally see that "bad move" message and for a while we thought that it was simply an actual hash collision. But we found a position where it happened a lot, and fairly quickly. Running with just one CPU and we had no problems. We sat around a table that night and I remember thinking out loud, "you know, this thing has 1024 banks of memory, and memory is interleaved by words (8 byte words). It is possible that two threads could store at the same time, and you end up with 1/2 of each entry, or it is possible that one thread stores at the same time another thread probes, and the probe gets 1/2 of two different entries. We thought about it, agreed it was a potential problem, and tried a simple lock (on the Cray VERY fast, as they had a 32 bit semaphore register that was shared among all CPUs, (2 at the time, but more later), where you could test/set any single bit. If it was already set, that CPU just hung until it was zero, then atomically set it to 1, etc. And the problem vanished. But by the time we hit the YMP, that single lock was becoming noticeable, and by the time we were running on the prototype C90 at Cray (16 cpus, Harry proposed the XOR trick to eliminate the lock and we were off and onto other issues.

The DEC alpha was much worse because it used out-of-order writes from a single CPU, which means a one-cpu version could corrupt the tt with a little effort.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Another way (addendum)

Post by bob »

Unfortunately, the alpha is a dead-end. However, the concept of "out-of-order memory writes" will very likely show up again in future architectures. And when it does, the lockless hashing still has a chance of failing. Normally, on a single CPU (intel/AMD today) if you write to two different memory addresses, the two writes are guaranteed to be done in the order in which they appeared in the instruction stream. That might not be true for future generations.

The alpha gave us a mfence instruction that said "before passing this point, complete all memory writes". But that is only a partial solution to the problem. For example, if you use a lock, and you do the store for the ttable entry, then the store to clear the lock, the lock might get cleared before all of the ttable entry is written.

I'll wait until we see another out-of-order writer before going any deeper since today it is not relevant unless you use an old alpha box somewhere...