Here is the issue: First, let's take a normal hash table. With Crafty, an entry is 16 bytes. I store the Zobrist signature in 8 bytes, the other 8 bytes has the score/bound, type (EXACT, LOWER, UPPER, WORTHLESS), the draft (depth remaining below this position), and the best move.
In order to store this in the hash table, you have to do something like this:
htable->word1 = key;
htable->word2 = score+move+etc...
Whether you do it with two stores, a memcpy() or whatever does not matter. The PC has no way to store 16 bytes in memory in one cycle, so the write to memory is done in 8 byte "chunks". (and I am not even discussing write-back buffers inside the CPU cache, we can ignore that for the moment.)
Now remember that we address the table by taking the lower N bits of the signature/key and using that as a table index.
So assume we have position P1 and P2. Both have different signatures, but they happen to match in the lower N bits. Both have different scores, best moves, etc.
On one processor, we try to store p1.key and p1.info into table entry X. On another processor, we try to store p2 key and p2 info into the same table entry. The first processor could store p1.key and then get an interrupt. While it is handling that, the second processor stores p2.key and p2.info. And then after the first finishes handling the interrupt, it finishes and stores p1.info. Since both store in the same entry, we now have p2.key and p1.info, which is bad. Interrupts are not the only way for this to happen. Both share the same memory and bus to access this memory, so just perfectly-ordered stores can still produce this because of timing. And then the write-combining buffers in cache can further add to the difficulties.
When you are in position p2, and you do a lookup, you find the p2 signature, but the data you get does not go with p2, it goes with p1. How this affects your program depends on lots of things. You just retrieved a move that might be illegal. If that will break your search, then you are going to crash. If it doesn't, then you might not even notice or care about this.
The simple solution is that before you store a position.key or position.info, you modify the key by XORing it with the info that goes with it. Now lets do this again.
You store p1.key', the modified key. The other processor then stores p2.key' and p2.info, and then the first stores p1.info. When you come back and do a probe, the first thing you do is take the table key and XOR it with the table info, before you try to match the current signature with the table signature. And now the match won't happen, because p2.key XOR p1.info will not produce the original p2.key.
My problem was with the pawn hash table. When I first wrote the code, this was not an issue to speak of, because nothing in the table would cause me to crash if it was invalid. But over the years, this changed. One of the things I store in a pawn hash entry is an 8 bit mask for white and black indicating which files have passed pawns on them. Then in the EvaluatePassedPawn() code I don't have to hunt for passed pawns, I just take this mask, find the first 1-bit which gives me the file #, then I can AND the pawns bitboard for the correct side with the right mask for that file and then use MSB (for white) or LSB (for black) to find the most advanced passed pawn.
Unfortunately, MSB(n) or LSB(n) returns an 8 since valid files are 0-7. And years ago I decided "I know that these file flags are correct so there is no need to verify them when using them...". So what was happening was that in some cases, I would get a false match (right key, wrong data as above) and then get an invalid file number. Which would make me use an invalid mask. Which would give me bogus data, and eventually this produced an ugly subscript (should be between 0-63, but it was in the billions, and I would get a setfault. And it was impossible to reproduce.
My pawn hash entry is 32 bytes long. 8 bytes for the pawn signature, 24 bytes for scores, king safety pawn shelter info, passed pawns, candidate pawns, weak pawns, etc. The fix was to XOR the signature with the other 3 8-byte chunks just as we did in the "lockless hashing" code for the normal hash table.
Now I can't get a signature match unless the data that goes with the signature is present, which solved the problem. I found this when I found one position that Crafty could simply not search without crashing, every. Several pawn structures had different signatures, but accessed the same entry, which broke the entry as above. And caused the crash (it took a very specific set of bad data to actually make it crash rather than just compute bogus scores). After the fix, the problem is gone, and I could detect no difference in NPS when using my 8-core box. By the way, for reference, when I first found this I wanted to verify this was what was happening (it could have just been a wild memory store that unluckily clobbered the data in this entry). I used a simple spinlock (Lock() / Unlock() in Crafty). But since this has to be done on _every_ pawn hash lookup and every store (the lookup is a killer) NPS dropped from about 20M on this box to around 5M. Which shows that a lock used every node is simply unusable...
Hope that explanation is clear. Crafty now uses "lockless hashing" on both the normal hash table and the pawn hash table... And that long-term problem is gone. On a dual-cpu this could still happen, but was less likely for obvious reasons. I first saw this when running on a quad-quad last year, but could never get it to reproduce. I then ran into it again on a dual I7 late last year. And had seen it on the dual-quad I use regularly. But when I say "I had seen this" I mean maybe once in a hundred games or less. So it wasn't glaring. But it was there... "was" being the important word. Not "is".
