SMP hashing problem

bob · Post by **bob** » Sat Jan 24, 2009 6:40 pm

This keeps coming up, and I have received 5 emails so far asking "what is this problem you mention in the comments in main.c recently?"

Here is the issue: First, let's take a normal hash table. With Crafty, an entry is 16 bytes. I store the Zobrist signature in 8 bytes, the other 8 bytes has the score/bound, type (EXACT, LOWER, UPPER, WORTHLESS), the draft (depth remaining below this position), and the best move.

In order to store this in the hash table, you have to do something like this:

htable->word1 = key;
htable->word2 = score+move+etc...

Whether you do it with two stores, a memcpy() or whatever does not matter. The PC has no way to store 16 bytes in memory in one cycle, so the write to memory is done in 8 byte "chunks". (and I am not even discussing write-back buffers inside the CPU cache, we can ignore that for the moment.)

Now remember that we address the table by taking the lower N bits of the signature/key and using that as a table index.

So assume we have position P1 and P2. Both have different signatures, but they happen to match in the lower N bits. Both have different scores, best moves, etc.

On one processor, we try to store p1.key and p1.info into table entry X. On another processor, we try to store p2 key and p2 info into the same table entry. The first processor could store p1.key and then get an interrupt. While it is handling that, the second processor stores p2.key and p2.info. And then after the first finishes handling the interrupt, it finishes and stores p1.info. Since both store in the same entry, we now have p2.key and p1.info, which is bad. Interrupts are not the only way for this to happen. Both share the same memory and bus to access this memory, so just perfectly-ordered stores can still produce this because of timing. And then the write-combining buffers in cache can further add to the difficulties.

When you are in position p2, and you do a lookup, you find the p2 signature, but the data you get does not go with p2, it goes with p1. How this affects your program depends on lots of things. You just retrieved a move that might be illegal. If that will break your search, then you are going to crash. If it doesn't, then you might not even notice or care about this.

The simple solution is that before you store a position.key or position.info, you modify the key by XORing it with the info that goes with it. Now lets do this again.

You store p1.key', the modified key. The other processor then stores p2.key' and p2.info, and then the first stores p1.info. When you come back and do a probe, the first thing you do is take the table key and XOR it with the table info, before you try to match the current signature with the table signature. And now the match won't happen, because p2.key XOR p1.info will not produce the original p2.key.

My problem was with the pawn hash table. When I first wrote the code, this was not an issue to speak of, because nothing in the table would cause me to crash if it was invalid. But over the years, this changed. One of the things I store in a pawn hash entry is an 8 bit mask for white and black indicating which files have passed pawns on them. Then in the EvaluatePassedPawn() code I don't have to hunt for passed pawns, I just take this mask, find the first 1-bit which gives me the file #, then I can AND the pawns bitboard for the correct side with the right mask for that file and then use MSB (for white) or LSB (for black) to find the most advanced passed pawn.

Unfortunately, MSB(n) or LSB(n) returns an 8 since valid files are 0-7. And years ago I decided "I know that these file flags are correct so there is no need to verify them when using them...". So what was happening was that in some cases, I would get a false match (right key, wrong data as above) and then get an invalid file number. Which would make me use an invalid mask. Which would give me bogus data, and eventually this produced an ugly subscript (should be between 0-63, but it was in the billions, and I would get a setfault. And it was impossible to reproduce.

My pawn hash entry is 32 bytes long. 8 bytes for the pawn signature, 24 bytes for scores, king safety pawn shelter info, passed pawns, candidate pawns, weak pawns, etc. The fix was to XOR the signature with the other 3 8-byte chunks just as we did in the "lockless hashing" code for the normal hash table.

Now I can't get a signature match unless the data that goes with the signature is present, which solved the problem. I found this when I found one position that Crafty could simply not search without crashing, every. Several pawn structures had different signatures, but accessed the same entry, which broke the entry as above. And caused the crash (it took a very specific set of bad data to actually make it crash rather than just compute bogus scores). After the fix, the problem is gone, and I could detect no difference in NPS when using my 8-core box. By the way, for reference, when I first found this I wanted to verify this was what was happening (it could have just been a wild memory store that unluckily clobbered the data in this entry). I used a simple spinlock (Lock() / Unlock() in Crafty). But since this has to be done on _every_ pawn hash lookup and every store (the lookup is a killer) NPS dropped from about 20M on this box to around 5M. Which shows that a lock used every node is simply unusable...

Hope that explanation is clear. Crafty now uses "lockless hashing" on both the normal hash table and the pawn hash table... And that long-term problem is gone. On a dual-cpu this could still happen, but was less likely for obvious reasons. I first saw this when running on a quad-quad last year, but could never get it to reproduce. I then ran into it again on a dual I7 late last year. And had seen it on the dual-quad I use regularly. But when I say "I had seen this" I mean maybe once in a hundred games or less. So it wasn't glaring. But it was there... "was" being the important word. Not "is".

diep · Post by **diep** » Sat Jan 24, 2009 9:32 pm

bob wrote:This keeps coming up, and I have received 5 emails so far asking "what is this problem you mention in the comments in main.c recently?"

Here is the issue: First, let's take a normal hash table. With Crafty, an entry is 16 bytes. I store the Zobrist signature in 8 bytes, the other 8 bytes has the score/bound, type (EXACT, LOWER, UPPER, WORTHLESS), the draft (depth remaining below this position), and the best move.

In order to store this in the hash table, you have to do something like this:

htable->word1 = key;
htable->word2 = score+move+etc...

Whether you do it with two stores, a memcpy() or whatever does not matter. The PC has no way to store 16 bytes in memory in one cycle, so the write to memory is done in 8 byte "chunks". (and I am not even discussing write-back buffers inside the CPU cache, we can ignore that for the moment.)

Actually DDR2 ram does store more than 8 bytes at once.

It depends largely upon whether we take a look to memory controllers from intel or from AMD. Huge differences.

Let's start with intel.

It's 128 bits path for Nehalem.
DDR3, so that's 16 * 2^2 = 64 bytes at once

DDR4 is 128 bytes at once stored
and DDR5 stores at once 256 bytes.

DDR2 ram you've got is storing it at chunks of 32 bytes.

AMD is total different there and much better from latency viewpoint as we do it, as that's 2 channels of 64 bits, rather than 1 channel of 128 bits like intel has it. AMD is total superior from our latency viewpoint.

Additionally processors have store buffers, which they write away at once at pc's, so at pc's there is practical even less risk. Odds for a bitflip in memory with non-ecc ram which about everyone has in their pc's except university guys, is much bigger in fact.

Only if you also take into account special hardware, that has more issues with memory controllers such as some shared memory supercomputers, then it is possible to get write errors.

So i measured that at the origin3800 supercomputer, after technical discussions with SGI on this subject. SGI was very helpful providing technical data here in 2003. I did do a lot of research after collissions and errors.

If i didn't do the XOR, I measured it at 1 write error in 200 billion nodes, using a hashtable of size 100GB @ 200 processors. Note that is including hashstore in qsearch especially for that occasion (at supercomputer i tended to store in qsearch only local for production, special for this test it was done global which slowed down search a little). The number of write errors was smaller than the number of hash collissions i measured. Considering the article that you wrote on hash collissions that you didn't really care for a lot of collissions, and fact that the write error occurs ONLY AT SUPERCOMPUTERS, at the PC i could measure no errors at all.

Note that i'm using a more limited form of XOR in diep's hashtable.

I have no clue why you want to store that much in your hashtable Bob, maybe it can be done more efficient.

In Diep i just use a byte or 8-12 for pawn entries and evaluation entries. Let me look. Currently it is 16 bytes for transpositiontable.

The reason diep needs that many bytes for a transpositiontable entry is because i also store its evaluation besides search score, and each score uses up 20 bits. So that's already 40 bits to 2 scores.

bob wrote: Now remember that we address the table by taking the lower N bits of the signature/key and using that as a table index.

So assume we have position P1 and P2. Both have different signatures, but they happen to match in the lower N bits. Both have different scores, best moves, etc.

On one processor, we try to store p1.key and p1.info into table entry X. On another processor, we try to store p2 key and p2 info into the same table entry. The first processor could store p1.key and then get an interrupt. While it is handling that, the second processor stores p2.key and p2.info. And then after the first finishes handling the interrupt, it finishes and stores p1.info. Since both store in the same entry, we now have p2.key and p1.info, which is bad. Interrupts are not the only way for this to happen. Both share the same memory and bus to access this memory, so just perfectly-ordered stores can still produce this because of timing. And then the write-combining buffers in cache can further add to the difficulties.

When you are in position p2, and you do a lookup, you find the p2 signature, but the data you get does not go with p2, it goes with p1. How this affects your program depends on lots of things. You just retrieved a move that might be illegal. If that will break your search, then you are going to crash. If it doesn't, then you might not even notice or care about this.

The simple solution is that before you store a position.key or position.info, you modify the key by XORing it with the info that goes with it. Now lets do this again.

You store p1.key', the modified key. The other processor then stores p2.key' and p2.info, and then the first stores p1.info. When you come back and do a probe, the first thing you do is take the table key and XOR it with the table info, before you try to match the current signature with the table signature. And now the match won't happen, because p2.key XOR p1.info will not produce the original p2.key.

My problem was with the pawn hash table. When I first wrote the code, this was not an issue to speak of, because nothing in the table would cause me to crash if it was invalid. But over the years, this changed. One of the things I store in a pawn hash entry is an 8 bit mask for white and black indicating which files have passed pawns on them. Then in the EvaluatePassedPawn() code I don't have to hunt for passed pawns, I just take this mask, find the first 1-bit which gives me the file #, then I can AND the pawns bitboard for the correct side with the right mask for that file and then use MSB (for white) or LSB (for black) to find the most advanced passed pawn.

Unfortunately, MSB(n) or LSB(n) returns an 8 since valid files are 0-7. And years ago I decided "I know that these file flags are correct so there is no need to verify them when using them...". So what was happening was that in some cases, I would get a false match (right key, wrong data as above) and then get an invalid file number. Which would make me use an invalid mask. Which would give me bogus data, and eventually this produced an ugly subscript (should be between 0-63, but it was in the billions, and I would get a setfault. And it was impossible to reproduce.

My pawn hash entry is 32 bytes long. 8 bytes for the pawn signature, 24 bytes for scores, king safety pawn shelter info, passed pawns, candidate pawns, weak pawns, etc. The fix was to XOR the signature with the other 3 8-byte chunks just as we did in the "lockless hashing" code for the normal hash table.

Now I can't get a signature match unless the data that goes with the signature is present, which solved the problem. I found this when I found one position that Crafty could simply not search without crashing, every. Several pawn structures had different signatures, but accessed the same entry, which broke the entry as above. And caused the crash (it took a very specific set of bad data to actually make it crash rather than just compute bogus scores). After the fix, the problem is gone, and I could detect no difference in NPS when using my 8-core box. By the way, for reference, when I first found this I wanted to verify this was what was happening (it could have just been a wild memory store that unluckily clobbered the data in this entry). I used a simple spinlock (Lock() / Unlock() in Crafty). But since this has to be done on _every_ pawn hash lookup and every store (the lookup is a killer) NPS dropped from about 20M on this box to around 5M. Which shows that a lock used every node is simply unusable...

Hope that explanation is clear. Crafty now uses "lockless hashing" on both the normal hash table and the pawn hash table... And that long-term problem is gone. On a dual-cpu this could still happen, but was less likely for obvious reasons. I first saw this when running on a quad-quad last year, but could never get it to reproduce. I then ran into it again on a dual I7 late last year. And had seen it on the dual-quad I use regularly. But when I say "I had seen this" I mean maybe once in a hundred games or less. So it wasn't glaring. But it was there... "was" being the important word. Not "is".

michiguel · Post by **michiguel** » Sat Jan 24, 2009 10:11 pm

bob wrote:This keeps coming up, and I have received 5 emails so far asking "what is this problem you mention in the comments in main.c recently?"

Here is the issue: First, let's take a normal hash table. With Crafty, an entry is 16 bytes. I store the Zobrist signature in 8 bytes, the other 8 bytes has the score/bound, type (EXACT, LOWER, UPPER, WORTHLESS), the draft (depth remaining below this position), and the best move.

In order to store this in the hash table, you have to do something like this:

htable->word1 = key;
htable->word2 = score+move+etc...

Whether you do it with two stores, a memcpy() or whatever does not matter. The PC has no way to store 16 bytes in memory in one cycle, so the write to memory is done in 8 byte "chunks". (and I am not even discussing write-back buffers inside the CPU cache, we can ignore that for the moment.)

Now remember that we address the table by taking the lower N bits of the signature/key and using that as a table index.

So assume we have position P1 and P2. Both have different signatures, but they happen to match in the lower N bits. Both have different scores, best moves, etc.

On one processor, we try to store p1.key and p1.info into table entry X. On another processor, we try to store p2 key and p2 info into the same table entry. The first processor could store p1.key and then get an interrupt. While it is handling that, the second processor stores p2.key and p2.info. And then after the first finishes handling the interrupt, it finishes and stores p1.info. Since both store in the same entry, we now have p2.key and p1.info, which is bad. Interrupts are not the only way for this to happen. Both share the same memory and bus to access this memory, so just perfectly-ordered stores can still produce this because of timing. And then the write-combining buffers in cache can further add to the difficulties.

When you are in position p2, and you do a lookup, you find the p2 signature, but the data you get does not go with p2, it goes with p1. How this affects your program depends on lots of things. You just retrieved a move that might be illegal. If that will break your search, then you are going to crash. If it doesn't, then you might not even notice or care about this.

The simple solution is that before you store a position.key or position.info, you modify the key by XORing it with the info that goes with it. Now lets do this again.

You store p1.key', the modified key. The other processor then stores p2.key' and p2.info, and then the first stores p1.info. When you come back and do a probe, the first thing you do is take the table key and XOR it with the table info, before you try to match the current signature with the table signature. And now the match won't happen, because p2.key XOR p1.info will not produce the original p2.key.

My problem was with the pawn hash table. When I first wrote the code, this was not an issue to speak of, because nothing in the table would cause me to crash if it was invalid. But over the years, this changed. One of the things I store in a pawn hash entry is an 8 bit mask for white and black indicating which files have passed pawns on them. Then in the EvaluatePassedPawn() code I don't have to hunt for passed pawns, I just take this mask, find the first 1-bit which gives me the file #, then I can AND the pawns bitboard for the correct side with the right mask for that file and then use MSB (for white) or LSB (for black) to find the most advanced passed pawn.

Unfortunately, MSB(n) or LSB(n) returns an 8 since valid files are 0-7. And years ago I decided "I know that these file flags are correct so there is no need to verify them when using them...". So what was happening was that in some cases, I would get a false match (right key, wrong data as above) and then get an invalid file number. Which would make me use an invalid mask. Which would give me bogus data, and eventually this produced an ugly subscript (should be between 0-63, but it was in the billions, and I would get a setfault. And it was impossible to reproduce.

I think it is a good idea to use assert() and systematically run tests with a debug version. Some of this ugly problems can be caught earlier. This is particularly useful when something depend on an assumption that you may want to change later, forgetting that this assumption was critical for something else.

One of the things about computer chess I hate is that sometimes I spent more time debugging than the code I introduced. When this is a hobby and you have limited time, it is really annoying. Without assert(), this would have been intolerable for me. Some people manage well without them, though.

Miguel

My pawn hash entry is 32 bytes long. 8 bytes for the pawn signature, 24 bytes for scores, king safety pawn shelter info, passed pawns, candidate pawns, weak pawns, etc. The fix was to XOR the signature with the other 3 8-byte chunks just as we did in the "lockless hashing" code for the normal hash table.

Now I can't get a signature match unless the data that goes with the signature is present, which solved the problem. I found this when I found one position that Crafty could simply not search without crashing, every. Several pawn structures had different signatures, but accessed the same entry, which broke the entry as above. And caused the crash (it took a very specific set of bad data to actually make it crash rather than just compute bogus scores). After the fix, the problem is gone, and I could detect no difference in NPS when using my 8-core box. By the way, for reference, when I first found this I wanted to verify this was what was happening (it could have just been a wild memory store that unluckily clobbered the data in this entry). I used a simple spinlock (Lock() / Unlock() in Crafty). But since this has to be done on _every_ pawn hash lookup and every store (the lookup is a killer) NPS dropped from about 20M on this box to around 5M. Which shows that a lock used every node is simply unusable...

Hope that explanation is clear. Crafty now uses "lockless hashing" on both the normal hash table and the pawn hash table... And that long-term problem is gone. On a dual-cpu this could still happen, but was less likely for obvious reasons. I first saw this when running on a quad-quad last year, but could never get it to reproduce. I then ran into it again on a dual I7 late last year. And had seen it on the dual-quad I use regularly. But when I say "I had seen this" I mean maybe once in a hundred games or less. So it wasn't glaring. But it was there... "was" being the important word. Not "is".

bob · Post by **bob** » Sat Jan 24, 2009 11:46 pm

diep wrote:
bob wrote:This keeps coming up, and I have received 5 emails so far asking "what is this problem you mention in the comments in main.c recently?"

Here is the issue: First, let's take a normal hash table. With Crafty, an entry is 16 bytes. I store the Zobrist signature in 8 bytes, the other 8 bytes has the score/bound, type (EXACT, LOWER, UPPER, WORTHLESS), the draft (depth remaining below this position), and the best move.

In order to store this in the hash table, you have to do something like this:

htable->word1 = key;
htable->word2 = score+move+etc...

Whether you do it with two stores, a memcpy() or whatever does not matter. The PC has no way to store 16 bytes in memory in one cycle, so the write to memory is done in 8 byte "chunks". (and I am not even discussing write-back buffers inside the CPU cache, we can ignore that for the moment.)
Actually DDR2 ram does store more than 8 bytes at once.

It depends largely upon whether we take a look to memory controllers from intel or from AMD. Huge differences.

Let's start with intel.

It's 128 bits path for Nehalem.
DDR3, so that's 16 * 2^2 = 64 bytes at once

DDR4 is 128 bytes at once stored
and DDR5 stores at once 256 bytes.

DDR2 ram you've got is storing it at chunks of 32 bytes.

All of that is well and good, but practically it does _not_ work like that. I ran into this identical problem on multiple platforms, all the way up to a dual quad-core I7. I debugged and fixed it on our dual quad-core 2.33ghz xeon (core-2) cluster.

The pawn hash table is aligned on a 32-byte starting address. If it really stored 32 bytes at once this problem could _never_ happened. Yet I could make it happen at will on the dual xeon, once I found a position that was problematic. Ditto for the dual I7 I was testing on..

AMD is total different there and much better from latency viewpoint as we do it, as that's 2 channels of 64 bits, rather than 1 channel of 128 bits like intel has it. AMD is total superior from our latency viewpoint.

Additionally processors have store buffers, which they write away at once at pc's, so at pc's there is practical even less risk. Odds for a bitflip in memory with non-ecc ram which about everyone has in their pc's except university guys, is much bigger in fact.

Only if you also take into account special hardware, that has more issues with memory controllers such as some shared memory supercomputers, then it is possible to get write errors.

So i measured that at the origin3800 supercomputer, after technical discussions with SGI on this subject. SGI was very helpful providing technical data here in 2003. I did do a lot of research after collissions and errors.

If i didn't do the XOR, I measured it at 1 write error in 200 billion nodes, using a hashtable of size 100GB @ 200 processors. Note that is including hashstore in qsearch especially for that occasion (at supercomputer i tended to store in qsearch only local for production, special for this test it was done global which slowed down search a little). The number of write errors was smaller than the number of hash collissions i measured. Considering the article that you wrote on hash collissions that you didn't really care for a lot of collissions, and fact that the write error occurs ONLY AT SUPERCOMPUTERS, at the PC i could measure no errors at all.

Note that i'm using a more limited form of XOR in diep's hashtable.

I have no clue why you want to store that much in your hashtable Bob, maybe it can be done more efficient.

?? My hash table is 16 bytes per entry. The _pawn_ hash table has 32 bytes, but I store lots of information that is relatively expensive to compute, but which can't be directly used in EvaluatePawns() because the scores require piece information and the hashing would not work.

In Diep i just use a byte or 8-12 for pawn entries and evaluation entries. Let me look. Currently it is 16 bytes for transpositiontable.

The reason diep needs that many bytes for a transpositiontable entry is because i also store its evaluation besides search score, and each score uses up 20 bits. So that's already 40 bits to 2 scores.

I used to use 12 bytes, because that was what we did in Cray Blitz. The current approach wastes 4 bytes per entry since I store the entire 64 bit hash signature, but the current approach is also faster, as measured in tests when I decided to "go simpler"...

bob wrote: Now remember that we address the table by taking the lower N bits of the signature/key and using that as a table index.

So assume we have position P1 and P2. Both have different signatures, but they happen to match in the lower N bits. Both have different scores, best moves, etc.

On one processor, we try to store p1.key and p1.info into table entry X. On another processor, we try to store p2 key and p2 info into the same table entry. The first processor could store p1.key and then get an interrupt. While it is handling that, the second processor stores p2.key and p2.info. And then after the first finishes handling the interrupt, it finishes and stores p1.info. Since both store in the same entry, we now have p2.key and p1.info, which is bad. Interrupts are not the only way for this to happen. Both share the same memory and bus to access this memory, so just perfectly-ordered stores can still produce this because of timing. And then the write-combining buffers in cache can further add to the difficulties.

When you are in position p2, and you do a lookup, you find the p2 signature, but the data you get does not go with p2, it goes with p1. How this affects your program depends on lots of things. You just retrieved a move that might be illegal. If that will break your search, then you are going to crash. If it doesn't, then you might not even notice or care about this.

The simple solution is that before you store a position.key or position.info, you modify the key by XORing it with the info that goes with it. Now lets do this again.

You store p1.key', the modified key. The other processor then stores p2.key' and p2.info, and then the first stores p1.info. When you come back and do a probe, the first thing you do is take the table key and XOR it with the table info, before you try to match the current signature with the table signature. And now the match won't happen, because p2.key XOR p1.info will not produce the original p2.key.

My problem was with the pawn hash table. When I first wrote the code, this was not an issue to speak of, because nothing in the table would cause me to crash if it was invalid. But over the years, this changed. One of the things I store in a pawn hash entry is an 8 bit mask for white and black indicating which files have passed pawns on them. Then in the EvaluatePassedPawn() code I don't have to hunt for passed pawns, I just take this mask, find the first 1-bit which gives me the file #, then I can AND the pawns bitboard for the correct side with the right mask for that file and then use MSB (for white) or LSB (for black) to find the most advanced passed pawn.

Unfortunately, MSB(n) or LSB(n) returns an 8 since valid files are 0-7. And years ago I decided "I know that these file flags are correct so there is no need to verify them when using them...". So what was happening was that in some cases, I would get a false match (right key, wrong data as above) and then get an invalid file number. Which would make me use an invalid mask. Which would give me bogus data, and eventually this produced an ugly subscript (should be between 0-63, but it was in the billions, and I would get a setfault. And it was impossible to reproduce.

My pawn hash entry is 32 bytes long. 8 bytes for the pawn signature, 24 bytes for scores, king safety pawn shelter info, passed pawns, candidate pawns, weak pawns, etc. The fix was to XOR the signature with the other 3 8-byte chunks just as we did in the "lockless hashing" code for the normal hash table.

Now I can't get a signature match unless the data that goes with the signature is present, which solved the problem. I found this when I found one position that Crafty could simply not search without crashing, every. Several pawn structures had different signatures, but accessed the same entry, which broke the entry as above. And caused the crash (it took a very specific set of bad data to actually make it crash rather than just compute bogus scores). After the fix, the problem is gone, and I could detect no difference in NPS when using my 8-core box. By the way, for reference, when I first found this I wanted to verify this was what was happening (it could have just been a wild memory store that unluckily clobbered the data in this entry). I used a simple spinlock (Lock() / Unlock() in Crafty). But since this has to be done on _every_ pawn hash lookup and every store (the lookup is a killer) NPS dropped from about 20M on this box to around 5M. Which shows that a lock used every node is simply unusable...

Hope that explanation is clear. Crafty now uses "lockless hashing" on both the normal hash table and the pawn hash table... And that long-term problem is gone. On a dual-cpu this could still happen, but was less likely for obvious reasons. I first saw this when running on a quad-quad last year, but could never get it to reproduce. I then ran into it again on a dual I7 late last year. And had seen it on the dual-quad I use regularly. But when I say "I had seen this" I mean maybe once in a hundred games or less. So it wasn't glaring. But it was there... "was" being the important word. Not "is".

bob · Post by **bob** » Sat Jan 24, 2009 11:48 pm

michiguel wrote:
bob wrote:This keeps coming up, and I have received 5 emails so far asking "what is this problem you mention in the comments in main.c recently?"

Here is the issue: First, let's take a normal hash table. With Crafty, an entry is 16 bytes. I store the Zobrist signature in 8 bytes, the other 8 bytes has the score/bound, type (EXACT, LOWER, UPPER, WORTHLESS), the draft (depth remaining below this position), and the best move.

In order to store this in the hash table, you have to do something like this:

htable->word1 = key;
htable->word2 = score+move+etc...

Whether you do it with two stores, a memcpy() or whatever does not matter. The PC has no way to store 16 bytes in memory in one cycle, so the write to memory is done in 8 byte "chunks". (and I am not even discussing write-back buffers inside the CPU cache, we can ignore that for the moment.)

Now remember that we address the table by taking the lower N bits of the signature/key and using that as a table index.

So assume we have position P1 and P2. Both have different signatures, but they happen to match in the lower N bits. Both have different scores, best moves, etc.

On one processor, we try to store p1.key and p1.info into table entry X. On another processor, we try to store p2 key and p2 info into the same table entry. The first processor could store p1.key and then get an interrupt. While it is handling that, the second processor stores p2.key and p2.info. And then after the first finishes handling the interrupt, it finishes and stores p1.info. Since both store in the same entry, we now have p2.key and p1.info, which is bad. Interrupts are not the only way for this to happen. Both share the same memory and bus to access this memory, so just perfectly-ordered stores can still produce this because of timing. And then the write-combining buffers in cache can further add to the difficulties.

When you are in position p2, and you do a lookup, you find the p2 signature, but the data you get does not go with p2, it goes with p1. How this affects your program depends on lots of things. You just retrieved a move that might be illegal. If that will break your search, then you are going to crash. If it doesn't, then you might not even notice or care about this.

The simple solution is that before you store a position.key or position.info, you modify the key by XORing it with the info that goes with it. Now lets do this again.

You store p1.key', the modified key. The other processor then stores p2.key' and p2.info, and then the first stores p1.info. When you come back and do a probe, the first thing you do is take the table key and XOR it with the table info, before you try to match the current signature with the table signature. And now the match won't happen, because p2.key XOR p1.info will not produce the original p2.key.

My problem was with the pawn hash table. When I first wrote the code, this was not an issue to speak of, because nothing in the table would cause me to crash if it was invalid. But over the years, this changed. One of the things I store in a pawn hash entry is an 8 bit mask for white and black indicating which files have passed pawns on them. Then in the EvaluatePassedPawn() code I don't have to hunt for passed pawns, I just take this mask, find the first 1-bit which gives me the file #, then I can AND the pawns bitboard for the correct side with the right mask for that file and then use MSB (for white) or LSB (for black) to find the most advanced passed pawn.

Unfortunately, MSB(n) or LSB(n) returns an 8 since valid files are 0-7. And years ago I decided "I know that these file flags are correct so there is no need to verify them when using them...". So what was happening was that in some cases, I would get a false match (right key, wrong data as above) and then get an invalid file number. Which would make me use an invalid mask. Which would give me bogus data, and eventually this produced an ugly subscript (should be between 0-63, but it was in the billions, and I would get a setfault. And it was impossible to reproduce.

I think it is a good idea to use assert() and systematically run tests with a debug version. Some of this ugly problems can be caught earlier. This is particularly useful when something depend on an assumption that you may want to change later, forgetting that this assumption was critical for something else.

The problem is, this was happening with a very rare frequency, typically once every 100 games or so, and not the games played on my cluster. I even tried 32000 games using 2 cpus for Crafty, and it never failed (I inserted checks for the bad values, specifically, once I knew what was happening)..

One of the things about computer chess I hate is that sometimes I spent more time debugging than the code I introduced. When this is a hobby and you have limited time, it is really annoying. Without assert(), this would have been intolerable for me. Some people manage well without them, though.

Miguel

My pawn hash entry is 32 bytes long. 8 bytes for the pawn signature, 24 bytes for scores, king safety pawn shelter info, passed pawns, candidate pawns, weak pawns, etc. The fix was to XOR the signature with the other 3 8-byte chunks just as we did in the "lockless hashing" code for the normal hash table.

Now I can't get a signature match unless the data that goes with the signature is present, which solved the problem. I found this when I found one position that Crafty could simply not search without crashing, every. Several pawn structures had different signatures, but accessed the same entry, which broke the entry as above. And caused the crash (it took a very specific set of bad data to actually make it crash rather than just compute bogus scores). After the fix, the problem is gone, and I could detect no difference in NPS when using my 8-core box. By the way, for reference, when I first found this I wanted to verify this was what was happening (it could have just been a wild memory store that unluckily clobbered the data in this entry). I used a simple spinlock (Lock() / Unlock() in Crafty). But since this has to be done on _every_ pawn hash lookup and every store (the lookup is a killer) NPS dropped from about 20M on this box to around 5M. Which shows that a lock used every node is simply unusable...

Hope that explanation is clear. Crafty now uses "lockless hashing" on both the normal hash table and the pawn hash table... And that long-term problem is gone. On a dual-cpu this could still happen, but was less likely for obvious reasons. I first saw this when running on a quad-quad last year, but could never get it to reproduce. I then ran into it again on a dual I7 late last year. And had seen it on the dual-quad I use regularly. But when I say "I had seen this" I mean maybe once in a hundred games or less. So it wasn't glaring. But it was there... "was" being the important word. Not "is".

diep · Post by **diep** » Sun Jan 25, 2009 4:20 am

bob wrote:
diep wrote: Actually DDR2 ram does store more than 8 bytes at once.

It depends largely upon whether we take a look to memory controllers from intel or from AMD. Huge differences.

Let's start with intel.

It's 128 bits path for Nehalem.
DDR3, so that's 16 * 2^2 = 64 bytes at once

DDR4 is 128 bytes at once stored
and DDR5 stores at once 256 bytes.

DDR2 ram you've got is storing it at chunks of 32 bytes.
All of that is well and good, but practically it does _not_ work like that. I ran into this identical problem on multiple platforms, all the way up to a dual quad-core I7. I debugged and fixed it on our dual quad-core 2.33ghz xeon (core-2) cluster.

The pawn hash table is aligned on a 32-byte starting address. If it really stored 32 bytes at once this problem could _never_ happened. Yet I could make it happen at will on the dual xeon, once I found a position that was problematic. Ditto for the dual I7 I was testing on..

Hi Bob on memory controller we have the luxury to ask an expert.

Actually i've got the luxury position here to easily be able to quote experts.

Note Nehalem thanks to triple memory controller is 96 bytes at once to and from memory controllers.

What i also asked is how this works.
"it is completely independent of the platform"
"you just go by I/O pin"
Vincent says: (3:59:01 AM) so it is also the case for amd?
Vincent says: (3:59:05 AM) or just for intel
Michael says: (3:59:08 AM) because it is done internally on the memory
Michael says: (3:59:26 AM) no matter who supplies the memory controller
Michael says: (3:59:40 AM) Intel, AMD. VIA, HP
Michael says: (3:59:44 AM) doesn't matter

Thanks to Michael Schuette (director technology OCZ).

He complained by the way about a memory test mine. As it uses just 8 bytes, in fact the memory controllers internally has to throw away a lot of bytes, and that throwing away costs some time, making it slower to get 8 bytes than 32,64,128 bytes at respectively ddr2,dd3,ddr4.

Heh - try to find such experts at any uni

So the point is that to DDR2,DDR3 ram a big bunch of bytes at once get read and written. Not just a few. That makes sense of course, otherwise they wouldn't get that bandwidth that they get.

The processors work different with respect to alignment of cachelinesizes. Now i must avoid running into trouble as not all tests i wrote in my life are public source code (understatement).

How the store buffers work is different from processor to processor.
If you use entries of 16 bytes and you are sure they are aligned it cannot happen obviously at most pc processors. It is interesting to figure out how this works for Nehalem better.

If you get 'weird results". Then please try to reproduce it, then i can take a look. Must be bug somewhere.

Speaking of bugs, i removed a parallel search bug from Diep that was inside since short before world champs 1999

Though it hardly ever occured. Possibly it never crashed because of it because of other code.

As we both know, nothing can be ever claimed bugfree in parallel searching engines, that's the thing

Maybe you simply measured a bitflip, who knows?

Vincent

diep · Post by **diep** » Sun Jan 25, 2009 4:32 am

More explanation on how memory controller works:

Vincent says: (4:26:57 AM) we write nonstop with all cores to RAM
Vincent says: (4:27:13 AM) now suppose 2 cores write at same time to the same adress in RAM
Vincent says: (4:27:34 AM) one entry is 16 bytes
Vincent says: (4:27:50 AM) core1 writes (a1,a2)
Vincent says: (4:27:59 AM) core2 writes (b1,b2)
Vincent says: (4:28:15 AM) can we get in ram stored (a1,b2) or (b1,a2) ?
Vincent says: (4:28:22 AM) as that would mean big problem
Michael says: (4:28:51 AM) no. that's not really how it works
Michael says: (4:29:37 AM) all cores share the same two memory controllers via crossbar
Michael says: (4:30:11 AM) and that is essentially the same thing as if they only had one memory controller / channel
Michael says: (4:30:25 AM) so they cannot write "simultaneously" to memory

Anyway, i knew this already back in 2002 or so. Just asked it to confirm it.

Note that at some HPC processors like R14000 it can happen you get write error, as they work more complex than pc processors.

Thanks,
Vincent

bob · Post by **bob** » Sun Jan 25, 2009 7:52 am

diep wrote:
bob wrote:
diep wrote: Actually DDR2 ram does store more than 8 bytes at once.

It depends largely upon whether we take a look to memory controllers from intel or from AMD. Huge differences.

Let's start with intel.

It's 128 bits path for Nehalem.
DDR3, so that's 16 * 2^2 = 64 bytes at once

DDR4 is 128 bytes at once stored
and DDR5 stores at once 256 bytes.

DDR2 ram you've got is storing it at chunks of 32 bytes.
All of that is well and good, but practically it does _not_ work like that. I ran into this identical problem on multiple platforms, all the way up to a dual quad-core I7. I debugged and fixed it on our dual quad-core 2.33ghz xeon (core-2) cluster.

The pawn hash table is aligned on a 32-byte starting address. If it really stored 32 bytes at once this problem could _never_ happened. Yet I could make it happen at will on the dual xeon, once I found a position that was problematic. Ditto for the dual I7 I was testing on..
Hi Bob on memory controller we have the luxury to ask an expert.

Actually i've got the luxury position here to easily be able to quote experts.

Note Nehalem thanks to triple memory controller is 96 bytes at once to and from memory controllers.

What i also asked is how this works.
"it is completely independent of the platform"
"you just go by I/O pin"
Vincent says: (3:59:01 AM) so it is also the case for amd?
Vincent says: (3:59:05 AM) or just for intel
Michael says: (3:59:08 AM) because it is done internally on the memory
Michael says: (3:59:26 AM) no matter who supplies the memory controller
Michael says: (3:59:40 AM) Intel, AMD. VIA, HP
Michael says: (3:59:44 AM) doesn't matter

Thanks to Michael Schuette (director technology OCZ).

He complained by the way about a memory test mine. As it uses just 8 bytes, in fact the memory controllers internally has to throw away a lot of bytes, and that throwing away costs some time, making it slower to get 8 bytes than 32,64,128 bytes at respectively ddr2,dd3,ddr4.

Heh - try to find such experts at any uni

So the point is that to DDR2,DDR3 ram a big bunch of bytes at once get read and written. Not just a few. That makes sense of course, otherwise they wouldn't get that bandwidth that they get.

The processors work different with respect to alignment of cachelinesizes. Now i must avoid running into trouble as not all tests i wrote in my life are public source code (understatement).

How the store buffers work is different from processor to processor.
If you use entries of 16 bytes and you are sure they are aligned it cannot happen obviously at most pc processors. It is interesting to figure out how this works for Nehalem better.

If you get 'weird results". Then please try to reproduce it, then i can take a look. Must be bug somewhere.

Speaking of bugs, i removed a parallel search bug from Diep that was inside since short before world champs 1999

Though it hardly ever occured. Possibly it never crashed because of it because of other code.

As we both know, nothing can be ever claimed bugfree in parallel searching engines, that's the thing

Maybe you simply measured a bitflip, who knows?

Vincent

I just realized that you and I are talking about a _different_ problem. The issue of a cache controller writing multiple quadwords at once is not the issue. What I described was this:

We have two processors, A and B. A searches to reach position Pa, while B searches to reach position Pb. Pa and Pb are completely different positions, but the lower N bits of Pa and Pb are the same so they both hash to the same memory address (table entry).

Processor A needs to store two words in the hash table, both of which belong to position Pa. Processor B needs to store two words in the hash table, both of which go with position Pb. And they are going to store at the same memory address. A stores the first 8 bytes of the hash signature, which is all you can write to memory with one instruction in X86 (mov instruction with an 8 byte register, or movq in linux). A then gets "distracted" for any reason. An interrupt. Hyperthreading is enabled and the other thread executes an instruction, whatever. While that is happening B stores _both_ words of the Pb entry to memory. A then comes along and stores the second word of Pa. Now we have the first word of Pb, the second word of Pa, and that's broken.

This is not an issue of writing a complete 64 byte / 128 byte L2 cache line to memory in one operation. It is an issue of the CPU being unable to write more than 8 bytes at once, or for more recent processors, no more than 16, which then fails on my 32 byte hash entries. So it isn't about the cache-to-memory write as much as it is about the CPU instruction sets' ability to write to cache/memory

And this happens _often_. Even on the 16 byte hash table entries, and it happens _much_ more often on the 32 byte pawn hash entries. Forget about memory controllers and such and figure out how to store 16 bytes with one instruction on X86. You can't... And that is a problem for the above scenarios...

bob · Post by **bob** » Sun Jan 25, 2009 8:05 am

diep wrote:More explanation on how memory controller works:

Vincent says: (4:26:57 AM) we write nonstop with all cores to RAM
Vincent says: (4:27:13 AM) now suppose 2 cores write at same time to the same adress in RAM
Vincent says: (4:27:34 AM) one entry is 16 bytes
Vincent says: (4:27:50 AM) core1 writes (a1,a2)
Vincent says: (4:27:59 AM) core2 writes (b1,b2)
Vincent says: (4:28:15 AM) can we get in ram stored (a1,b2) or (b1,a2) ?
Vincent says: (4:28:22 AM) as that would mean big problem
Michael says: (4:28:51 AM) no. that's not really how it works
Michael says: (4:29:37 AM) all cores share the same two memory controllers via crossbar
Michael says: (4:30:11 AM) and that is essentially the same thing as if they only had one memory controller / channel
Michael says: (4:30:25 AM) so they cannot write "simultaneously" to memory

Anyway, i knew this already back in 2002 or so. Just asked it to confirm it.

Note that at some HPC processors like R14000 it can happen you get write error, as they work more complex than pc processors.

Thanks,
Vincent

Michael is not thinking of what I am talking about. If the CPU only stores A1, then pauses for any reason from interrupt to hyper-threading context swap, the other can store B1, B2 and then the first later comes back to the current process and stores A2.

But for hardware, I still believe he is wrong, because of the "write-combine" buffers Intel does. A1 could be on the tail end of one of those buffers and get written out _before_ A2 gets written out. Other cores can then overwrite A1 and A2 before before A2 gets written, and we are right back to this problem. The X86 actually has load/store fence operations you can use to help this, if you are writing in asm. But even those do not solve the problem where two consecutive mov [mem], rax and mov [mem+8], rbx do not necessarily get executed close together, even though they appear consecutively in the program being run. And without that guarantee, we have to take evasive action ourselves...

diep · Post by **diep** » Mon Jan 26, 2009 9:14 pm

bob wrote:
diep wrote:
bob wrote:
diep wrote: Actually DDR2 ram does store more than 8 bytes at once.

It depends largely upon whether we take a look to memory controllers from intel or from AMD. Huge differences.

Let's start with intel.

It's 128 bits path for Nehalem.
DDR3, so that's 16 * 2^2 = 64 bytes at once

DDR4 is 128 bytes at once stored
and DDR5 stores at once 256 bytes.

DDR2 ram you've got is storing it at chunks of 32 bytes.
All of that is well and good, but practically it does _not_ work like that. I ran into this identical problem on multiple platforms, all the way up to a dual quad-core I7. I debugged and fixed it on our dual quad-core 2.33ghz xeon (core-2) cluster.

The pawn hash table is aligned on a 32-byte starting address. If it really stored 32 bytes at once this problem could _never_ happened. Yet I could make it happen at will on the dual xeon, once I found a position that was problematic. Ditto for the dual I7 I was testing on..
Hi Bob on memory controller we have the luxury to ask an expert.

Actually i've got the luxury position here to easily be able to quote experts.

Note Nehalem thanks to triple memory controller is 96 bytes at once to and from memory controllers.

What i also asked is how this works.
"it is completely independent of the platform"
"you just go by I/O pin"
Vincent says: (3:59:01 AM) so it is also the case for amd?
Vincent says: (3:59:05 AM) or just for intel
Michael says: (3:59:08 AM) because it is done internally on the memory
Michael says: (3:59:26 AM) no matter who supplies the memory controller
Michael says: (3:59:40 AM) Intel, AMD. VIA, HP
Michael says: (3:59:44 AM) doesn't matter

Thanks to Michael Schuette (director technology OCZ).

He complained by the way about a memory test mine. As it uses just 8 bytes, in fact the memory controllers internally has to throw away a lot of bytes, and that throwing away costs some time, making it slower to get 8 bytes than 32,64,128 bytes at respectively ddr2,dd3,ddr4.

Heh - try to find such experts at any uni

So the point is that to DDR2,DDR3 ram a big bunch of bytes at once get read and written. Not just a few. That makes sense of course, otherwise they wouldn't get that bandwidth that they get.

The processors work different with respect to alignment of cachelinesizes. Now i must avoid running into trouble as not all tests i wrote in my life are public source code (understatement).

How the store buffers work is different from processor to processor.
If you use entries of 16 bytes and you are sure they are aligned it cannot happen obviously at most pc processors. It is interesting to figure out how this works for Nehalem better.

If you get 'weird results". Then please try to reproduce it, then i can take a look. Must be bug somewhere.

Speaking of bugs, i removed a parallel search bug from Diep that was inside since short before world champs 1999

Though it hardly ever occured. Possibly it never crashed because of it because of other code.

As we both know, nothing can be ever claimed bugfree in parallel searching engines, that's the thing

Maybe you simply measured a bitflip, who knows?

Vincent
I just realized that you and I are talking about a _different_ problem. The issue of a cache controller writing multiple quadwords at once is not the issue. What I described was this:

We have two processors, A and B. A searches to reach position Pa, while B searches to reach position Pb. Pa and Pb are completely different positions, but the lower N bits of Pa and Pb are the same so they both hash to the same memory address (table entry).

Processor A needs to store two words in the hash table, both of which belong to position Pa. Processor B needs to store two words in the hash table, both of which go with position Pb. And they are going to store at the same memory address. A stores the first 8 bytes of the hash signature, which is all you can write to memory with one instruction in X86 (mov instruction with an 8 byte register, or movq in linux). A then gets "distracted" for any reason. An interrupt. Hyperthreading is enabled and the other thread executes an instruction, whatever. While that is happening B stores _both_ words of the Pb entry to memory. A then comes along and stores the second word of Pa. Now we have the first word of Pb, the second word of Pa, and that's broken.

This is not an issue of writing a complete 64 byte / 128 byte L2 cache line to memory in one operation. It is an issue of the CPU being unable to write more than 8 bytes at once, or for more recent processors, no more than 16, which then fails on my 32 byte hash entries. So it isn't about the cache-to-memory write as much as it is about the CPU instruction sets' ability to write to cache/memory

And this happens _often_. Even on the 16 byte hash table entries, and it happens _much_ more often on the 32 byte pawn hash entries. Forget about memory controllers and such and figure out how to store 16 bytes with one instruction on X86. You can't... And that is a problem for the above scenarios...

Suppose we have 2 different entries both going to the same slot in hashtable.

{a1,a2} and {b1,b2}

What we want to know is the odds for writing: {a1,b2} or {b1,a2}

This as we use the XOR trick to avoid this.

In supercomputer processors like R14000 indeed an interrupt can cause that
you write {a1,b2} or {b1,a2}. You guessed that right. This is however not how modern microprocessors work. They are way more sophisticated than this. At a philosophical level you could understand why. If it would work like the above, then of course it is impossible to keep a PC very stable online and the bandwidth to the RAM would be too little.

At pc hardware as it is now this is not possible, PROVIDED, that you write entire entry within 1 cacheline. It gets stored at once then to RAM without interrupt.

Of course the scenario that you KILL a process is not interesting to discuss, as then crafty is dead. However also in THAT case it should go, according to PC paper specs ok, as the operation is atomic to flush either entire cacheline or nothing at all.

Memory bitflips CAN occur however. If you have ECC ram you can already take a look how many bitflips in RAM happen, as it generates statistics about this. That is very easy to do in linux.

So non-ECC ram, which majority of machines use, you could get sometimes a bitflip in which case you can get a 'seemingly' different entry which gets reported to you as a write error...

Thanks,
Vincent

SMP hashing problem

SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem

Re: SMP hashing problem