bob wrote:diep wrote:bob wrote:Sven Schüle wrote:bob wrote:Sven Schüle wrote:diep wrote:zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.
Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.
We have no control of the thread execution, so in extreme case it could happen:
thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)
[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware
Even impossible in fact in hardware from 90s.
They wrote 32 bytes at once
It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.
So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.
As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?
Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.
Sven
There are some older versions of Crafty without the "lockless hashing" code. One only has to run it on a multi-cpu box to see the problem. Doesn't happen every move, but it happens a bunch. More cores = more bad moves from hash.
if you do this to store a 16 byte entry:
mov hash[rax], rbx
mov hash+8[rax], rcx
you have a problem, because the first move can be executed, writing the first 8 bytes to the hash, then that process can get suspended, and the second can execute both instructions, then the first comes back and writes the 2nd 8 bytes and breaks the entry. There are other ways for it to occur as well, but the basic idea is that the two writes get split, with two other writes in between. More threads mean this becomes more likely, particularly when a hash transaction is generally going to be a cache miss anyway, sort of creating a pseudo-synchronization point since the access time is so slow on such entries. In Cray Blitz, we originally locked entries since semaphores were so fast there, but we (Harry Nelson's idea) started on the lockless approach to reduce semaphore usage when we got to 32 cpus in 1993 or 1994...
Bob, add chances to every occasion.
First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.
Still takes two stores for most people. I looked at my code, and the compiler (intel) doesn't use any SSE/MMX stuff at all in storing hash entries...
So then it can't happen.
Secondly some OS-es can stop at any point, others cannot.
Not sure what that means. If interrupts are enabled, you are going to lose some time in the context-switch to the OS to handle the interrupt. I don't know of any O/S that keeps interrupts disabled while a user program is running. How would you preempt to execute another process there? The typical unix box gets 100 timer interrupts per second anyway, just for starters.
So in those you are safe as well.
Now let's ignore the above realities. Let's focus upon odds.
What's odds that a thread gets interrupted out of the 8 you run?
100%. I run 8 thread on 8 cores. At the very least, the O/S is going to have to handle the real-time clock interrupts (typically 100/sec for unix). Not to mention misc network traffic and such. So some thread will get interrupted quite frequently overall.
Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.
Right?
Right, but that is all that is needed to split the first pair of writes, and do the second pair of writes in between them. I'll try to dig up the non-lockless code and run it on 8 cores and give some counts. It is not millions of times per minute, but it happens with regularity and can show up hundreds of times per minute in positions like fine70 which are beating on the same basic hash positions over and over...
So we speak of small interrupts here. If it's 1 millisecond it's a lot.
That say each 20 milliseconds.
So we speak of 50 occassions a second that one of the threads thread.
stops there.
however the odds it stops specifically at *this spot* is really tiny.
Not quite. First, almost all hash probes are cache misses. So we get a "cluster effect" where every thread is waiting on memory at some points. Now it is pretty easy for this to happen since the first has accessed the thing and the second core (or another one) is waiting...
Say 1 in a million.
How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.
You've got 7 other threads.
How many stores in 1 millisecond can they do?
250 * 7 = 1750 stores.
Your hashtable has 500 million entries or so?
I have here.
Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035
Also close to 1 in a million.
So we look at something having odds 1 in 10^12 or so.
That's worse than the 10^10 claim i did do Bob.
And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.
You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times
Already done, and the problem was verified to be exactly what I claimed. I'll post some numbers...
You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.
Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.
You've got ECC
Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.
For todays hardware odds is way smaller than those machines.
This 10^10 is for a write error. Say Abcd or ABCd or ABcd.
Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.
So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.
Here's a sample run with Crafty version 22.9 that did NOT lock the hash table. I modified next.c to display the illegal move from hash table error (the code is surrounded by a #if defined(DEBUG) and I just removed the #if for this test)...
time=2:47 mat=1 n=1149177757 fh=94% nps=6.9M
ext-> check=124.1M qcheck=118.5M reduce=552.1M/197.0M
predicted=0 evals=657.3M 50move=0 EGTBprobes=0 hits=0
SMP-> splits=39193 aborts=1569 data=42/512 elap=2:47
Search took 2:47 on fine 70. If I "grep 'bad move' log.004 | wc -l" I get this:
1550
so 1,550 "broken writes" that were detected solely by the hash move being illegal". When I stored more info back when testing, the actual number of broken writes was much higher, but since a move can be legal in a large number of different positions, this test doesn't catch 'em all.
That was for one move.
From the opening position, it is not so dramatic, same time limit but just 47 broken writes. Output looks like this as it runs...
starting thread 7
14-> 1.27 0.08 1. Nf3 Nf6 2. Nc3 Nc6 3. e4 e5 4. d4
Bb4 5. d5 Bxc3+ 6. bxc3 Na5 7. Bg5
h6 8. Bd2 Nxe4 9. Nxe5 Nxd2 10. Qxd2
(s=2)
15 1.55 0.13 1. Nf3 Nf6 2. Nc3 Nc6 3. e4 e5 4. d4
Bb4 5. d5 Bxc3+ 6. bxc3 Na5 7. Nxe5
Nxe4 8. Qg4 Nxc3 9. Qxg7
15-> 1.78 0.13 1. Nf3 Nf6 2. Nc3 Nc6 3. e4 e5 4. d4
Bb4 5. d5 Bxc3+ 6. bxc3 Na5 7. Nxe5
Nxe4 8. Qg4 Nxc3 9. Qxg7
16 5.25 0.12 1. Nf3 Nf6 2. e3 Nc6 3. d4 e6 4. Bd3
d5 5. O-O Bd6 6. Nc3 O-O 7. Bd2 Nb4
8. Ne5 Nxd3 9. cxd3
bad move from hash table, ply=22
bad move from hash table, ply=22
bad move from hash table, ply=24
16 12.96 0.41 1. e4 Nf6 2. e5 Nd5 3. Nc3 e6 4. Nxd5
exd5 5. d4 Bb4+ 6. c3 Be7 7. Qb3 c6
8. Bd2 O-O 9. O-O-O
16-> 12.96 0.41 1. e4 Nf6 2. e5 Nd5 3. Nc3 e6 4. Nxd5
exd5 5. d4 Bb4+ 6. c3 Be7 7. Qb3 c6
8. Bd2 O-O 9. O-O-O (s=2)
bad move from hash table, ply=24
17 22.36 0.29 1. e4 e5 2. Nf3 Bd6 3. Nc3 Nf6 4. Bc4
Nc6 5. O-O O-O 6. d4 exd4 7. Nxd4 Ne5
8. Bd5 Neg4 9. f4 Nxd5 10. Nxd5
17-> 22.93 0.29 1. e4 e5 2. Nf3 Bd6 3. Nc3 Nf6 4. Bc4
Nc6 5. O-O O-O 6. d4 exd4 7. Nxd4 Ne5
8. Bd5 Neg4 9. f4 Nxd5 10. Nxd5
18 26.51 0.26 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Na4 Na5
8. Nxe5 Nxc4 9. Nxc4 Bg4
18-> 28.81 0.26 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Na4 Na5
8. Nxe5 Nxc4 9. Nxc4 Bg4
bad move from hash table, ply=15
19 35.09 0.25 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Bg5 h6 8.
Be3 Bxe3 9. fxe3 Be6 10. Bxe6 fxe6
19-> 41.95 0.25 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Bg5 h6 8.
Be3 Bxe3 9. fxe3 Be6 10. Bxe6 fxe6
(s=2)
bad move from hash table, ply=28
bad move from hash table, ply=32
bad move from hash table, ply=28
bad move from hash table, ply=28
bad move from hash table, ply=23
bad move from hash table, ply=10
bad move from hash table, ply=14
20 1:02 0.16 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. d3 O-O 6. Bg5 Na5 7. Nxe5 d6
8. Nxf7 Rxf7 9. Bxf7+ Kxf7 10. O-O
Be6 11. Qf3
<etc>