I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
Speaking of the hash table
Moderators: hgm, Rebel, chrisw
-
- Posts: 931
- Joined: Tue Mar 09, 2010 3:46 pm
- Location: New York
- Full name: Álvaro Begué (RuyDos)
Re: Speaking of the hash table
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Speaking of the hash table
Bob, add chances to every occasion.bob wrote:There are some older versions of Crafty without the "lockless hashing" code. One only has to run it on a multi-cpu box to see the problem. Doesn't happen every move, but it happens a bunch. More cores = more bad moves from hash.Sven Schüle wrote:Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.bob wrote:No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).Sven Schüle wrote:Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?diep wrote:What you write is impossible in the hardwarezamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.
Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.
We have no control of the thread execution, so in extreme case it could happen:
thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)
[sleep(1second) = situation when OS decides to pause thread execution]
Even impossible in fact in hardware from 90s.
They wrote 32 bytes at once
It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.
So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.
As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Sven
Sven
if you do this to store a 16 byte entry:
mov hash[rax], rbx
mov hash+8[rax], rcx
you have a problem, because the first move can be executed, writing the first 8 bytes to the hash, then that process can get suspended, and the second can execute both instructions, then the first comes back and writes the 2nd 8 bytes and breaks the entry. There are other ways for it to occur as well, but the basic idea is that the two writes get split, with two other writes in between. More threads mean this becomes more likely, particularly when a hash transaction is generally going to be a cache miss anyway, sort of creating a pseudo-synchronization point since the access time is so slow on such entries. In Cray Blitz, we originally locked entries since semaphores were so fast there, but we (Harry Nelson's idea) started on the lockless approach to reduce semaphore usage when we got to 32 cpus in 1993 or 1994...
First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.
So then it can't happen.
Secondly some OS-es can stop at any point, others cannot.
So in those you are safe as well.
Now let's ignore the above realities. Let's focus upon odds.
What's odds that a thread gets interrupted out of the 8 you run?
Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.
Right?
So we speak of small interrupts here. If it's 1 millisecond it's a lot.
That say each 20 milliseconds.
So we speak of 50 occassions a second that one of the threads thread.
stops there.
however the odds it stops specifically at *this spot* is really tiny.
Say 1 in a million.
How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.
You've got 7 other threads.
How many stores in 1 millisecond can they do?
250 * 7 = 1750 stores.
Your hashtable has 500 million entries or so?
I have here.
Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035
Also close to 1 in a million.
So we look at something having odds 1 in 10^12 or so.
That's worse than the 10^10 claim i did do Bob.
And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.
You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times
You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.
Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.
You've got ECC
Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.
For todays hardware odds is way smaller than those machines.
This 10^10 is for a write error. Say Abcd or ABCd or ABcd.
Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.
So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Speaking of the hash table
This formula for collisions is of course total wrong which is not so interesting; if you play a game you'll see typically more collissions in less nodes than when you do a single search in a given position of the same time that the game lasted.AlvaroBegue wrote:I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
So in the formula you also would need the search space size you visit and a probability of the 'entropy' you encounter with your search in that search space.
In diep i use 128 bits Zobrist so i'm not interested in collissions at all.
Note i store a bit or 80 of it.
I stated that the odds for a write error is 1 in 10^10 hashtable writes according to my own measurements some years ago.
Bob vaguely remembers it to be a dozen within a minute.
And i also indicated that odds the write error is a random 64 bits within an entry's center, that's less than 1 in 10^20 so won't happen in our lifetime.
Say AbCD would be nasty for Diep if it would happen - which is why i had figured out whether it could occur and consulted the hardware engineers there.
They indicated it won't ever happen at x86 hardware and could happen sometimes, very seldom though, at HPC type cpu's because of how other cpu's can hammer into the memory using a hub or whatever sort of interconnect that is between the RAM and the CPU which is not there in x86.
If we do math it's obvious the hardware engineers were right.
-
- Posts: 5557
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Speaking of the hash table
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.AlvaroBegue wrote:I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
-
- Posts: 5557
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Speaking of the hash table
A 16 byte write to a single cache line by a single SSE instruction is not guaranteed to be atomic. The Intel manual specifically states that x87 and SSE instructions accessing data larger than a quadword may be implemented using multiple memory accesses.diep wrote:First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.
So then it can't happen.
Another thing is that you're ignoring what is probably the bigger problem: one thread reading a hash entry while another thread is updating the same hash entry. Even if the update is atomic, things can go wrong at the reading side. I'm pretty sure that most engines do not first copy the full cache line that holds the relevant bucket to local memory (let alone atomically, which is not even possible) and only then compare hash keys.
So what can easily happen is this:
1. Thread 1 probes the hash table, compares hash keys, finds a matching entry.
2. Thread 2 overwrites that same entry (let's say it happens atomically, it doesn't matter).
3. Thread 1 reads the (wrong!) hash move from the entry.
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Speaking of the hash table
Your math is not accurate either.syzygy wrote:The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.AlvaroBegue wrote:I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
I've had very different results with collissions. If you search static positions starting with a cleared hashtable, then i got like 1 or 0 collissions with 64 bits at each 200 billion nodes. That was with a 120 GB hashtable and 200 processors.
If you count number of collissions during blitz games i had 1 collission on average each blitz game with a 64 bits hashkey.
That was 5 minute games at dual k7 2.1Ghz you know with a 400MB hashtable and 16 bytes per entry back then (can't remember that exactly but could look it up). Diep got 200k nps at 2 processors there. So slightly under 200k stores per second into the hashtable. Do the math.
600 seconds * 0.2M nps = 120M nodes on average. A lot less!
If i counted collissions with less agressive extensions, like the supercomputer version used, i had 0 collissions suddenly.
So even the method how you extend matters there as silly extensions get you easily 127 ply lines at which point the position is total different from the start position you start search.
So the search space you cover in your hashtable and the entropy you get within that search space is more important for the number of collissions than anything else there.
A hashkey of 64 bits represents 10^19 positions. So if your search space is not overwhelming larger than 10^19 you won't get collissions easily of course.
Todays agressive pruning most use, i doubt many engines get a really large search space.
Note that those 64 bits numbers diep have are of a better randomness than anything you've ever seen on the planet.
Not only generated by a very good RNG but also hand editted at 4 AM in one weekend where i randomly through the list of random numbers in overwrite mode wrote 0..f into the numbers
So there is no discussions there possible on whether i used a good or bad RNG.
It's really impossible to make a formula that's generic which calculats the number of collissions. All i can say is that the search space is rather important and that most collissions i had appeared in the far endgame (to my big amazement) in the blitzgames.
The explanation why it happens there is rather simple of course. You search deeper there (remember this was 2003 and in blitz didn't get the search depths we get today) and move more pieces therefore to a different square.
-
- Posts: 5557
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Speaking of the hash table
I'm sorry but it is, except that I'm ignoring actual hash hits. Hash hits of course cannot lead to collisions. So you have to look at the probes that do not hit.diep wrote:Your math is not accurate either.syzygy wrote:The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.AlvaroBegue wrote:I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
See also here for a case where my formula predicted 244 collisions per million probes, whereas the number measured was 240 collisions per million nodes.
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Speaking of the hash table
You're not using the search space nor entropy at which engines search.syzygy wrote:I'm sorry but it is, except that I'm ignoring actual hash hits. Hash hits of course cannot lead to collisions. So you have to look at the probes that do not hit.diep wrote:Your math is not accurate either.syzygy wrote:The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.AlvaroBegue wrote:I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
See also here for a case where my formula predicted 244 collisions per million probes, whereas the number measured was 240 collisions per million nodes.
So it's not even close to accurate.
You show small numbers. In prime numbers we see the same problem. It's easy to theoretize about small numbers, but when you get to larger ones suddenly all those profets look like beginners.
Last edited by diep on Mon Dec 10, 2012 9:32 pm, edited 1 time in total.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Speaking of the hash table
Here is what happens.diep wrote:Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.syzygy wrote:What I wrote still stands.diep wrote:I editted my post read it.syzygy wrote:But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
The Kiiski scenario of AbCD cannot occur.
The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.
Progress!
You're still not getting the idea though is it?
AbCD as Joona wrotes it cannot occur.
Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.
It's busy writing to the storebuffer local.
Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.
Our 'A' is still in the storebuffer. It's not written to the memory controller yet.
Then after a while our thread P0 resumes. It writes a B. Still nothing happens. It's still local.
It writes then a C which is the next cacheline. Our core says to itself: "heh what the hell, i'm referencing a 2nd cacheline now, so i can write the first 64 bytes to the memory controller"
This is globally what happens.
Two different threads want to do this at the same time:
mov hash[rax]. rbx
mov hash+8[rax], rcx
which simply writes two 8-byte pieces of a hash entry back to memory.
the first core writes the first entry, which fetches that 64 byte line into cache, and then updates 8 bytes somewhere. Before that CPU executes the second mov, a context switch (for an interrupt or whatever) occurs. A second core then requests that same block of memory. It gets forwarded from the cpu that currently has the partially modified block. The second CPU writes both words to the same area, and invalidates the copy at the first. When the first tries to write the 2nd 64 bits, it requests it from memory but the second cache forwards the current data. The first then writes the second 64 bit chunk, and we are now left with the first 64 bit chunk from the 2nd cpu, and the 2nd 64 bit chunk from the first, which is obviously wrong. And it happens often enough to be a pain in the ass, although it might not cause any crashes or blunders...
-
- Posts: 1822
- Joined: Thu Mar 09, 2006 11:54 pm
- Location: The Netherlands
Re: Speaking of the hash table
bob wrote:Here is what happens.diep wrote:Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.syzygy wrote:What I wrote still stands.diep wrote:I editted my post read it.syzygy wrote:But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
The Kiiski scenario of AbCD cannot occur.
The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.
Progress!
You're still not getting the idea though is it?
AbCD as Joona wrotes it cannot occur.
Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.
It's busy writing to the storebuffer local.
Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.
Our 'A' is still in the storebuffer. It's not written to the memory controller yet.
Then after a while our thread P0 resumes. It writes a B. Still nothing happens. It's still local.
It writes then a C which is the next cacheline. Our core says to itself: "heh what the hell, i'm referencing a 2nd cacheline now, so i can write the first 64 bytes to the memory controller"
This is globally what happens.
Two different threads want to do this at the same time:
mov hash[rax]. rbx
mov hash+8[rax], rcx
which simply writes two 8-byte pieces of a hash entry back to memory.
the first core writes the first entry, which fetches that 64 byte line into cache, and then updates 8 bytes somewhere. Before that CPU executes the second mov, a context switch (for an interrupt or whatever) occurs. A second core then requests that same block of memory. It gets forwarded from the cpu that currently has the partially modified block. The second CPU writes both words to the same area, and invalidates the copy at the first. When the first tries to write the 2nd 64 bits, it requests it from memory but the second cache forwards the current data. The first then writes the second 64 bit chunk, and we are now left with the first 64 bit chunk from the 2nd cpu, and the 2nd 64 bit chunk from the first, which is obviously wrong. And it happens often enough to be a pain in the ass, although it might not cause any crashes or blunders...
Bob, add chances to every occasion.
First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.
So then it can't happen.
Secondly some OS-es can stop at any point, others cannot.
So in those you are safe as well.
Now let's ignore the above realities. Let's focus upon odds.
What's odds that a thread gets interrupted out of the 8 you run?
Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.
Right?
So we speak of small interrupts here. If it's 1 millisecond it's a lot.
That say each 20 milliseconds.
So we speak of 50 occassions a second that one of the threads thread.
stops there.
however the odds it stops specifically at *this spot* is really tiny.
Say 1 in a million.
How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.
You've got 7 other threads.
How many stores in 1 millisecond can they do?
250 * 7 = 1750 stores.
Your hashtable has 500 million entries or so?
I have here.
Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035
Also close to 1 in a million.
So we look at something having odds 1 in 10^12 or so.
That's worse than the 10^10 claim i did do Bob.
And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.
You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times Smile
You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.
Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.
You've got ECC
Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.
For todays hardware odds is way smaller than those machines.
This 10^10 is for a write error. Say Abcd or ABCd or ABcd.
Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.
So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.