Speaking of the hash table

AlvaroBegue · Post by **AlvaroBegue** » Mon Dec 10, 2012 6:10 pm

diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.

I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.

diep · Post by **diep** » Mon Dec 10, 2012 6:19 pm

bob wrote:
Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.

Sven
There are some older versions of Crafty without the "lockless hashing" code. One only has to run it on a multi-cpu box to see the problem. Doesn't happen every move, but it happens a bunch. More cores = more bad moves from hash.

if you do this to store a 16 byte entry:

mov hash[rax], rbx
mov hash+8[rax], rcx

you have a problem, because the first move can be executed, writing the first 8 bytes to the hash, then that process can get suspended, and the second can execute both instructions, then the first comes back and writes the 2nd 8 bytes and breaks the entry. There are other ways for it to occur as well, but the basic idea is that the two writes get split, with two other writes in between. More threads mean this becomes more likely, particularly when a hash transaction is generally going to be a cache miss anyway, sort of creating a pseudo-synchronization point since the access time is so slow on such entries. In Cray Blitz, we originally locked entries since semaphores were so fast there, but we (Harry Nelson's idea) started on the lockless approach to reduce semaphore usage when we got to 32 cpus in 1993 or 1994...

Bob, add chances to every occasion.

First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.

So then it can't happen.

Secondly some OS-es can stop at any point, others cannot.

So in those you are safe as well.

Now let's ignore the above realities. Let's focus upon odds.

What's odds that a thread gets interrupted out of the 8 you run?

Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.

Right?

So we speak of small interrupts here. If it's 1 millisecond it's a lot.

That say each 20 milliseconds.

So we speak of 50 occassions a second that one of the threads thread.
stops there.

however the odds it stops specifically at *this spot* is really tiny.

Say 1 in a million.

How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.

You've got 7 other threads.

How many stores in 1 millisecond can they do?

250 * 7 = 1750 stores.

Your hashtable has 500 million entries or so?
I have here.

Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035

Also close to 1 in a million.

So we look at something having odds 1 in 10^12 or so.

That's worse than the 10^10 claim i did do Bob.

And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.

You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times

You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.

Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.

You've got ECC

Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.

For todays hardware odds is way smaller than those machines.

This 10^10 is for a write error. Say Abcd or ABCd or ABcd.

Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.

So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.

diep · Post by **diep** » Mon Dec 10, 2012 6:35 pm

AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.

This formula for collisions is of course total wrong which is not so interesting; if you play a game you'll see typically more collissions in less nodes than when you do a single search in a given position of the same time that the game lasted.

So in the formula you also would need the search space size you visit and a probability of the 'entropy' you encounter with your search in that search space.

In diep i use 128 bits Zobrist so i'm not interested in collissions at all.
Note i store a bit or 80 of it.

I stated that the odds for a write error is 1 in 10^10 hashtable writes according to my own measurements some years ago.

Bob vaguely remembers it to be a dozen within a minute.

And i also indicated that odds the write error is a random 64 bits within an entry's center, that's less than 1 in 10^20 so won't happen in our lifetime.

Say AbCD would be nasty for Diep if it would happen - which is why i had figured out whether it could occur and consulted the hardware engineers there.

They indicated it won't ever happen at x86 hardware and could happen sometimes, very seldom though, at HPC type cpu's because of how other cpu's can hammer into the memory using a hub or whatever sort of interconnect that is between the RAM and the CPU which is not there in x86.

If we do math it's obvious the hardware engineers were right.

syzygy · Post by **syzygy** » Mon Dec 10, 2012 8:50 pm

AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.

The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)

syzygy · Post by **syzygy** » Mon Dec 10, 2012 9:14 pm

diep wrote:First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.

So then it can't happen.

A 16 byte write to a single cache line by a single SSE instruction is not guaranteed to be atomic. The Intel manual specifically states that x87 and SSE instructions accessing data larger than a quadword may be implemented using multiple memory accesses.

Another thing is that you're ignoring what is probably the bigger problem: one thread reading a hash entry while another thread is updating the same hash entry. Even if the update is atomic, things can go wrong at the reading side. I'm pretty sure that most engines do not first copy the full cache line that holds the relevant bucket to local memory (let alone atomically, which is not even possible) and only then compare hash keys.

So what can easily happen is this:
1. Thread 1 probes the hash table, compares hash keys, finds a matching entry.
2. Thread 2 overwrites that same entry (let's say it happens atomically, it doesn't matter).
3. Thread 1 reads the (wrong!) hash move from the entry.

diep · Post by **diep** » Mon Dec 10, 2012 9:18 pm

syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)

Your math is not accurate either.

I've had very different results with collissions. If you search static positions starting with a cleared hashtable, then i got like 1 or 0 collissions with 64 bits at each 200 billion nodes. That was with a 120 GB hashtable and 200 processors.

If you count number of collissions during blitz games i had 1 collission on average each blitz game with a 64 bits hashkey.

That was 5 minute games at dual k7 2.1Ghz you know with a 400MB hashtable and 16 bytes per entry back then (can't remember that exactly but could look it up). Diep got 200k nps at 2 processors there. So slightly under 200k stores per second into the hashtable. Do the math.

600 seconds * 0.2M nps = 120M nodes on average. A lot less!

If i counted collissions with less agressive extensions, like the supercomputer version used, i had 0 collissions suddenly.

So even the method how you extend matters there as silly extensions get you easily 127 ply lines at which point the position is total different from the start position you start search.

So the search space you cover in your hashtable and the entropy you get within that search space is more important for the number of collissions than anything else there.

A hashkey of 64 bits represents 10^19 positions. So if your search space is not overwhelming larger than 10^19 you won't get collissions easily of course.

Todays agressive pruning most use, i doubt many engines get a really large search space.

Note that those 64 bits numbers diep have are of a better randomness than anything you've ever seen on the planet.

Not only generated by a very good RNG but also hand editted at 4 AM in one weekend where i randomly through the list of random numbers in overwrite mode wrote 0..f into the numbers

So there is no discussions there possible on whether i used a good or bad RNG.

It's really impossible to make a formula that's generic which calculats the number of collissions. All i can say is that the search space is rather important and that most collissions i had appeared in the far endgame (to my big amazement) in the blitzgames.

The explanation why it happens there is rather simple of course. You search deeper there (remember this was 2003 and in blitz didn't get the search depths we get today) and move more pieces therefore to a different square.

syzygy · Post by **syzygy** » Mon Dec 10, 2012 9:30 pm

diep wrote:
syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
Your math is not accurate either.

I'm sorry but it is, except that I'm ignoring actual hash hits. Hash hits of course cannot lead to collisions. So you have to look at the probes that do not hit.

See also here for a case where my formula predicted 244 collisions per million probes, whereas the number measured was 240 collisions per million nodes.

diep · Post by **diep** » Mon Dec 10, 2012 9:31 pm

syzygy wrote:
diep wrote:
syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
Your math is not accurate either.
I'm sorry but it is, except that I'm ignoring actual hash hits. Hash hits of course cannot lead to collisions. So you have to look at the probes that do not hit.

See also here for a case where my formula predicted 244 collisions per million probes, whereas the number measured was 240 collisions per million nodes.

You're not using the search space nor entropy at which engines search.

So it's not even close to accurate.

You show small numbers. In prime numbers we see the same problem. It's easy to theoretize about small numbers, but when you get to larger ones suddenly all those profets look like beginners.

bob · Post by **bob** » Mon Dec 10, 2012 9:32 pm

diep wrote:
syzygy wrote:
diep wrote:
syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
I editted my post read it.

The Kiiski scenario of AbCD cannot occur.
What I wrote still stands.

The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.
Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.

Progress!

You're still not getting the idea though is it?

AbCD as Joona wrotes it cannot occur.

Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.

It's busy writing to the storebuffer local.

Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.

Our 'A' is still in the storebuffer. It's not written to the memory controller yet.

Then after a while our thread P0 resumes. It writes a B. Still nothing happens. It's still local.

It writes then a C which is the next cacheline. Our core says to itself: "heh what the hell, i'm referencing a 2nd cacheline now, so i can write the first 64 bytes to the memory controller"

This is globally what happens.

Here is what happens.

Two different threads want to do this at the same time:

mov hash[rax]. rbx
mov hash+8[rax], rcx

which simply writes two 8-byte pieces of a hash entry back to memory.

the first core writes the first entry, which fetches that 64 byte line into cache, and then updates 8 bytes somewhere. Before that CPU executes the second mov, a context switch (for an interrupt or whatever) occurs. A second core then requests that same block of memory. It gets forwarded from the cpu that currently has the partially modified block. The second CPU writes both words to the same area, and invalidates the copy at the first. When the first tries to write the 2nd 64 bits, it requests it from memory but the second cache forwards the current data. The first then writes the second 64 bit chunk, and we are now left with the first 64 bit chunk from the 2nd cpu, and the 2nd 64 bit chunk from the first, which is obviously wrong. And it happens often enough to be a pain in the ass, although it might not cause any crashes or blunders...

diep · Post by **diep** » Mon Dec 10, 2012 9:35 pm

bob wrote:
diep wrote:
syzygy wrote:
diep wrote:
syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
I editted my post read it.

The Kiiski scenario of AbCD cannot occur.
What I wrote still stands.

The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.
Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.

Progress!

You're still not getting the idea though is it?

AbCD as Joona wrotes it cannot occur.

Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.

It's busy writing to the storebuffer local.

Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.

Our 'A' is still in the storebuffer. It's not written to the memory controller yet.

Then after a while our thread P0 resumes. It writes a B. Still nothing happens. It's still local.

It writes then a C which is the next cacheline. Our core says to itself: "heh what the hell, i'm referencing a 2nd cacheline now, so i can write the first 64 bytes to the memory controller"

This is globally what happens.
Here is what happens.

Two different threads want to do this at the same time:

mov hash[rax]. rbx
mov hash+8[rax], rcx

which simply writes two 8-byte pieces of a hash entry back to memory.

the first core writes the first entry, which fetches that 64 byte line into cache, and then updates 8 bytes somewhere. Before that CPU executes the second mov, a context switch (for an interrupt or whatever) occurs. A second core then requests that same block of memory. It gets forwarded from the cpu that currently has the partially modified block. The second CPU writes both words to the same area, and invalidates the copy at the first. When the first tries to write the 2nd 64 bits, it requests it from memory but the second cache forwards the current data. The first then writes the second 64 bit chunk, and we are now left with the first 64 bit chunk from the 2nd cpu, and the 2nd 64 bit chunk from the first, which is obviously wrong. And it happens often enough to be a pain in the ass, although it might not cause any crashes or blunders...

Bob, add chances to every occasion.

First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.

So then it can't happen.

Secondly some OS-es can stop at any point, others cannot.

So in those you are safe as well.

Now let's ignore the above realities. Let's focus upon odds.

What's odds that a thread gets interrupted out of the 8 you run?

Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.

Right?

So we speak of small interrupts here. If it's 1 millisecond it's a lot.

That say each 20 milliseconds.

So we speak of 50 occassions a second that one of the threads thread.
stops there.

however the odds it stops specifically at *this spot* is really tiny.

Say 1 in a million.

How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.

You've got 7 other threads.

How many stores in 1 millisecond can they do?

250 * 7 = 1750 stores.

Your hashtable has 500 million entries or so?
I have here.

Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035

Also close to 1 in a million.

So we look at something having odds 1 in 10^12 or so.

That's worse than the 10^10 claim i did do Bob.

And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.

You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times Smile

You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.

Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.

You've got ECC

Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.

For todays hardware odds is way smaller than those machines.

This 10^10 is for a write error. Say Abcd or ABCd or ABcd.

Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.

So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.

Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table