Speaking of the hash table

syzygy · Post by **syzygy** » Mon Dec 10, 2012 9:14 pm

diep wrote:First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.

So then it can't happen.

A 16 byte write to a single cache line by a single SSE instruction is not guaranteed to be atomic. The Intel manual specifically states that x87 and SSE instructions accessing data larger than a quadword may be implemented using multiple memory accesses.

Another thing is that you're ignoring what is probably the bigger problem: one thread reading a hash entry while another thread is updating the same hash entry. Even if the update is atomic, things can go wrong at the reading side. I'm pretty sure that most engines do not first copy the full cache line that holds the relevant bucket to local memory (let alone atomically, which is not even possible) and only then compare hash keys.

So what can easily happen is this:
1. Thread 1 probes the hash table, compares hash keys, finds a matching entry.
2. Thread 2 overwrites that same entry (let's say it happens atomically, it doesn't matter).
3. Thread 1 reads the (wrong!) hash move from the entry.

diep · Post by **diep** » Mon Dec 10, 2012 9:18 pm

syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)

Your math is not accurate either.

I've had very different results with collissions. If you search static positions starting with a cleared hashtable, then i got like 1 or 0 collissions with 64 bits at each 200 billion nodes. That was with a 120 GB hashtable and 200 processors.

If you count number of collissions during blitz games i had 1 collission on average each blitz game with a 64 bits hashkey.

That was 5 minute games at dual k7 2.1Ghz you know with a 400MB hashtable and 16 bytes per entry back then (can't remember that exactly but could look it up). Diep got 200k nps at 2 processors there. So slightly under 200k stores per second into the hashtable. Do the math.

600 seconds * 0.2M nps = 120M nodes on average. A lot less!

If i counted collissions with less agressive extensions, like the supercomputer version used, i had 0 collissions suddenly.

So even the method how you extend matters there as silly extensions get you easily 127 ply lines at which point the position is total different from the start position you start search.

So the search space you cover in your hashtable and the entropy you get within that search space is more important for the number of collissions than anything else there.

A hashkey of 64 bits represents 10^19 positions. So if your search space is not overwhelming larger than 10^19 you won't get collissions easily of course.

Todays agressive pruning most use, i doubt many engines get a really large search space.

Note that those 64 bits numbers diep have are of a better randomness than anything you've ever seen on the planet.

Not only generated by a very good RNG but also hand editted at 4 AM in one weekend where i randomly through the list of random numbers in overwrite mode wrote 0..f into the numbers

So there is no discussions there possible on whether i used a good or bad RNG.

It's really impossible to make a formula that's generic which calculats the number of collissions. All i can say is that the search space is rather important and that most collissions i had appeared in the far endgame (to my big amazement) in the blitzgames.

The explanation why it happens there is rather simple of course. You search deeper there (remember this was 2003 and in blitz didn't get the search depths we get today) and move more pieces therefore to a different square.

syzygy · Post by **syzygy** » Mon Dec 10, 2012 9:30 pm

diep wrote:
syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
Your math is not accurate either.

I'm sorry but it is, except that I'm ignoring actual hash hits. Hash hits of course cannot lead to collisions. So you have to look at the probes that do not hit.

See also here for a case where my formula predicted 244 collisions per million probes, whereas the number measured was 240 collisions per million nodes.

diep · Post by **diep** » Mon Dec 10, 2012 9:31 pm

syzygy wrote:
diep wrote:
syzygy wrote:
AlvaroBegue wrote:
diep wrote:Yeah it's unlikely it's collissions as with 64 bits only 1 in a 200 billion lookups should be a collission.
I don't know what that number means. The probability of finding a collision after you have added N different positions is a function that increases with N (of course). The value of 0.5 is reached when N is about 5.1e9 (5.1 American billions). See http://en.wikipedia.org/wiki/Birthday_p ... lity_table for details.
The Birthday paradox is not very relevant here. Because multiple positions hash to the same bucket, not all previously searched entries are present in the hash table. So even if two positions in the search tree collide, this does not mean there is a hash collision.

With a 2GB hash table, 16 bytes per position (i.e. 2^27 buckets), bucket size of 4 = 2^2, hash key of 64 bits, the probability of a collision per node is 1 in 2 ^ (64 - 29) = 34.3 billion nodes. With slighly different parameters you get even closer to Vincent's number. The assumption here is that the hash table is filled, which is completely reasonable if we are looking at billions of probes. (But if it is only partially full, there is of course even less chance of a collision.)
Your math is not accurate either.
I'm sorry but it is, except that I'm ignoring actual hash hits. Hash hits of course cannot lead to collisions. So you have to look at the probes that do not hit.

See also here for a case where my formula predicted 244 collisions per million probes, whereas the number measured was 240 collisions per million nodes.

You're not using the search space nor entropy at which engines search.

So it's not even close to accurate.

You show small numbers. In prime numbers we see the same problem. It's easy to theoretize about small numbers, but when you get to larger ones suddenly all those profets look like beginners.

bob · Post by **bob** » Mon Dec 10, 2012 9:32 pm

diep wrote:
syzygy wrote:
diep wrote:
syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
I editted my post read it.

The Kiiski scenario of AbCD cannot occur.
What I wrote still stands.

The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.
Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.

Progress!

You're still not getting the idea though is it?

AbCD as Joona wrotes it cannot occur.

Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.

It's busy writing to the storebuffer local.

Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.

Our 'A' is still in the storebuffer. It's not written to the memory controller yet.

Then after a while our thread P0 resumes. It writes a B. Still nothing happens. It's still local.

It writes then a C which is the next cacheline. Our core says to itself: "heh what the hell, i'm referencing a 2nd cacheline now, so i can write the first 64 bytes to the memory controller"

This is globally what happens.

Here is what happens.

Two different threads want to do this at the same time:

mov hash[rax]. rbx
mov hash+8[rax], rcx

which simply writes two 8-byte pieces of a hash entry back to memory.

the first core writes the first entry, which fetches that 64 byte line into cache, and then updates 8 bytes somewhere. Before that CPU executes the second mov, a context switch (for an interrupt or whatever) occurs. A second core then requests that same block of memory. It gets forwarded from the cpu that currently has the partially modified block. The second CPU writes both words to the same area, and invalidates the copy at the first. When the first tries to write the 2nd 64 bits, it requests it from memory but the second cache forwards the current data. The first then writes the second 64 bit chunk, and we are now left with the first 64 bit chunk from the 2nd cpu, and the 2nd 64 bit chunk from the first, which is obviously wrong. And it happens often enough to be a pain in the ass, although it might not cause any crashes or blunders...

diep · Post by **diep** » Mon Dec 10, 2012 9:35 pm

bob wrote:
diep wrote:
syzygy wrote:
diep wrote:
syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
I editted my post read it.

The Kiiski scenario of AbCD cannot occur.
What I wrote still stands.

The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.
Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.

Progress!

You're still not getting the idea though is it?

AbCD as Joona wrotes it cannot occur.

Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.

It's busy writing to the storebuffer local.

Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.

Our 'A' is still in the storebuffer. It's not written to the memory controller yet.

Then after a while our thread P0 resumes. It writes a B. Still nothing happens. It's still local.

It writes then a C which is the next cacheline. Our core says to itself: "heh what the hell, i'm referencing a 2nd cacheline now, so i can write the first 64 bytes to the memory controller"

This is globally what happens.
Here is what happens.

Two different threads want to do this at the same time:

mov hash[rax]. rbx
mov hash+8[rax], rcx

which simply writes two 8-byte pieces of a hash entry back to memory.

the first core writes the first entry, which fetches that 64 byte line into cache, and then updates 8 bytes somewhere. Before that CPU executes the second mov, a context switch (for an interrupt or whatever) occurs. A second core then requests that same block of memory. It gets forwarded from the cpu that currently has the partially modified block. The second CPU writes both words to the same area, and invalidates the copy at the first. When the first tries to write the 2nd 64 bits, it requests it from memory but the second cache forwards the current data. The first then writes the second 64 bit chunk, and we are now left with the first 64 bit chunk from the 2nd cpu, and the 2nd 64 bit chunk from the first, which is obviously wrong. And it happens often enough to be a pain in the ass, although it might not cause any crashes or blunders...

Bob, add chances to every occasion.

First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.

So then it can't happen.

Secondly some OS-es can stop at any point, others cannot.

So in those you are safe as well.

Now let's ignore the above realities. Let's focus upon odds.

What's odds that a thread gets interrupted out of the 8 you run?

Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.

Right?

So we speak of small interrupts here. If it's 1 millisecond it's a lot.

That say each 20 milliseconds.

So we speak of 50 occassions a second that one of the threads thread.
stops there.

however the odds it stops specifically at *this spot* is really tiny.

Say 1 in a million.

How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.

You've got 7 other threads.

How many stores in 1 millisecond can they do?

250 * 7 = 1750 stores.

Your hashtable has 500 million entries or so?
I have here.

Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035

Also close to 1 in a million.

So we look at something having odds 1 in 10^12 or so.

That's worse than the 10^10 claim i did do Bob.

And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.

You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times Smile

You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.

Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.

You've got ECC

Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.

For todays hardware odds is way smaller than those machines.

This 10^10 is for a write error. Say Abcd or ABCd or ABcd.

Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.

So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.

syzygy · Post by **syzygy** » Mon Dec 10, 2012 9:40 pm

diep wrote:You're not using the search space nor entropy at which engines search.

So it's not even close to accurate.

It is true that I assume a truely and uniform random mapping from positions to 64-bit keys, but that is pretty much optimal. In so far as this assumption is false, the probability of a collision will be higher.

You show small numbers. In prime numbers we see the same problem. It's easy to theoretize about small numbers, but when you get to larger ones suddenly all those profets look like beginners.

Isn't it funny that Gauss at age 15 or 16 could correctly predict the growth of pi(x) (= number of primes smaller than x) based on a relatively small table of primes?

diep · Post by **diep** » Mon Dec 10, 2012 9:46 pm

syzygy wrote:
diep wrote:You're not using the search space nor entropy at which engines search.

So it's not even close to accurate.
It is true that I assume a truely and uniform random mapping from positions to 64-bit keys, but that is pretty much optimal. In so far as this assumption is false, the probability of a collision will be higher.

You show small numbers. In prime numbers we see the same problem. It's easy to theoretize about small numbers, but when you get to larger ones suddenly all those profets look like beginners.
Isn't it funny that Gauss at age 15 or 16 could correctly predict the growth of pi(x) (= number of primes smaller than x) based on a relatively small table of primes?

You seem to ignore each time the SEARCH SPACE into any of your equations and also the fact that i want 0 collissions.

So i'm not interested in models where you get thousands

If i go play blitz, and todays diep has bunches of extensions that it didn't have at the supercomputer (which got so few collissions), and you get at least each blitz game a few collissions with 64 bits. Now i know most programmers do not care about that.

I do so i'm using a hashkey of 128 bits there from which roughly a bit or 80 end up stored (+indexed) into the transpositiontable.

This where according to your math it's like once each 10^10.

The actual measurement shows simply it's more similar to once each 10^8 when the search space you cover during the searches is rather large.

In fact if you do the math correctly you'll see that the loading factor of the hashtable wasn't even high.

A difference of 10^8 versus 10^10 is pretty shocking if your intentions is get 0 collisions a game...

10^8 nodes is something we get in every search now...

Note that 10^8 is closer to the number Alvaro links to than the weird estimate your function gives.

Please note that diep is using hashtable also in its qsearch and also is giving checks there and has agressive extensions in mainsearch (of course).

p.s. the formula's to predict number of primes are off by factors always, especially the n / ln n formula...
...which is similar to getting to a car dealer which has a price card: "roughly 1000 euro", you get there and you need to pay 1 million euro...

syzygy · Post by **syzygy** » Mon Dec 10, 2012 11:27 pm

diep wrote:Note that 10^8 is closer to the number Alvaro links to than the weird estimate your function gives.

So you're now arguing that my estimate is too optimistic? It's a bit hard to follow you.

Anyway, what are your parameters:
- number of entries in the hash table
- bucket size
- number of unique hash bits stored in an entry (so size of hash key minus the number of bits used to find the bucket index)

diep wrote:p.s. the formula's to predict number of primes are off by factors always, especially the n / ln n formula...

Not at all. The prime number theorem states that pi(x) / (x / log(x)) converges to 1 as x approaches infinity.

Also there are quite precise estimate for the error term pi(x) - (x / log(x)).

bob · Post by **bob** » Tue Dec 11, 2012 12:21 am

diep wrote:
bob wrote:
Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.

Sven
There are some older versions of Crafty without the "lockless hashing" code. One only has to run it on a multi-cpu box to see the problem. Doesn't happen every move, but it happens a bunch. More cores = more bad moves from hash.

if you do this to store a 16 byte entry:

mov hash[rax], rbx
mov hash+8[rax], rcx

you have a problem, because the first move can be executed, writing the first 8 bytes to the hash, then that process can get suspended, and the second can execute both instructions, then the first comes back and writes the 2nd 8 bytes and breaks the entry. There are other ways for it to occur as well, but the basic idea is that the two writes get split, with two other writes in between. More threads mean this becomes more likely, particularly when a hash transaction is generally going to be a cache miss anyway, sort of creating a pseudo-synchronization point since the access time is so slow on such entries. In Cray Blitz, we originally locked entries since semaphores were so fast there, but we (Harry Nelson's idea) started on the lockless approach to reduce semaphore usage when we got to 32 cpus in 1993 or 1994...
Bob, add chances to every occasion.

First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.

Still takes two stores for most people. I looked at my code, and the compiler (intel) doesn't use any SSE/MMX stuff at all in storing hash entries...

So then it can't happen.

Secondly some OS-es can stop at any point, others cannot.

Not sure what that means. If interrupts are enabled, you are going to lose some time in the context-switch to the OS to handle the interrupt. I don't know of any O/S that keeps interrupts disabled while a user program is running. How would you preempt to execute another process there? The typical unix box gets 100 timer interrupts per second anyway, just for starters.

So in those you are safe as well.

Now let's ignore the above realities. Let's focus upon odds.

What's odds that a thread gets interrupted out of the 8 you run?

100%. I run 8 thread on 8 cores. At the very least, the O/S is going to have to handle the real-time clock interrupts (typically 100/sec for unix). Not to mention misc network traffic and such. So some thread will get interrupted quite frequently overall.

Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.

Right?

Right, but that is all that is needed to split the first pair of writes, and do the second pair of writes in between them. I'll try to dig up the non-lockless code and run it on 8 cores and give some counts. It is not millions of times per minute, but it happens with regularity and can show up hundreds of times per minute in positions like fine70 which are beating on the same basic hash positions over and over...

So we speak of small interrupts here. If it's 1 millisecond it's a lot.

That say each 20 milliseconds.

So we speak of 50 occassions a second that one of the threads thread.
stops there.

however the odds it stops specifically at *this spot* is really tiny.

Not quite. First, almost all hash probes are cache misses. So we get a "cluster effect" where every thread is waiting on memory at some points. Now it is pretty easy for this to happen since the first has accessed the thing and the second core (or another one) is waiting...

Say 1 in a million.

How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.

You've got 7 other threads.

How many stores in 1 millisecond can they do?

250 * 7 = 1750 stores.

Your hashtable has 500 million entries or so?
I have here.

Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035

Also close to 1 in a million.

So we look at something having odds 1 in 10^12 or so.

That's worse than the 10^10 claim i did do Bob.

And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.

You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times

Already done, and the problem was verified to be exactly what I claimed. I'll post some numbers...

You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.

Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.

You've got ECC

Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.

For todays hardware odds is way smaller than those machines.

This 10^10 is for a write error. Say Abcd or ABCd or ABcd.

Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.

So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.

Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table