Speaking of the hash table

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

diep wrote:You're not using the search space nor entropy at which engines search.

So it's not even close to accurate.
It is true that I assume a truely and uniform random mapping from positions to 64-bit keys, but that is pretty much optimal. In so far as this assumption is false, the probability of a collision will be higher.
You show small numbers. In prime numbers we see the same problem. It's easy to theoretize about small numbers, but when you get to larger ones suddenly all those profets look like beginners.
Isn't it funny that Gauss at age 15 or 16 could correctly predict the growth of pi(x) (= number of primes smaller than x) based on a relatively small table of primes?
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

syzygy wrote:
diep wrote:You're not using the search space nor entropy at which engines search.

So it's not even close to accurate.
It is true that I assume a truely and uniform random mapping from positions to 64-bit keys, but that is pretty much optimal. In so far as this assumption is false, the probability of a collision will be higher.
You show small numbers. In prime numbers we see the same problem. It's easy to theoretize about small numbers, but when you get to larger ones suddenly all those profets look like beginners.
Isn't it funny that Gauss at age 15 or 16 could correctly predict the growth of pi(x) (= number of primes smaller than x) based on a relatively small table of primes?
You seem to ignore each time the SEARCH SPACE into any of your equations and also the fact that i want 0 collissions.

So i'm not interested in models where you get thousands :)

If i go play blitz, and todays diep has bunches of extensions that it didn't have at the supercomputer (which got so few collissions), and you get at least each blitz game a few collissions with 64 bits. Now i know most programmers do not care about that.

I do so i'm using a hashkey of 128 bits there from which roughly a bit or 80 end up stored (+indexed) into the transpositiontable.

This where according to your math it's like once each 10^10.

The actual measurement shows simply it's more similar to once each 10^8 when the search space you cover during the searches is rather large.

In fact if you do the math correctly you'll see that the loading factor of the hashtable wasn't even high.

A difference of 10^8 versus 10^10 is pretty shocking if your intentions is get 0 collisions a game...

10^8 nodes is something we get in every search now...

Note that 10^8 is closer to the number Alvaro links to than the weird estimate your function gives.

Please note that diep is using hashtable also in its qsearch and also is giving checks there and has agressive extensions in mainsearch (of course).

p.s. the formula's to predict number of primes are off by factors always, especially the n / ln n formula...
...which is similar to getting to a car dealer which has a price card: "roughly 1000 euro", you get there and you need to pay 1 million euro...
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

diep wrote:Note that 10^8 is closer to the number Alvaro links to than the weird estimate your function gives.
So you're now arguing that my estimate is too optimistic? It's a bit hard to follow you.

Anyway, what are your parameters:
- number of entries in the hash table
- bucket size
- number of unique hash bits stored in an entry (so size of hash key minus the number of bits used to find the bucket index)
diep wrote:p.s. the formula's to predict number of primes are off by factors always, especially the n / ln n formula...
Not at all. The prime number theorem states that pi(x) / (x / log(x)) converges to 1 as x approaches infinity.

Also there are quite precise estimate for the error term pi(x) - (x / log(x)).
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

diep wrote:
bob wrote:
Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.

Sven
There are some older versions of Crafty without the "lockless hashing" code. One only has to run it on a multi-cpu box to see the problem. Doesn't happen every move, but it happens a bunch. More cores = more bad moves from hash.

if you do this to store a 16 byte entry:

mov hash[rax], rbx
mov hash+8[rax], rcx

you have a problem, because the first move can be executed, writing the first 8 bytes to the hash, then that process can get suspended, and the second can execute both instructions, then the first comes back and writes the 2nd 8 bytes and breaks the entry. There are other ways for it to occur as well, but the basic idea is that the two writes get split, with two other writes in between. More threads mean this becomes more likely, particularly when a hash transaction is generally going to be a cache miss anyway, sort of creating a pseudo-synchronization point since the access time is so slow on such entries. In Cray Blitz, we originally locked entries since semaphores were so fast there, but we (Harry Nelson's idea) started on the lockless approach to reduce semaphore usage when we got to 32 cpus in 1993 or 1994...
Bob, add chances to every occasion.

First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.
Still takes two stores for most people. I looked at my code, and the compiler (intel) doesn't use any SSE/MMX stuff at all in storing hash entries...


So then it can't happen.

Secondly some OS-es can stop at any point, others cannot.
Not sure what that means. If interrupts are enabled, you are going to lose some time in the context-switch to the OS to handle the interrupt. I don't know of any O/S that keeps interrupts disabled while a user program is running. How would you preempt to execute another process there? The typical unix box gets 100 timer interrupts per second anyway, just for starters.


So in those you are safe as well.

Now let's ignore the above realities. Let's focus upon odds.

What's odds that a thread gets interrupted out of the 8 you run?
100%. I run 8 thread on 8 cores. At the very least, the O/S is going to have to handle the real-time clock interrupts (typically 100/sec for unix). Not to mention misc network traffic and such. So some thread will get interrupted quite frequently overall.


Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.

Right?
Right, but that is all that is needed to split the first pair of writes, and do the second pair of writes in between them. I'll try to dig up the non-lockless code and run it on 8 cores and give some counts. It is not millions of times per minute, but it happens with regularity and can show up hundreds of times per minute in positions like fine70 which are beating on the same basic hash positions over and over...



So we speak of small interrupts here. If it's 1 millisecond it's a lot.

That say each 20 milliseconds.

So we speak of 50 occassions a second that one of the threads thread.
stops there.

however the odds it stops specifically at *this spot* is really tiny.
Not quite. First, almost all hash probes are cache misses. So we get a "cluster effect" where every thread is waiting on memory at some points. Now it is pretty easy for this to happen since the first has accessed the thing and the second core (or another one) is waiting...



Say 1 in a million.

How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.

You've got 7 other threads.

How many stores in 1 millisecond can they do?

250 * 7 = 1750 stores.

Your hashtable has 500 million entries or so?
I have here.

Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035

Also close to 1 in a million.

So we look at something having odds 1 in 10^12 or so.

That's worse than the 10^10 claim i did do Bob.

And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.

You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times :)
Already done, and the problem was verified to be exactly what I claimed. I'll post some numbers...


You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.

Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.

You've got ECC :)

Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.

For todays hardware odds is way smaller than those machines.

This 10^10 is for a write error. Say Abcd or ABCd or ABcd.

Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.

So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

bob wrote:
diep wrote:
bob wrote:
Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.

Sven
There are some older versions of Crafty without the "lockless hashing" code. One only has to run it on a multi-cpu box to see the problem. Doesn't happen every move, but it happens a bunch. More cores = more bad moves from hash.

if you do this to store a 16 byte entry:

mov hash[rax], rbx
mov hash+8[rax], rcx

you have a problem, because the first move can be executed, writing the first 8 bytes to the hash, then that process can get suspended, and the second can execute both instructions, then the first comes back and writes the 2nd 8 bytes and breaks the entry. There are other ways for it to occur as well, but the basic idea is that the two writes get split, with two other writes in between. More threads mean this becomes more likely, particularly when a hash transaction is generally going to be a cache miss anyway, sort of creating a pseudo-synchronization point since the access time is so slow on such entries. In Cray Blitz, we originally locked entries since semaphores were so fast there, but we (Harry Nelson's idea) started on the lockless approach to reduce semaphore usage when we got to 32 cpus in 1993 or 1994...
Bob, add chances to every occasion.

First of all you assume simple code here. If you vectorize this it's 1 instruction already of SIMD. Say SSE2 or newer. In fact yoru core2 Xeons have SSSE which is a lot newer than SSE2.
Still takes two stores for most people. I looked at my code, and the compiler (intel) doesn't use any SSE/MMX stuff at all in storing hash entries...


So then it can't happen.

Secondly some OS-es can stop at any point, others cannot.
Not sure what that means. If interrupts are enabled, you are going to lose some time in the context-switch to the OS to handle the interrupt. I don't know of any O/S that keeps interrupts disabled while a user program is running. How would you preempt to execute another process there? The typical unix box gets 100 timer interrupts per second anyway, just for starters.


So in those you are safe as well.

Now let's ignore the above realities. Let's focus upon odds.

What's odds that a thread gets interrupted out of the 8 you run?
100%. I run 8 thread on 8 cores. At the very least, the O/S is going to have to handle the real-time clock interrupts (typically 100/sec for unix). Not to mention misc network traffic and such. So some thread will get interrupted quite frequently overall.


Very little of course. Only for a few milliseconds and then it's doing a copule of thousands of instructions of another program and restarts again.

Right?
Right, but that is all that is needed to split the first pair of writes, and do the second pair of writes in between them. I'll try to dig up the non-lockless code and run it on 8 cores and give some counts. It is not millions of times per minute, but it happens with regularity and can show up hundreds of times per minute in positions like fine70 which are beating on the same basic hash positions over and over...



So we speak of small interrupts here. If it's 1 millisecond it's a lot.

That say each 20 milliseconds.

So we speak of 50 occassions a second that one of the threads thread.
stops there.

however the odds it stops specifically at *this spot* is really tiny.
Not quite. First, almost all hash probes are cache misses. So we get a "cluster effect" where every thread is waiting on memory at some points. Now it is pretty easy for this to happen since the first has accessed the thing and the second core (or another one) is waiting...



Say 1 in a million.

How many nps does crafty get at 8 cores. Mlillion or 20?
From this 2 million writes a second to hashtable or so?
So that's 250k stores a second a thread.

You've got 7 other threads.

How many stores in 1 millisecond can they do?

250 * 7 = 1750 stores.

Your hashtable has 500 million entries or so?
I have here.

Odds of 1750 stores by accident hammering onto that slot out of 500 million is : 1750 / 500 million = 1.75 / 0.5 million = 3.5 / 1 million = 0.0000035

Also close to 1 in a million.

So we look at something having odds 1 in 10^12 or so.

That's worse than the 10^10 claim i did do Bob.

And this for our theoretic worst case scenario. It just won't happen a lot. Definitely not dozen times a minute.

You can easily measure write errors Bob. Why not do it?
Just add a 8 bits CRC to each hashtable entry and check this first when scanning hashtable. report exactly the entry and what it has been filled with so that you don't count the same write error a 1000 times :)
Already done, and the problem was verified to be exactly what I claimed. I'll post some numbers...


You'll see that in some conditions you will get 0 ever if you have settings that avoid this write error from happening.

Yet my once in a 10^10 stores chance of a write error is what i stated and thats what i measured with Diep.

You've got ECC :)

Note that my original measurements were on a supercomputer with 200 cpu's and on a dual k7 at home.

For todays hardware odds is way smaller than those machines.

This 10^10 is for a write error. Say Abcd or ABCd or ABcd.

Note that for AbCD you need 2 of these occassions to happen at the same slot with another few constraints so the odds for that is more than the square of the odds of a simple write error.

So odds for that is easy to see is far less than 10^20, in short will never happen in our lifetime.
Here's a sample run with Crafty version 22.9 that did NOT lock the hash table. I modified next.c to display the illegal move from hash table error (the code is surrounded by a #if defined(DEBUG) and I just removed the #if for this test)...

time=2:47 mat=1 n=1149177757 fh=94% nps=6.9M
ext-> check=124.1M qcheck=118.5M reduce=552.1M/197.0M
predicted=0 evals=657.3M 50move=0 EGTBprobes=0 hits=0
SMP-> splits=39193 aborts=1569 data=42/512 elap=2:47


Search took 2:47 on fine 70. If I "grep 'bad move' log.004 | wc -l" I get this:

1550

so 1,550 "broken writes" that were detected solely by the hash move being illegal". When I stored more info back when testing, the actual number of broken writes was much higher, but since a move can be legal in a large number of different positions, this test doesn't catch 'em all.

That was for one move.

From the opening position, it is not so dramatic, same time limit but just 47 broken writes. Output looks like this as it runs...

starting thread 7
14-> 1.27 0.08 1. Nf3 Nf6 2. Nc3 Nc6 3. e4 e5 4. d4
Bb4 5. d5 Bxc3+ 6. bxc3 Na5 7. Bg5
h6 8. Bd2 Nxe4 9. Nxe5 Nxd2 10. Qxd2
(s=2)
15 1.55 0.13 1. Nf3 Nf6 2. Nc3 Nc6 3. e4 e5 4. d4
Bb4 5. d5 Bxc3+ 6. bxc3 Na5 7. Nxe5
Nxe4 8. Qg4 Nxc3 9. Qxg7
15-> 1.78 0.13 1. Nf3 Nf6 2. Nc3 Nc6 3. e4 e5 4. d4
Bb4 5. d5 Bxc3+ 6. bxc3 Na5 7. Nxe5
Nxe4 8. Qg4 Nxc3 9. Qxg7
16 5.25 0.12 1. Nf3 Nf6 2. e3 Nc6 3. d4 e6 4. Bd3
d5 5. O-O Bd6 6. Nc3 O-O 7. Bd2 Nb4
8. Ne5 Nxd3 9. cxd3
bad move from hash table, ply=22
bad move from hash table, ply=22
bad move from hash table, ply=24
16 12.96 0.41 1. e4 Nf6 2. e5 Nd5 3. Nc3 e6 4. Nxd5
exd5 5. d4 Bb4+ 6. c3 Be7 7. Qb3 c6
8. Bd2 O-O 9. O-O-O
16-> 12.96 0.41 1. e4 Nf6 2. e5 Nd5 3. Nc3 e6 4. Nxd5
exd5 5. d4 Bb4+ 6. c3 Be7 7. Qb3 c6
8. Bd2 O-O 9. O-O-O (s=2)
bad move from hash table, ply=24
17 22.36 0.29 1. e4 e5 2. Nf3 Bd6 3. Nc3 Nf6 4. Bc4
Nc6 5. O-O O-O 6. d4 exd4 7. Nxd4 Ne5
8. Bd5 Neg4 9. f4 Nxd5 10. Nxd5
17-> 22.93 0.29 1. e4 e5 2. Nf3 Bd6 3. Nc3 Nf6 4. Bc4
Nc6 5. O-O O-O 6. d4 exd4 7. Nxd4 Ne5
8. Bd5 Neg4 9. f4 Nxd5 10. Nxd5
18 26.51 0.26 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Na4 Na5
8. Nxe5 Nxc4 9. Nxc4 Bg4
18-> 28.81 0.26 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Na4 Na5
8. Nxe5 Nxc4 9. Nxc4 Bg4
bad move from hash table, ply=15
19 35.09 0.25 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Bg5 h6 8.
Be3 Bxe3 9. fxe3 Be6 10. Bxe6 fxe6
19-> 41.95 0.25 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. O-O O-O 6. d3 d6 7. Bg5 h6 8.
Be3 Bxe3 9. fxe3 Be6 10. Bxe6 fxe6
(s=2)
bad move from hash table, ply=28
bad move from hash table, ply=32
bad move from hash table, ply=28
bad move from hash table, ply=28
bad move from hash table, ply=23
bad move from hash table, ply=10
bad move from hash table, ply=14
20 1:02 0.16 1. e4 e5 2. Nf3 Nc6 3. Nc3 Nf6 4. Bc4
Bc5 5. d3 O-O 6. Bg5 Na5 7. Nxe5 d6
8. Nxf7 Rxf7 9. Bxf7+ Kxf7 10. O-O
Be6 11. Qf3

<etc>
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

diep wrote:
bob wrote:
diep wrote:
zamar wrote:Multithreading. Typical size of hash table entry size is 128 bits.

Thread1.write(64bits)
Thread2.write(64bits)
Thread2.write(64bits)
Thread1.write(64bits)

leads to corrupted entry
First of all this can only happen when you have a parallel searching engine.

Did Rebel already get SMP?

Secondly this happens even more seldom than collissions at PC hardware as it can only happen with stores :)

Around 1 in 200 billion stores :)

In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.

Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.

Has 4 memory channels.

Note that when reading it can give a premature abort reading less bytes, which is why it has such fast latency for us to the RAM.

It can abort after reading 32 bytes.

So the only way it can go wrong is when an entry is at the end of a cacheline, in this case 256 bytes and by accident 1 cacheline gets updated quicker than another.

Odds of this happening is real real tiny, especially if your hashtable is some gigabytes in size.

So how you represented it cannot happen as it doesn't write 64 bits. It writes 256 bytes at once.
It can happen far more frequently than that. This has been discussed in the past. With 8 threads (cores) writing pairs of 64 bit words, ordering gets mangled and you can store a position where the first 8 bytes is from position P1, and the second 8 is from position P2, leading to this. I've seen dozens per minute in some positions. with the lockless hashing I use, it never happens, however.
Bob you're making this story up.

You did not see 'dozens of write errors a minute.

I've been extensively testing this and you need dozens of billions of stores to get one.

Of course i have ECC machines here. Maybe you had a bad dimm during your tests or your memory again serves you bad?

Also realize Alpha is a HPC processor. Today we only have x64 cpu's which is a different architecture.

In between alpha and the memory controller there are several steps to connect to other memory. That step can cause the AbCD sometimes which is not possible at x86/x64 hardware as they directly connect to the RAM.

Even then this also happens VERY SELDOM at the alpha.

I had at the time contact with the technical design team of memory controllers to verify how much all this could occur at different processors under which Alpha and the R14000, and i got each time the same answers from the engineers.

Even then at x86 i've been testing this extensively and if i got 1 write error or 1 or 2 collissions that was a lot in every single test.

This testing ran for months in 2003.

So from all persons here seems i'm the only one who really measured the number of write errors and the number of collissions.

p.s. note that the alpha's later on up to 2007 even were used for serving storage. Petabyte level, that's where i worked at the time.
I didn't make ANYTHING up. Please feel free to download Crafty 22.9, which does NOT use lockless hashing at all. Compile it with -DCPUS=8, and then simply run it. look in next.c for the "bad move from hash" and remove the #if test that disables those two lines. Run with 8 cores and you will see dozens of those errors per minute, to tens of thousands if you use fine 70...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

syzygy wrote:
Sven Schüle wrote:
syzygy wrote:
diep wrote:i don't know what idiocy you are busy with, but you are speaking about a scenario that happens once in each 10^20 occasions and even less if your processes get all system time from your CPU. Your RAM has failed long before that time and so did your CPU's warranty already expire long before that :)

(...)
It seems we're in agreement on the non-atomicity of two consecutive writes.

As long as the program checks the validity of the hash move, it is of course safe to ignore the rare smp corruption.
If two consecutive writes are non-atomic (which is also my understanding), then why would a third CPU get a consistent hash element returned when starting to read after the first half of the write has arrived in memory? In other words, I don't even believe that "smp corruption" is so rare if no special care is taken about it.
That third CPU indeed won't get a consistent hash entry. I agree it's not once-in-a-lifetime rare as Vincent seems to suggest, but it's rare enough that it won't measurably affect the quality of the search. To affect the search you probably need many corruptions per second.
It is unlikely that memory is even in this. Remember that all recent Intel and AMD caches use forwarding, so that if one cache asks for something in another, it comes from that cache. Memory writes are always delayed until a cache block is replaced, anyway (write-back). But this doesn't change a thing with this problem. The cache block can be passed around with exactly the same effect...

My previous post in this thread shows just how frequently this can actually happen. And it gets worse on 12 core boxes and beyond, obviously...
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

syzygy wrote:
diep wrote:Note that 10^8 is closer to the number Alvaro links to than the weird estimate your function gives.
So you're now arguing that my estimate is too optimistic? It's a bit hard to follow you.

Anyway, what are your parameters:
- number of entries in the hash table
- bucket size
- number of unique hash bits stored in an entry (so size of hash key minus the number of bits used to find the bucket index)
diep wrote:p.s. the formula's to predict number of primes are off by factors always, especially the n / ln n formula...
Not at all. The prime number theorem states that pi(x) / (x / log(x)) converges to 1 as x approaches infinity.

Also there are quite precise estimate for the error term pi(x) - (x / log(x)).
What i'm showing, as i experimented bigtime with this, that the formula you propose already is wrong when i turn on extensions (which in nowadays diep are turned on by default) and play games with it.

If i just analyze 1 position and turn off extensions, 64 bits is going to give 1 in a 200 billion stores a collission.

In short the formula is not accurate as it doesn't take into account *which* part of the search space you are in, you investigate.

Then further there is a difference, be it smaller, playing games versus analysing static positions. *where* you analyze also defines the search space you can investigate.

These 2 factors are influencing the probability of a collission more than anything else.

10^8 versus 10^11 it is. That's 3 orders of a magnitude difference.

Saying your formula is accurate when it predicts 10^10 always is not correct therefore, as extensions matters the most for how many collissions with a 64 bits hashkey you're gonna get.

Needless to write down here why i'm using more than 64 bits...
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

bob wrote:
diep wrote:
bob wrote:
diep wrote:
zamar wrote:Multithreading. Typical size of hash table entry size is 128 bits.

Thread1.write(64bits)
Thread2.write(64bits)
Thread2.write(64bits)
Thread1.write(64bits)

leads to corrupted entry
First of all this can only happen when you have a parallel searching engine.

Did Rebel already get SMP?

Secondly this happens even more seldom than collissions at PC hardware as it can only happen with stores :)

Around 1 in 200 billion stores :)

In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.

Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.

Has 4 memory channels.

Note that when reading it can give a premature abort reading less bytes, which is why it has such fast latency for us to the RAM.

It can abort after reading 32 bytes.

So the only way it can go wrong is when an entry is at the end of a cacheline, in this case 256 bytes and by accident 1 cacheline gets updated quicker than another.

Odds of this happening is real real tiny, especially if your hashtable is some gigabytes in size.

So how you represented it cannot happen as it doesn't write 64 bits. It writes 256 bytes at once.
It can happen far more frequently than that. This has been discussed in the past. With 8 threads (cores) writing pairs of 64 bit words, ordering gets mangled and you can store a position where the first 8 bytes is from position P1, and the second 8 is from position P2, leading to this. I've seen dozens per minute in some positions. with the lockless hashing I use, it never happens, however.
Bob you're making this story up.

You did not see 'dozens of write errors a minute.

I've been extensively testing this and you need dozens of billions of stores to get one.

Of course i have ECC machines here. Maybe you had a bad dimm during your tests or your memory again serves you bad?

Also realize Alpha is a HPC processor. Today we only have x64 cpu's which is a different architecture.

In between alpha and the memory controller there are several steps to connect to other memory. That step can cause the AbCD sometimes which is not possible at x86/x64 hardware as they directly connect to the RAM.

Even then this also happens VERY SELDOM at the alpha.

I had at the time contact with the technical design team of memory controllers to verify how much all this could occur at different processors under which Alpha and the R14000, and i got each time the same answers from the engineers.

Even then at x86 i've been testing this extensively and if i got 1 write error or 1 or 2 collissions that was a lot in every single test.

This testing ran for months in 2003.

So from all persons here seems i'm the only one who really measured the number of write errors and the number of collissions.

p.s. note that the alpha's later on up to 2007 even were used for serving storage. Petabyte level, that's where i worked at the time.
I didn't make ANYTHING up. Please feel free to download Crafty 22.9, which does NOT use lockless hashing at all. Compile it with -DCPUS=8, and then simply run it. look in next.c for the "bad move from hash" and remove the #if test that disables those two lines. Run with 8 cores and you will see dozens of those errors per minute, to tens of thousands if you use fine 70...
I don't know why you're posting this nonsense Bob.

I've already extensively tested this with a lockless hashtable. With a CRC you can easily measure this way faster than completely messing up the bus locking nonstop.

In Diep there are more collissions than write errors with a 64 bits hashkey!

So you're just making this story up.

Note that your normal crafty is using a lockless hashtable! Don't bring in other idiotic factors you're not using in practice! Changing the discussion here just to 'save face' is what they do in Asia Bob, over here you're busted.

Long before you wrote/posted about Tim Mann inventing the XOR manner of doing things lockless, others were already using a 8 bits CRC...
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Speaking of the hash table

Post by Sven »

diep wrote:I've already extensively tested this with a lockless hashtable. With a CRC you can easily measure this way faster than completely messing up the bus locking nonstop.

In Diep there are more collissions than write errors with a 64 bits hashkey!

[...]

Note that your normal crafty is using a lockless hashtable! Don't bring in other idiotic factors you're not using in practice! Changing the discussion here just to 'save face' is what they do in Asia Bob, over here you're busted.

Long before you wrote/posted about Tim Mann inventing the XOR manner of doing things lockless, others were already using a 8 bits CRC...
Vincent, it seems you accept that the problem discussed here is solved with lockless hashing but persists without. Right?

Sven