Speaking of the hash table

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

Rebel wrote:Now that we are discussing the TT I have a question about a phenomenon I never understood --> illegal moves in the TT. How is this possible?

I have them, not many, less than 1/10 per mil of the hash hits.

It's not because of collisions, I can test with 48 and 64 bit keys and the percentage remains the same.
64 bits still produces collisions. Otherwise there would be no way unless the program has a bug and stores an illegal move...

in an SMP search, it is easier since entries are stored in parallel, and writes can be done in arbitrary order when writing an entry to the table.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

diep wrote:
zamar wrote:Multithreading. Typical size of hash table entry size is 128 bits.

Thread1.write(64bits)
Thread2.write(64bits)
Thread2.write(64bits)
Thread1.write(64bits)

leads to corrupted entry
First of all this can only happen when you have a parallel searching engine.

Did Rebel already get SMP?

Secondly this happens even more seldom than collissions at PC hardware as it can only happen with stores :)

Around 1 in 200 billion stores :)

In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.

Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.

Has 4 memory channels.

Note that when reading it can give a premature abort reading less bytes, which is why it has such fast latency for us to the RAM.

It can abort after reading 32 bytes.

So the only way it can go wrong is when an entry is at the end of a cacheline, in this case 256 bytes and by accident 1 cacheline gets updated quicker than another.

Odds of this happening is real real tiny, especially if your hashtable is some gigabytes in size.

So how you represented it cannot happen as it doesn't write 64 bits. It writes 256 bytes at once.
It can happen far more frequently than that. This has been discussed in the past. With 8 threads (cores) writing pairs of 64 bit words, ordering gets mangled and you can store a position where the first 8 bytes is from position P1, and the second 8 is from position P2, leading to this. I've seen dozens per minute in some positions. with the lockless hashing I use, it never happens, however.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
cache reads/writes 32 (nowadays 64) bytes in one large gulp, but the CPUs don't do that...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: Speaking of the hash table

Post by mcostalba »

We experienced this collisions (actually races) indirectly because we had for long months (even more than a year) a subtle bug triggered by a race to TT access that lead to a crash. Bug was rare but quite constant during tests, about once every 2-3 days (50-70 hours) on a couple of QUADs with 4 threads per engine.

For the bug itself a simple TT race was not enough (although a prerequisite, we never crashed in single thread tests), it needed that the race produced a "special" corrupted TT move in a "special" position where the net result was the king to remain under check with opponent side to move. This makes me to think that the TT races frequency was (is) much higher than that.

We know was a race and not an alias (key collision) because once eventually fixed (with a better TT move legality check) we experienced no more crashes in SF.
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Speaking of the hash table

Post by Sven »

bob wrote:
diep wrote:
zamar wrote:Multithreading. Typical size of hash table entry size is 128 bits.

Thread1.write(64bits)
Thread2.write(64bits)
Thread2.write(64bits)
Thread1.write(64bits)

leads to corrupted entry
First of all this can only happen when you have a parallel searching engine.

Did Rebel already get SMP?

Secondly this happens even more seldom than collissions at PC hardware as it can only happen with stores :)

Around 1 in 200 billion stores :)

[...]
It can happen far more frequently than that. This has been discussed in the past. With 8 threads (cores) writing pairs of 64 bit words, ordering gets mangled and you can store a position where the first 8 bytes is from position P1, and the second 8 is from position P2, leading to this. I've seen dozens per minute in some positions. with the lockless hashing I use, it never happens, however.
Thanks for your confirmation, that was my understanding as well so far but Vincent's replies did raise some doubts, just for a few microseconds ;-)

Sven
Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Speaking of the hash table

Post by Sven »

bob wrote:
Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.

Sven
User avatar
Rebel
Posts: 6995
Joined: Thu Aug 18, 2011 12:04 pm

Re: Speaking of the hash table

Post by Rebel »

Joost Buijs wrote:This must be some kind of bug, maybe there is something wrong with your Sobrist numbers.
Zobrist numbers are okay but I found the 2 (!!) bastards in the meantime.

1. Checkmate situations

2. Promotions

Strength increase is 0.0 elo but it's nice to have everything clean.
I saw this happening once when I replaced my 64 bit random function with another one that contained a small bug.

I use 64 bit keys and I can let my program run for ages before I get a wrong move from the hashtable. I don't test for legality anymore because this really is not necessary.
Indeed, that's how it is supposed to be.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

bob wrote:
diep wrote:
zamar wrote:Multithreading. Typical size of hash table entry size is 128 bits.

Thread1.write(64bits)
Thread2.write(64bits)
Thread2.write(64bits)
Thread1.write(64bits)

leads to corrupted entry
First of all this can only happen when you have a parallel searching engine.

Did Rebel already get SMP?

Secondly this happens even more seldom than collissions at PC hardware as it can only happen with stores :)

Around 1 in 200 billion stores :)

In case Ed has his hashtable aligned and a multiple of it forms a cacheline, you can prove it cannot happen at a PC.

Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.

Has 4 memory channels.

Note that when reading it can give a premature abort reading less bytes, which is why it has such fast latency for us to the RAM.

It can abort after reading 32 bytes.

So the only way it can go wrong is when an entry is at the end of a cacheline, in this case 256 bytes and by accident 1 cacheline gets updated quicker than another.

Odds of this happening is real real tiny, especially if your hashtable is some gigabytes in size.

So how you represented it cannot happen as it doesn't write 64 bits. It writes 256 bytes at once.
It can happen far more frequently than that. This has been discussed in the past. With 8 threads (cores) writing pairs of 64 bit words, ordering gets mangled and you can store a position where the first 8 bytes is from position P1, and the second 8 is from position P2, leading to this. I've seen dozens per minute in some positions. with the lockless hashing I use, it never happens, however.
Bob you're making this story up.

You did not see 'dozens of write errors a minute.

I've been extensively testing this and you need dozens of billions of stores to get one.

Of course i have ECC machines here. Maybe you had a bad dimm during your tests or your memory again serves you bad?

Also realize Alpha is a HPC processor. Today we only have x64 cpu's which is a different architecture.

In between alpha and the memory controller there are several steps to connect to other memory. That step can cause the AbCD sometimes which is not possible at x86/x64 hardware as they directly connect to the RAM.

Even then this also happens VERY SELDOM at the alpha.

I had at the time contact with the technical design team of memory controllers to verify how much all this could occur at different processors under which Alpha and the R14000, and i got each time the same answers from the engineers.

Even then at x86 i've been testing this extensively and if i got 1 write error or 1 or 2 collissions that was a lot in every single test.

This testing ran for months in 2003.

So from all persons here seems i'm the only one who really measured the number of write errors and the number of collissions.

p.s. note that the alpha's later on up to 2007 even were used for serving storage. Petabyte level, that's where i worked at the time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Speaking of the hash table

Post by bob »

Sven Schüle wrote:
bob wrote:
Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
No it can't. This is known in architecture as a "write after write hazard condition" and cache coherency has no effect on it at all. It is a simple race condition. Nothing can solve this, and on some machines, even the hardware will cause this to happen (the alpha does out-of-order memory writes, for example).
Thanks again for confirming also this part of the story. I think the facts are now clear to most readers.

Sven
There are some older versions of Crafty without the "lockless hashing" code. One only has to run it on a multi-cpu box to see the problem. Doesn't happen every move, but it happens a bunch. More cores = more bad moves from hash.

if you do this to store a 16 byte entry:

mov hash[rax], rbx
mov hash+8[rax], rcx

you have a problem, because the first move can be executed, writing the first 8 bytes to the hash, then that process can get suspended, and the second can execute both instructions, then the first comes back and writes the 2nd 8 bytes and breaks the entry. There are other ways for it to occur as well, but the basic idea is that the two writes get split, with two other writes in between. More threads mean this becomes more likely, particularly when a hash transaction is generally going to be a cache miss anyway, sort of creating a pseudo-synchronization point since the access time is so slow on such entries. In Cray Blitz, we originally locked entries since semaphores were so fast there, but we (Harry Nelson's idea) started on the lockless approach to reduce semaphore usage when we got to 32 cpus in 1993 or 1994...