Speaking of the hash table

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Sven
Posts: 4052
Joined: Thu May 15, 2008 9:57 pm
Location: Berlin, Germany
Full name: Sven Schüle

Re: Speaking of the hash table

Post by Sven »

diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.
You forget to mention bugs.
Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.
You are correct, except that OS interruptions are not even needed for this to happen.

If one core executes two instructions each writing 64 bits into the same cacheline, there is no guarantee that in between those writes another core does not write to the same cacheline.

Also note that it is not even necessary that the hash entry in memory gets corrupted. It is possible that a thread retrieving the data stored in the hash entry retrieves data that is half from before and half from after a hash entry update by another thread. So the hash key retrieved might belong to the position being searched, while the hash move retrieved might belong to the position with which the entry was just overwritten.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Speaking of the hash table

Post by hgm »

Bob invented lockless hashing for this. It is almost free, so why not use it?
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

diep wrote:Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.
You are forgetting that cache lines can be stolen by other cores between two instructions and even halfway single (non-atomic) instructions.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

Sven Schüle wrote:
diep wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.

Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.

We have no control of the thread execution, so in extreme case it could happen:

thread1.write(64bits)
sleep(1second)
thread2.write(64bits)
thread2.write(64bits)
sleep(1second)
thread1.write(64bits)

[sleep(1second) = situation when OS decides to pause thread execution]
What you write is impossible in the hardware :)

Even impossible in fact in hardware from 90s.

They wrote 32 bytes at once :)

It really has cachelines of 64 bytes which is 512 bits and it really writes 1536 bits at once not 64.

So only when an entry overlaps at 2 cachelines and 2 different threads start to write to the RAM then it can happen.

As all this is atomic optimized, even in case 2 write simultaneous, it still hardly happens. That's why odds for it is smaller than a collision.
Can you explain how it works in case of cache misses on both CPUs? Especially, what happens at the memory bus if CPU2 performs both instructions "store first 64 bit of hash element" and "store second 64 bit of hash element" before CPU1 performs its "store second 64 bit of hash element" instruction? Can "cache coherency" prevent this to happen?

Sven
Oh this works very complex. AMD is total different concept there from Intel and i7 is again very different from core2.

But to summarize it - this is why those cpu's have so few cores. The cache coherency is one of their big scaling weaknesses.

To see it simple use the 90s concept which is a lot simpler.

See it as a buffer there which was called litterally storebuffer which first had to be filled and only when storing in the 2nd cacheline it would throw the entire cacheline to the memory controller in order to store it. Don't quote me on that, as already the opteron only undertakes action after you reference the 3d cacheline it seems.

See it as they can have several cachelines opened at same time and are atomically synchronizing between them and only when forced they write to the memory controller.

Now if 2 different cores are busy with the same cacheline, you hit a bottleneck in all cpu's.

Now i7 is more clever there as it can already fetch without first storing to the memory controller. So only L3 is involved.

At AMD if you have 2 cpu's busy with same cacheline then it's a long wait as things get first synchronized to the RAM.

The AMD concept was IP taken over from the DEC Alpha concept. Namely that at every cacheline there is a master/slave relationship. So a bit describing which memory controller owns this cacheline (these 64 bytes).

This appeared to be working rather well. Initial opterons kicked butt and soon the 4 socket opteron released as well in april 2004.

I do not know what AMD is doing with todays bulldozer. Probably still is similar to original concept with a few new tricks they patented.

Intel had a problem there up to the i7 where they started doing things within the SRAM (L3).

Basically see it as 2 loops where the RAM moves through before hitting the RAM.

I'm sure some sites document this a little.

Please note that some of the memory experts have big criticism against nearly any article written about memory controllers. Not a single one is really well documenting the global concepts of AMD versus Intel.

It's a patents thing.
That tells you something about what is happening there.

David Kanter usually tries at realworldtech, note his memory articles always catch big criticism but at least he shines a light. He's a big intel fanboy - let me warn you there (admitted owning considerable share in intel).

I tried to figure out what happens in L3 cache of the i7 as when you work together with different cores that's mighty interesting for example exchanges which have to realtime proces incoming data that's there every few microseconds over the ethernet during surges.

Despite their great working memory controller for 1 and 2 sockets, intel had big problems with 4 socket motherboards. This is probably why the 4 socket motherboards were not so popular (together with the high price - the 8 socket i7 thing with 80 cores was there in the shop for $200k initially at oracle.com). Intel undertook steps then to solve the cache coherency between 4 sockets but i don't know what they did do to reduce the traffic there.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

syzygy wrote:
zamar wrote:AFAIK hash key collisions and multithreading are the only source of illegal moves in TT.
You forget to mention bugs.
Not 100% sure, but I think that cache line alignment doesn't protect you from the issue.
You are correct, except that OS interruptions are not even needed for this to happen.

If one core executes two instructions each writing 64 bits into the same cacheline, there is no guarantee that in between those writes another core does not write to the same cacheline.

Also note that it is not even necessary that the hash entry in memory gets corrupted. It is possible that a thread retrieving the data stored in the hash entry retrieves data that is half from before and half from after a hash entry update by another thread. So the hash key retrieved might belong to the position being searched, while the hash move retrieved might belong to the position with which the entry was just overwritten.
If you write in the same cacheline 64 bits it goes ok :)

If our hashentry has a few bytes in cacheline A and the remaining bytes in cacheline B, now then we have ourselves a ballgame.

However this ballgame even then goes pretty well ok in most cases. So we really need an atomic event that a thread gets stopped in between writing the 2 cachelines.

So cacheline A has been written yet not cacheline B.

That's not 64 bits. That's in case of AMD 64 bytes.
Or in case of 3 channels DDR3 we speak of 192 bytes.

So say that our thread gets stopped right before writing to the 2nd cacheline. As soon as it has executed the instruction our atomically correct working memory controllers are going to follow a protocol that ensures us it goes correct.

So the odds for this occuring are therefore that low. Somewhere around 1 in a 200 billion nodes i measured for Diep.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

syzygy wrote:
diep wrote:Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.
You are forgetting that cache lines can be stolen by other cores between two instructions and even halfway single (non-atomic) instructions.
This case goes 100% atomically correct.

There is just 1 case that goes wrong which i described several times here now.
That's why the odds are so small this goes wrong.

Just measure it.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

diep wrote:
syzygy wrote:
diep wrote:Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.
You are forgetting that cache lines can be stolen by other cores between two instructions and even halfway single (non-atomic) instructions.
This case goes 100% atomically correct.
So are you denying that in between two writes to the same cacheline, that cacheline can't be stolen by another core?

If the cacheline is stolen, only the first write will be committed to memory. The full cacheline is written, but the hash entry still has only half the data. It is then overwritten by the other, after which the first core writes the other half. The result is a corrupted hash entry.

And of course Joona is right that if you allow for OS scheduling in between two instructions it is even more clear that hash entries can get corrupted due to smp. There is no way you can prevent an OS interrupting your thread in between the two writes. And unless you set affinities, the second write can be executed by a completely different core.
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: Speaking of the hash table

Post by diep »

syzygy wrote:
diep wrote:
syzygy wrote:
diep wrote:Note that in case of an i7 (not sandy bridge), it shows itself as having 64 bytes cacheline, however the memory controller only stores 192 bytes at a time. Sandy Bridge stores 256 bytes at once.
You are forgetting that cache lines can be stolen by other cores between two instructions and even halfway single (non-atomic) instructions.
This case goes 100% atomically correct.
So are you denying that in between two writes to the same cacheline, that cacheline can't be stolen by another core?

If the cacheline is stolen, only the first write will be committed to memory. The full cacheline is written, but the hash entry still has only half the data. It is then overwritten by the other, after which the first core writes the other half. The result is a corrupted hash entry.

And of course Joona is right that if you allow for OS scheduling in between two instructions it is even more clear that hash entries can get corrupted due to smp. There is no way you can prevent an OS interrupting your thread in between the two writes. And unless you set affinities, the second write can be executed by a completely different core.
This is entirely correct for PC processors.

The exception is with some HPC processors where if your program hits a control-break for example or a control-C that it can occur.

However in such case your program already terminates anyway...

So those big exceptions we can safely ignore here for a chessprogram.

Joona is incorrect that it can happen with 64 bits. Different CPU's cannot write just 64 bits to a memory cacheline, then another one writes 64 bits then this one write 64 bits again. This scenario is normally spoken not possible if you store in a normal sequential manner to the hashtable.

This 64 bits problem would only happen if you have f'd up code.

Like you write 64 bits here, then go write in another hashtable entry referencing another area of the RAM (telling in short the CPU to already flush this to the memory controller,
then again write the next 64 bits, go write in another hashtable entry,
then write the 3d 64 bits.

This is of course theoretic nonsense which we can safely ignore.

If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.

It's very easy to understand why what Joona says cannot happen.

CPU's would not be able to deliver the bandwidth simply that you need them to have to the memory controller.

To make it more clear if we have 2 CPU's and we have a split
of the cacheline within the entry.

Say our entry is 4 integers of 64 bytes (32 bytes).

CPU 0 writes normally spoken ABCD and CPU 1 writes abcd

Now if the split of the cacheline happens to be not aligned with this 32 bytes (which is possible of course that things aren't aligned).

then depending where the cacheline starts or stops we can have basically ABCD,abcd,ABcd,abCD,Abcd,ABCd etc.

We can NOT have: AbCD, aBcd,AbCd etc

In Diep's hashtable with the Zobrist trick i obviously only check whether the start of the entry is the same as the end of the entry. In between checks are pretty useless.

So i'm storing currently a byte or 24 (could get 32 soon - the luxury of hardware progress). And i'm using 32 bits integers for now to store.

So i have 6 quadwords. I just XOR quadcore 0 with quadword 5.

and i XOR quadword 1 with 4 etc.

There is no need to check for cases like AbCD, that's nonsense :)

Note that AbCD at some HPC processors, which you do not have at home, could occur when giving a control break to the program.
Last edited by diep on Sun Dec 09, 2012 8:22 pm, edited 1 time in total.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Speaking of the hash table

Post by syzygy »

diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".
It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.