Speaking of the hash table

diep · Post by **diep** » Sun Dec 09, 2012 8:23 pm

syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.

I editted my post read it.

The Kiiski scenario of AbCD cannot occur.

You are confusing atomic execution of instructions with how the caches and protocols store onto the memory controller.

They use buffering. So they write either at once ABCD or they write if it's last word of a cacheline A to it. They cannot first write A and then flush the cacheline, then write B etc.

This ain't gonna happen if A and B are the same cacheline. It simply didn't flush this cacheline yet simply.

This is why the occurance of a write error is so seldom. You can just test this you know.

Be patient when testing though it occured when i tested really seldom

In fact i had same amount of memory writes like i had collissions.

And as most store less bits than Diep does and search more nodes than they do stores/reads, it's easy to see there is more collisions for them than there is write errors.

syzygy · Post by **syzygy** » Sun Dec 09, 2012 8:26 pm

diep wrote:
syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
I editted my post read it.

The Kiiski scenario of AbCD cannot occur.

What I wrote still stands.

The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.

diep · Post by **diep** » Sun Dec 09, 2012 8:35 pm

syzygy wrote:
diep wrote:
syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
I editted my post read it.

The Kiiski scenario of AbCD cannot occur.
What I wrote still stands.

It is very naive to believe that if this core has a local storebuffer, that another core would just squeeze in this 64 bits whereas the memory controller doesn't have the data of this local storebuffer yet

Imagine what would happen if all cores would be busy at atomic level with just 64 bits of RAM.

We'd have no bandwidth to the RAM then!

What happens is this core writes in a cache line, or in 2, and only *after* that it gets thrown to the memory controller which then atomically ensures correctness. Namely that either the cacheline as how core0 wrote into it gets written to the RAM or core2.

So the scenario AbCD can simply not occur.

Also what cannot occur is if we have a 3d thread writing WXYZ

We can not have a mix of ABCD,abcd,WXYZ.

We can prove that, control codes excepted, this cannot occur.

As the transpositiontable entries are SMALLER than a cacheline, namely say between 10 and 32 bytes with most of us.

And the cachelines are 64+ bytes (some are 128,192,256 even).

So we can only mix up at most 64 bytes from core 1 with 64 bytes from core 2. In short a mixture within say a byte or 32 of 3 different cores at the same time is not possible. as in these 2 cachelines it can be written only by 2 cores at most.

diep · Post by **diep** » Sun Dec 09, 2012 8:44 pm

syzygy wrote:
diep wrote:
syzygy wrote:
diep wrote:If you do a normal store of the entire entry, you willl simply not run into troubles except for 1 case.
But you can't write 16 bytes atomically using TWO write instructions. That the 16 bytes are in the same cacheline does not make it different.

The only way to write 16 bytes atomically on x86-64 is using "lock cmpxchgb16".

It can only happen in between cachelines as it simply hasn't written that cacheline yet when it's still busy in this cacheline.
The first write by the first core makes the cache line dirty. The first write by the second core can force the first core to write the dirty cache line to memory (or maybe to a higher level cache) before the second write by the first core has taken place.

The execution of two consecutive instructions is NOT atomic. Even the instruction of most single instructions is not atomic. Single 8-byte writes to a single cacheline are atomic, though.
I editted my post read it.

The Kiiski scenario of AbCD cannot occur.
What I wrote still stands.

The short story is this: yes, cache lines essentially are read from and written to main memory as a whole. But no, the programmer has no control on when they are written back. They can be written back between two instructions writing to that same cache line.

Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.

Progress!

You're still not getting the idea though is it?

AbCD as Joona wrotes it cannot occur.

Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.

It's busy writing to the storebuffer local.

Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.

Our 'A' is still in the storebuffer. It's not written to the memory controller yet.

Then after a while our thread P0 resumes. It writes a B. Still nothing happens. It's still local.

It writes then a C which is the next cacheline. Our core says to itself: "heh what the hell, i'm referencing a 2nd cacheline now, so i can write the first 64 bytes to the memory controller"

This is globally what happens.

syzygy · Post by **syzygy** » Sun Dec 09, 2012 9:41 pm

diep wrote:It is very naive to believe that if this core has a local storebuffer, that another core would just squeeze in this 64 bits whereas the memory controller doesn't have the data of this local storebuffer yet

The store buffer will be flushed obviously. The point is you have no control on when this happens. There are no guarantees.

You undoubtedly know that if two threads increment the same counter located in memory, some of the increments can be lost. Between the memory read and the memory write that make up the increment, the cache line can ping pong between cores. The same holds for two memory writes.

Imagine what would happen if all cores would be busy at atomic level with just 64 bits of RAM.

We'd have no bandwidth to the RAM then!

All cores working on the same cache line kills performance, so you have to avoid that situation. Threads should as much as possible work on different cache lines (but all cores only reading is fine).

What happens is this core writes in a cache line, or in 2, and only *after* that it gets thrown to the memory controller which then atomically ensures correctness. Namely that either the cacheline as how core0 wrote into it gets written to the RAM or core2.

The problem is that the cache line can get thrown to the memory controller *between* the two writes. This can happen when another core needs that cache line.

syzygy · Post by **syzygy** » Sun Dec 09, 2012 9:55 pm

diep wrote:Ok we are 1 step further now. You are no longer denying that the core hammers 64+ bytes at once to the memory controller.

I never denied that... just read what I wrote earlier.

You're still not getting the idea though is it?

Your idea is flawed.

AbCD as Joona wrotes it cannot occur.

Suppose we do not have an interrupt, as interrupts kills our proces, of this proces P0.

It's busy writing to the storebuffer local.

Storebuffer has A and it doesn't write B yet which is in the same cacheline.The thread gets stopped for 1 millisecond.

Our 'A' is still in the storebuffer. It's not written to the memory controller yet.

No!! If a thread gets interrupted by the OS and a context switch occurs, the store buffer is flushed (and half the hash entry is committed to memory). What do you think, that the OS is saving the core's store buffer as part of the context switch? Wrong.

In addition, even without an OS context switch the memory stores to the cache line can be committed and ownership of the cache line can go to another core *between* the two writes.

diep · Post by **diep** » Sun Dec 09, 2012 11:30 pm

syzygy wrote:
diep wrote:It is very naive to believe that if this core has a local storebuffer, that another core would just squeeze in this 64 bits whereas the memory controller doesn't have the data of this local storebuffer yet
The store buffer will be flushed obviously. The point is you have no control on when this happens. There are no guarantees.

You undoubtedly know that if two threads increment the same counter located in memory, some of the increments can be lost. Between the memory read and the memory write that make up the increment, the cache line can ping pong between cores. The same holds for two memory writes.

Imagine what would happen if all cores would be busy at atomic level with just 64 bits of RAM.

We'd have no bandwidth to the RAM then!
All cores working on the same cache line kills performance, so you have to avoid that situation. Threads should as much as possible work on different cache lines (but all cores only reading is fine).

What happens is this core writes in a cache line, or in 2, and only *after* that it gets thrown to the memory controller which then atomically ensures correctness. Namely that either the cacheline as how core0 wrote into it gets written to the RAM or core2.
The problem is that the cache line can get thrown to the memory controller *between* the two writes. This can happen when another core needs that cache line.

i don't know what idiocy you are busy with, but you are speaking about a scenario that happens once in each 10^20 occasions and even less if your processes get all system time from your CPU. Your RAM has failed long before that time and so did your CPU's warranty already expire long before that

Additionally in case of Stockfish it already loses factor 1000 in nps when it doesn't get the full system time for all cores, so we already have had a quadritrillion other problems terminating our program prior to this happening

Let's not be busy with theory here of things like radiation from space that can cause a bitflip and cause our aBcd as that's more likely than what we speak about here.

In the end it is all analogue technology at microscopic level that WILL fail after a period of time. For example radiation from space that will cause a bitflip once in a while.

Now in your box that hurts more than mine as i've got ECC, so if you're gonna try to measure it, you will measure with 99.9999999999999999% sureness bitflips caused by radiation from outer space

we are not busy with idiocies here that never happen in our lifetime for a chessprogram that runs at most a few hours at a bunch of cores.

you know it won't happen. you know it won't fail. you zoom in into 1 unlikely event as you don't know how the storebuffer and cache coherency functions, which avoids a lot of problems here.

This whereas the normal write error we speak about which happens once in each 10^10 occasions that happens it is not possible to have AbCD.

Of course with special interrupts that sometimes happen on machines it can happen more often in theory, however our program then already gets a break.

The normal write errors you can measure if you let your program run at a core or 200+ as i did, i had at some overnight analysis a write error once each 200 billion nodes (= nearly 200 billion writes) and one or 2 collissions.

In such case this is the thing i described.

Namely that you can expect it to happen that you get Abcd or ABcd or ABCd or capitals and underscore reversed.

Any other scenario you don't need to prepare for except if you intend to live eternal and run your current box forever

In that sense this forum always is a fools forum (my words), as Frans Morsch already indicated around 1999 when he noted that CCC hadn't understood how the storebuffer functions, garantueeing Fritz when being aligned that he couldn't get any write error at all...

syzygy · Post by **syzygy** » Mon Dec 10, 2012 12:37 am

diep wrote:i don't know what idiocy you are busy with, but you are speaking about a scenario that happens once in each 10^20 occasions and even less if your processes get all system time from your CPU. Your RAM has failed long before that time and so did your CPU's warranty already expire long before that

(...)

It seems we're in agreement on the non-atomicity of two consecutive writes.

As long as the program checks the validity of the hash move, it is of course safe to ignore the rare smp corruption.

Sven · Post by **Sven** » Mon Dec 10, 2012 12:49 am

syzygy wrote:
diep wrote:i don't know what idiocy you are busy with, but you are speaking about a scenario that happens once in each 10^20 occasions and even less if your processes get all system time from your CPU. Your RAM has failed long before that time and so did your CPU's warranty already expire long before that

(...)
It seems we're in agreement on the non-atomicity of two consecutive writes.

As long as the program checks the validity of the hash move, it is of course safe to ignore the rare smp corruption.

If two consecutive writes are non-atomic (which is also my understanding), then why would a third CPU get a consistent hash element returned when starting to read after the first half of the write has arrived in memory? In other words, I don't even believe that "smp corruption" is so rare if no special care is taken about it.

Sven

syzygy · Post by **syzygy** » Mon Dec 10, 2012 1:40 am

Sven Schüle wrote:
syzygy wrote:
diep wrote:i don't know what idiocy you are busy with, but you are speaking about a scenario that happens once in each 10^20 occasions and even less if your processes get all system time from your CPU. Your RAM has failed long before that time and so did your CPU's warranty already expire long before that

(...)
It seems we're in agreement on the non-atomicity of two consecutive writes.

As long as the program checks the validity of the hash move, it is of course safe to ignore the rare smp corruption.
If two consecutive writes are non-atomic (which is also my understanding), then why would a third CPU get a consistent hash element returned when starting to read after the first half of the write has arrived in memory? In other words, I don't even believe that "smp corruption" is so rare if no special care is taken about it.

That third CPU indeed won't get a consistent hash entry. I agree it's not once-in-a-lifetime rare as Vincent seems to suggest, but it's rare enough that it won't measurably affect the quality of the search. To affect the search you probably need many corruptions per second.

Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table

Re: Speaking of the hash table