syzygy wrote:diep wrote:It is very naive to believe that if this core has a local storebuffer, that another core would just squeeze in this 64 bits whereas the memory controller doesn't have the data of this local storebuffer yet
The store buffer will be flushed obviously. The point is you have
no control on when this happens. There are
no guarantees.
You undoubtedly know that if two threads increment the same counter located in memory, some of the increments can be lost. Between the memory read and the memory write that make up the increment, the cache line can ping pong between cores. The same holds for two memory writes.
Imagine what would happen if all cores would be busy at atomic level with just 64 bits of RAM.
We'd have no bandwidth to the RAM then!
All cores working on the same cache line kills performance, so you have to avoid that situation. Threads should as much as possible work on different cache lines (but all cores only reading is fine).
What happens is this core writes in a cache line, or in 2, and only *after* that it gets thrown to the memory controller which then atomically ensures correctness. Namely that either the cacheline as how core0 wrote into it gets written to the RAM or core2.
The problem is that the cache line can get thrown to the memory controller *between* the two writes. This can happen when another core needs that cache line.
i don't know what idiocy you are busy with, but you are speaking about a scenario that happens once in each 10^20 occasions and even less if your processes get all system time from your CPU. Your RAM has failed long before that time and so did your CPU's warranty already expire long before that
Additionally in case of Stockfish it already loses factor 1000 in nps when it doesn't get the full system time for all cores, so we already have had a quadritrillion other problems terminating our program prior to this happening
Let's not be busy with theory here of things like radiation from space that can cause a bitflip and cause our aBcd as that's more likely than what we speak about here.
In the end it is all analogue technology at microscopic level that WILL fail after a period of time. For example radiation from space that will cause a bitflip once in a while.
Now in your box that hurts more than mine as i've got ECC, so if you're gonna try to measure it, you will measure with 99.9999999999999999% sureness bitflips caused by radiation from outer space
we are not busy with idiocies here that never happen in our lifetime for a chessprogram that runs at most a few hours at a bunch of cores.
you know it won't happen. you know it won't fail. you zoom in into 1 unlikely event as you don't know how the storebuffer and cache coherency functions, which avoids a lot of problems here.
This whereas the normal write error we speak about which happens once in each 10^10 occasions that happens it is not possible to have AbCD.
Of course with special interrupts that sometimes happen on machines it can happen more often in theory, however our program then already gets a break.
The normal write errors you can measure if you let your program run at a core or 200+ as i did, i had at some overnight analysis a write error once each 200 billion nodes (= nearly 200 billion writes) and one or 2 collissions.
In such case this is the thing i described.
Namely that you can expect it to happen that you get Abcd or ABcd or ABCd or capitals and underscore reversed.
Any other scenario you don't need to prepare for except if you intend to live eternal and run your current box forever
In that sense this forum always is a fools forum (my words), as Frans Morsch already indicated around 1999 when he noted that CCC hadn't understood how the storebuffer functions, garantueeing Fritz when being aligned that he couldn't get any write error at all...