c++11 std::atomic and memory_order_relaxed

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

kbhearn
Posts: 411
Joined: Thu Dec 30, 2010 4:48 am

c++11 std::atomic and memory_order_relaxed

Post by kbhearn »

I thought i had an idea how this worked in theory, but having just recently done some reading one of the examples in the specification has me quite confused:

from http://www.open-std.org/jtc1/sc22/wg21/ ... mics.order
[Note: The requirements do allow r1 == r2 == 42 in (x, y initially zero):

Thread 1:
r1 = x.load( memory_order_relaxed );
if ( r1 == 42 ) y.store( r1, memory_order_relaxed );
Thread 2:
r2 = y.load( memory_order_relaxed );
if ( r2 == 42 ) x.store( 42, memory_order_relaxed );

Implementations are discouraged from allowing such behavior. —end note]
My understanding of memory_order_relaxed was that the atomic operations can be rearranged in any order but pulling numbers out of thin air would be forbidden. what rearrangement of loads and stores leads to r1 == r2 == 42 in this situation?
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: c++11 std::atomic and memory_order_relaxed

Post by syzygy »

I think the point is (only) that the result r1 == r2 == 42 does not contradict any of the requirements imposed by the C++11 standard:
- consistent with single threaded behaviour of both threads;
- consistent with the modification order of both r1 and r2.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: c++11 std::atomic and memory_order_relaxed

Post by syzygy »

I had second look. The idea seems to be as follows.

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( r1, memory_order_relaxed );
can be rewritten as

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( 42, memory_order_relaxed );
Now the store to y can be performed speculatively before the load and the comparison take place. It needs to be rolled back without other threads ever seeing it if the comparison results in false.

Similar for thread 2.

I suppose you would need hardware with some sort of double-threaded speculative execution that at the moment probably does not exist for this to make sense (and to allow the nonsensical result..).

Similar examples are discussed here:
http://lwn.net/Articles/586838/
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: c++11 std::atomic and memory_order_relaxed

Post by bob »

syzygy wrote:I had second look. The idea seems to be as follows.

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( r1, memory_order_relaxed );
can be rewritten as

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( 42, memory_order_relaxed );
Now the store to y can be performed speculatively before the load and the comparison take place. It needs to be rolled back without other threads ever seeing it if the comparison results in false.

Similar for thread 2.

I suppose you would need hardware with some sort of double-threaded speculative execution that at the moment probably does not exist for this to make sense (and to allow the nonsensical result..).

Similar examples are discussed here:
http://lwn.net/Articles/586838/
I don't believe any hardware ever has, nor ever will, speculatively perform a write to memory. Lots of reasons.

(1) who would want to speculatively mark a cache block as dirty, because it would be quite difficult to "undirty it" later since other valid writes could have also set the block to dirty;

(2) who would want to deal with I/O where a speculative write changes a buffer that is written out. Now you not only have to undo the write to cache/memory, you also have to undo the file-system write which is way beyond the CPU's ability to do internally.

Hennessy/Patterson has a good discussion on speculation and where it should be avoided and why. Memory writes are at the top of the list. Right above memory reads that would produce a TLB miss or, even worse, a "page invalid" page fault.

I think your first guess was correct, that r1=r2=42 is legal, even if not possible, given existing (or future) hardware.

I suppose a compiler could change this (I am using normal memory reads to avoid excessive typing with the relaxed order stuff):

r1 = x
temp = y
if (r1 == 42)
y = 42
else
y = temp

Now, good branch prediction might correctly guess that most of the time x == 42 and do the write and the read, speculating on the write before the read completes and the conditional is actually evaluated.
I can't imagine a compiler doing that, but I suppose one might go that far for some obscure reason, and the standard says "this is ok". Even if it is slower, even if it might break a program that does asynchronous I/O, etc. In other words, the new standard is just like the old standard. Leaves a lot of stupid holes.
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: c++11 std::atomic and memory_order_relaxed

Post by hgm »

I think the problem we are dealing with here is that the C standard does not require causality. One thus can have a loop of events that all cause each other, without violating the standard.

But in real life of course there are more compelling laws than the C++11 standard, and physics could not care less about what the C standard allows. Don't expect causality to be violated just because some standard allows it.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: c++11 std::atomic and memory_order_relaxed

Post by syzygy »

bob wrote:
syzygy wrote:I had second look. The idea seems to be as follows.

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( r1, memory_order_relaxed );
can be rewritten as

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( 42, memory_order_relaxed );
Now the store to y can be performed speculatively before the load and the comparison take place. It needs to be rolled back without other threads ever seeing it if the comparison results in false.

Similar for thread 2.

I suppose you would need hardware with some sort of double-threaded speculative execution that at the moment probably does not exist for this to make sense (and to allow the nonsensical result..).

Similar examples are discussed here:
http://lwn.net/Articles/586838/
I don't believe any hardware ever has, nor ever will, speculatively perform a write to memory.
Processors have been performing speculative writes since many years, but you probably mean writes that actually land in DRAM. But that wouldn't be necessary for the above scenario to occur. It is sufficient that two threads see each other's speculative writes, for example in a shared cache or store buffer.

It also does not have to be the processor logic that decides by itself to perform such writes speculatively. It could be the compiler explicitly making use of a speculative store instruction of the processor.

Again, as far as I know the scenario is not possible on hardware that exists today. It would also not seem to make much sense to involve two hardware threads in one speculative execution that must be rolled back for both threads if anything goes wrong. But maybe I am wrong and someone can come up with a valid reason to do this.
Lots of reasons.

(...)
Your processor already performs speculative stores. Of course stores performed by one (logical) core are not visible to other cores. See e.g. US6065103.

The hardware logic for implementing this speculative execution is now being reused to implement support for transactional memory in Haswell processors. This allows the implementaiton of explicit speculative stores (i.e. by the programmer).
See e.g. http://cs.brown.edu/~mph/HerlihyM93/her ... tional.pdf
bob wrote:I suppose a compiler could change this (I am using normal memory reads to avoid excessive typing with the relaxed order stuff):

r1 = x
temp = y
if (r1 == 42)
y = 42
else
y = temp
This wouldn't be legal in C11/C++11, as it would create a (non-speculative) store to y where previously there wasn't one. C99 allows this, C11 not anymore.
5.1.2.4.27 wrote:NOTE 13 Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard, since such an assignment might overwrite another assignment by a different thread in cases in which an abstract machine execution would not have encountered a data race. This includes implementations of data member assignment that overwrite adjacent members in separate memory locations. We also generally preclude reordering of atomic loads in cases in which the atomics in question may alias, since this may violate the "visible sequence" rules.
C99 does not have a concept of threads, so does not forbid it. POSIX effectively does forbid it, but in the past gcc developers have argued that POSIX threads are none of their concern:
http://gcc.gnu.org/ml/gcc/2007-10/msg00270.html

Another example:
http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html

A system with such a compiler is not POSIX-compliant.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: c++11 std::atomic and memory_order_relaxed

Post by syzygy »

hgm wrote:But in real life of course there are more compelling laws than the C++11 standard, and physics could not care less about what the C standard allows. Don't expect causality to be violated just because some standard allows it.
Better not rely too much on causality if you use an optimising compiler:
Dangerous Optimizations and the Loss of Causality
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: c++11 std::atomic and memory_order_relaxed

Post by bob »

syzygy wrote:
bob wrote:
syzygy wrote:I had second look. The idea seems to be as follows.

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( r1, memory_order_relaxed );
can be rewritten as

Code: Select all

    r1 = x.load( memory_order_relaxed );
    if ( r1 == 42 ) y.store( 42, memory_order_relaxed );
Now the store to y can be performed speculatively before the load and the comparison take place. It needs to be rolled back without other threads ever seeing it if the comparison results in false.

Similar for thread 2.

I suppose you would need hardware with some sort of double-threaded speculative execution that at the moment probably does not exist for this to make sense (and to allow the nonsensical result..).

Similar examples are discussed here:
http://lwn.net/Articles/586838/
I don't believe any hardware ever has, nor ever will, speculatively perform a write to memory.
Processors have been performing speculative writes since many years, but you probably mean writes that actually land in DRAM. But that wouldn't be necessary for the above scenario to occur. It is sufficient that two threads see each other's speculative writes, for example in a shared cache or store buffer.

It also does not have to be the processor logic that decides by itself to perform such writes speculatively. It could be the compiler explicitly making use of a speculative store instruction of the processor.

Again, as far as I know the scenario is not possible on hardware that exists today. It would also not seem to make much sense to involve two hardware threads in one speculative execution that must be rolled back for both threads if anything goes wrong. But maybe I am wrong and someone can come up with a valid reason to do this.
Lots of reasons.

(...)
Your processor already performs speculative stores. Of course stores performed by one (logical) core are not visible to other cores. See e.g. US6065103.

The hardware logic for implementing this speculative execution is now being reused to implement support for transactional memory in Haswell processors. This allows the implementaiton of explicit speculative stores (i.e. by the programmer).
See e.g. http://cs.brown.edu/~mph/HerlihyM93/her ... tional.pdf
bob wrote:I suppose a compiler could change this (I am using normal memory reads to avoid excessive typing with the relaxed order stuff):

r1 = x
temp = y
if (r1 == 42)
y = 42
else
y = temp
This wouldn't be legal in C11/C++11, as it would create a (non-speculative) store to y where previously there wasn't one. C99 allows this, C11 not anymore.
5.1.2.4.27 wrote:NOTE 13 Compiler transformations that introduce assignments to a potentially shared memory location that would not be modified by the abstract machine are generally precluded by this standard, since such an assignment might overwrite another assignment by a different thread in cases in which an abstract machine execution would not have encountered a data race. This includes implementations of data member assignment that overwrite adjacent members in separate memory locations. We also generally preclude reordering of atomic loads in cases in which the atomics in question may alias, since this may violate the "visible sequence" rules.
C99 does not have a concept of threads, so does not forbid it. POSIX effectively does forbid it, but in the past gcc developers have argued that POSIX threads are none of their concern:
http://gcc.gnu.org/ml/gcc/2007-10/msg00270.html

Another example:
http://gcc.gnu.org/ml/gcc/2007-10/msg00275.html

A system with such a compiler is not POSIX-compliant.
By speculative writes, I mean ANY writes. How on earth do you do a speculative write on CPU 0, which goes to its L1 cache, and from there promptly gets forwarded to another core's L1 cache where it is read, and has now polluted THAT cores execution as well. I am not aware of ANY processors that speculatively do writes, if you mean a write that actually leaves the CPU logic and gets delivered to L1 where someone else can read it. If you mean speculative writes in terms of intel, that's not the same thing. The writes are done, everything is done, but the instruction that commits the write to cache/memory sits in the reorder buffer until the cpu verifies that the write actually needs to be executed. At that point, the write is actually queued up from the CPU to L1, and away it goes.

None of that "speculative write" stuff is visible outside the CPU until the speculation has been proven to be correct. Not before. There is simply no way to unwind a write once it lands in L1 because it can be used by another core the next cycle and then can't be undone.

The "transaction" concept is not new, it's been around forever. Trying to make it more efficient has been an ongoing research topic for years. Nothing prevents this stuff from happening in Intel, other than efficiency and the small detail that it does not support such a model at all. But so long as the CPU keeps everything to itself, there is no problem, the minute any change is made visible, it can't be undone, which represents a problem for such a "transaction-based architecture."

The example given simply can't happen without out-of-order speculative writes that can't be undone today, which means they are never going to happen with today's architectures. Instruction retiring in Intel is already a complicated process, but at least it is all inside the CPU and no other core can see any of the results until the speculation has been eliminated AND the instructions doing the weird stuff have been retired.
User avatar
hgm
Posts: 27796
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: c++11 std::atomic and memory_order_relaxed

Post by hgm »

syzygy wrote:Better not rely too much on causality if you use an optimising compiler:
Dangerous Optimizations and the Loss of Causality
Causality is a law of nature. If anyone claims it doesn't apply, he is wrong. It is as simple as that. Doesn't matter if he writes a book about it, or an encyclopedia. Events cannot cause themselves, directly or indirectly.
syzygy
Posts: 5566
Joined: Tue Feb 28, 2012 11:56 pm

Re: c++11 std::atomic and memory_order_relaxed

Post by syzygy »

hgm wrote:
syzygy wrote:Better not rely too much on causality if you use an optimising compiler:
Dangerous Optimizations and the Loss of Causality
Causality is a law of nature. If anyone claims it doesn't apply, he is wrong. It is as simple as that. Doesn't matter if he writes a book about it, or an encyclopedia. Events cannot cause themselves, directly or indirectly.
As I have pointed out above, computer architectures are very much feasible within the laws of physics for which an appropriate C++11-compliant compiler could produce code that on some executions will give r1 == r2 == 42.