c++11 std::atomic and memory_order_relaxed

hgm · Post by **hgm** » Fri Apr 04, 2014 9:42 am

syzygy wrote:In the case we have here the two threads must synchronise to see whether both conditions hold true. This is a bit annoying and complicating matters, but not a theoretical obstacle.

And then of course they could never get 42 for any of the variables. BTW, I don't think at all that this is even theoretically possible. IIRC there are no general algorithms to recognize and resolve deadlocks (this is related to the stopping problem of Turing machines). The only feasible defense would be to ignore the speculative writes, and use the last non-speculative value from before it.

Your code example is not a problem since the compiler (or processor) can simply decide not to attempt to execute it speculatively.

One thing is for sure: making compilers insert speculative writes in the code could never serve any purpose other than wreck multi-threaded behavior at the cost of a huge performance penalty. Because writes to memory cannot be undone unless you first do a read of that same location to make a copy that you could use for restoring it, in case you speculated wrong. And of course you would need locks around that read, because otherwise the copy you make could in fact be a speculatively written value by another thread. So you can forget about that. If there is any speculative execution, it can only serve any purpose when the hardware does it.

And indeed, all existing hardware does decide not to execute writes speculatively. Speculative writing is a totally non-sensical idea, that serves absolutely no purpose, for completely general logical reasons. There will never be any 42, unless you invent time travel first.

bob · Post by **bob** » Fri Apr 04, 2014 6:07 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:There's one thing wrong with this. "memory order relaxed" does NOT cover what is known as a "control dependency". if (c) a is a classic control dependency where a is not executed unless c is true. The alpha is a classic "relaxed memory order" architecture. but it does NOT include violating control dependencies. To do so violates the basic premise of programming, in fact, that the program ALWAYS produces the same results as when it is executed one instruction at a time, in the order they are written. It would seem this might even produce a WAR hazard, where the write gets speculatively done before a previous read, which would kill most any program on the planet.

This really is NOT going to happen architecturally.
I agree this single threaded execution will always behave as one would expect it to behave, but of course only from the point of view of that single thread. This principle is not violated.

Also note that it could be the compiler that is inserting the speculative store (knowing that it will not affect single-threaded execution AND that it will not violate the requirements on multithreaded execution imposed by the C++11 standard).
That was my original guess, if you recall.
Not really. This was about compiler/hardware combinations doing stuff all along. To do the actual speculation you need hardware support. You said hardware won't do that, now you suddenly insert "try to". But this is just getting tiring again.

About semantics of multithreaded programming: it is completely irrational to expect a language to allow only one result from a multithreaded program. It would mean you can't efficiently program a multithreaded alpha-beta anymore.

You don't need hardware support to do speculation. A compiler can change this:

if (c) x=y
else x=0

However, if most of the time, via PGO, it notices that c is true, it could well choose to do this:

x=y
if (!c) x=0

That's software speculation. That was not uncommon when static branch prediction was used. Speculatively make the assignment then undo it if wrong.

So, I did NOT say hardware speculation was needed. In fact, I said, and still maintain, that speculation that would do a write to L1 cache is badly broken and no program will work correctly there, nor will it be possible to "roll back". The instant a speculative store hits L1, it can be forwarded to another L1, and one cycle later that cache might make the decision to replace that block in cache, which means it gets written back to memory. There is no practical way to unroll any of that once the speculative store hits L1. So it won't happen.

A compiler is free to move things around and do all the speculative stores it wants, knowing that other threads will see those speculative stores instantly.

My take is that the C++11 standard is simply broken. Again, the purpose is to REMOVE ambiguities, NOT add 'em.

Here is my FIRST response on the topic, since you don't seem to remember it:

I think your first guess was correct, that r1=r2=42 is legal, even if not possible, given existing (or future) hardware.

I suppose a compiler could change this (I am using normal memory reads to avoid excessive typing with the relaxed order stuff):

I think that is post #4 from the top of the thread. I snipped the compiler-modified source example to keep the above quote short. I said no hardware, present or future would do that. I stick by that. A compiler can do most anything it wants, and if the standard is written ambiguously, the compiler writers will apparently be happy to hide behind the "but the standard says this is ok.." even if it really should NOT be.

I don't expect a compiler to produce the same results every time on a multi-threaded program. However, if that were guaranteed, alpha/beta would still work just fine, speedups would still be there, the compiler would somehow magically eliminate all the races.

What I DO expect the compiler to do, is something rational. For example, given this simple bit of code:

x = 0;
if (c) x=1;

I expect x to be zero or one. NOTHING else. Yet the C++11 standard you are quoting says that this is too much to ask. I think that is ridiculous.

BTW this argument IS ridiculous. You are taking a LOUSY standard, and trying to justify it by inventing hardware that will never exist. Better to just say "the standard is broken here".

bob · Post by **bob** » Fri Apr 04, 2014 6:12 pm

hgm wrote:
syzygy wrote:In the case we have here the two threads must synchronise to see whether both conditions hold true. This is a bit annoying and complicating matters, but not a theoretical obstacle.
And then of course they could never get 42 for any of the variables. BTW, I don't think at all that this is even theoretically possible. IIRC there are no general algorithms to recognize and resolve deadlocks (this is related to the stopping problem of Turing machines). The only feasible defense would be to ignore the speculative writes, and use the last non-speculative value from before it.

Your code example is not a problem since the compiler (or processor) can simply decide not to attempt to execute it speculatively.
One thing is for sure: making compilers insert speculative writes in the code could never serve any purpose other than wreck multi-threaded behavior at the cost of a huge performance penalty. Because writes to memory cannot be undone unless you first do a read of that same location to make a copy that you could use for restoring it, in case you speculated wrong. And of course you would need locks around that read, because otherwise the copy you make could in fact be a speculatively written value by another thread. So you can forget about that. If there is any speculative execution, it can only serve any purpose when the hardware does it.

And indeed, all existing hardware does decide not to execute writes speculatively. Speculative writing is a totally non-sensical idea, that serves absolutely no purpose, for completely general logical reasons. There will never be any 42, unless you invent time travel first.

that simply shows the standard is broken. Just as surely as the current C standard is broken where it says "char variables may be either signed or unsigned at the discretion of the implementation" even though the standard REQUIRES both signed and unsigned chars to be implemented. If you have both, "char x" should mean the same thing across all platforms. I got burned by the IBM RX6000 c compiler assuming unsigned was the default. My byte-oriented chess board suddenly had nothing but white pieces...

A standard is a standard. I don't like the "int" and "long" problem. But thankfully, we were later given int32_t and int64_t so that I don't get zapped by micro$oft making a long 32 bits even on 64 bit systems...

kbhearn · Post by **kbhearn** » Fri Apr 04, 2014 6:51 pm

In fact it's not allowed to speculatively write to x if x is atomic (unless perhaps it can guarantee spec-determined multithreaded behavior will behave as if that write never happened). in the original example, the 42 is allowed to show up because the result is acceptable under the guarantees the spec will give which allows for seeing the future:

1) single-thread sequencing for thread 1 and thread 2 are followed (r1 and r2 both read a 42, and then the thread writes a 42)

2) modification order for both x and y are acceptable (write in one thread and then read in the other)

3) there exists no total order (none of the inter-thread happens before conditions are met because relaxed offers no synchronisation) so the apparent inconsistency between 1 and 2 is acceptable under the standard, though the note did say this behavior of pulling a 42 out where there was none is discouraged if the compiler even bothered to detect this multithreaded conundrum. for performance reasons they did later add memory_order_consume for relaxed memory architectures which is less demanding than acquire but observes dependency chains (causes an inter-thread happens before with the value read), but it gives me a headache trying to figure out just what exactly that guarantees (if anything beyond said dependency chain) let alone what code it would generate on architectures i don't really understand.

bob · Post by **bob** » Fri Apr 04, 2014 7:01 pm

kbhearn wrote:In fact it's not allowed to speculatively write to x if x is atomic (unless perhaps it can guarantee spec-determined multithreaded behavior will behave as if that write never happened). in the original example, the 42 is allowed to show up because the result is acceptable under the guarantees the spec will give which allows for seeing the future:

1) single-thread sequencing for thread 1 and thread 2 are followed (r1 and r2 both read a 42, and then the thread writes a 42)

2) modification order for both x and y are acceptable (write in one thread and then read in the other)

3) there exists no total order (none of the inter-thread happens before conditions are met because relaxed offers no synchronisation) so the apparent inconsistency between 1 and 2 is acceptable under the standard, though the note did say this behavior of pulling a 42 out where there was none is discouraged if the compiler even bothered to detect this multithreaded conundrum. for performance reasons they did later add memory_order_consume for relaxed memory architectures which is less demanding than acquire but observes dependency chains (causes an inter-thread happens before with the value read), but it gives me a headache trying to figure out just what exactly that guarantees (if anything beyond said dependency chain) let alone what code it would generate on architectures i don't really understand.

How do BOTH read a 42 when the values are initialized to zero?

I think it is just a poor part of the standard. The idea is to specify as much as possible to remove ambiguity, not specify things that actually allow completely impossible behavior to be acceptable.

hgm · Post by **hgm** » Fri Apr 04, 2014 8:15 pm

I think the confusion is that some people believe that just because the standard allows it, there could actually be implementations where this happens. I guess they must also believe that the compiler could make demons fly out of their nose, in some future CPU architecture...

syzygy · Post by **syzygy** » Fri Apr 04, 2014 9:35 pm

hgm wrote:
syzygy wrote:In the case we have here the two threads must synchronise to see whether both conditions hold true. This is a bit annoying and complicating matters, but not a theoretical obstacle.
And then of course they could never get 42 for any of the variables.

Of course they can.

1. Both threads speculatively store 42 in separate variables. They are both betting that the condition evaluates to true.
2. Both threads read the value 42 from the variable written by the other thread.
3. Both threads evaluate the condition. Both conditions are true.
4. The speculative stores can now be committed.

BTW, I don't think at all that this is even theoretically possible. IIRC there are no general algorithms to recognize and resolve deadlocks (this is related to the stopping problem of Turing machines).

There is nothing complicated about this at all. The compiler just decides to speculatively execute, in advance, the store that must be executed if the condition is true. A few instructions later it will be known whether the condition is true. If it is, then everything is fine. If it is not, the speculative store must be undone (and also any other speculative executions depending on it).

Ignoring thread 2, we have:

r1 = x.load( memory_order_relaxed );
if ( r1 == 42 ) y.store( r1, memory_order_relaxed );

The compiler first rewrites this as:

r1 = x.load( memory_order_relaxed );
if ( r1 == 42 ) y.store( 42, memory_order_relaxed );

Then it rewrites this as:

transaction.begin();
y.store( 42, memory_order_relaxed ); // speculative
r1 = x.load( memory_order_relaxed );
if (r1 == 42) transaction.commit();
else transaction.abort();

bob · Post by **bob** » Fri Apr 04, 2014 9:46 pm

bob wrote:
kbhearn wrote:In fact it's not allowed to speculatively write to x if x is atomic (unless perhaps it can guarantee spec-determined multithreaded behavior will behave as if that write never happened). in the original example, the 42 is allowed to show up because the result is acceptable under the guarantees the spec will give which allows for seeing the future:

1) single-thread sequencing for thread 1 and thread 2 are followed (r1 and r2 both read a 42, and then the thread writes a 42)

2) modification order for both x and y are acceptable (write in one thread and then read in the other)

3) there exists no total order (none of the inter-thread happens before conditions are met because relaxed offers no synchronisation) so the apparent inconsistency between 1 and 2 is acceptable under the standard, though the note did say this behavior of pulling a 42 out where there was none is discouraged if the compiler even bothered to detect this multithreaded conundrum. for performance reasons they did later add memory_order_consume for relaxed memory architectures which is less demanding than acquire but observes dependency chains (causes an inter-thread happens before with the value read), but it gives me a headache trying to figure out just what exactly that guarantees (if anything beyond said dependency chain) let alone what code it would generate on architectures i don't really understand.
How do BOTH read a 42 when the values are initialized to zero?

I think it is just a poor part of the standard. The idea is to specify as much as possible to remove ambiguity, not specify things that actually allow completely impossible behavior to be acceptable.

Apparently the following is reasonable, here.

Let's take a standard hardwood floor. Which generally uses case-hardened nails because the common nails used in framing a house are too soft. The standard for such a nail might read like this:

"The nail will have the following dimensions [xxxxx omitted here] and be case-hardened sufficiently so that if the nail is hit square on the head, within 5 degrees of the axis of the nail, it will successfully penetrate the oak hardwood and anchor the floor to the sub-flooring."

Seems perfectly rational to me. But now the quacks step in. What if the nail is hit outside that 5 degree angle? That's undefined by the standard so ANYTHING could happen. The nail could explode, it could start a fire, it could produce an EMP that disables every computer within a mile, demons could fly out of your nose, etc.

In reality only a couple of outcomes are possible if you don't hit it correctly (and having had a dad that built houses, all of these have happened to me). 1. The nail flies off on a tangent; 2; the nail breaks; 3. The nail bends (but not much since it is case-hardened, so much bending turns into a break. There are no other actions possible, but because that nail standard did not lay out all the bad things, it is therefore ok to claim that ANYTHING could happen.

We all know what computers do when you add +1 to hex ffffffff. You get zero. No demons, no nothing, you just get 0. But SOME have a very vivid imagination, obviously. And they hide behind the "standard" to justify that imagination.

bob · Post by **bob** » Fri Apr 04, 2014 9:50 pm

syzygy wrote:
hgm wrote:
syzygy wrote:In the case we have here the two threads must synchronise to see whether both conditions hold true. This is a bit annoying and complicating matters, but not a theoretical obstacle.
And then of course they could never get 42 for any of the variables.
Of course they can.

1. Both threads speculatively store 42 in separate variables. They are both betting that the condition evaluates to true.
2. Both threads read the value 42 from the variable written by the other thread.
3. Both threads evaluate the condition. Both conditions are true.
4. The speculative stores can now be committed.

BTW, I don't think at all that this is even theoretically possible. IIRC there are no general algorithms to recognize and resolve deadlocks (this is related to the stopping problem of Turing machines).
There is nothing complicated about this at all. The compiler just decides to speculatively execute, in advance, the store that must be executed if the condition is true. A few instructions later it will be known whether the condition is true. If it is, then everything is fine. If it is not, the speculative store must be undone (and also any other speculative executions depending on it).

Ignoring thread 2, we have:
r1 = x.load( memory_order_relaxed );
if ( r1 == 42 ) y.store( r1, memory_order_relaxed );
The compiler first rewrites this as:
r1 = x.load( memory_order_relaxed );
if ( r1 == 42 ) y.store( 42, memory_order_relaxed );
Then it rewrites this as:
transaction.begin();
y.store( 42, memory_order_relaxed ); // speculative
r1 = x.load( memory_order_relaxed );
if (r1 == 42) transaction.commit();
else transaction.abort();

And you just violated a primary principle used in computer architecture design, that if some instruction is subject to a control dependency, the instruction is NOT executed until the dependency has been resolved and removed. In YOUR architecture, things like

if (a) x=x/a;

would be quite free to produce a divide by zero result, which is nonsense and no architecture today would even consider doing so. No one would program on such a machine, it would be hopelessly random in its behavior.

syzygy · Post by **syzygy** » Fri Apr 04, 2014 9:54 pm

kbhearn wrote:I thought i had an idea how this worked in theory, but having just recently done some reading one of the examples in the specification has me quite confused:

from http://www.open-std.org/jtc1/sc22/wg21/ ... mics.order

Since this document is from 2007 it is not clear to me whether this actually applies to the current standard.

But the situation seems to be in fact worse. Here is more recent document:
http://www.open-std.org/jtc1/sc22/wg21/ ... n3710.html
It gives this example:

Thread 1:
r1 = x.load(memory_order_relaxed);
y.store(r1, memory_order_relaxed);

Thread 2:
r2 = y.load(memory_order_relaxed);
x.store(r2, memory_order_relaxed);

If x and y are initially 0, this code can result in x == y == 42.

This famously allows both r1 and r2 to have final values of 42, or any other "out of thin air" value. This occurs if each load sees the store in the other thread. It effectively models an execution in which the compiler speculates that both atomic variables will have a value of 42, speculatively stores the resulting values, and then performs the loads to confirm that the speculation was correct and nothing needs to be undone.

Pretty much my explanation for the example from the 2007 document.

No known implementations actually produce such results. However, it is extraordinarily hard to write specifications that present them without also preventing legitimate compiler and hardware optimizations.

c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed