c++11 std::atomic and memory_order_relaxed

syzygy · Post by **syzygy** » Fri Apr 04, 2014 9:56 pm

bob wrote:And you just violated a primary principle used in computer architecture design, that if some instruction is subject to a control dependency, the instruction is NOT executed until the dependency has been resolved and removed. In YOUR architecture, things like

if (a) x=x/a;

would be quite free to produce a divide by zero result, which is nonsense and no architecture today would even consider doing so. No one would program on such a machine, it would be hopelessly random in its behavior.

Dadadada. Same for your software speculation, but you did not think of that did you.

Do you not think the compiler can recognise that storing the value of 42 does not lead to a division by zero?

I suppose you simply did not read (and why am I still replying to you...):

The compiler first rewrites this as:

Then it rewrites this as:

Do you understand the example now, or is it still too difficult.

hgm · Post by **hgm** » Fri Apr 04, 2014 10:36 pm

syzygy wrote:Of course they can.

1. Both threads speculatively store 42 in separate variables. They are both betting that the condition evaluates to true.
2. Both threads read the value 42 from the variable written by the other thread.
3. Both threads evaluate the condition. Both conditions are true.
4. The speculative stores can now be committed.

And (4) is where you still go wrong. Of course they cannot be committed. Because they still depend on a speculative evaluation of the condition. They cannot be committed ('retired', in Intel micro-architectural speak) before the evaluation of the condition is committed, and the condition cannot be committed as long as the speculative load is not committed. Deadlock!

You consistently ignore that executing loads speculatively (on speculatively written data) does no longer make progress dependent on resolution of the compare, but on resolution of that earlier load (and thus resolution of the store in the other CPU).

And if you drop the requirement of waiting till the load can be committed (i.e. not treating the load of speculatively written data as speculative, but as real data), you will have a totally broken architecture that never can do anything useful, as all sort of never executed stores would be seen by all other cores. Overwriting their data, and even modifying their code.

There is nothing complicated about this at all. The compiler just decides to speculatively execute, in advance, the store that must be executed if the condition is true. A few instructions later it will be known whether the condition is true. If it is, then everything is fine. If it is not, the speculative store must be undone (and also any other speculative executions depending on it).

Ignoring thread 2, we have:
r1 = x.load( memory_order_relaxed );
if ( r1 == 42 ) y.store( r1, memory_order_relaxed );
The compiler first rewrites this as:
r1 = x.load( memory_order_relaxed );
if ( r1 == 42 ) y.store( 42, memory_order_relaxed );
Then it rewrites this as:
transaction.begin();
y.store( 42, memory_order_relaxed ); // speculative
r1 = x.load( memory_order_relaxed );
if (r1 == 42) transaction.commit();
else transaction.abort();

And how do you suppose this abort works? By magic? How would you know what to write back? How would you prevent other CPUs would see the 42 that is not supposed to be there? (Especially threads that would not be so friendly to store it back in a way that makes the condition true after the fact. But, say, send it to the printer?) Lock the bus between begin and abort?

syzygy · Post by **syzygy** » Fri Apr 04, 2014 10:50 pm

hgm wrote:
syzygy wrote:Of course they can.

1. Both threads speculatively store 42 in separate variables. They are both betting that the condition evaluates to true.
2. Both threads read the value 42 from the variable written by the other thread.
3. Both threads evaluate the condition. Both conditions are true.
4. The speculative stores can now be committed.
And (4) is where you still go wrong. Of course they cannot be committed. Because they still depend on a speculative evaluation of the condition. They cannot be committed ('retired', in Intel micro-architectural speak)

Look. I am not talking about out of order speculation. I am talking about speculation at the programming level. Explicit speculation inserted by the compiler.

And how do you suppose this abort works?

You have never heard about transactional memory?

I believe current hardware implementations are pessimistic in that speculative stores won't be visible to other threads until the transaction is committed, but there is no fundamental reason why e.g. two cores on the same cpu (or maybe two hyperthreads on the same core) could not implement an optimistic version.

hgm · Post by **hgm** » Fri Apr 04, 2014 11:14 pm

syzygy wrote:You have never heard about transactional memory?

No. Is that another physics-violating theoretical construct?

I believe current hardware implementations are pessimistic in that speculative stores won't be visible to other threads until the transaction is committed, but there is no fundamental reason why e.g. two cores on the same cpu (or maybe two hyperthreads on the same core) could not implement an optimistic version.

That it would deadlock both cores seems reason enough, for me.

For reasons that are still obscure (as nothing can ever be gained by it, and it always thoroughly wrecks things, most of all performance) you want speculatively executed stores to be seen by other cores. Then you have two choices:

1. Those other cores treat loading that data also as speculation, and must thus wait until the core that stored it commits it before it can commit any of the stuff that depended on it.
2. The other cores treat data from a speculative write as real.

(1) Leads to a deadlock, the causality chain load1 -> test1 -> store1 -> load2 -> test2 -> store2 -> load1 being looped, and all waiting for each other to be committed.
(2) Leads to total corruption of the operation of other cores, as they start acting on data that was never supposed to be there. Programs like:

Code: Select all

int start = false;

Thread 1&#58;
if&#40;false&#41; start = true;

Thread2&#58;
if&#40;start&#41; GlobalNuclearWar&#40;);

would not have the expected effect, because the compiler could apparently speculatively store 42 to start in thread 1.

I understand of course that adopting such a standard would make writing a universal compiler just a job of minutes. You just let it speculate that each thread would crash, and starts filling the memory (in particular the part holding the code of other threads) with random data. So that these threads crash as well, and fill our code section with random data. Low and behold! The speculation was true! What a perfect translation of our SMP program.]!

bob · Post by **bob** » Fri Apr 04, 2014 11:28 pm

syzygy wrote:
bob wrote:And you just violated a primary principle used in computer architecture design, that if some instruction is subject to a control dependency, the instruction is NOT executed until the dependency has been resolved and removed. In YOUR architecture, things like

if (a) x=x/a;

would be quite free to produce a divide by zero result, which is nonsense and no architecture today would even consider doing so. No one would program on such a machine, it would be hopelessly random in its behavior.
Dadadada. Same for your software speculation, but you did not think of that did you.

Do you not think the compiler can recognise that storing the value of 42 does not lead to a division by zero?

I suppose you simply did not read (and why am I still replying to you...):
The compiler first rewrites this as:

Then it rewrites this as:
Do you understand the example now, or is it still too difficult.

Speculation is pretty well defined. Your example will NEVER happen unless a compiler is broken. Ergo, for MOST of us it will NEVER happen. Why do you keep this nonsense alive. Storing before testing is no different than dividing before testing. it can STILL break something, and no compiler will break the control dependency. I don't know of a compiler (C, C++, etc) around that is not AWARE there might be other threads. They don't particularly worry about such, but the guys that write the compilers certainly know about it. And they are not going to intentionally break code for nonsensical reasons. Except perhaps for Apple, maybe.

None of what you are writing is "too difficult". It is "too stupid".

You can NOT do hardware speculation as you define it. You end up with multiple problems.

1. Potential for deadlock;

2. Potential for indefinite postponement if one core is waiting on the other to commit stuff, but the other is working on another process entirely;

This is nonsense. A compiler CAN do that kind of speculation, but it would NOT be assisted by the hardware. The compiler is free to do a store whenever it wants, so long as it does not violate the semantics of the code it is compiling. So it could look far ahead in the source, find all the stores, and do them up front. Then undo them when it becomes apparent they would not have been done. World's first "unoptimizing compiler" that produces slower code than the direct sequential translation. Wonderful stuff. And all utter nonsense.

bob · Post by **bob** » Fri Apr 04, 2014 11:37 pm

syzygy wrote:
kbhearn wrote:I thought i had an idea how this worked in theory, but having just recently done some reading one of the examples in the specification has me quite confused:

from http://www.open-std.org/jtc1/sc22/wg21/ ... mics.order
Since this document is from 2007 it is not clear to me whether this actually applies to the current standard.

But the situation seems to be in fact worse. Here is more recent document:
http://www.open-std.org/jtc1/sc22/wg21/ ... n3710.html
It gives this example:
Thread 1:
r1 = x.load(memory_order_relaxed);
y.store(r1, memory_order_relaxed);

Thread 2:
r2 = y.load(memory_order_relaxed);
x.store(r2, memory_order_relaxed);
If x and y are initially 0, this code can result in x == y == 42.
This famously allows both r1 and r2 to have final values of 42, or any other "out of thin air" value. This occurs if each load sees the store in the other thread. It effectively models an execution in which the compiler speculates that both atomic variables will have a value of 42, speculatively stores the resulting values, and then performs the loads to confirm that the speculation was correct and nothing needs to be undone.
Pretty much my explanation for the example from the 2007 document.
No known implementations actually produce such results. However, it is extraordinarily hard to write specifications that present them without also preventing legitimate compiler and hardware optimizations.

Notice bolding above. Hardware will NEVER do this.

And it would take a REALLY inferior compiler to do it as well.

Again, the C++11 standard is simply poor in that specific part. Just like it is poor in other parts.

syzygy · Post by **syzygy** » Sat Apr 05, 2014 12:39 am

hgm wrote:
syzygy wrote:You have never heard about transactional memory?
No. Is that another physics-violating theoretical construct?

Ok, so not much reason to continue this discussion.
From the beginning of this thread:

syzygy wrote:The hardware logic for implementing this speculative execution is now being reused to implement support for transactional memory in Haswell processors. This allows the implementaiton of explicit speculative stores (i.e. by the programmer).
See e.g. http://cs.brown.edu/~mph/HerlihyM93/her ... tional.pdf

Haswell implements it. Not in a way that would be needed for the example to "work", but I already said this:

syzygy wrote:I suppose you would need hardware with some sort of double-threaded speculative execution that at the moment probably does not exist for this to make sense (and to allow the nonsensical result..).

For reasons that are still obscure (as nothing can ever be gained by it, and it always thoroughly wrecks things, most of all performance) you want speculatively executed stores to be seen by other cores.

I am not "wanting" this. I am simply explaining how the strange result could happen. You insisted:

hgm wrote:But in real life of course there are more compelling laws than the C++11 standard, and physics could not care less about what the C standard allows. Don't expect causality to be violated just because some standard allows it.

so I explain that there is no law of physics that would prevent it.

1. Those other cores treat loading that data also as speculation, and must thus wait until the core that stored it commits it before it can commit any of the stuff that depended on it.
2. The other cores treat data from a speculative write as real.

The idea is simply that the two transactions get committed if and only if both conditions are true. Both inevitably evaluate the condition. If one or both of them are false, abort and re-execute whatever is needed, for example non-speculatively.

syzygy · Post by **syzygy** » Sat Apr 05, 2014 12:46 am

bob wrote:Your example will NEVER happen unless a compiler is broken.

You seem to have forgotten what this thread was all about.

The OP mentions some strange behaviour that is not explicitly forbidden by the standard (or at least not by a draft of the standard).

Some people argue that it would be physically impossible to happen. Either because of laws of physics imposing causality, or because the hardware could not be built.

I simply explain that such hardware can very well be built. Hardware on which a compiler could make the example happen. It would be a simple variation of existing hardware.

That's it. Of course the usual tactic applies: argue that something is false that I have simply never said. Nowhere did I write that hardware would do this all by itself. Nowhere did I write that a compiler will or should do it.

syzygy · Post by **syzygy** » Sat Apr 05, 2014 1:05 am

bob wrote:2. Potential for indefinite postponement if one core is waiting on the other to commit stuff, but the other is working on another process entirely;

This is a complete non-issue.

In multithreaded programming potential for deadlock is everywhere. One thread takes a lock and never releases it, another thread will hang.

If you have transactions that must synchronise when committing, one thread will also hang if the other thead never attempts to commit. Same story.

The programmer must write a program that works. The compiler must generate code that conforms to the program. If the compiler sees a way of speeding up the program using speculative execution that requires some synchronisation at commit time, that is just fine. (But preferably the compiler will avoid generating code that appears to break causality.)

Rules for committing data

With thread-level speculative execution, tasks are committed according to the following rules:

- Before a task is committed, the data is in a speculative state.
- Tasks are committed in program order.
- Therefore, a later task in program order can only be committed when all the earlier tasks have been committed. If a thread running a task encounters a conflict, all the threads running later tasks must roll back and retry. Eventually, all tasks are committed.

Oh gosh, speculatively executing threads on Blue Gene/Q may have to wait for other threads to commit first. That might lead to deadlock, yikes!!

bob · Post by **bob** » Sat Apr 05, 2014 2:38 am

syzygy wrote:
bob wrote:Your example will NEVER happen unless a compiler is broken.
You seem to have forgotten what this thread was all about.

The OP mentions some strange behaviour that is not explicitly forbidden by the standard (or at least not by a draft of the standard).

Some people argue that it would be physically impossible to happen. Either because of laws of physics imposing causality, or because the hardware could not be built.

I simply explain that such hardware can very well be built. Hardware on which a compiler could make the example happen. It would be a simple variation of existing hardware.

That's it. Of course the usual tactic applies: argue that something is false that I have simply never said. Nowhere did I write that hardware would do this all by itself. Nowhere did I write that a compiler will or should do it.

Such hardware can not, will not be built. Two of us have explained the problems to you repeatedly, but you seem to ignore what is written. A speculative write can NEVER get out of a cpu, because the only way out is to L1 cache, and once it get there, all is lost. Your "multi-mode speculative core model" has deadlock/indefinite-postponement problems and is also not going to happen. In short, the ONLY way this happens, is with a broken compiler. Exactly as the person that originated the post you keep quoting said. Exactly as I have said. Exactly as HGM has said.

So let's get off the hardware bandwagon here, it is NOT going to happen. All we have here is a stupid section of a standard that leaves things open that (a) should be clearly defined and (b) which can only happen with a broken compiler. You broached the hardware topic as "possible". No one else has chimed in to support that because it simply will not happen.

I'm more than happy to leave the conclusion as "stupid C++11 specification shortcoming that should have been made much more precise."

c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed