c++11 std::atomic and memory_order_relaxed

hgm · Post by **hgm** » Sun Apr 06, 2014 10:16 am

syzygy wrote:Since this document is from 2007 it is not clear to me whether this actually applies to the current standard.

But the situation seems to be in fact worse. Here is more recent document:
http://www.open-std.org/jtc1/sc22/wg21/ ... n3710.html

Actually I wonder why you call that 'worse'. In fact that more recent document refers to the old wording of the standard as

[ Note: The requirements do allow r1 == r2 == 42 in the following example, with x and y initially zero:

// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);

However, implementations should not allow such behavior. -end note ]

Seems to me this note is just as much part of the C++11 standard as mentioned 'requirements', which just are a section of the standard. But unlike the 2007 paper, which only uses the word 'discouraged', this outright forbids the 42. The paper proposes to amend this as the following, which is even more explicit.

[ Note: The recommendation similarly disallows r1 == r2 == 42 in the following example, with x and y initially zero:

// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);

-end note ]

So what exactly are we arguing about? It seems that the people developing the standard have already realized that it was broken, and did not provide the definition of a workable programming language. It is not only that this kind of self-confirming speculation would cause violations of the standard as a side effect, when the speculation has to be retracted. Even the behavior in the case where it 'worked' is already a violation!

Harald · Post by **Harald** » Sun Apr 06, 2014 11:12 am

This topic is interesting and complicated and I even learned something.
But it is also a little annoying to follow the 'discussion'.

I just made two google searches:

(1) '++11 std::atomic and memory order relaxed'
There is a lot of information. Is there a good starting point to read
a understandable and correct introduction? Some must read standard papers?

(2) 'cpu speculative cache threads 42'
These papers discuss the topic in a very deep detail level and it looks
like they have their own language and typical graphics.

Is it possible to have some of this in this thread? In opposition to
'I know something and you are stupid' that is answered with
'No, I know something and you are stupid'?

Harald

syzygy · Post by **syzygy** » Sun Apr 06, 2014 2:05 pm

hgm wrote:
syzygy wrote:What "my" system requires is that speculative dependencies between transactions are detected (such as reads of speculative values), and that cyclic dependencies between threads are permitted but must result in these threads being committed all together or not at all.
The problem is that it cannot know that. Your 42 example hinges on the fact that when there is a test on speculatively transferred data that happens to evaluate to true, you know that committing the data will clear all dependencies in the cycle, and eventually propagate back to you to confirm the original speculation. But you cannot know that. Part of that dependency cycle is in the other CPU. And that other CPU could have been speculating on something that is completely independent of whether you will commit the store or not. The only thing you know is that you obtained a value (42) which at any time in the future can be withdrawn. Committing your own results and send the missiles flying, just based on the blind gamble that doing so will prevent the source of the speculative data from retracting it, will in general not perform according to the standard.

There is nothing "cannot know" about it. As I wrote several times before:

syzygy wrote:Just to be clear "my" system commits both threads because:
- both conditions evaluate to true (yes, based on forwarded speculative values);
- assuming the speculative stores are valid, everything else is valid as well.

How would this go in practice?

Code: Select all

  transaction.begin&#40;);
  y.store&#40; 42, memory_order_relaxed ); // speculative
  r1 = x.load&#40; memory_order_relaxed );
  if &#40;r1 == 42&#41;
    transaction.commit&#40;);
  else
    transaction.abort&#40;);

What might not be clear to you is that transaction.commit() may fail.
The "r1 == 42" check verifies whether the speculative store was valid (yes, based on forwarded speculative values).
In a next step, transaction.commit() verifies whether "everything else" is valid. This includes verifying that dependent transactions are ready to commit as well.

So both threads have to reach transaction.commit(). If one of them does not, the other will abort as well.

Just to pre-empt I repeat that the deadlock argument is a non-issue. Either the compiler has proven that each transaction finishes in finite time, or the system times them out.

Just to pre-empt I also repeat that transactions won't irrecoverably start nuclear wars. Either the compiler has proven that they don't, or the system prevents them from doing it.

Just to pre-empt I note that systems already exist where transactions at commit() time have to wait for dependent transactions to commit() (and if those other transactions abort() then the commit() turns into an abort(), obviously). These systems detect at runtime the dependencies that are created at runtime.

Just to pre-empt I repeat that a thread T1 becomes dependent on a thread T2 if it has been forward a speculatively stored value from T2. (There are more ways to get dependencies, but this is the only one occurring here.)

Please read carefully. And maybe try to be open to the idea that this just might work.

Just to pre-empt: I am NOT advocating that such systems should be built that lead to nonsensical results as in the example that triggered this thread. I do not WANT this.

syzygy · Post by **syzygy** » Sun Apr 06, 2014 2:29 pm

hgm wrote:
syzygy wrote:Since this document is from 2007 it is not clear to me whether this actually applies to the current standard.

But the situation seems to be in fact worse. Here is more recent document:
http://www.open-std.org/jtc1/sc22/wg21/ ... n3710.html
Actually I wonder why you call that 'worse'. In fact that more recent document refers to the old wording of the standard as

[ Note: The requirements do allow r1 == r2 == 42 in the following example, with x and y initially zero:

// Thread 1:
r1 = x.load(memory_order_relaxed);
if (r1 == 42) y.store(r1, memory_order_relaxed);
// Thread 2:
r2 = y.load(memory_order_relaxed);
if (r2 == 42) x.store(42, memory_order_relaxed);

However, implementations should not allow such behavior. -end note ]
Seems to me this note is just as much part of the C++11 standard as mentioned 'requirements', which just are a section of the standard. But unlike the 2007 paper, which only uses the word 'discouraged', this outright forbids the 42. The paper proposes to amend this as the following, which is even more explicit.

But the 2013 paper is not part of the C++11 standard from 2011.

What I meant is that the 2013 paper confirms that the C++11 has various problems, and that they seem to be even worse than the one we have been discussing.

From the 2007 note:

On the other hand,

Thread 1:
r1 = y.load( memory_order_relaxed );
x.store( r1, memory_order_relaxed );
Thread 2:
r2 = x.load( memory_order_relaxed );
y.store( r2, memory_order_relaxed );

may not produce r1 = r2 = 42, since there is no sequence of evaluations that results in the computation of 42.

From the 2013 note:

Consider the following example, where x and y are atomic variables initialized to zero, and ri are local variables:

Thread 1:
r1 = x.load(memory_order_relaxed);
y.store(r1, memory_order_relaxed);

Thread 2:
r2 = y.load(memory_order_relaxed);
x.store(r2, memory_order_relaxed);

Effectively Thread 1 copies x to y, and thread 2 copies y to x. The section 1.10 specification allows each load to see either the initializing store of zero, or the store in the other thread.

This famously allows both r1 and r2 to have final values of 42, or any other "out of thin air" value. This occurs if each load sees the store in the other thread. It effectively models an execution in which the compiler speculates that both atomic variables will have a value of 42, speculatively stores the resulting values, and then performs the loads to confirm that the speculation was correct and nothing needs to be undone.

According to the 2007 note this was disallowed by the requirement:

2. If an evaluation A is included in the sequence, then all evaluations that assign to the same variable and are sequenced before or happens-before A must be included. If an evaluation A is included in the sequence, then all evaluations that assign to the same variable and are sequenced before or happens-before A must be included.

So either this requirement did not make it into C++11 or it is simply not sufficient. I have not even tried to parse this requirement, so this is all I can say.

It seems C++14 might fix these problems at the cost of making the implementation of memory_order_relaxed more expensive on certain architectures.

I do not completely understand this. I very much doubt (but I might be wrong) that within the C++11 memory model a compiler could on these certain existing architectures sensibly compile the example from the opening post in such a way that it produces the nonsensical result. If this is correct, then the reason for "unrelaxing" memory_order_relaxed is a formal one: we do not know how to write it down better, so we have to add assumptions that make the language less attractive, but at least allow us to formally prove that things are fine. This is not very satisfactory.

Simply not allowing certain examples is probably not an ideal solution, because there are probably always other examples one could come up with. So general requirements are necessary, but these are difficult to find.

syzygy · Post by **syzygy** » Sun Apr 06, 2014 2:36 pm

Harald wrote:Is it possible to have some of this in this thread? In opposition to
'I know something and you are stupid' that is answered with
'No, I know something and you are stupid'?

The posts to/from Bob certainly do not add anything useful, so ignore those.

syzygy · Post by **syzygy** » Sun Apr 06, 2014 3:35 pm

hgm wrote:So what exactly are we arguing about? It seems that the people developing the standard have already realized that it was broken, and did not provide the definition of a workable programming language. It is not only that this kind of self-confirming speculation would cause violations of the standard as a side effect, when the speculation has to be retracted. Even the behavior in the case where it 'worked' is already a violation!

Nobody is expecting gcc or g++ to produce the nonsensical results, and certainly not on x86. So the practical aspect of this discussion is very limited.

The OP wondered why the 2007 document said that the nonsensical result was allowed by the requirements.

My first reaction was: this only means that the result does not violate the standard. (Maybe this was not accurate, because the 2007 paper is not necessarily part of the standard. But as far as I understand now, the 2011 standard does in fact allow it.)

My second reaction was: at least in theory combinations of compiler + hardware could produce this result (and it would be within the standard, at least in so far as my initial reaction was correct).

Then this got contested in all kinds of funny ways. But my point still stands. It is very limited and it is wrong to read anything into it that I did not write.

bob · Post by **bob** » Sun Apr 06, 2014 7:01 pm

syzygy wrote:
hgm wrote:
syzygy wrote:What "my" system requires is that speculative dependencies between transactions are detected (such as reads of speculative values), and that cyclic dependencies between threads are permitted but must result in these threads being committed all together or not at all.
The problem is that it cannot know that. Your 42 example hinges on the fact that when there is a test on speculatively transferred data that happens to evaluate to true, you know that committing the data will clear all dependencies in the cycle, and eventually propagate back to you to confirm the original speculation. But you cannot know that. Part of that dependency cycle is in the other CPU. And that other CPU could have been speculating on something that is completely independent of whether you will commit the store or not. The only thing you know is that you obtained a value (42) which at any time in the future can be withdrawn. Committing your own results and send the missiles flying, just based on the blind gamble that doing so will prevent the source of the speculative data from retracting it, will in general not perform according to the standard.
There is nothing "cannot know" about it. As I wrote several times before:
syzygy wrote:Just to be clear "my" system commits both threads because:
- both conditions evaluate to true (yes, based on forwarded speculative values);
- assuming the speculative stores are valid, everything else is valid as well.
How would this go in practice?
Code: Select all
  transaction.begin&#40;);
  y.store&#40; 42, memory_order_relaxed ); // speculative
  r1 = x.load&#40; memory_order_relaxed );
  if &#40;r1 == 42&#41;
    transaction.commit&#40;);
  else
    transaction.abort&#40;);
What might not be clear to you is that transaction.commit() may fail.
The "r1 == 42" check verifies whether the speculative store was valid (yes, based on forwarded speculative values).
In a next step, transaction.commit() verifies whether "everything else" is valid. This includes verifying that dependent transactions are ready to commit as well.

So both threads have to reach transaction.commit(). If one of them does not, the other will abort as well.

Just to pre-empt I repeat that the deadlock argument is a non-issue. Either the compiler has proven that each transaction finishes in finite time, or the system times them out.

So now the compiler executes the program BEFORE it compiles it, to prove the above? Please. This is NOT a compiler issue. This is a user programming issue. As in normal transaction processing, you can do whatever you want, and commit once you have finished. And the commit can fail. And YOU have to deal with that in your programming. The compiler doesn't help. The hardware doesn't help.

In the case of transaction memory, there is simply some hardware support to help with the commit and commit-fail or commit-succeed issues to make it more efficient. But it does NOT solve any speculative problems for you, you get to solve those all by yourself, the TM model simply has some hardware support to help deal with atomicity.

Just to pre-empt I also repeat that transactions won't irrecoverably start nuclear wars. Either the compiler has proven that they don't, or the system prevents them from doing it.

Just to pre-empt I note that systems already exist where transactions at commit() time have to wait for dependent transactions to commit() (and if those other transactions abort() then the commit() turns into an abort(), obviously). These systems detect at runtime the dependencies that are created at runtime.

Just to pre-empt I repeat that a thread T1 becomes dependent on a thread T2 if it has been forward a speculatively stored value from T2. (There are more ways to get dependencies, but this is the only one occurring here.)

Please read carefully. And maybe try to be open to the idea that this just might work.

Just to pre-empt: I am NOT advocating that such systems should be built that lead to nonsensical results as in the example that triggered this thread. I do not WANT this.

If we could just leave the hardware issue out of this, I would have no problem with the discussion. As I have repeatedly stated, transaction processing has been around forever. This transaction memory approach is just giving us a more efficient way of doing the commits, without needing a host of locks and complex programming. But it does not automate the process. In the original example given, there was NO transaction memory mentioned, nor was there any transaction model discussed that had something like a commit operation. Just a simple program where it was pointed out that the c standard allowed a result that is actually impossible unless the compiler has gone too far, and doing a speculative store (a meaningless concept in terms of hardware since a store is NEVER speculated to its ultimate conclusion where L1 is updated) is unacceptable in the context of threads because stores can affect other threads and hence can not be done even if they would produce correct results in a sequential program.

About the only reason I can think of to do the store BEFORE the conditional test would be to push the conditional test a cycle forward in time to free up another cycle to wait for part of the expression to be tested. Stores are done outside the normal CPU arithmetic pipelines, so doing them first really offers nothing and I am not even sure it would save a clock cycle since the instruction has long since been fetched and held in the reorder buffer waiting for dependencies to resolve. I can see practically no possible benefit to doing the store first, and then having to undo the store if the condition is false, unless the processor is relying on old-school static branch prediction. For example, one model used in the past was "if a branch goes backward in memory, it is likely a loop, assume it will be taken, if a branch goes forward in memory, assume it will not be taken."

Given this:

if (c)
x = y;

I'll rewrite more like asm:

if (!c) go to l1;
x = y;
l1:

If C is true most of the time, that !c will be false, and the branch is taken most of the time. That is going to get mis-predicted frequently, resulting in pipeline restarts. If you change that to

t=x;
x = y;
if (!c) go to l1;
l2:

and further forward in the program:

l1: x=t;
goto l2;

We put it further forward to make the conditional branch prediction "not taken" since it is in the forward direction.

If (c) is true most of the time, then the path goes t=x, x=y, conditional branch not taken (forward prediction is therefore correct), and we have done the assignment with no mis-prediction. We introduced a bit of overhead with t=x, but we have to get x into L1 anyway so we only added one l1 hit penalty which is generally very small.

But that is COMPLETELY unsafe in a parallel program if x is shared, because we are changing the value of x two times when it should not be changed at all. In standard C, that would be perfectly acceptable, however, since x should be protected by a lock if it is shared. But in C++11, since we are using a class that provides atomicity automatically, the above is NFG as an optimization.

And, to make matters worse, modern branch prediction doesn't use a static mechanism. everyone uses branch history, either local, global or both, which means that branch will be correctly predicted most of the time, whether it is taken or not taken, because it is based on dynamic observation, rather than a static guess.

So, we are back to the 2x42 model. Hardware won't ever do that. Ever. Compilers can do it easily. The guy writing that explanation specifically mentioned that compilers should NOT do such nonsense. Knowing the hardware never will.

I hope I did not screw up the above example, the c and !c gets hard to read after a few minutes..

bob · Post by **bob** » Sun Apr 06, 2014 7:05 pm

syzygy wrote:
bob wrote:You started the hardware idea.
Just read
http://talkchess.com/forum/viewtopic.ph ... 733#564733
http://talkchess.com/forum/viewtopic.ph ... 844#564844
Read both posts. Very slowly, very carefully. Then read them again.

Now the store to y can be performed speculatively before the load and the comparison take place. It needs to be rolled back without other threads ever seeing it if the comparison results in false.

Similar for thread 2.

I suppose you would need hardware with some sort of double-threaded speculative execution that at the moment probably does not exist for this to make sense (and to allow the nonsensical result..).

You REALLY should read your OWN posts before telling me to do so. What does the bold sentence above say? Any reference to hardware?

So your point with referring me to a post that shows your reference to hardware would be???

bob · Post by **bob** » Sun Apr 06, 2014 7:30 pm

syzygy wrote:
hgm wrote:So what exactly are we arguing about? It seems that the people developing the standard have already realized that it was broken, and did not provide the definition of a workable programming language. It is not only that this kind of self-confirming speculation would cause violations of the standard as a side effect, when the speculation has to be retracted. Even the behavior in the case where it 'worked' is already a violation!
Nobody is expecting gcc or g++ to produce the nonsensical results, and certainly not on x86. So the practical aspect of this discussion is very limited.

The OP wondered why the 2007 document said that the nonsensical result was allowed by the requirements.

My first reaction was: this only means that the result does not violate the standard. (Maybe this was not accurate, because the 2007 paper is not necessarily part of the standard. But as far as I understand now, the 2011 standard does in fact allow it.)

My second reaction was: at least in theory combinations of compiler + hardware could produce this result (and it would be within the standard, at least in so far as my initial reaction was correct).

Then this got contested in all kinds of funny ways. But my point still stands. It is very limited and it is wrong to read anything into it that I did not write.

The most "relaxed" memory model on planet earth is the (was the) DEC alpha. It was free to reorder stores and loads until it got tired of doing so. But it was NOT free to violate control dependencies, which means the example code would STILL produce the correct answers for two reasons.

(1) different cores could do anything they want, speculatively, but that does NOT include writes to L1. No architecture to date has had a CPU-to-CPU bus in addition to all the cache busses and such. Which means there is no way to "share" something done speculatively except to write it to L1, and once written to L1 it becomes irreversible.

(2) violating dependencies is not allowable. From the simple RAW, WAR, WAW hazards caused by data dependencies, up through and including control dependencies which covers the current example being discussed.

Many have criticized the alpha for its highly relaxed memory model, but it produces flawlessly correct code when used correctly. And the example being discussed here will NOT cause any problems, and will never produce the 2x42 result. Hardware is out of the picture. Only software can break this, and that compiler would be broken. Hence the continued updates to the standard to fix what should never have been passed through in the first place.

bob · Post by **bob** » Sun Apr 06, 2014 7:33 pm

syzygy wrote:
Harald wrote:Is it possible to have some of this in this thread? In opposition to
'I know something and you are stupid' that is answered with
'No, I know something and you are stupid'?
The posts to/from Bob certainly do not add anything useful, so ignore those.

Your posts have not added ANYTHING but mistakes, so "pot, kettle"? Your architectural examples are ALL broken. For clear and easy to see reasons. Yet you blindly forge ahead...

"Bob's" posts simply point out the numerous holes in your arguments, as do HGM's. You are in over your head and drowning.

c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed