c++11 std::atomic and memory_order_relaxed

bob · Post by **bob** » Sun Apr 06, 2014 7:40 pm

Harald wrote:This topic is interesting and complicated and I even learned something.
But it is also a little annoying to follow the 'discussion'.

I just made two google searches:

(1) '++11 std::atomic and memory order relaxed'
There is a lot of information. Is there a good starting point to read
a understandable and correct introduction? Some must read standard papers?

(2) 'cpu speculative cache threads 42'
These papers discuss the topic in a very deep detail level and it looks
like they have their own language and typical graphics.

Is it possible to have some of this in this thread? In opposition to
'I know something and you are stupid' that is answered with
'No, I know something and you are stupid'?

Harald

I ran through the first page of your last search. Didn't see anything particularly applicable to violating control dependencies. Very first one was a CMU dissertation dealing with a compiler's generating speculative threads of execution. Which is not the same thing as this discussion at all.

I think the reason this is not addressed is easy to explain: (1) the original C++11 document had a serious flaw that has now been fixed; (2) hardware will not speculate as in the 2x42 example because it violates a control dependency; (3) a compiler could do anything it wants, but 2x42 is broken, which the latest versions of the standard seem to have agreed on and fixed.

Therefore this discussion is pretty pointless. The only conclusion of any interest is the one I posted some time back. "The C++11 standard was broken and allowed dangerous compiler behavior." We now know this has been fixed. This never was an architecture problem, no architecture would dare try speculatively writing to memory (writing to your own internal registers is fine of course, intel does this everywhere).

syzygy · Post by **syzygy** » Sun Apr 06, 2014 9:02 pm

I am not going to address Bob anymore.

However, if anyone else has any questions I am more than willing to answer them.

bob · Post by **bob** » Sun Apr 06, 2014 9:26 pm

syzygy wrote:I am not going to address Bob anymore.

However, if anyone else has any questions I am more than willing to answer them.

With PLENTY of incorrect answers, don't forget. Don't address anyone that points out your obvious nonsense. I suppose you also will ignore HGM?

syzygy · Post by **syzygy** » Sun Apr 06, 2014 10:51 pm

bob wrote:
syzygy wrote:I am not going to address Bob anymore.

However, if anyone else has any questions I am more than willing to answer them.
With PLENTY of incorrect answers, don't forget. Don't address anyone that points out your obvious nonsense. I suppose you also will ignore HGM?

Do you have problems understanding "anyone else"?

bob · Post by **bob** » Mon Apr 07, 2014 12:00 am

syzygy wrote:
bob wrote:
syzygy wrote:I am not going to address Bob anymore.

However, if anyone else has any questions I am more than willing to answer them.
With PLENTY of incorrect answers, don't forget. Don't address anyone that points out your obvious nonsense. I suppose you also will ignore HGM?
Do you have problems understanding "anyone else"?

Not NEARLY as many problems as you have understanding "hardware speculative store and why it doesn't happen as you define it." My original statement, this is a compiler issue, NOT a hardware issue still stands as the ONLY correct stuff in this thread. No hardware speculates as you suggest, none likely ever will. And I doubt any C++ compiler is so broken as to allow that to happen either. The guy was commenting about the C++11 specifications, you get off into never-never-land about non-existent hardware and such. He asked,

My understanding of memory_order_relaxed was that the atomic operations can be rearranged in any order but pulling numbers out of thin air would be forbidden. what rearrangement of loads and stores leads to r1 == r2 == 42 in this situation?

The answer is "this can ONLY happen with a broken compiler. Hardware today will not do this, hardware of the future will not do this."

And off you went on unrelated topics such as transaction memory, which is completely unrelated to his question, and it only got worse from there.

Michel · Post by **Michel** » Mon Apr 07, 2014 9:05 pm

My understanding of memory_order_relaxed was that the atomic operations can be rearranged in any order but pulling numbers out of thin air would be forbidden. what rearrangement of loads and stores leads to r1 == r2 == 42 in this situation?

The answer is "this can ONLY happen with a broken compiler."

That is not so clear right? There are two points:

(1) The compiler can produce code that in single threaded mode is completely equivalent to the naive translation but in multithreaded mode may produce r1=42, r2=42.

(2) This is an extreme example. But the issue seems to be that it is tricky (a) to make this example illegal and (b) still allow common compiler optimizations (speculative stores).

I am not sure if I understands Ronald's point correctly, but it seems to be that the compiler issues could be resolved in case you had hardware
with transaction support. This is similar to concurrent database access

http://docs.oracle.com/cd/B19306_01/ser ... onsist.htm

Perhaps something like "serializable isolation"?

syzygy · Post by **syzygy** » Mon Apr 07, 2014 10:10 pm

Michel wrote:(1) The compiler can produce code that in single threaded mode is completely equivalent to the naive translation but in multithreaded mode may produce r1=42, r2=42.

(2) This is an extreme example. But the issue seems to be that it is tricky (a) to make this example illegal and (b) still allow common compiler optimizations (speculative stores).

If the compiled code for Thread 1 looks like:

Code: Select all

tmp = y;
y = 42;
r1 = x;
if &#40;r1 != 42&#41; y = tmp;

then this would work fine in single-threaded mode, but it would violate the requirements of C++11, because in some executions it includes (visible) stores to global variables that the abstract machine does not perform.

To "correct" this violation, the store would to have to be made invisible retroactively.

Such things happen in transactional memory systems with eager versioning. A speculative / transactional store to a shared variable by a first transaction is visible to another second transaction (i.e. another thread running in transactional mode). If the first transaction aborts, then the second transaction must also be aborted. The transactional memory system takes care of this.

So for Thread 1 we get:

Code: Select all

transaction.start&#40;);
y = 42; // speculatively
r1 = x;
if &#40;r1 == 42&#41;
  transaction.commit&#40;);
else
  transaction.abort&#40;);

Transaction.start() tells the system that all stores to global variables may have to be rolled back. In an eager versioning system the old values will typically be recorded in an undo log. Transaction.abort() will roll back the transaction. Transaction.commit() will check whether any conflicts occurred that require a rollback. If not, the undo log is cleared and the stores are final. If the transaction is dependent on other transactions, transaction.commit() may have to wait for the fate of those other transactions to become known.

What needs to be added to this for the example to work is that two transactions that are dependent on each other in the sense that both read a value speculatively written by the other are allowed to proceed and commit (necessarily simultaneously). This is probably a bad idea, because it allows these weird results, but there is no technical obstacle to it.

(Btw, what happens if y is read non-transactionally by a third thread? Answer: either the transactional system simply does not allow this, in which case the compiler "optimisation" is illegal, or the transactional systems detects it and aborts the transaction being executed by Thread 1. There are probably many more of these small issues that can be resolved by a little bit of thinking and have been resolved in actual systems.)

This is similar to concurrent database access

Transactional memory has its roots in database transactions, but it is quite different. Its main purpose is to make multithreaded programming easier. Currently the "easy" way to do multithreaded programming is to have one big lock for locking all shared data. This can be terribly inefficient if most accesses do not "touch" each other. The solution is to refine the locks, but this make life much more complicated.

With transactional memory, the programmer can again pretend that all shared data is behind one big lock, usually by encapsulating all accesses like this:

Code: Select all

  atomic &#123;
    // access shared data
  &#125;

This could be implemented as

Code: Select all

  singleglobalmutex.lock&#40;);
  // access shared data
  singleglobalmutex.unlock&#40;);

but that would be inefficient. With transactional memory the system does not lock anything. Instead, it lets all the threads entering an atomic open a transaction. In transactional mode, it monitors all accesses to see if there is a conflict (usally on a cacheline basis). Usually there will not be a conflict and the transactions can commit.

In chess this could be used to access the shared TT table in a completely safe manner without locking overhead. On Haswell processors with TSX this can already be done today.

One way to do this on Haswell is by simply putting TT accesses behind a lock:

Code: Select all

  ttmutex.lock&#40;);
  // access TT
  ttmutex.unlock&#40;);

and link to a recent glibc. Glibc will use TSX to "elide" the lock by replacing it with a transaction (and then take the lock if the transaction fails). (I am not sure whether glibc already does this in any official releases.)

It can also be done with self-made spinlocks by inserting XACQUIRE and XRELEASE instructions into the assembly. Older processors will ignore these, processors wih TSX will recognise these and "elide" the lock.

TSX implements a "lazy versioning" form of transactional memory. Stores only become visible after a transaction was committed.

bob · Post by **bob** » Mon Apr 07, 2014 10:17 pm

Michel wrote:
My understanding of memory_order_relaxed was that the atomic operations can be rearranged in any order but pulling numbers out of thin air would be forbidden. what rearrangement of loads and stores leads to r1 == r2 == 42 in this situation?
The answer is "this can ONLY happen with a broken compiler."

That is not so clear right? There are two points:

(1) The compiler can produce code that in single threaded mode is completely equivalent to the naive translation but in multithreaded mode may produce r1=42, r2=42.

Correct. but note that this standard is C++11, which explicitly addresses threads as well. The author of the quote that is being bandied about here (the 2x42 example) specifically says that this is a compiler issue and he originally stated "the standard does not prohibit this, but the implementation (the compiler) should avoid doing so." It was later changed to (eventually) forbidding this behavior by the compiler completely. It never was a hardware issue because NO past or existing hardware would do this, except possibly for the very rare non-cache-coherent processor (first version of IBM's blue gene for example).

(2) This is an extreme example. But the issue seems to be that it is tricky (a) to make this example illegal and (b) still allow common compiler optimizations (speculative stores).

I am not sure if I understands Ronald's point correctly, but it seems to be that the compiler issues could be resolved in case you had hardware
with transaction support. This is similar to concurrent database access

Transaction memory doesn't solve this by itself. It requires effort by the programmer as well, just as in the concurrent database access problem, where you eventually must do a "commit" to make the changes visible, knowing that the commit can fail on occasion and again that is the programmer's responsibility to deal with. This example is pure C/C++ reading and writing memory. Hardware can't possibly speculate memory writes, even the non-cache-coherent processors. But a compiler certainly can, as has been explained repeatedly. Ronald simply could not get off his fantasy "hardware fix" that would take care of this.

http://docs.oracle.com/cd/B19306_01/ser ... onsist.htm

Perhaps something like "serializable isolation"?

There are several approaches to deal with this. The problem has been around for 50 years now. But the solutions are always "programming" solutions. Transaction memory makes the "programming solution" easier, but it does not do away with it. The programmer remains intimately in charge of making it work, just with less effort in locking and such.

Turns out the entire discussion has been moot, because the latest version of that analysis (C++11/beyond) has now clarified the 2x42 as "this is forbidden". That is only done because it was ALWAYS a compiler issue, not a hardware problem, since speculative writes in hardware are never done. At least speculative writes that actually reach L1, because at that instant, they can no longer be "undone".

kbhearn · Post by **kbhearn** » Mon Apr 07, 2014 11:16 pm

Technically the quote in the first example still almost exists in the current working draft.

http://isocpp.org/files/papers/N3690.pdf

'discouraged from allowing' changed to 'should not allow'. Regardless, my question was answered a couple dozen postings ago as to how it doesn't break the requirements put forth in the standard. The elaborations that in practical terms it'd never happen anyway were also appreciated. I'd ask that the circle of insults stop but i think the two of you must rather enjoy it.

bob · Post by **bob** » Mon Apr 07, 2014 11:22 pm

kbhearn wrote:Technically the quote in the first example still almost exists in the current working draft.

http://isocpp.org/files/papers/N3690.pdf

'discouraged from allowing' changed to 'should not allow'. Regardless, my question was answered a couple dozen postings ago as to how it doesn't break the requirements put forth in the standard. The elaborations that in practical terms it'd never happen anyway were also appreciated. I'd ask that the circle of insults stop but i think the two of you must rather enjoy it.

Seems that "should not allow" is pretty clear? It would seem to me if it says "should not allow" and your compiler allows that, it violates the standard?

Which makes perfect sense since current hardware can not pull this off...

c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed

Re: c++11 std::atomic and memory_order_relaxed