Questions on volatile keyword and memory barriers

bob · Post by **bob** » Tue Aug 18, 2009 9:57 pm

Zach Wegner wrote:
bob wrote:Our alpha is now officially retired and on the way to wherever retired machines go around here. But the barrier discussion was simply one case in point where something works fine on x86/x86-64 but fails miserably on another architecture like the alpha, where a real fence instruction needs to be added to that barrier function, the volatile trick is not good enough there.
So what's the problem? Just add the fence instruction. Then you don't need volatile anymore.

The problem is that the volatile function was given as a "functional barrier". Which it is on X86, but is not on an alpha. That was the point both of us were making. You do not want to use a solution that is architecturally sensitive and not know that. This was being suggested as a way to avoid the use of volatile, and I simply pointed out that it would only work on x86.

And you can't just add the fence instruction, you still need a volatile keyword so that the compiler will not eliminate memory reads due to optimization, and fail to notice that a value has been changed by another thread. If you go the fence approach, you still need either the __volatile__ asm() option, or else volatile on the declarations for the variables that can be changed spontaneously. The fence just avoids the severe re-ordering for reads and writes that the alpha can do. And which one day I will bet X86 will do as well since it is currently a noticable bottleneck and is getting worse as processor speeds go up.

Zach Wegner · Post by **Zach Wegner** » Tue Aug 18, 2009 10:10 pm

OK, I meant adding the fence instruction to GCP's function, and eliminating the "volatile" from variable declarations.

bob · Post by **bob** » Tue Aug 18, 2009 11:50 pm

Zach Wegner wrote:OK, I meant adding the fence instruction to GCP's function, and eliminating the "volatile" from variable declarations.

That is OK, but the point is there is no universal barrier that will work using just the __volatile__ asm() stuff. My goal was to make that clear so that someone doesn't just copy the code that was posted and expect it to work anywhere, since it was very specific to x86 in a non-obvious way.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Aug 19, 2009 3:16 pm

wgarvin wrote: This wide variety of semantics across different hardware is one of the reasons its so hard to write correct, portable multithreaded code (especially if you make your own synchronization primitives rather than just using semaphore/mutex implementations supplied by the OS).

For sure. As I said: use the POSIX locking functions and you can forget about volatile completely.

There are very good reasons to never use volatile but to rely on barriers instead (either by instantiating them directly or relying on them via POSIX locks), the original poster even gave them.

Also notice that their AMD64 row doesn't have a checkmark for "loads reordered after stores" but the x86 and x86 OOStore rows do have that checkmark. So you could easily end up with code that has a race condition between a load and a store, and if you do all your testing on an AMD chip the issue will never be exposed. But on Intel, you might get weird crashes or erroneous behaviour, and it might occur rarely (e.g. once in 10,000 games) making it very difficult to track down.

Intel has recently clarified in their documentation that the same guarantees that are given in the AMD64 model are in fact given on all their CPUs, and AMD has confirmed this is the same for theirs, and updated their architecture manuals. I quoted these in the previous discussion about this kind of thing we had here.

So your "works on AMD breaks on Intel" scenario simply doesn't exist.

It also won't exist if you use the OS barrier functions. And it doesn't affect the code snippet I posted, which works correctly even on architectures with out-of-order memory writes.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Aug 19, 2009 3:24 pm

bob wrote: It does solve the optimization issue since it is gcc, but it doesn't emit any sort of barrier instruction to deal with the out-of-order-writes, which breaks this.

What part of the code breaks exactly due to out of order writes? I don't think the snipped I posted is sensitive to them.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Aug 19, 2009 3:25 pm

The last thing I wanted to point out is that volatile *also* won't generate any hardware barriers. So those can never be an argument to prefer volatile over using barriers.

bob · Post by **bob** » Wed Aug 19, 2009 7:49 pm

Gian-Carlo Pascutto wrote:
wgarvin wrote: This wide variety of semantics across different hardware is one of the reasons its so hard to write correct, portable multithreaded code (especially if you make your own synchronization primitives rather than just using semaphore/mutex implementations supplied by the OS).
For sure. As I said: use the POSIX locking functions and you can forget about volatile completely.

There are very good reasons to never use volatile but to rely on barriers instead (either by instantiating them directly or relying on them via POSIX locks), the original poster even gave them.

Also notice that their AMD64 row doesn't have a checkmark for "loads reordered after stores" but the x86 and x86 OOStore rows do have that checkmark. So you could easily end up with code that has a race condition between a load and a store, and if you do all your testing on an AMD chip the issue will never be exposed. But on Intel, you might get weird crashes or erroneous behaviour, and it might occur rarely (e.g. once in 10,000 games) making it very difficult to track down.
Intel has recently clarified in their documentation that the same guarantees that are given in the AMD64 model are in fact given on all their CPUs, and AMD has confirmed this is the same for theirs, and updated their architecture manuals. I quoted these in the previous discussion about this kind of thing we had here.

So your "works on AMD breaks on Intel" scenario simply doesn't exist.

It also won't exist if you use the OS barrier functions. And it doesn't affect the code snippet I posted, which works correctly even on architectures with out-of-order memory writes.

As I mentioned, I compiled your code, using gcc, on an older alpha 21164 box and it definitely did _not_ work. The compiler did not know to emit a MB/WMB instruction at all, which will cause it to not work. More recent versions of GCC might well do that, I do not know. It would be a dozen times better if we had a POSIX "barrier()" function that is supplied by the vendor so that it does the right things for each possible architecture.

It did appear that this version of GCC understood the __volatile__ on the asm() stuff, and did not reorder instructions and move them either before or after the barrier() function, but that isn't quite enough on the alpha. May well be that newer versions of GCC know to produce a MB/WMB on the alpha if you have a volatile asm() block in it, can't speak to that since this was the only alpha we had remaining and it is now history (I would not want to take the time to build a new GCC on it anyway, that is a painful process).

bob · Post by **bob** » Wed Aug 19, 2009 7:52 pm

Gian-Carlo Pascutto wrote:
bob wrote: It does solve the optimization issue since it is gcc, but it doesn't emit any sort of barrier instruction to deal with the out-of-order-writes, which breaks this.
What part of the code breaks exactly due to out of order writes? I don't think the snipped I posted is sensitive to them.

What I meant was that there was no MB/WMB instructions produced, so that when you reach the barrier() call, and _assume_ all writes have been done, so that you can then clear a lock or whatever, you can run into the classic race we have been talking about. your example was simple enough to have no problems since it really didn't do anything. But what is needed to make it work, at least on the version of gcc on that old box, is to add a WMB instruction (sort of like a sfence in the Intel world) in the asm() code to guarantee that all previous writes have been done before passing the barrier and proceeding.

bob · Post by **bob** » Wed Aug 19, 2009 7:59 pm

Gian-Carlo Pascutto wrote:The last thing I wanted to point out is that volatile *also* won't generate any hardware barriers. So those can never be an argument to prefer volatile over using barriers.

Here's the thing.

1. volatile tells the compiler to read the value every time it is used in the code, to not keep copies in a register to avoid the memory access. That is all I care about. I don't want the compiler to produce an infinite loop incorrectly, thinking that since a value is not modified inside the loop, it is a constant for the duration of the loop.

2. A barrier is a different issue. There is a case to be made for using a barrier, to be sure. Because you can tailor the barrier to the architecture. And, say, on the alpha, you can include the needed WMB instruction, while on X86 you do not.

However, not using volatile still leads to some issues that need addressing. Suppose you are not doing a lock, but you do occasionally test a value to see if it has changed (say an "abort search flag".) Without volatile, you end up inserting barriers at inconvenient locations and break other optimizations, where all I want to guarantee is that when I test the value, I test the most current value, not one saved in a register for what could be several minutes.

Since the store is being done in another thread, and I am simply testing the flag in this thread, barriers are very inconvenient since I have to stick a barrier() before each place where I check the variable. But volatile is not, in the above context, because I just check the variable knowing I will get the most recent change. If I hit a race where A is writing and B is trying to read, I don't care, the next time B reads it will get the new value. But I don't want barrier() calls scattered all over my code to make it harder to read and more likely that I will forget to use one (the often-quoted "sin of omission" in parallel programming).

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Aug 19, 2009 9:17 pm

bob wrote: Without volatile, you end up inserting barriers at inconvenient locations and break other optimizations, where all I want to guarantee is that when I test the value, I test the most current value, not one saved in a register for what could be several minutes.

Volatile will basically inhibit optimization on all operations involving the variable. The barrier can put be there only when it's needed. This is a disadvantage of volatile, not an advantage.

But I don't want barrier() calls scattered all over my code to make it harder to read and more likely that I will forget to use one (the often-quoted "sin of omission" in parallel programming).

You can just as well make the opposite argument and say that using volatile will hide the fact that it's parallel from the code and make it more likely that you forget to lock something, whereas the barrier forces you to think it over on every access.

In Deep Sjeng I just lock, and for those few variables who need to be checked every node so locking them is too expensive (is_aborted), I use accessor functions.

Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers

Re: Questions on volatile keyword and memory barriers