volatile?

bob · Post by **bob** » Sun Mar 23, 2014 5:34 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:It is NOT ok to retrieve old values. The caches on Intel SPECIFICALLY prevent this by their snooping and inter-cache forwarding. Where is this stuff coming from? On Intel, the value you read will be the LAST value written by any other CPU. That's guaranteed.
I was wrong to say that Intel guarantees the illusion that a read will always give the last value written by any other CPU. Not even this illusion holds true in its full generality.

Suppose memory locations x and y are initialised to 0.

Now CPU1 performs a write and a read:
mov $1, [x]
mov [y], %eax

At roughly the same time, CPU2 also performs a write and a read:
mov $1, [y]
mov [x], %eax

Now %eax for both CPU1 and CPU2 may be 0.

How can that be? If CPU1 reads 0 from [y], it must have executed the read before CPU2 executed the write, right? So CPU1 must have executed the write even earlier, and CPU2 must have executed the read even later. That means that CPU2 can only have read 1. But in reality, it may read a 0.
That is a trivial example that is well known. Has absolutely nothing to do with current discussion which would only be about ONE variable. You will NEVER get an old value with Intel.
This is what you wrote:
bob wrote:On Intel, the value you read will be the LAST value written by any other CPU. That's guaranteed.
It is wrong.
Sorry it is absolutely correct. Just look up their MESIF cache coherency protocol, it will explain EXACTLY why it is guaranteed to be true. That is the very definition of "cache coherent NUMA"
Look. I gave an example, which you acknowledge, where the value read is NOT the LAST value written by any other CPU.

The value read is 0. At the time the value is being read, the value 1 had already been written. Capisce?

Your example is simply wrong.

When ANY core writes to a variable, it is INSTANTLY invalidated in all other caches. There is absolutely no way another CPU can get an old value after that, because the cache has to do another read from memory since it no longer has that value. The cache with the newly modified data will forward a copy to the requestor, the requestor ends up with the last value written. Happens that way every last time. If a cpu does a read BEFORE a value is modified, certainly it will get the old value. If it reads the value AFTER another core has written to that address it will NEVER get the old value.

So what, exactly, are you talking about? This is an actual guarantee by the cache coherency protocol. Please give a sensible example. In your case, you are depending on a race. A race where the last write is NOT done last. I don't care about that case. Whenever the write is done, from that point on everyone will get the new value, IF they do a read after the write. If they did a read before the write, they get the old value. If they do ANOTHER read after the write, they get the new value.

Your example is not doing reads/writes to the SAME value...

syzygy · Post by **syzygy** » Sun Mar 23, 2014 6:12 pm

bob wrote:Your example is simply wrong.

Sheesh.

Your example is not doing reads/writes to the SAME value...

I hope you meant to the same LOCATION.
My example is exactly about reads/writes to the same memory location.

I will give the example again.

Suppose memory locations x and y are initialised to 0.

Now CPU1 performs a write and a read at times A and B with A < B:
A. mov $1, [x] (a write to [x])
B. mov [y], %eax (a read from [y])

At roughly the same time, CPU2 also performs a write and a read at time C and D with C < D:
C. mov $1, [y] (a write to [y])
D. mov [x], %eax (a read from [x])

Now it may actually happen that %eax for both CPU1 and CPU2 end up containing 0. (This is coming from the Intel manuals, you can look it up yourself.)

Your statement (about reads and write to the SAME memory location):

bob wrote:On Intel, the value you read will be the LAST value written by any other CPU. That's guaranteed.

If this is true, then the following statements must be true as well:

(1) CPU1 reads 0 from [y], so it must have executed the read from [y] before CPU2 executed the write to [y]. It follows that B < C. This is about the same memory location [y].

(2) CPU2 reads 0 from [x], so it must have executed the read from [x] before CPU1 executed the write to [x]. It follows that D < A. This is about the same memory location [x].

Do you agree?

So we have A < B and C < D.
We also have B < C. This means A < B < C < D. So A < D.
But this is in contradiction to D < A. We cannot have A < D and D < A.

So we come to a contradiction. Statements (1) and (2) cannot both be true. This means that your statement is false.

Either CPU1 reads 0 from [y] after CPU2 writes 1 to [y]
or CPU2 reads 0 from [x] after CPU1 writes 1 to [x]
or both.
Conclusion: on x86, later reads from a memory location can be reordered before earlier writes to that SAME memory location. (And this you can find directly in Intel's manual as well...)

syzygy · Post by **syzygy** » Sun Mar 23, 2014 6:53 pm

syzygy wrote:Conclusion: on x86, later reads from a memory location can be reordered before earlier writes to that SAME memory location. (And this you can find directly in Intel's manual as well...)

This is lacking in preciseness.

On x86, later reads from a memory location obviously cannot be reordered before earlier writes to that SAME memory location by the same processor. Otherwise single-threaded code would simply stop working.

But if you look at two processors, then it can appear that a CPU reads a value from a memory location that according to the program order of both CPUs had already been overwritten by another CPU.

It is true that the example relies on both CPUs reading and writing to different memory locations. But the effective result remains what it is.

rbarreira · Post by **rbarreira** » Sun Mar 23, 2014 11:15 pm

bob wrote: OK, some specific points.

1. If you choose to lock EVERY shared access, you still have a problem For example, the simple spin lock in Crafty where a thread waits on work. Do you want to lock to READ the value? As well as write the value? How does that solve the problem? You acquire the lock, you read the value, you release the lock, if the value is zero, you repeat. That's known as a "loop". Without volatile, the "value" is a loop invariant, and the compiler will lift the code out of the loop and fetch it once. Now it doesn't work because the code will never see the changed value.

2. My knowledge about compilers is not "so 90s". I actually (1) tested this on gcc yesterday, and I actually looked at SOME of the compiler source to see if it had any recognition of pthread_mutex_lock built in. You seem to have either a reading problem OR a comprehension problem. I clearly pointed out that by presenting the compiler with the procedure source, it definitely CAN optimize across the procedure call. But NOT if it is buried in a library where it can't see it, or if it is compiled as a separate file that it can't see.

Try again.

What version of gcc did you use? If you truly have seen gcc optimizing out checks on a variable that's accessed after a call to the standard pthread_mutex_lock then you have found a library and/or compiler bug.

syzygy · Post by **syzygy** » Sun Mar 23, 2014 11:32 pm

rbarreira wrote:What version of gcc did you use? If you truly have seen gcc optimizing out checks on a variable that's accessed after a call to the standard pthread_mutex_lock then you have found a library and/or compiler bug.

He's not calling the standard pthread_mutex_lock. He copied and pasted from the pthreads library source code into his own program.

It seems Bob has come to realise that volatile is indeed not needed, because values will be reloaded after a pthread_mutex_lock(). He has now retreated into a corner where he is arguing that that has nothing to do with pthreads, but is just a case of "luck" because pthread_mutex_lock just happens to be a library call (and not e.g. a macro that the compiler could analyse).

As the rest of us understand, this is not a case of "luck" but simply a case of a POSIX conformant implementation. It is POSIX that guarantees this property of pthread_mutex_lock().

But as we have seen in the UB threads, some people have difficulty with thinking in terms of "what does the standard guarantee me".

bob · Post by **bob** » Mon Mar 24, 2014 1:30 am

rbarreira wrote:
bob wrote: OK, some specific points.

1. If you choose to lock EVERY shared access, you still have a problem For example, the simple spin lock in Crafty where a thread waits on work. Do you want to lock to READ the value? As well as write the value? How does that solve the problem? You acquire the lock, you read the value, you release the lock, if the value is zero, you repeat. That's known as a "loop". Without volatile, the "value" is a loop invariant, and the compiler will lift the code out of the loop and fetch it once. Now it doesn't work because the code will never see the changed value.

2. My knowledge about compilers is not "so 90s". I actually (1) tested this on gcc yesterday, and I actually looked at SOME of the compiler source to see if it had any recognition of pthread_mutex_lock built in. You seem to have either a reading problem OR a comprehension problem. I clearly pointed out that by presenting the compiler with the procedure source, it definitely CAN optimize across the procedure call. But NOT if it is buried in a library where it can't see it, or if it is compiled as a separate file that it can't see.

Try again.
What version of gcc did you use? If you truly have seen gcc optimizing out checks on a variable that's accessed after a call to the standard pthread_mutex_lock then you have found a library and/or compiler bug.

What, EXACTLY, are you talking about? I clearly pointed out that ANY global variable access before a procedure call will be reloaded AFTER the procedure call, UNLESS the compiler can see the procedure in its entirety to verify that no global variables are modified. If the compiler sees the code, it will avoid the reload after the procedure call. And it might even inline the call as well, depending on size.

So I have no idea what you are talking about. GCC has no code I can find that looks specifically for "pthread_mutex_lock". It doesn't need any such code, because it can't see that procedure source and therefore can't avoid reloading globals after the access. Not because it is pthread_mutex_lock, but because it is a procedure it knows nothing about and which it can't see inside of.

bob · Post by **bob** » Mon Mar 24, 2014 1:35 am

syzygy wrote:
rbarreira wrote:What version of gcc did you use? If you truly have seen gcc optimizing out checks on a variable that's accessed after a call to the standard pthread_mutex_lock then you have found a library and/or compiler bug.
He's not calling the standard pthread_mutex_lock. He copied and pasted from the pthreads library source code into his own program.

It seems Bob has come to realise that volatile is indeed not needed, because values will be reloaded after a pthread_mutex_lock(). He has now retreated into a corner where he is arguing that that has nothing to do with pthreads, but is just a case of "luck" because pthread_mutex_lock just happens to be a library call (and not e.g. a macro that the compiler could analyse).

As the rest of us understand, this is not a case of "luck" but simply a case of a POSIX conformant implementation. It is POSIX that guarantees this property of pthread_mutex_lock().

But as we have seen in the UB threads, some people have difficulty with thinking in terms of "what does the standard guarantee me".

I gave you the example. I gave you the way to cause the problem. I told you if you include the pthread_mutex_lock() source, the compiler will optimize right across it because there is no check inside the compiler for that specific procedure name. So sorry, your explanation is pure bullshit. I told you how to verify it. There is nothing in the posix standard that says you can't compile with the library source code. Or against the library binary. Your claiming otherwise is pure nonsense.

rbarreira · Post by **rbarreira** » Mon Mar 24, 2014 2:00 am

bob wrote:
syzygy wrote:
rbarreira wrote:What version of gcc did you use? If you truly have seen gcc optimizing out checks on a variable that's accessed after a call to the standard pthread_mutex_lock then you have found a library and/or compiler bug.
He's not calling the standard pthread_mutex_lock. He copied and pasted from the pthreads library source code into his own program.

It seems Bob has come to realise that volatile is indeed not needed, because values will be reloaded after a pthread_mutex_lock(). He has now retreated into a corner where he is arguing that that has nothing to do with pthreads, but is just a case of "luck" because pthread_mutex_lock just happens to be a library call (and not e.g. a macro that the compiler could analyse).

As the rest of us understand, this is not a case of "luck" but simply a case of a POSIX conformant implementation. It is POSIX that guarantees this property of pthread_mutex_lock().

But as we have seen in the UB threads, some people have difficulty with thinking in terms of "what does the standard guarantee me".
I gave you the example. I gave you the way to cause the problem. I told you if you include the pthread_mutex_lock() source, the compiler will optimize right across it because there is no check inside the compiler for that specific procedure name. So sorry, your explanation is pure bullshit. I told you how to verify it. There is nothing in the posix standard that says you can't compile with the library source code. Or against the library binary. Your claiming otherwise is pure nonsense.

Could you post the source code for pthread_mutex_lock you used and where you found it please?

bob · Post by **bob** » Mon Mar 24, 2014 2:19 am

syzygy wrote:
bob wrote:Your example is simply wrong.
Sheesh.

Your example is not doing reads/writes to the SAME value...
I hope you meant to the same LOCATION.
My example is exactly about reads/writes to the same memory location.

I will give the example again.

Suppose memory locations x and y are initialised to 0.

Now CPU1 performs a write and a read at times A and B with A < B:
A. mov $1, [x] (a write to [x])
B. mov [y], %eax (a read from [y])

At roughly the same time, CPU2 also performs a write and a read at time C and D with C < D:
C. mov $1, [y] (a write to [y])
D. mov [x], %eax (a read from [x])

Now it may actually happen that %eax for both CPU1 and CPU2 end up containing 0. (This is coming from the Intel manuals, you can look it up yourself.)

Your statement (about reads and write to the SAME memory location):
bob wrote:On Intel, the value you read will be the LAST value written by any other CPU. That's guaranteed.
If this is true, then the following statements must be true as well:

(1) CPU1 reads 0 from [y], so it must have executed the read from [y] before CPU2 executed the write to [y]. It follows that B < C. This is about the same memory location [y].

(2) CPU2 reads 0 from [x], so it must have executed the read from [x] before CPU1 executed the write to [x]. It follows that D < A. This is about the same memory location [x].

Do you agree?

So we have A < B and C < D.
We also have B < C. This means A < B < C < D. So A < D.
But this is in contradiction to D < A. We cannot have A < D and D < A.

So we come to a contradiction. Statements (1) and (2) cannot both be true. This means that your statement is false.

Either CPU1 reads 0 from [y] after CPU2 writes 1 to [y]
or CPU2 reads 0 from [x] after CPU1 writes 1 to [x]
or both.
Conclusion: on x86, later reads from a memory location can be reordered before earlier writes to that SAME memory location. (And this you can find directly in Intel's manual as well...)

Now CPU1 performs a write and a read at times A and B with A < B:
A. mov $1, [x] (a write to [x])
B. mov [y], %eax (a read from [y])

At roughly the same time, CPU2 also performs a write and a read at time C and D with C < D:
C. mov $1, [y] (a write to [y])
D. mov [x], %eax (a read from [x])

What you do is write to x, read from y in one cpu. In the other you write to y, then read to x.

That has absolutely NOTHING to do with what I wrote. I specifically discussed a SINGLE volatile variable, written/read by different cpus. Not two different cpus writing to two different addresses and then reading the other address. That's a well-known artifact where intel can reorder reads and writes, although it can not reorder writes as the alpha loves to do.

So what is the point for the above? Who was talking about two DIFFERENT memory addresses (x and y)?

How about getting back on topic. And no, I didn't mean "different values" I meant different addresses, which is exactly what x and y represent.

This was about cache coherency. Once you do a write, everyone else AFTER that point is guaranteed to get the new value. Before the write, obviously they get the old value. That's what volatile is all about... trying to access something that can change spontaneously and avoid the compiler optimizing the change out. Is it a race condition? Certainly. Will it work if used correctly? Absolutely. Otherwise a basic atomic lock won't work, and we know they do. xchg for example...

bob · Post by **bob** » Mon Mar 24, 2014 2:25 am

rbarreira wrote:
bob wrote:
syzygy wrote:
rbarreira wrote:What version of gcc did you use? If you truly have seen gcc optimizing out checks on a variable that's accessed after a call to the standard pthread_mutex_lock then you have found a library and/or compiler bug.
He's not calling the standard pthread_mutex_lock. He copied and pasted from the pthreads library source code into his own program.

It seems Bob has come to realise that volatile is indeed not needed, because values will be reloaded after a pthread_mutex_lock(). He has now retreated into a corner where he is arguing that that has nothing to do with pthreads, but is just a case of "luck" because pthread_mutex_lock just happens to be a library call (and not e.g. a macro that the compiler could analyse).

As the rest of us understand, this is not a case of "luck" but simply a case of a POSIX conformant implementation. It is POSIX that guarantees this property of pthread_mutex_lock().

But as we have seen in the UB threads, some people have difficulty with thinking in terms of "what does the standard guarantee me".
I gave you the example. I gave you the way to cause the problem. I told you if you include the pthread_mutex_lock() source, the compiler will optimize right across it because there is no check inside the compiler for that specific procedure name. So sorry, your explanation is pure bullshit. I told you how to verify it. There is nothing in the posix standard that says you can't compile with the library source code. Or against the library binary. Your claiming otherwise is pure nonsense.
Could you post the source code for pthread_mutex_lock you used and where you found it please?

You can download the pthread library source from dozens of places. You can download ALL Library sources from dozens of places. At one point this was a separate library. Then a second library. Then it was moved into glibc where I believe it still is. The version I have is from an older system, after posix threads was finally cleaned up and debugged. I have a couple of these tucked away because I like to show them when teaching the parallel programming course, as a method of showing where they work well and where they are not so good (such as in my chess program).

volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?