volatile?

AlvaroBegue · Post by **AlvaroBegue** » Mon Mar 24, 2014 8:09 pm

bob wrote: My example is trivial. A thread needs to wait for work to be made available. I do NOT want it to block, it takes too long to wake it up. So I simply have it spin "While (!work);" work has to be volatile. No lock needed, no nothing. When someone wants to give this thread work, they simply store a pointer into "work" and the thread starts searching within nanoseconds.

How to do that without a volatile spin loop? Barriers/condition variables/mutex locks don't offer any high-performance alternative. I can think of lots of programming examples where locks with no volatile value STILL has issues unless you require that ALL accesses to a value, whether they are reads OR writes, must use a lock. More of a performance loss.

Would making that variable atomic (as in C11 or C++11) work? Would it be any less efficient?

syzygy · Post by **syzygy** » Mon Mar 24, 2014 8:28 pm

bob wrote:There is no contradiction. Once ANY cpu does the write, NO cpu will get the old value. If a cpu does a read BEFORE the write, certainly they will get the old value. Whatever point you are trying to make is clearly ill-conceived.

If you're simply too lazy or otherwise unable to look at this in detail, I cannot help you.

I only assume you realize that the mov $1, x is NOT "the write".

Obviously this is the write. Load instructions immediately after this instruction on the same CPU correctly see the new value, at a time when other CPUs still read the old value.

With my volatile example, it might miss one read due to the race, the next read will get the correct value. Once the write is done.

Aha! So while the write "is in progress" the old value can stil be read by some CPUs while other CPUs already see the new value. Woohoo. So much for your original statement..

syzygy · Post by **syzygy** » Mon Mar 24, 2014 8:36 pm

bob wrote:
syzygy wrote:So we have agreement. My statement was correct all along.

Next time just don't immediately start blabbering "this is all wrong" without having a clue. Thank you.
I take issue with the following:

For multithreaded programs using volatile is not necessary if you are properly using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads.
If you want to add "using volatile is not necessary AND you don't care about high performance..." Then I would agree.

I meant exactly what I wrote. Nothing more, nothing less.

I did not write that always using locks when accessing shared data is the most efficient way. I did not write that, I did not imply that. Do not pretend that I did.

When you originally wrote "all wrong" you were clearly claiming that not using volatile would result in incorrect behaviour (the compiler not reloading from memory etc.). As you now slowly seem to accept, POSIX guarantees that values are reloaded when necessary. That's all.

If you read the beginning of this thread, you might understand why I made that statement. I simply responded to a question from Lucas. Sanpei apparently consistently uses locks, but the shared variables are still marked volatile. That is not necessary.

syzygy · Post by **syzygy** » Mon Mar 24, 2014 9:36 pm

AlvaroBegue wrote:
bob wrote:My example is trivial. A thread needs to wait for work to be made available. I do NOT want it to block, it takes too long to wake it up. So I simply have it spin "While (!work);" work has to be volatile. No lock needed, no nothing. When someone wants to give this thread work, they simply store a pointer into "work" and the thread starts searching within nanoseconds.

How to do that without a volatile spin loop? Barriers/condition variables/mutex locks don't offer any high-performance alternative. I can think of lots of programming examples where locks with no volatile value STILL has issues unless you require that ALL accesses to a value, whether they are reads OR writes, must use a lock. More of a performance loss.
Would making that variable atomic (as in C11 or C++11) work? Would it be any less efficient?

I am going to make an attempt at porting my tablebase generator and chess engine to C11. I am currently using a lot of inline assembly for lock-free synchronisation. It seems C11 allows me to do the same within the language standard. The result should have about the same performance and be portable to systems with arbitrary memory models.

syzygy · Post by **syzygy** » Tue Mar 25, 2014 12:09 am

syzygy wrote:More examples are here:
Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.
Two CPUs may see writes by two other CPUs in different order.
Let's say CPU1 writes to x. CPU2 writes to y.

CPU3 observes the write to x before the write to y.
This can only mean that the write to x happened before the write to y, right? As you are saying, the writes propagate immediately.

CPU4 observer the write to y before the write to x.
So this can only mean that the write to x happened after the write to y.

Funny, isn't it?

This one is actually not correct for x86.

Intel Manual 8.2.3.7 wrote:As noted in Section 8.2.3.5, the memory-ordering model allows stores by two processors to be seen in different orders by those two processors. However, any two stores must appear to execute in the same order to all processors other than those performing the stores.

hgm · Post by **hgm** » Tue Mar 25, 2014 9:40 am

What the manual says seems equivalent to saying: "the order of stores is globally consistent, except that processors might misplace their own stores in this ordering". So I wonder how much this really has to do with cache coherence, and how much of it is just due to out-of-order execution.

syzygy · Post by **syzygy** » Tue Mar 25, 2014 11:06 am

hgm wrote:What the manual says seems equivalent to saying: "the order of stores is globally consistent, except that processors might misplace their own stores in this ordering". So I wonder how much this really has to do with cache coherence, and how much of it is just due to out-of-order execution.

It seems to me that the main factor is that you don't want a read that can be served now out of cache, or even out of main memory, to have to wait for a store that, at least logically, has already been performed by another processor. That other processor already sees its own stores (in its store buffer, or in its cache), but it will take a bit of time to synchronise the rest of the system with respect to that store.

The very real effect is that at the same time t one processor sees the last value it wrote, while anotheror processor still sees the old value. Usually this won't be noticeabe, because CPUs aren't clocking all their operations and comparing the results, but it becomes noticeable in the case of write/read interactions as in the examples.

The Wikipedia entry for MESI mentions (as necessary improvements upon the naive implementation) store buffers and invalidation buffers:

A store buffer is used when writing to an invalid cache line. Since the write will proceed anyway, the CPU issues a read-invalid message (hence the cache line in question and all other CPU's cache lines which store that address of memory are invalidated) and then pushes the write into the store-buffer, to be executed when the cache line finally arrives. (A CPU will when trying to read cache lines scan its own store buffer, in case it has something ready to write to the cache).

Consequently, a CPU can from its point of view have written something, but it isn't yet in the cache and so other CPUs *cannot see this* - they cannot scan the store buffer of other CPUs.

With regard to invalidation, CPUs implement invalidate queues, whereby incoming invalidate requests are instantly acknowledged but not in fact acted upon - they instead simply enter an invalidation queue, their processing occurs as soon as possible (but not necessarily instantly). As such a CPU can have in its cache a line which is invalid, but where it doesn't yet know that line is invalid - the invalidation queue contains the invalidation which hasn't yet been acted upon. (The invalidation queue is on the other "side" of the cache; the CPU can't scan it, as it can the store buffer).

So if the written value is in the store buffer, it has not yet entered the memory system. But from the point of view of the writing processor, the new value was already written.

If the invalidate request is in a reading processor's invalidation queue, the newly stored value may really be in DRAM. The processors that have processed the invalidate requests see it. The processors that have not yet processed it will still see the old value in their own cache.

Bob may now say "I told you, there can only be one value in DRAM at a time". He's right

bob · Post by **bob** » Wed Mar 26, 2014 2:27 am

syzygy wrote:
bob wrote:There is no contradiction. Once ANY cpu does the write, NO cpu will get the old value. If a cpu does a read BEFORE the write, certainly they will get the old value. Whatever point you are trying to make is clearly ill-conceived.
If you're simply too lazy or otherwise unable to look at this in detail, I cannot help you.

I only assume you realize that the mov $1, x is NOT "the write".
Obviously this is the write. Load instructions immediately after this instruction on the same CPU correctly see the new value, at a time when other CPUs still read the old value.

With my volatile example, it might miss one read due to the race, the next read will get the correct value. Once the write is done.
Aha! So while the write "is in progress" the old value can stil be read by some CPUs while other CPUs already see the new value. Woohoo. So much for your original statement..

My original statement: ONCE the write is done, everybody gets new value. That has not changed one scintilla. You want to quibble about internal reordering. Go ahead. Volatile works EXACTLY as I said. If you want to eliminate the reordering issues, use a lock prefix or the exchange instruction which has no such problem at all. But significant overhead.

Quite obvious you don't understand how even high-performance spin locks use both reads AND locked read/writes to avoid soaking the cache controllers time away from useful work. All thanks to volatile.

bob · Post by **bob** » Wed Mar 26, 2014 2:29 am

hgm wrote:What the manual says seems equivalent to saying: "the order of stores is globally consistent, except that processors might misplace their own stores in this ordering". So I wonder how much this really has to do with cache coherence, and how much of it is just due to out-of-order execution.

Only thing that affects cache coherence is when the cache sees the write, NOBODY will get the old value after that. And if you want to go all the way to using the lock prefix, or an xchg instruction, it works correctly for ALL cases, albeit at a significant performance hit...

bob · Post by **bob** » Wed Mar 26, 2014 2:36 am

syzygy wrote:
hgm wrote:What the manual says seems equivalent to saying: "the order of stores is globally consistent, except that processors might misplace their own stores in this ordering". So I wonder how much this really has to do with cache coherence, and how much of it is just due to out-of-order execution.
It seems to me that the main factor is that you don't want a read that can be served now out of cache, or even out of main memory, to have to wait for a store that, at least logically, has already been performed by another processor. That other processor already sees its own stores (in its store buffer, or in its cache), but it will take a bit of time to synchronise the rest of the system with respect to that store.

Not at all. Once the cache has it, there are no old copies around. A store instantly invalidates that data in other caches. The headache is between the cpu executing a non-atomic write like mov, and the length of time that expires before the CPU actually delivers the data to cache. Once it gets there, everything is coherent, within whatever time-frame the cache bus can deliver the invalidate to other caches. Doesn't change the use of volatile as I have been explaining (and using it).

The very real effect is that at the same time t one processor sees the last value it wrote, while anotheror processor still sees the old value. Usually this won't be noticeabe, because CPUs aren't clocking all their operations and comparing the results, but it becomes noticeable in the case of write/read interactions as in the examples.

The Wikipedia entry for MESI mentions (as necessary improvements upon the naive implementation) store buffers and invalidation buffers:
A store buffer is used when writing to an invalid cache line. Since the write will proceed anyway, the CPU issues a read-invalid message (hence the cache line in question and all other CPU's cache lines which store that address of memory are invalidated) and then pushes the write into the store-buffer, to be executed when the cache line finally arrives. (A CPU will when trying to read cache lines scan its own store buffer, in case it has something ready to write to the cache).
MESI is "long-gone" in Intel. MESIF is the game now, which helps even on reads.

Consequently, a CPU can from its point of view have written something, but it isn't yet in the cache and so other CPUs *cannot see this* - they cannot scan the store buffer of other CPUs.

With regard to invalidation, CPUs implement invalidate queues, whereby incoming invalidate requests are instantly acknowledged but not in fact acted upon - they instead simply enter an invalidation queue, their processing occurs as soon as possible (but not necessarily instantly). As such a CPU can have in its cache a line which is invalid, but where it doesn't yet know that line is invalid - the invalidation queue contains the invalidation which hasn't yet been acted upon. (The invalidation queue is on the other "side" of the cache; the CPU can't scan it, as it can the store buffer).
So if the written value is in the store buffer, it has not yet entered the memory system. But from the point of view of the writing processor, the new value was already written.

If the invalidate request is in a reading processor's invalidation queue, the newly stored value may really be in DRAM. The processors that have processed the invalidate requests see it. The processors that have not yet processed it will still see the old value in their own cache.

Bob may now say "I told you, there can only be one value in DRAM at a time". He's right

No, I said this works. And if you are concerned with that invalidate delay, just use a lock prefix. Poof. it is gone. That's why spin locks use the xchg instruction, absolutely guarantees that one cpu gets the 0, the rest get the 1 the first one wrote back. Without fail, guaranteed.

Fortunately, one can do clever things with volatile and not have to take the performance hit. Doesn't matter to me when there is one writer and one reader, and the writer writes a new non-zero value at time t, and the reader gets the old value for N cycles. Doesn't hurt a thing, and that is faster than what the xchg will do, but the delay now becomes zero although the overhead to do the read/write atomically or write atomically is bad.

volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?