volatile?

bob · Post by **bob** » Mon Mar 24, 2014 2:35 am

syzygy wrote:
bob wrote:
syzygy wrote:Look at my two (identical) statements.
syzygy wrote:For multithreaded programs using volatile is not necessary if you are properly using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads.

syzygy wrote:If "lock" is based on phtreads primitives or a similar library, which is the case I am talking about, then "lock" acts as a memory barrier. The compiler will reload "variable". No need for making it volatile.
They are correct.

Agreed?
No. If lock is based on ANY external procedure call, the compiler will reload any global variables referenced prior to the call. If no lock is called, the compiler will reload ANY global variable referenced prior to the call, UNLESS the compiler has access to the source code for the procedures being called so that it can directly determine which global variables (if any) might be modified.
Do you understand "using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads"?

POSIX guarantees that it works. Just treat it as a black box. Use the pthread primitives the way you are supposed to use them. Then it works. No volatile needed. Just check the POSIX standard.

And SURE the compiler will reload any global variable referenced prior to external calls, but what does that have to do with anything? Do you mean that not needing "volatile" is just a matter of LUCK because phtreads JUST SO HAPPENED to implement pthread_mutex_lock() as a function call?

Maybe that is not a matter of luck? Maybe the implementor was implementing a standard? And he chose this way to implement a (compiler) memory barrier?

Read those two sentences again. They are based directly on the POSIX standard. Is the POSIX standard lying?

I will repeat, you CAN do without volatile. You CAN do without a lawn mower. Both will cause extra work/overhead, however. But used correctly, volatile GREATLY improves efficiency, eliminates the need for unnecessary locking and unlocking, and the resulting cache thrashing. Period. And it works perfectly, done correctly. You have your head stuck so far up the standards bunghole you can't see any other way to do things more efficiently. I am not bound by such a small box. Volatile is part of the C standard. It works quite well.

Anything CAN be broken, if one tries hard enough.

syzygy · Post by **syzygy** » Mon Mar 24, 2014 10:09 am

bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:Look at my two (identical) statements.
syzygy wrote:For multithreaded programs using volatile is not necessary if you are properly using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads.

syzygy wrote:If "lock" is based on phtreads primitives or a similar library, which is the case I am talking about, then "lock" acts as a memory barrier. The compiler will reload "variable". No need for making it volatile.
They are correct.

Agreed?
No. If lock is based on ANY external procedure call, the compiler will reload any global variables referenced prior to the call. If no lock is called, the compiler will reload ANY global variable referenced prior to the call, UNLESS the compiler has access to the source code for the procedures being called so that it can directly determine which global variables (if any) might be modified.
Do you understand "using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads"?

POSIX guarantees that it works. Just treat it as a black box. Use the pthread primitives the way you are supposed to use them. Then it works. No volatile needed. Just check the POSIX standard.

And SURE the compiler will reload any global variable referenced prior to external calls, but what does that have to do with anything? Do you mean that not needing "volatile" is just a matter of LUCK because phtreads JUST SO HAPPENED to implement pthread_mutex_lock() as a function call?

Maybe that is not a matter of luck? Maybe the implementor was implementing a standard? And he chose this way to implement a (compiler) memory barrier?

Read those two sentences again. They are based directly on the POSIX standard. Is the POSIX standard lying?
I will repeat, you CAN do without volatile.

OK!

So we have agreement. My statement was correct all along.

Next time just don't immediately start blabbering "this is all wrong" without having a clue. Thank you.

lauriet · Post by **lauriet** » Mon Mar 24, 2014 10:29 am

WOW !!!
He we go again.
Another example of how C/C++ is confusing, not clearly defined, indeterminant, and really clever people keep arguing about how it, and its compiler works.

Well cut off my nuts and call me aunty, but Ill stick with the Wirth family of languages

syzygy · Post by **syzygy** » Mon Mar 24, 2014 10:40 am

lauriet wrote:WOW !!!
He we go again.
Another example of how C/C++ is confusing, not clearly defined, indeterminant, and really clever people keep arguing about how it, and its compiler works.

Well cut off my nuts and call me aunty, but Ill stick with the Wirth family of languages

C/C++ is simply undefined for multithreading.
C11/C++11 are defined

If you don't use C11/C++11, or don't use it's facilities for multithreading, you have to fall back on something like POSIX. Then you have again a well-defined multi-threaded environment where you can work in.

In all cases, you have to either stick to the rules of the relevant standard, or really know what you're doing and accept that the program won't be portable across compilers and/or platforms.

I'm pretty sure the situation is not "better" for Pascal or any other somewhat efficient language. Even inefficient languages can give anyone a headache when trying to get a multithreaded algorithm work correctly.

phenri · Post by **phenri** » Mon Mar 24, 2014 10:52 am

lauriet wrote:WOW !!!
He we go again.
Another example of how C/C++ is confusing, not clearly defined, indeterminant, and really clever people keep arguing about how it, and its compiler works.

Well cut off my nuts and call me aunty, but Ill stick with the Wirth family of languages

lol 12 pages, all that for this... it deserves to create a comic book inspired by this thread. Bob and RDM as main actors.

But we need a conclusion. Otherwise it will turn into tv show

syzygy · Post by **syzygy** » Mon Mar 24, 2014 11:01 am

If you have nothing to say...

syzygy · Post by **syzygy** » Mon Mar 24, 2014 2:40 pm

bob wrote:
syzygy wrote:Now CPU1 performs a write and a read at times A and B with A < B:
A. mov $1, [x] (a write to [x])
B. mov [y], %eax (a read from [y])

At roughly the same time, CPU2 also performs a write and a read at time C and D with C < D:
C. mov $1, [y] (a write to [y])
D. mov [x], %eax (a read from [x])
What you do is write to x, read from y in one cpu. In the other you write to y, then read to x.

Correct. One CPU writes to x, another reads from x. Same for y.

That has absolutely NOTHING to do with what I wrote.

Oh yes it does:

bob wrote:On Intel, the value you read will be the LAST value written by any other CPU. That's guaranteed.

I understand this as "the value a CPU reads from x will be the LAST value written by any other CPU to x" and "the value a CPU reads from y will be the LAST value written by any other CPU to y".

If you apply this to the example, you get a contradiction.

More examples are here:

Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.

Two CPUs may see writes by two other CPUs in different order.
Let's say CPU1 writes to x. CPU2 writes to y.

CPU3 observes the write to x before the write to y.
This can only mean that the write to x happened before the write to y, right? As you are saying, the writes propagate immediately.

CPU4 observer the write to y before the write to x.
So this can only mean that the write to x happened after the write to y.

Funny, isn't it?

The technical explanation is that, instead of one bus serializing all memory accesses, each core uses its own memory cache. Writes propagate from one cache to another with finite speed (measured in clock cycles). Reads snoop around the caches before going to shared memory. All this caching is done because a trip to main memory very expensive–hundreds of cycles.

Another example:

Consider the following example from the x86 spec. Initially x == y == 0.
Code: Select all
Processor 0 	Processor 1
mov &#91;_x&#93;, 1     mov &#91;_y&#93;, 1
mov r1, &#91;_x&#93;    mov r3, &#91;_y&#93;
mov r2, &#91;_y&#93; 	mov r4, &#91;_x&#93;
The result r2 == 0 and r4 == 0 is allowed.

Let’s analyze that. First of all, because reads and writes to the same location are never reordered, r1 and r3 must both be 1.

If r2 == 0, we would naturally conclude that the write to y at processor 1 hasn’t happened yet. It means: processor 1 hasn’t executed its code yet. When it does, it should see the earlier write to x (processor 0 saw it!). Guess what–sometimes it doesn’t.

The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

michiguel · Post by **michiguel** » Mon Mar 24, 2014 3:57 pm

AlvaroBegue wrote:[Disclaimer: I haven't implemented a multi-threaded chess engine, so perhaps I don't know anything.]

zamar wrote:Of course this doesn't make any sense. But the standard allows C++ compiler to do almost anything as long as the result in single thread mode would be correct.

So:
- Mathematically speaking there exists a standard compliant compiler that would miscompile SF relating to TTEntry handling.
- Practically speaking I wouldn't worry about this.
The standard allows for worse things than that:
Some part of the C++11 standard wrote:The execution of a program contains a data race if it contains two conflicting actions in different threads, at least one of which is not atomic, and neither happens before the other. Any such data race results in undefined behavior.
So at this point you are in undefined behavior land and anything can happen, whether it would be correct in single-threaded code or not.

It's hard to imagine a practical situation where a compiler will screw you over this, but I have an uneasy feeling when I rely on my lack of imagination for safety.

So how bad is it to actually lock accesses to the hash table? I can see two possible problems:
(1) Too much contention for mutexes.
(2) Locking and unlocking operations are simply too slow, regardless of contention.

(1) can be ameliorated by dividing the transposition table in chunks and having a separate mutex for each chunk. So perhaps (2) is really the problem. Does anyone have some measure of how bad it is?

SJ Edwards proposed something like that a while ago. I implemented a similar concept yesterday and the penalty is, in single core, 4% if I use a mutex, and less than 1% if I use spin lock (I did not timed these many times, so this is very approximate). This really depends on the implementation, I guess. With four threads, it is very difficult to see a difference. Sounds like a very safe way of doing things. I will test it for a while, and maybe I will keep it.

It is something like

Code: Select all

&#123;
	mythread_spinx_lock   (&HT_frag&#91;po->hashsig & &#40;MAX_HTLOCKS - 1&#41;&#93;.lock&#41;;
	// store in HT
	mythread_spinx_unlock (&HT_frag&#91;po->hashsig & &#40;MAX_HTLOCKS - 1&#41;&#93;.lock&#41;;
	return;
&#125;

Where MAX_HTLOCKS is 256, but asserted to be always 2^x.

Miguel

bob · Post by **bob** » Mon Mar 24, 2014 7:19 pm

syzygy wrote:
bob wrote:
syzygy wrote:Now CPU1 performs a write and a read at times A and B with A < B:
A. mov $1, [x] (a write to [x])
B. mov [y], %eax (a read from [y])

At roughly the same time, CPU2 also performs a write and a read at time C and D with C < D:
C. mov $1, [y] (a write to [y])
D. mov [x], %eax (a read from [x])
What you do is write to x, read from y in one cpu. In the other you write to y, then read to x.
Correct. One CPU writes to x, another reads from x. Same for y.

That has absolutely NOTHING to do with what I wrote.
Oh yes it does:
bob wrote:On Intel, the value you read will be the LAST value written by any other CPU. That's guaranteed.
I understand this as "the value a CPU reads from x will be the LAST value written by any other CPU to x" and "the value a CPU reads from y will be the LAST value written by any other CPU to y".

If you apply this to the example, you get a contradiction.

More examples are here:
Welcome to the world of special relativity! Two observers (cores) might see two events (memory writes) in different order. Even on an x86.
Two CPUs may see writes by two other CPUs in different order.
Let's say CPU1 writes to x. CPU2 writes to y.

CPU3 observes the write to x before the write to y.
This can only mean that the write to x happened before the write to y, right? As you are saying, the writes propagate immediately.

CPU4 observer the write to y before the write to x.
So this can only mean that the write to x happened after the write to y.

Funny, isn't it?
The technical explanation is that, instead of one bus serializing all memory accesses, each core uses its own memory cache. Writes propagate from one cache to another with finite speed (measured in clock cycles). Reads snoop around the caches before going to shared memory. All this caching is done because a trip to main memory very expensive–hundreds of cycles.
Another example:
Consider the following example from the x86 spec. Initially x == y == 0.
Code: Select all
Processor 0 	Processor 1
mov &#91;_x&#93;, 1     mov &#91;_y&#93;, 1
mov r1, &#91;_x&#93;    mov r3, &#91;_y&#93;
mov r2, &#91;_y&#93; 	mov r4, &#91;_x&#93;
The result r2 == 0 and r4 == 0 is allowed.

Let’s analyze that. First of all, because reads and writes to the same location are never reordered, r1 and r3 must both be 1.

If r2 == 0, we would naturally conclude that the write to y at processor 1 hasn’t happened yet. It means: processor 1 hasn’t executed its code yet. When it does, it should see the earlier write to x (processor 0 saw it!). Guess what–sometimes it doesn’t.

The technical reason for this behavior is that the write to x is immediately visible at processor 0, before it propagates to processor 1. Similarly, the write to y is immediately visible at processor 1, before it propagates to processor 0. So it is possible for both processor to see each other’s writes delayed, even though they see their own writes immediately.

There is no contradiction. Once ANY cpu does the write, NO cpu will get the old value. If a cpu does a read BEFORE the write, certainly they will get the old value. Whatever point you are trying to make is clearly ill-conceived.

Stop "naturally concluding" anything. Once a write is done, the data is invalidated in ALL other caches. NOBODY can fetch the old value once the write has been done. I don't see why that is so hard to grasp. Nobody is talking about the race between a read and a write. Just what happens AFTER the write has been completed. It works perfectly.

I only assume you realize that the mov $1, x is NOT "the write". That is a request to the CPU to write the value. "The write" is done when the CPU actually does the write to L1. Nobody cares about the reordering, the delay between the mov and L1 getting updated, etc. That's a simple race. With my volatile example, it might miss one read due to the race, the next read will get the correct value. Once the write is done.

bob · Post by **bob** » Mon Mar 24, 2014 7:30 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:Look at my two (identical) statements.
syzygy wrote:For multithreaded programs using volatile is not necessary if you are properly using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads.

syzygy wrote:If "lock" is based on phtreads primitives or a similar library, which is the case I am talking about, then "lock" acts as a memory barrier. The compiler will reload "variable". No need for making it volatile.
They are correct.

Agreed?
No. If lock is based on ANY external procedure call, the compiler will reload any global variables referenced prior to the call. If no lock is called, the compiler will reload ANY global variable referenced prior to the call, UNLESS the compiler has access to the source code for the procedures being called so that it can directly determine which global variables (if any) might be modified.
Do you understand "using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads"?

POSIX guarantees that it works. Just treat it as a black box. Use the pthread primitives the way you are supposed to use them. Then it works. No volatile needed. Just check the POSIX standard.

And SURE the compiler will reload any global variable referenced prior to external calls, but what does that have to do with anything? Do you mean that not needing "volatile" is just a matter of LUCK because phtreads JUST SO HAPPENED to implement pthread_mutex_lock() as a function call?

Maybe that is not a matter of luck? Maybe the implementor was implementing a standard? And he chose this way to implement a (compiler) memory barrier?

Read those two sentences again. They are based directly on the POSIX standard. Is the POSIX standard lying?
I will repeat, you CAN do without volatile.
OK!

So we have agreement. My statement was correct all along.

Next time just don't immediately start blabbering "this is all wrong" without having a clue. Thank you.

I take issue with the following:

For multithreaded programs using volatile is not necessary if you are properly using the synchronisation primitives of a thread library, e.g. pthreads or C++11 threads.

If you want to add "using volatile is not necessary AND you don't care about high performance..." Then I would agree.

My example is trivial. A thread needs to wait for work to be made available. I do NOT want it to block, it takes too long to wake it up. So I simply have it spin "While (!work);" work has to be volatile. No lock needed, no nothing. When someone wants to give this thread work, they simply store a pointer into "work" and the thread starts searching within nanoseconds.

How to do that without a volatile spin loop? Barriers/condition variables/mutex locks don't offer any high-performance alternative. I can think of lots of programming examples where locks with no volatile value STILL has issues unless you require that ALL accesses to a value, whether they are reads OR writes, must use a lock. More of a performance loss.

volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?

Re: volatile?