mar wrote:Compiler should IMHO really only optimize code in a way that it produces the same output given the same input.
No, that would have way too much impact on optimisation possibilities. And in any case "same output" is not well-defined. Do you mean the same output as the previous version of the compiler at the same optimisation level? Or the same output as the same version of the compiler at a lower optimisation level?
Compilers may do whatever they want as long as they stick to the standards they are advertised as supporting (e.g. C99, pthreads) and as long as they honour any other explicit guarantees that the compiler writers made.
Here is part of the race condition. This is where the 128 bit entry is stored, using the pointer "replace" which points to some specific hash entry we have chosen to overwrite. Two writes to memory. This code is found in HashStore() in Crafty.
/*
************************************************************
* *
* Now that we know which entry to replace, we simply *
* stuff the values and exit. Note that the two 64 bit *
* words are xor'ed together and stored as the signature *
* for the "lockless-hash" approach. *
* *
************************************************************
*/
replace->word1 = word1;
replace->word2 = temp_hashkey ^ word1;
There is another part of the race, because it is possible that one thread is storing an entry as another thread reads it. That code is found in HashProbe() and looks like this:
Again, I iterate across one cache block, checking each of the four entries in a single "bucket" for a hash match. The second potential race occurs when one or more threads are storing at the same location where a probe is being done. It is possible that the above code fetches the old word1 value, but a new word2 value stored by someone else.
OK, code given and explained. Now on to the biggee. How can the compiler detect the race condition?
1. Can it prove that two different threads store at exactly the same memory address? Absolutely not. It can't even predict which table entry will be overwritten since that is a run-time decision made by HashStore().
2. Can it prove that two different threads will have the same pointer value when they use different pointer variables? How? Again, not possible at compile time.
3. If it COULD prove (1) or (2) above, could it prove there is no atomic lock in place? The lock could be acquired and cleared OUTSIDE the hash procedure. Here's an example from Crafty where I select a move using the SAME NextMove() procedure that is used in the non-parallel search. I simply acquire/release a lock before calling NextMove() so that when NextMove() is compiled, the compiler sees no lock to realize it is really safe:
4. What if my hashing algorithm had each thread write into a different bucket member? No race? Can the compiler determine that they write into different entries? It can't even determine if said code is executed by multiple threads or not until run-time.
So, in summary, I see absolutely no way a compiler can recognize a race condition at compile time, given the C language as it stands today, using classic posix threads, or using classic fork(), or even using classic MPI with shared memory. It is just not possible. Not in any form. Which means the standard is utter nonsense if you try to take it in the way most here are. In reality, what "undefined" means is that there are several possible outcomes, and the result can't be predicted. One example: two threads do i++; at nearly the same instant. I starts off at 0. After the above, I will either be 1 or 2, depending on how the race plays out. It is possible both fetch, increment and store, so that both get 0, increment to 1, and store. It is possible one finishes the store just before the second does the fetch. That is classic "undefined" behavior. The compiler doesn't try to wreck things, it does EXACTLY what the program specifies, but the results are not guaranteed to be correct. Would the above "race" kill a program? Depends on what I is used for. If it is a simple node counter, you would be off by one. In a 2B node tree, who cares? If it were an important value, it would cause a problem. A smart programmer will allow races when the value is non-critical, but not where it is. He accepts undefined behavior, and his program runs just as well today as yesterday.
Unless you are Apple, of course.
As I wrote, multiple times, there are two key parts here:
(1) the compiler can not detect the race. It can detect a _potential_ race, in some cases, but it can't be sure locks are not used before the procedure being compiled is called, and it certainly can't predict the execution timing to show they can/will happen simultaneously;
(2) the race doesn't cause any problems for the lockless hashing algorithm at all. An actual race will just render that entry as worthless until it gets overwritten with a matching pair of quadwords.
The decision I chose is that (a) this is a very rare race condition since the instruction path through a single node is several thousand instructions long, and this requires two threads to synch up so that two of several thousand instructions get executed at exactly the same instant in time; (b) is is so unlikely, it is more desirable to simply lose an occasional hash match, as opposed to acquiring / releasing locks which just fry cache controllers. The potential performance hit is far worse then occasionally failing to find a possibly useful hash entry, and having to replicate that work again in light of the hash miss.
The ONLY way a compiler can break this is to assume it has found a race condition, and then do something unexpected, you say? Not a problem STILL, because unless the two quadwords make sense, AND they match so that the first can be decoded using the second, it won't hurt a thing. In short, this is not going to happen for so many reasons.
Again, this from a person that has written compilers and who has debugged gcc multiple times over the years when they screwed up long long stuff early on when I started to use it.
rbarreira wrote:Disclaimer: I'm not an expert on the C standard. But I'll try to shed some light on at least part of the question. Someone correct me if I'm wrong.
Let's assume that:
a) The compiler knows the only places that read and write the hash table are in the probing / hash insertion functions. (this is trivially true for a sufficiently advanced compiler)
b) The compiler is allowed to assume that code only ever runs on one thread simultaneously.
If this is the case, I think the compiler is allowed to eliminate the two XOR operations entirely, since they cancel out. This is because (AFAIK), the compiler only has to generate code that behaves "as if" the code was not optimized at all.
Absolutely not. 1. There are OTHER places that fiddle with the hash table, and depend on that. 2. It is more than allowable to write the hash to disk, and then restore it at some future point on a possibly different operating system, etc. It MUST be consistent. 3. Volatile forbids the compiler from changing anything about the variable's usage. That XOR might be critical to make a disk controller do the right thing, for example. and (4) the compiler doesn't get to see ALL the code at once. I don't even believe it is physically possible to compile the linux kernel as one large file, as one example.
The question is whether assumption b) above is true. I don't know enough about the standard at this moment to answer that. The reason why I think b) might be true is that the code is written without any library calls (like pthread, which immediately tells the compiler of potential side-effects).
PS: BTW, I think that the "volatile" keyword probably makes this optimization illegal, as the compiler is then not allowed to assume anything about the usage of that memory (e.g. it could be used by an external hardware device or memory mapping that the compiler can't analyze).
a) The compiler knows the only places that read and write the hash table are in the probing / hash insertion functions. (this is trivially true for a sufficiently advanced compiler)
b) The compiler is allowed to assume that code only ever runs on one thread simultaneously.
If this is the case, I think the compiler is allowed to eliminate the two XOR operations entirely, since they cancel out. This is because (AFAIK), the compiler only has to generate code that behaves "as if" the code was not optimized at all.
Yes, this is the general reasoning a compiler could make, or at least, that it can locally optimize code as if it were single-threaded if it doesn't see any explicit synchronization primitives. Also, the elimination of two cancelling XORs is an allowed optimization. Cancellation of XORs in one thread, and a write in another thread would indeed break the hashing. THanks, I didn't think of something so simple.
Of course, in real code, the insert() and lookup() would have to be very close to each other for such inter-procedural optimizations to kick in, and you would need a very small hash size to number of cores ratio in order to get bitten, but this thread is about the principle, not about the frequency.
You can not remove XORs on volatile data. You can't remove anything. If the XORs happened back to back, with NO intervening code, you STILL can not eliminate them if the target is a volatile variable.
Rein Halbersma wrote:Race detection is not really the issue. The problems arise if a compiler optimizes as if it were single-threaded code (for which there are no data races, by definition). If a multi-threaded program has a data race, such optimizations can lead to unexpected behavior.
Ah I finally get your point. I was a bit stuck on races.
As I said, in order to eliminate the xor trick the compiler would have to analyze probing and storing code, go through every code path to take every possibility into account.
It could then say, ok this isn't necessary at all, so let's eliminate that and store something else (it would have to store "unxored" 1234 instead of 5678 for example).
Of course there could be many other tricks that the compiler must be able to detect statically, like pointer aliasing among other things.
This optimization would be very very expensive and would imply whole program optimization.
Another possibility would be instrumentation and PGO. But this would rely on trying every possibility in the PGO test run, so it's unreliable either.
If you export the hashtable pointer (or any pointer that points somewhere inside) to a shared library, the compiler can't optimize anymore because it might break things.
So I still very much doubt you find a compiler that might break that anytime soon (if ever).
You stopped thinking too quickly.
int * volatile p;
p=some address that talks to a device controller directly.
p = xor(p);
p = xor(p);
can NOT be removed or optimized away. Flipping bits on a device controller register might be what is needed to kick off an I/O operation. Volatile changes everything, otherwise we'd have no O/S written in C, the above would then have to be done only in asm where the assembler doesn't optimize whatsoever.
syzygy wrote:Unless I'm mistaken, the C standard allows the compiler to load a value multiple times. Since the C99 standard does not know about threads, it may load the hash entry, do the xor-trick to verify that the entry can be used, and load the hash entry again. Between the two loads the hash entry may change its content due to other threads. There you go.
bob wrote:You can not remove XORs on volatile data. You can't remove anything. If the XORs happened back to back, with NO intervening code, you STILL can not eliminate them if the target is a volatile variable.
So is crafty's hashtable declared as volatile? I don't see it.
lockless hashing depends on the necessity of storing 128 bits of information, and the issue is that none of the bits can be arbitrarily changed.
Lockless hashing protects against these arbitrarily changed bits by doing this:
I want to store quad-words a and b. A is signature, b is best move, score, draft, entry type, age, etc.
Rather than storing A and B directly, I first compute A' = A ^ B and then directly store A' and B.
When I do a hash probe, I require that B must go with the signature, because an illegal hash move, or a bogus score could be an issue. SO to do a probe, I grab the two quad-words for that entry and again given that I just fetched X' and Y, I compute X = X' ^ Y and then compare the signatures A and X to see if they match. If any part of this 128 bit entry gets changed, the signature will not match, the position is simply ignored, and the search continues. The most common problem occurs when two threads do a store at the same time, one trying to store {A', B} while the other stores {C', D}. The two processors can interleave the writes to memory so that I can get any of the following: {A', B}, {A', D}, {C', B}, {C', D}. The first or last is OK, the entry is valid and the xor will decode the first value correctly (A = A' ^ B, or C = C' ^ D). The other two combinations fail. For {A', D} for example, A' ^ D won't produce A unless we have pure luck and B == D.
Whether we talk about two 8-byte stores as above, or 16 1-byte stores, this works exactly the same. If you turn it into 128 1-bit stores, it still works exactly the same. All that is required is for the compiler to write the 128 bit entry to memory. Since the memory is global/shared, it has no choice, unless it somehow recognizes that this is a race where it can then introduce any sort of random undefined behavior. Only problem would be it has to somehow write 128 bits of data to memory so that the first 64 xored with the last 64 still match a valid signature, and the other 64 bits have to be bogus and cause a problem.
I see absolutely no way a compiler can do this. Even if it arbitrarily stores just one half of the entry, nothing breaks.
Thanks Bob, for summarizing your paper so succinctly. Yes, indeed, if those were the 4 possibilities then I find it hard to spot a logical error in your argument. However, the Boehm paper posted by Ronald was a real eye-opener where it has an example of two threads writing the same value to a global variable, and the compiler applied a register optimization and instruction reordering (again, using the no UB assumption) that had the very surprising effect that the global was not modified at all.
It is that kind of optimization magic that could conceivably mess up the logic, but so far I haven't been able to conjure up something.
Was he using the volatile modifier? That's a key. If not, anything can happen. But with volatile, I see no way that the original value can last through two simultaneous modify attempts. As I said, you DO need to know what you are doing when you invite races, but if you do, they need not be crippling.
In my case, I have two choices...
(1) acquire/release a lock on each probe or store. A performance killer because with a significant number of threads, and one hash probe and one hash store per node typically, the number of lock accesses is enormous. This kills performance.
(2) fix it so that a broken entry causes no harm, using the lockless xor trick. Now if you get parts of two different entries in an entry, the entry won't match any normal signature and you simply lose the benefit of that hit.
(2) is MUCH cheaper than (1) in terms of performance. Hence tolerating a race condition since any of its possible "undefined results" can be detected and not break anything at all.
bob wrote:You can not remove XORs on volatile data. You can't remove anything. If the XORs happened back to back, with NO intervening code, you STILL can not eliminate them if the target is a volatile variable.
So is crafty's hashtable declared as volatile? I don't see it.
For hashing, no. Why? I don't object to the "let's keep it in a register once it has been loaded." In general, volatile is used when variables are getting changed between threads and the new value is important. Here it is not. Suppose you store before I probe? I get a benefit if it matches. If you store after I probe, I get no benefit, but nothing breaks.
In this case (HashStore() and HashProbe()), the XORs are FAR removed from each other. One in HashProbe() called from Search(), the other in HashStore() called from elsewhere in Search(), and table entries are modified from some other places for various things like flagging all entries as bad near a 50-move draw, making sure the PV is in the hash before starting the next iteration, and such. The compiler can't even tell if that data gets written to disk somewhere else and then gets read back in later. You can only optimize if you can absolutely prove with a data dependency graph that there is no intervening stuff between the first XOR and the second (and every other use of course).
syzygy wrote:
No, that would have way too much impact on optimisation possibilities. And in any case "same output" is not well-defined. Do you mean the same output as the previous version of the compiler at the same optimisation level? Or the same output as the same version of the compiler at a lower optimisation level?
Compilers may do whatever they want as long as they stick to the standards they are advertised as supporting (e.g. C99, pthreads) and as long as they honour any other explicit guarantees that the compiler writers made.
I meant that a function that writes to memory should write the same values after optimization.
Consider the following: let's say you have an application that uses a simple symmetric cipher to encrypt a memory buffer
and at some other point it decrypts the same buffer using the same cipher.
The compiler may detect that and optimize it away.
Let's say it does so only to protect some sensitive data in memory.
I wouldn't like at all if the compiler would optimize that away.