How could a compiler break the lockless hashing method?

Rein Halbersma · Post by **Rein Halbersma** » Sun Dec 08, 2013 7:46 pm

Over in the strcpy() thread, Hyatt's lockless hashing method was discussed. It is my understanding that this XOR trick, while beautiful and innovative as it was, exhibits a data race and therefore undefined behavior as defined by the C Standard. Furthermore, as the classic paper on Double-Checked Locking by Alexandrescu and Meyers shows, modern compilers go out of their way to optimize programs and will typically find such data races even if the offending code is spread out over separately compiled modules. Finally, Ronald de Man posted a paper by Hans Boehm on how such modern papers in the face of "benign" data races can have the most stunning effects. E.g. two threads simultaneously writing the same value to a global variable, can have the effect that this variably stays in its initial state! Bob, while not conceding that his lockless hashing even exhibits undefined behavior, or that a compiler could even detect it, also challenged me to show how a compiler could actually break the XOR trick. Now I am not knowledgeable enough about low-level compiler optimizations, and after doing some thinking on it, I came up empty.

So I am posing this question as a challenge to this forum: using compiler optimizations as shown in the Alexandrescu/Meyers and Boehm papers, how could the lockless hashing be broken by an optimizing compiler that assumes that a conforming program does not exhibit undefined behavior?

NOTE: please only serious attempts to answer this technical question. Please, no debate here on whether it is undefined behavior, or whether it could be detected. The Standard and the quoted papers IMO leave no doubt about that. And even if you don't agree with that, consider that point settled for the purposes of this thread. I want to find an answer to Bob's challenge because it has merit.

bob · Post by **bob** » Sun Dec 08, 2013 8:01 pm

Rein Halbersma wrote:Over in the strcpy() thread, Hyatt's lockless hashing method was discussed. It is my understanding that this XOR trick, while beautiful and innovative as it was, exhibits a data race and therefore undefined behavior as defined by the C Standard. Furthermore, as the classic paper on Double-Checked Locking by Alexandrescu and Meyers shows, modern compilers go out of their way to optimize programs and will typically find such data races even if the offending code is spread out over separately compiled modules. Finally, Ronald de Man posted a paper by Hans Boehm on how such modern papers in the face of "benign" data races can have the most stunning effects. E.g. two threads simultaneously writing the same value to a global variable, can have the effect that this variably stays in its initial state! Bob, while not conceding that his lockless hashing even exhibits undefined behavior, or that a compiler could even detect it, also challenged me to show how a compiler could actually break the XOR trick. Now I am not knowledgeable enough about low-level compiler optimizations, and after doing some thinking on it, I came up empty.

So I am posing this question as a challenge to this forum: using compiler optimizations as shown in the Alexandrescu/Meyers and Boehm papers, how could the lockless hashing be broken by an optimizing compiler that assumes that a conforming program does not exhibit undefined behavior?

NOTE: please only serious attempts to answer this technical question. Please, no debate here on whether it is undefined behavior, or whether it could be detected. The Standard and the quoted papers IMO leave no doubt about that. And even if you don't agree with that, consider that point settled for the purposes of this thread. I want to find an answer to Bob's challenge because it has merit.

Let me fill in the blanks.

lockless hashing depends on the necessity of storing 128 bits of information, and the issue is that none of the bits can be arbitrarily changed.

Lockless hashing protects against these arbitrarily changed bits by doing this:

I want to store quad-words a and b. A is signature, b is best move, score, draft, entry type, age, etc.

Rather than storing A and B directly, I first compute A' = A ^ B and then directly store A' and B.

When I do a hash probe, I require that B must go with the signature, because an illegal hash move, or a bogus score could be an issue. SO to do a probe, I grab the two quad-words for that entry and again given that I just fetched X' and Y, I compute X = X' ^ Y and then compare the signatures A and X to see if they match. If any part of this 128 bit entry gets changed, the signature will not match, the position is simply ignored, and the search continues. The most common problem occurs when two threads do a store at the same time, one trying to store {A', B} while the other stores {C', D}. The two processors can interleave the writes to memory so that I can get any of the following: {A', B}, {A', D}, {C', B}, {C', D}. The first or last is OK, the entry is valid and the xor will decode the first value correctly (A = A' ^ B, or C = C' ^ D). The other two combinations fail. For {A', D} for example, A' ^ D won't produce A unless we have pure luck and B == D.

Whether we talk about two 8-byte stores as above, or 16 1-byte stores, this works exactly the same. If you turn it into 128 1-bit stores, it still works exactly the same. All that is required is for the compiler to write the 128 bit entry to memory. Since the memory is global/shared, it has no choice, unless it somehow recognizes that this is a race where it can then introduce any sort of random undefined behavior. Only problem would be it has to somehow write 128 bits of data to memory so that the first 64 xored with the last 64 still match a valid signature, and the other 64 bits have to be bogus and cause a problem.

I see absolutely no way a compiler can do this. Even if it arbitrarily stores just one half of the entry, nothing breaks.

Rein Halbersma · Post by **Rein Halbersma** » Sun Dec 08, 2013 8:11 pm

bob wrote: Let me fill in the blanks.

lockless hashing depends on the necessity of storing 128 bits of information, and the issue is that none of the bits can be arbitrarily changed.

Lockless hashing protects against these arbitrarily changed bits by doing this:

I want to store quad-words a and b. A is signature, b is best move, score, draft, entry type, age, etc.

Rather than storing A and B directly, I first compute A' = A ^ B and then directly store A' and B.

When I do a hash probe, I require that B must go with the signature, because an illegal hash move, or a bogus score could be an issue. SO to do a probe, I grab the two quad-words for that entry and again given that I just fetched X' and Y, I compute X = X' ^ Y and then compare the signatures A and X to see if they match. If any part of this 128 bit entry gets changed, the signature will not match, the position is simply ignored, and the search continues. The most common problem occurs when two threads do a store at the same time, one trying to store {A', B} while the other stores {C', D}. The two processors can interleave the writes to memory so that I can get any of the following: {A', B}, {A', D}, {C', B}, {C', D}. The first or last is OK, the entry is valid and the xor will decode the first value correctly (A = A' ^ B, or C = C' ^ D). The other two combinations fail. For {A', D} for example, A' ^ D won't produce A unless we have pure luck and B == D.

Whether we talk about two 8-byte stores as above, or 16 1-byte stores, this works exactly the same. If you turn it into 128 1-bit stores, it still works exactly the same. All that is required is for the compiler to write the 128 bit entry to memory. Since the memory is global/shared, it has no choice, unless it somehow recognizes that this is a race where it can then introduce any sort of random undefined behavior. Only problem would be it has to somehow write 128 bits of data to memory so that the first 64 xored with the last 64 still match a valid signature, and the other 64 bits have to be bogus and cause a problem.

I see absolutely no way a compiler can do this. Even if it arbitrarily stores just one half of the entry, nothing breaks.

Thanks Bob, for summarizing your paper so succinctly. Yes, indeed, if those were the 4 possibilities then I find it hard to spot a logical error in your argument. However, the Boehm paper posted by Ronald was a real eye-opener where it has an example of two threads writing the same value to a global variable, and the compiler applied a register optimization and instruction reordering (again, using the no UB assumption) that had the very surprising effect that the global was not modified at all.

It is that kind of optimization magic that could conceivably mess up the logic, but so far I haven't been able to conjure up something.

rbarreira · Post by **rbarreira** » Sun Dec 08, 2013 8:26 pm

Disclaimer: I'm not an expert on the C standard. But I'll try to shed some light on at least part of the question. Someone correct me if I'm wrong.

Let's assume that:

a) The compiler knows the only places that read and write the hash table are in the probing / hash insertion functions. (this is trivially true for a sufficiently advanced compiler)

b) The compiler is allowed to assume that code only ever runs on one thread simultaneously.

If this is the case, I think the compiler is allowed to eliminate the two XOR operations entirely, since they cancel out. This is because (AFAIK), the compiler only has to generate code that behaves "as if" the code was not optimized at all.

The question is whether assumption b) above is true. I don't know enough about the standard at this moment to answer that. The reason why I think b) might be true is that the code is written without any library calls (like pthread, which immediately tells the compiler of potential side-effects).

PS: BTW, I think that the "volatile" keyword probably makes this optimization illegal, as the compiler is then not allowed to assume anything about the usage of that memory (e.g. it could be used by an external hardware device or memory mapping that the compiler can't analyze).

mar · Post by **mar** » Sun Dec 08, 2013 8:31 pm

I think that no compiler to date can break lockess hashing. The compiler has no way to detect potential races IMHO, how many threads actually run depends on a runtime variable. I wouldn't like a compiler that breaks code that will run single-threaded.
Reordering shouldn't hurt either. I think it can't break on x86/x64 or ARM, the mainstream today.
It may not work on other hardware though, namely on hardware that would be able to detect races in hardware and to generate faults.
So I personally will continue using it until it breaks (if it breaks).
Even if the compiler optimizes xoring away it probably won't break anything either, because I always check for legality of moves from hashtable.
EDIT: I doubt the compiler would eliminate xoring because it doesn't know that it can do that. Hashtable holds signature xored with the other word and probing code has real signature as input. The compiler can't be so clever to know that (maybe a super-advanced might in the future but it'd have to do exhaustive expensive analysis on everything)
I think we're talking science fiction here, but it may become true someday...

Rein Halbersma · Post by **Rein Halbersma** » Sun Dec 08, 2013 8:42 pm

mar wrote:I think that no compiler to date can break lockess hashing. The compiler has no way to detect potential races IMHO, how many threads actually run depends on a runtime variable.

Race detection is not really the issue. The problems arise if a compiler optimizes as if it were single-threaded code (for which there are no data races, by definition). If a multi-threaded program has a data race, such optimizations can lead to unexpected behavior.

Rein Halbersma · Post by **Rein Halbersma** » Sun Dec 08, 2013 8:46 pm

rbarreira wrote: Let's assume that:

a) The compiler knows the only places that read and write the hash table are in the probing / hash insertion functions. (this is trivially true for a sufficiently advanced compiler)

b) The compiler is allowed to assume that code only ever runs on one thread simultaneously.

If this is the case, I think the compiler is allowed to eliminate the two XOR operations entirely, since they cancel out. This is because (AFAIK), the compiler only has to generate code that behaves "as if" the code was not optimized at all.

Yes, this is the general reasoning a compiler could make, or at least, that it can locally optimize code as if it were single-threaded if it doesn't see any explicit synchronization primitives. Also, the elimination of two cancelling XORs is an allowed optimization. Cancellation of XORs in one thread, and a write in another thread would indeed break the hashing. THanks, I didn't think of something so simple.

Of course, in real code, the insert() and lookup() would have to be very close to each other for such inter-procedural optimizations to kick in, and you would need a very small hash size to number of cores ratio in order to get bitten, but this thread is about the principle, not about the frequency.

mar · Post by **mar** » Sun Dec 08, 2013 8:51 pm

Rein Halbersma wrote:Race detection is not really the issue. The problems arise if a compiler optimizes as if it were single-threaded code (for which there are no data races, by definition). If a multi-threaded program has a data race, such optimizations can lead to unexpected behavior.

Ah I finally get your point. I was a bit stuck on races.
As I said, in order to eliminate the xor trick the compiler would have to analyze probing and storing code, go through every code path to take every possibility into account.
It could then say, ok this isn't necessary at all, so let's eliminate that and store something else (it would have to store "unxored" 1234 instead of 5678 for example).
Of course there could be many other tricks that the compiler must be able to detect statically, like pointer aliasing among other things.
This optimization would be very very expensive and would imply whole program optimization.
Another possibility would be instrumentation and PGO. But this would rely on trying every possibility in the PGO test run, so it's unreliable either.
If you export the hashtable pointer (or any pointer that points somewhere inside) to a shared library, the compiler can't optimize anymore because it might break things.
So I still very much doubt you find a compiler that might break that anytime soon (if ever).

Rein Halbersma · Post by **Rein Halbersma** » Sun Dec 08, 2013 9:07 pm

mar wrote: This optimization would be very very expensive and would imply whole program optimization.

I think it's easy enough to trigger, assuming something like

Code: Select all

void insert&#40;Entry e&#41;
&#123;
    hash&#91;e&#93;.key_value = e.key ^ e.value;
    hash&#91;e&#93;.value = e.value;
&#125;

bool matching_signature&#40;Entry e&#41;
&#123;
    return e.key == &#40;hash&#91;e&#93;.key_value ^ hash&#91;e&#93;.value&#41;;
&#125;

a code fragment like this

Code: Select all

insert&#40;e&#41;;
// ... some non hash related stuff in between
if &#40;matching_signature&#40;e&#41;) // optimized to true as if in single-threaded code
   read_and_process&#40;e&#41;;    // data race if other thread has written

would only need simple inlining and common expression elimination to make matching_signature always return true in that fragment. A write from another thread in the same hash entry would then corrupt it.

syzygy · Post by **syzygy** » Sun Dec 08, 2013 9:19 pm

Unless I'm mistaken, the C standard allows the compiler to load a value multiple times. Since the C99 standard does not know about threads, it may load the hash entry, do the xor-trick to verify that the entry can be used, and load the hash entry again. Between the two loads the hash entry may change its content due to other threads. There you go.

How could a compiler break the lockless hashing method?

How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?