How could a compiler break the lockless hashing method?

rbarreira · Post by **rbarreira** » Wed Dec 11, 2013 7:42 pm

bob wrote:
rbarreira wrote:
wgarvin wrote: I don't believe the compiler writers would bother to do this stuff if it didn't actually speed up some actual programs out there. Their job is to generate the best possible code while complying with the spec. Our job is to write programs, but if we want those programs to work reliably, we also have to comply with the spec.
It's actually trivial to give examples of where not using Bob's favorite instruction (1-operand imul) results in better performance. The easiest would be a situation where the compiler is using the edx register and doesn't want it clobbered. A slightly more complicated one would be auto-vectorization of code, where the pmullud SSE instruction would be used.
Read my post AGAIN. If you do a divide, you ARE going to use edx. Period. Therefore a single-operand mul makes perfect sense since you need a quadword to do a 32 bit idiv anyway. If you don't do that form of imul, you get to move the number to eax, then follow that with a cdq (another instruction that extends eax into edx, then do the idiv.

So with a divide, you use edx or else don't do the divide. Unless you decide to go to SSE or whatever. Which is yet another can of worms.

But if I write "x * y / 2" then idiv is not needed. So that's just wrong.

bob · Post by **bob** » Wed Dec 11, 2013 7:47 pm

hgm wrote:
bob wrote:All wheels rubber would represent a problem. Main wheels steel but applying power through a rubber wheel does NOT represent a problem. But they sure got taught a lesson, eh?
And all regular trains get a 20% higher energy bill because of the extra resistance they have to overcome, due to the spikes...

But we are really drifting very far now from the topic of this thread, which was not strcpy but the lockless hashing. The strcpy() is just a side track to refute arguments that compiler writers would not do intentionally malicious things, and that all things defined as UB would have an upside in terms of increased performance for at least some users. Not so! In the strcpy case everyone obviously suffers. No need to discuss it further. Compiler writers can be purely malicious if the standard allows them.

The question I worry about, however, is if similarly malicious future compiler writers could abuse the standard in a similar way to force people to use expensive locks on code like the lockless hashing that would work perfectly without them (i.e. making a one-time copy from a volatile hash table, and verifying and using the data only from that copy), by arguing absence of such locks falls under the standard's definition of UB, and use that as an excuse to summon nasal demons?

I can safely answer "no" for my code. Because I generally compile everything separately so that the compiler can not see from procedure to procedure. The only advantage I have ever seen for "big compiles" is to let the compiler see everything and inline where appropriate. For example, if you take your eval and break it into EvaluateRooks(), EvaluateKnights() and such, you introduce procedure call over head. Since there is only one instance of the calling procedure, the compiler will likely inline those since it makes the program smaller, and allows it to then optimize across the inlined functions when possible.

But there is a solution that many use. Just put such things into one file, as in Evaluate() in Crafty. All that is there is evaluation code, so the compiler can, if it deems it worthwhile, turn it into one long inline procedure with no calls at all.

Detecting races due to missing locks is hopeless, because you can enter a procedure with a lock already set, rendering the race condition not possible. If the compiler can't see that code, it would not be able to handle it correctly. I'd hope it would not break things. I do exactly this in my "NextMove()" procedure usage. Normal NextMove() calls get a local tree structure passed to it, but at a split point, I acquire a lock for a specific shared split block, then pass NextMove() the address of that split block with the instruction "give me the next move to search." After it returns I release the lock. Looking at just NextMove() you might wonder "how does this get synchronized outside split points? I wrote it this way so that NextMove() is the same for all "find next move" requests. The code that calls it handles the synchronization before and after calling it, which gives me that one special purpose SearchSMP() function with everything else exactly the same... Clean design, works perfectly, so long as no compiler demands that you lock/unlock immediately adjacent to where you access/modify something that might be shared with others. If a compiler ever went that far, it would break so many programs, including the linux kernel for certain, and the windows kernel almost certainly, because that is a pretty common programming practice.

bob · Post by **bob** » Wed Dec 11, 2013 7:56 pm

rbarreira wrote:
bob wrote:
rbarreira wrote:
bob wrote:
rbarreira wrote:
bob wrote: What I am asking of the compiler is "get out of the way and let the hardware do what it will do." Don't try to make assumptions.
That is not a reasonable request of you, because not only is there different hardware out there but even on x86 there are many different ways to calculate "x * 2":

- the compiler can use imul
- the compiler can use add
- the compiler can use shl
- the compiler can use lea
- the compiler could conceivably use SIMD (SSE) instructions, if it's multiplying several values in a loop.

Why do you assume only one of those is a valid response by the compiler?
If you notice * and / in same expression, imul with one operand is a natural choice. "do the right thing." My students do this on class assignments without having to be told, because they want to make sure that my input won't break their code. multiplying two numbers > 65535 is problematic, unless you then divide by something.

So, "do the right thing" as opposed to "do something." I mean, for an int multiply, it COULD use the floating point processor. Should it?
So you're asking compilers to detect the presence of divisions in an expression, and use a different instruction for multiplication if that is the case?

I should stop arguing with you but this is starting to become hilarious, so I probably won't.
Nope. If you "do the right thing" it will work as I want. Just do the math. Don't try to get cute with the optimizing and think I mean something I don't, just because there is a potential for integer overflow there. BTW don't you think it a little inconsistent to argue as you do yet still have code with potential undefined behavior in it? Or do you write in C and check EVERY add, subtract, multiply and divide operation for overflow before performing it? I'll bet if you look at gcc, it doesn't either, in its source code. Meanwhile, back at the ranch, the compiler optimizer guys are breaking things as quickly as they can for optimizations that are REALLY not very significant.

I realize there is a definite risk in fixed-word-length architectures. I agree to that when I choose to use that architecture. If it is going to overflow, let it overflow. If it is not, let it complete. Don't arbitrarily toss the code out in some cases, and leave it alone in others. Trying to handle just a subset of overflows is a REAL problem because it makes compiler output unpredictable. Don't assume that if someone writes if (a + 1 > a) that they really mean if (1). If they meant if(1) they would have likely written if(1). Overflow can occur on any arithmetic operation. Nothing the compiler can do, other than to check the overflow flag after every instruction. nobody would tolerate that slow-down. So leave it alone rather than assuming I meant something different from what I wrote. If I write it one way, compile it that way. If you want to produce a warning that says "Hey, you wrote if (a + 1 > a) which is always true unless an overflow occurs, you might consider writing if(1) instead. UNLESS you think an overflow might occur, in which case you wrote exactly the write thing."

Much better.

Why don't you do what I did. Compile a few big programs with and without -fwrapv, and see if you can find any speed difference. I tried several chess engines and found absolutely no difference in speed. For the record, I compiled (a) crafty, (b) stockfish; (c) gnu chess 6.0something (whatever is current on macports); (d) fruit; (e) Toga 2, and (e) an older version of Arasan. If this were a useful optimization, wouldn't you expect at least one of those integer-only programs to show an improvement? I mean, we KNOW it breaks some silly programs, like the infinite loop example. Where is that benefit to justify taking such a risk? Surely there must be some actual programs that this helps, somewhere. I might even try building gcc itself with -fwrapv to see if it makes the compiler run slower. However, something tells me "no". I'll report back on this a bit later, it does take a while to compile.
But what is the right thing? You just told me it depends on the operators present in the expression. By your criteria, any compiler that compiles "x * 2 / 2" to use the shl or add instructions is not doing "the right thing". So what other expressions shouldn't use shl or add?

Yes, my programs probably have bugs, some related to UB and some not. In neither case would I blame the compiler if those bugs caused my program to break. But note that you don't have to do checking in every expression. Only an actual int overflow is UB, not any addition.

I can't do performance experiments right now, nor do I want to prove that every single gcc optimization works well. I'm not a gcc developer.

OK, a summary.

A compiler CAN do whatever it wants with a * b / c. It can do the entire thing (at least a*b) in one 32 bit register. That is perfectly OK to me. But since the divide only gives us one choice, divide a source into the eax:edx register pair, doing the multiply that produces a 64 bit result is actually simpler and there is no other choice for the idiv anyway.

Either way works. given the a*b/c I would do the 64 bit version on 32 bit hardware since it is far more accurate, assuming a*b/c is a 32 bit value after computing. If it is bigger, it wraps. If you do it all in 32 bit mode, it wraps MUCH quicker, but that would be acceptable, since as a programmer, I understand that a fixed-word-length machine has integer math limitations.

My point about the adds is that if you don't test, you don't know whether you suffered "undefined behavior" or not. And, in fact, you will not EVER suffer because integer addition and subtraction is precisely defined on Intel hardware, even if it DOES overflow. So you know what you will get in every single add. But IF you happen to use a couple of constants, you now have no idea, because the compiler steps in and intentionally breaks your code because it assumes that overflow won't occur, on a hardware platform where it is guaranteed that it WILL happen on occasion. Multiply two variables and you get the expected wrapped overflow. Multiply two constants and the choices increase, depending on the optimizer. You might get the same answer. You might get a different answer (0x40000000 * 2 / 2 != 0x40000000 for example) or you might get an error, or you might have demons fly out of your nose. All because of your choice to use constants. Or because you let the compiler see what the variable values are rather than keeping them hidden until run-time. That I don't find acceptable. Any integer arithmetic operator can overflow, yet the compiler changes just a few of them to do something different than all the rest. Inconsistent is the word I like there. In addition to "unacceptable".

syzygy · Post by **syzygy** » Wed Dec 11, 2013 8:15 pm

bob wrote:A compiler CAN do whatever it wants with a * b / c. It can do the entire thing (at least a*b) in one 32 bit register. That is perfectly OK to me. But since the divide only gives us one choice, divide a source into the eax:edx register pair, doing the multiply that produces a 64 bit result is actually simpler and there is no other choice for the idiv anyway.

Either way works. given the a*b/c I would do the 64 bit version on 32 bit hardware since it is far more accurate, assuming a*b/c is a 32 bit value after computing. If it is bigger, it wraps. If you do it all in 32 bit mode, it wraps MUCH quicker, but that would be acceptable, since as a programmer, I understand that a fixed-word-length machine has integer math limitations.

So, in your view, even for unsigned 32-bit ints a compiler has a lot of freedom in evaluating a * b / c? It does not need to give the same outcome every time?

You would find it acceptable if a * b / c is evaluated as a 64-bit multiplication followed by a division, followed by a truncation to 32-bit, but you would also find it acceptable if a * b / c is evaluated as a 32-bit multiplication followed by a division?

And if the hardware allows completely different ways of evaluating a multiplication followed by a division, then you're fine if the compiler uses that? As long as it is somehow close enough to "what you wanted"?

bob · Post by **bob** » Wed Dec 11, 2013 9:51 pm

rbarreira wrote:
bob wrote:
rbarreira wrote:
wgarvin wrote: I don't believe the compiler writers would bother to do this stuff if it didn't actually speed up some actual programs out there. Their job is to generate the best possible code while complying with the spec. Our job is to write programs, but if we want those programs to work reliably, we also have to comply with the spec.
It's actually trivial to give examples of where not using Bob's favorite instruction (1-operand imul) results in better performance. The easiest would be a situation where the compiler is using the edx register and doesn't want it clobbered. A slightly more complicated one would be auto-vectorization of code, where the pmullud SSE instruction would be used.
Read my post AGAIN. If you do a divide, you ARE going to use edx. Period. Therefore a single-operand mul makes perfect sense since you need a quadword to do a 32 bit idiv anyway. If you don't do that form of imul, you get to move the number to eax, then follow that with a cdq (another instruction that extends eax into edx, then do the idiv.

So with a divide, you use edx or else don't do the divide. Unless you decide to go to SSE or whatever. Which is yet another can of worms.
But if I write "x * y / 2" then idiv is not needed. So that's just wrong.

What are you going to do? Shift a signed integer to the right? And get undefined behavior? or possibly try to optimize out the idiv and use a shrd instruction? Either one works for me. the shrd would be faster, but hey, aren't you advocating "no undefined behavior"???

How are you going to divide by 2?

bob · Post by **bob** » Wed Dec 11, 2013 9:58 pm

syzygy wrote:
bob wrote:A compiler CAN do whatever it wants with a * b / c. It can do the entire thing (at least a*b) in one 32 bit register. That is perfectly OK to me. But since the divide only gives us one choice, divide a source into the eax:edx register pair, doing the multiply that produces a 64 bit result is actually simpler and there is no other choice for the idiv anyway.

Either way works. given the a*b/c I would do the 64 bit version on 32 bit hardware since it is far more accurate, assuming a*b/c is a 32 bit value after computing. If it is bigger, it wraps. If you do it all in 32 bit mode, it wraps MUCH quicker, but that would be acceptable, since as a programmer, I understand that a fixed-word-length machine has integer math limitations.
So, in your view, even for unsigned 32-bit ints a compiler has a lot of freedom in evaluating a * b / c? It does not need to give the same outcome every time?

You would find it acceptable if a * b / c is evaluated as a 64-bit multiplication followed by a division, followed by a truncation to 32-bit, but you would also find it acceptable if a * b / c is evaluated as a 32-bit multiplication followed by a division?

And if the hardware allows completely different ways of evaluating a multiplication followed by a division, then you're fine if the compiler uses that? As long as it is somehow close enough to "what you wanted"?

As I said, "do the right thing." If you do a * b / c, using the x86 hardware to give a more accurate result seems reasonable. Otherwise, use the pure 32 bit versions if you insist on being that strict in your thinking. Most of the time there is no / in the multiply calculations so this is moot anyway. The majority are subscript calculations, converting anything above 1 dimension to a linear index. But on occasion, the 64 bit approach is sensible.

I can't imagine where I would ever want an overflow when doing multiplies, except maybe in the LCG PRNGs. So if it can do that mil/div more accurately fine. The more common optimization problem deals with addition/subtraction. But if I am prepared to accept an overflow, as in a * b + c for the PRNG I mentioned, I accept it. The thing works as intended, so long as the compiler simply does those two operations. I don't care how it multiplies and adds. I know it will overflow. I know how it will overflow. Apparently that is acceptable to me. So just do it.

bob · Post by **bob** » Wed Dec 11, 2013 10:04 pm

AlvaroBegue wrote:So you think mapping `x * 2 / 2' to `x' is a valid optimization, but mapping it to `(x + x) / 2' or `(x * 2) >> 1' (with the appropriate shift that propagates the sign bit) is not. Is that what you are saying? Also, do you think it's OK for the compiler to give you different results if you store `x * 2' in a local variable and then divide that one by 2?

I really think by now you don't know what you want anymore, and there is a slight chance you might end up seeing the point of leaving things undefined and letting the compiler give you different answers for different levels of optimization, given that your program invokes undefined behavior. But I am not keeping my hopes up.

No I don't think mapping x * 2 / 2 to x is OK. I did that for a reason. If you want to play the "macro expansion card" write better macros. If I tell it to do an operation, I generally have a reason. If it can avoid duplicated work with common subexpressions and such, fine, so long as it honors what I asked it to do and how variables are declared (this optimization has to be disabled if there are volatile variables of course). I don't see why the compiler has to handle crappy macro output and try to make it better. That's the programmer's responsibility...

No I do not accept the "different answers for different levels of optimization." yes I understand that uninitialized variables can cause that. The compiler produces warnings. I should fix them. But for the overflow example given before, the silent optimization of removing the code is unacceptable. Any optimization that changes the semantics of my code, whether the compiler thinks of it as "undefined behavior" or not really doesn't matter.

BTW, I just discovered, after testing something in the strcpy() thread here, that Mavericks doesn't even do the overlapping test right. It aborts on strcpy(a, a+n) which is the perfectly safe version of overlap in strcpy(), but it ignores and screws up on strcpy(a+n, a). No abort. Just bogus output. Do you think that is acceptable? Break the part that is known to work, leave "undefined" the part that is known to be bad? hmmm...

bob · Post by **bob** » Wed Dec 11, 2013 10:15 pm

syzygy wrote:
bob wrote:
syzygy wrote:Do you have any clue what you are asking from the compiler when you write x * 2 / 2?

If x is a signed int, and you want signed overflow to wrap, which so far you have always wanted, then what you are asking is to get -12768 when x = 20000 and ints are 16-bit.

If you do not understand this, you should not touch a compiler.
What I am asking of the compiler is "get out of the way and let the hardware do what it will do." Don't try to make assumptions.

x * 2 / 2 is done correctly. The compiler can replace it with x if it wants, or it can do the multiply/divide which STILL produces x. But the compiler comes up with a different answer when IT chooses what should happen. That I do not like.
Let's switch to 32-bit ints. We take x = 1500000000.
Code: Select all
  int p = x * 2 / 2;
  int q = &#40;x * 2&#41; / 2;
  int r = x * 2;
  int s = r / 2;
What should be the values of p, q, r, s?

Let's ask the compiler first.

run 1. gcc, no optimization:

-647483648 -647483648 -1294967296 -647483648

run 2. gcc, -O3:

1500000000 1500000000 -1294967296 -647483648

My answer:

1500000000 1500000000 for the first two. third = -1294967296. Normal signed overflow wrap. For the last, _I_ would optimize it a bit differently, namely I would have done the multiply using imul to get the answer in two registers as a 64 bit value. Displaying EAX gives the right answer for #3 above. I would then have done a simple expression substitution and replaced r with the result calculated (and still in both registers) and got right back to the 1500000000 again.

BTW, for the divides the compiler in unoptimized mode produces the undefined behavior right shift. Go figure. It is undefined if I do it, but apparently not if the compiler does it. How exactly does that work, again???

syzygy · Post by **syzygy** » Wed Dec 11, 2013 10:55 pm

bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:Do you have any clue what you are asking from the compiler when you write x * 2 / 2?

If x is a signed int, and you want signed overflow to wrap, which so far you have always wanted, then what you are asking is to get -12768 when x = 20000 and ints are 16-bit.

If you do not understand this, you should not touch a compiler.
What I am asking of the compiler is "get out of the way and let the hardware do what it will do." Don't try to make assumptions.

x * 2 / 2 is done correctly. The compiler can replace it with x if it wants, or it can do the multiply/divide which STILL produces x. But the compiler comes up with a different answer when IT chooses what should happen. That I do not like.
Let's switch to 32-bit ints. We take x = 1500000000.
Code: Select all
  int p = x * 2 / 2;
  int q = &#40;x * 2&#41; / 2;
  int r = x * 2;
  int s = r / 2;
What should be the values of p, q, r, s?
Let's ask the compiler first.

run 1. gcc, no optimization:

-647483648 -647483648 -1294967296 -647483648

run 2. gcc, -O3:

1500000000 1500000000 -1294967296 -647483648

My answer:

1500000000 1500000000 for the first two. third = -1294967296. Normal signed overflow wrap. For the last, _I_ would optimize it a bit differently, namely I would have done the multiply using imul to get the answer in two registers as a 64 bit value. Displaying EAX gives the right answer for #3 above. I would then have done a simple expression substitution and replaced r with the result calculated (and still in both registers) and got right back to the 1500000000 again.

Wow, you would have let -1294967296 / 2 = 1500000000 !!

Of course this is allowed, because x * 2 gives UB so anything can happen. If I don't overlook anything, your proposed implementation of this code is C99-compliant (i.e. it gives the results specified by C99 when no overflow occurs).

So basically you accept that anything can happen when signed integers overflow. You don't care at all about having a programming language with well-defined semantics.

Most likely you don't even care about getting well-defined results from unsigned arithmetic.

bob wrote:BTW, for the divides the compiler in unoptimized mode produces the undefined behavior right shift. Go figure. It is undefined if I do it, but apparently not if the compiler does it. How exactly does that work, again???

Can you REALLY not answer this yourself?

bob · Post by **bob** » Thu Dec 12, 2013 12:35 am

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:
syzygy wrote:Do you have any clue what you are asking from the compiler when you write x * 2 / 2?

If x is a signed int, and you want signed overflow to wrap, which so far you have always wanted, then what you are asking is to get -12768 when x = 20000 and ints are 16-bit.

If you do not understand this, you should not touch a compiler.
What I am asking of the compiler is "get out of the way and let the hardware do what it will do." Don't try to make assumptions.

x * 2 / 2 is done correctly. The compiler can replace it with x if it wants, or it can do the multiply/divide which STILL produces x. But the compiler comes up with a different answer when IT chooses what should happen. That I do not like.
Let's switch to 32-bit ints. We take x = 1500000000.
Code: Select all
  int p = x * 2 / 2;
  int q = &#40;x * 2&#41; / 2;
  int r = x * 2;
  int s = r / 2;
What should be the values of p, q, r, s?
Let's ask the compiler first.

run 1. gcc, no optimization:

-647483648 -647483648 -1294967296 -647483648

run 2. gcc, -O3:

1500000000 1500000000 -1294967296 -647483648

My answer:

1500000000 1500000000 for the first two. third = -1294967296. Normal signed overflow wrap. For the last, _I_ would optimize it a bit differently, namely I would have done the multiply using imul to get the answer in two registers as a 64 bit value. Displaying EAX gives the right answer for #3 above. I would then have done a simple expression substitution and replaced r with the result calculated (and still in both registers) and got right back to the 1500000000 again.
Wow, you would have let -1294967296 / 2 = 1500000000 !!

Nope. I would have let 3000000000 / 2 = 1500000000. The previous calculation produced 3000000000, which would be wrapped to the -129etc number. But I still have the RIGHT answer in EAX:EDX. That is a "common subexpression elimination" optimization. R and the previous instruction should be the same value. My approach simply gives the most accurate answer.

Of course this is allowed, because x * 2 gives UB so anything can happen. If I don't overlook anything, your proposed implementation of this code is C99-compliant (i.e. it gives the results specified by C99 when no overflow occurs).

So basically you accept that anything can happen when signed integers overflow. You don't care at all about having a programming language with well-defined semantics.

No, I accept that I want the compiler to do "the best thing". Obviously a series of overflows as in this sample are not exactly likely. But for the last two assignments, my suggested assignment produces a better number.

Most likely you don't even care about getting well-defined results from unsigned arithmetic.

I care about getting CONSISTENT behavior, as in the LCG PRNGs I had mentioned. What the hardware does is perfectly acceptable to me. No more, no less. Certainly not deciding an expression can be replaced by a "1", or that an expression can be eliminated completely.

bob wrote:BTW, for the divides the compiler in unoptimized mode produces the undefined behavior right shift. Go figure. It is undefined if I do it, but apparently not if the compiler does it. How exactly does that work, again???
Can you REALLY not answer this yourself?

No I really can't figure it out. Actually I already knew the answer. This one was actually handled slightly better by the standards committee. Shifting a negative value right is not undefined behavior. It is implementation dependent. Seems we get way too many claims of UB.

How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?