How could a compiler break the lockless hashing method?

wgarvin · Post by **wgarvin** » Sat Dec 14, 2013 7:04 pm

syzygy wrote:Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.

Of course, but back in the days of single-core processors, if you were't dealing with memory-mapped I/O or something, if you were just synchronizing access to a data structure shared between your threads... You could ensure it was aligned and elide the LOCK prefix, relying on the atomicity of the read/write to cacheline. So I thought a multi-threaded app that self-patched those LOCKs to NOPs in startup, was pretty neat. But yeah, with multi-core CPUs (edit: Or any machine that can run more than one stream of instructions at once, i guess) you do need the stronger guarantee that LOCK (or XCHG) provides.

wgarvin · Post by **wgarvin** » Sat Dec 14, 2013 7:24 pm

syzygy wrote:
wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.

I should clarify my statement a bit. I didn't mean I won't use it where it's needed! That would be nuts. All I meant was I try to not to write code that has to do a lot of accesses to shared state that would need this kind of protection. Of course it depends on your problem domain, sometimes it just is the right solution, e.g. your tb gen case.

syzygy wrote: I think on my Sandybridge cpu the overhead for single-threaded generation was just 1 or 2 percent.

Nice! By "single-threaded" I assume you mean also that there was no lock prefix on the insns.

abulmo · Post by **abulmo** » Sun Dec 15, 2013 12:11 am

bob wrote:
abulmo wrote:gcc invoked the -ftrapv option is an example of such a compiler.
That appears to not even work on gcc 4.7.3... I wrote a small test that overflowed multiple times, and absolutely nothing happened other than the printf() results showed the effects of the expected overflows...

The -ftrapv option is broken on gcc for x86-64 target. Your program should abort if compiled with -m32.
The -ftrapv option also works on clang.

bob · Post by **bob** » Sun Dec 15, 2013 1:15 am

syzygy wrote:
bob wrote:
wgarvin wrote:
syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.
Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.

I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)
Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.

Yes, but according to docs, 8 bytes is ALL that can be "locked"... BTW there are lockless semaphores that work, so long as the architecture guarantees in-order writes from a single processor, which X86 does. Then you can safely acquire a lock without needing an atomic exchange or whatever.

bob · Post by **bob** » Sun Dec 15, 2013 1:17 am

wgarvin wrote:
syzygy wrote:Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.
Of course, but back in the days of single-core processors, if you were't dealing with memory-mapped I/O or something, if you were just synchronizing access to a data structure shared between your threads... You could ensure it was aligned and elide the LOCK prefix, relying on the atomicity of the read/write to cacheline. So I thought a multi-threaded app that self-patched those LOCKs to NOPs in startup, was pretty neat. But yeah, with multi-core CPUs (edit: Or any machine that can run more than one stream of instructions at once, i guess) you do need the stronger guarantee that LOCK (or XCHG) provides.

As I have mentioned, there IS a way of doing locks without atomic operations, so long a in-order writes are guaranteed from a single CPU, which Intel does. Doesn't work worth a crap on alpha, and we may one day see in-order writes go on X86 since they are a bit of a pain to the hardware.

bob · Post by **bob** » Sun Dec 15, 2013 1:20 am

wgarvin wrote:
syzygy wrote:
wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.
I should clarify my statement a bit. I didn't mean I won't use it where it's needed! That would be nuts. All I meant was I try to not to write code that has to do a lot of accesses to shared state that would need this kind of protection. Of course it depends on your problem domain, sometimes it just is the right solution, e.g. your tb gen case.

syzygy wrote: I think on my Sandybridge cpu the overhead for single-threaded generation was just 1 or 2 percent.
Nice! By "single-threaded" I assume you mean also that there was no lock prefix on the insns.

There is definitely overhead, it just is not that significant. That "read for ownership" operation is more complex than just a plain read.

syzygy · Post by **syzygy** » Sun Dec 15, 2013 2:34 am

wgarvin wrote:
syzygy wrote:
wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.
I should clarify my statement a bit. I didn't mean I won't use it where it's needed! That would be nuts. All I meant was I try to not to write code that has to do a lot of accesses to shared state that would need this kind of protection. Of course it depends on your problem domain, sometimes it just is the right solution, e.g. your tb gen case.

Don't worry, I did not misunderstand. My point is only that they have become quite cheap. Obviously Intel had to do something about their cost now that multicore CPUs have become prevalent.

syzygy wrote:I think on my Sandybridge cpu the overhead for single-threaded generation was just 1 or 2 percent.
Nice! By "single-threaded" I assume you mean also that there was no lock prefix on the insns.

Yes, single-threaded with atomic updates versus single-threaded with non-atomic updates. The atomic updates are implemented using inline assembly that uses lock cmpxchg on byte values.

syzygy · Post by **syzygy** » Sun Dec 15, 2013 2:49 am

bob wrote:Yes, but according to docs, 8 bytes is ALL that can be "locked"...

The exception is lock cmpxchg16b. It can be used to construct atomic operations on 16 bytes. See Intel Manual Vol. 2A, 3-150.

BTW there are lockless semaphores that work, so long as the architecture guarantees in-order writes from a single processor, which X86 does. Then you can safely acquire a lock without needing an atomic exchange or whatever.

Hmm, maybe you are right (didn't Leslie Lamport come up with something like that)? I will look into this.

edit: It's Lamport's Bakery algorithm:

Lamport wrote:Unlike any previous algorithm, and almost all subsequent algorithms, the bakery algorithm works regardless of what value is obtained by a read that overlaps a write. If the write changes the value from 0 to 1, a concurrent read could obtain the value 7456 (assuming that 7456 is a value that could be in the memory location). The algorithm still works. I didn't try to devise an algorithm with this property. I discovered that the bakery algorithm had this property after writing a proof of its correctness and noticing that the proof did not depend on what value is returned by a read that overlaps a write.

I don't know how many people realize how remarkable this algorithm is. Perhaps the person who realized it better than anyone is Anatol Holt, a former colleague at Massachusetts Computer Associates. When I showed him the algorithm and its proof and pointed out its amazing property, he was shocked. He refused to believe it could be true. He could find nothing wrong with my proof, but he was certain there must be a flaw. He left that night determined to find it. I don't know when he finally reconciled himself to the algorithm's correctness.

(...)

What is significant about the bakery algorithm is that it implements mutual exclusion without relying on any lower-level mutual exclusion. Assuming that reads and writes of a memory location are atomic actions, as previous mutual exclusion algorithms had done, is tantamount to assuming mutually exclusive access to the location. So a mutual exclusion algorithm that assumes atomic reads and writes is assuming lower-level mutual exclusion. Such an algorithm cannot really be said to solve the mutual exclusion problem. Before the bakery algorithm, people believed that the mutual exclusion problem was unsolvable--that you could implement mutual exclusion only by using lower-level mutual exclusion. Brinch Hansen said exactly this in a 1972 paper. Many people apparently still believe it.

syzygy · Post by **syzygy** » Sun Dec 15, 2013 5:13 pm

wgarvin wrote:The least they ought to have done, was define some semantics for it with unsigned integer types, providing that it might return ANY bit pattern, but would not trap or otherwise blow up. I have a hard time imagining a C-capable platform that could not meet that requirement.

If you allow signed overflow to return ANY bit pattern, it seems difficult to imagine any program (written by a programmer well aware that signed overflow can return ANY bit pattern) that on purpose lets integers overflow. In other words, any overflow would be a bug and can easily lead to crashes, buffer overflows and any other kind of unexpected behavior. There does not seem to be much to gain here.

Maybe what could be done is defining in the C standard certain behaviors that compilers could optionally support, e.g. via compiler flags.

mvk · Post by **mvk** » Sun Dec 15, 2013 8:05 pm

syzygy wrote: If you allow signed overflow to return ANY bit pattern, it seems difficult to imagine any program (written by a programmer well aware that signed overflow can return ANY bit pattern) that on purpose lets integers overflow. In other words, any overflow would be a bug and can easily lead to crashes, buffer overflows and any other kind of unexpected behavior. There does not seem to be much to gain here.

I found a case of signed overflow in my program where this would still be ok.
The functions in Rookie's evaluator, besides spitting out a total score, also provide a broken-down score. With breakdown defined something like this:

Code: Select all

struct evaluate_breakdown {
        int     material[2]; // separate white and black
        int     king_safety[2];
        int     pawn_structure[2];
        int     piece_placement[2];
        int     other[2];
};

The evaluation consists of many little functions, and each of them would update part of the breakdown structure besides returning their score contribution. The returned total score is in my case not calculated from the breakdown fields: both are maintained separately. (When I made it I considered the breakdown more as an add-on/side-effect. There is an assert somewhere near the end to check that these are consistent). So the structure works as an accumulator: it is zeroed outside evaluate_full(), and the evaluation subfunctions add and subtract their contributions.

Now the thing is: the breakdown is really only of interest at the end of the PV. So what my program does is zero this structure only prior to evaluations from PV nodes, and otherwise just let it run. This can cause overflows outside the PV, but that is ok because there I don't look at it...

I found this today with -ftrapv, and for now I just stripped the whole breakdown thing from the code. (There are better ways to preserve the functionality, but I'm not really looking at the information anymore).

Mind that this case has no uninitialized variables, because there is a zeroing before the first update. So it is really signed overflow for a case where it doesn't matter. I would prefer C to treat signed overflow as IDB instead of UB.

How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?

Re: How could a compiler break the lockless hashing method?