Of course, but back in the days of single-core processors, if you were't dealing with memory-mapped I/O or something, if you were just synchronizing access to a data structure shared between your threads... You could ensure it was aligned and elide the LOCK prefix, relying on the atomicity of the read/write to cacheline. So I thought a multi-threaded app that self-patched those LOCKs to NOPs in startup, was pretty neat. But yeah, with multi-core CPUs (edit: Or any machine that can run more than one stream of instructions at once, i guess) you do need the stronger guarantee that LOCK (or XCHG) provides.syzygy wrote:Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.
How could a compiler break the lockless hashing method?
Moderator: Ras
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: How could a compiler break the lockless hashing method?
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: How could a compiler break the lockless hashing method?
I should clarify my statement a bit. I didn't mean I won't use it where it's needed! That would be nuts. All I meant was I try to not to write code that has to do a lot of accesses to shared state that would need this kind of protection. Of course it depends on your problem domain, sometimes it just is the right solution, e.g. your tb gen case.syzygy wrote:I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
Nice! By "single-threaded" I assume you mean also that there was no lock prefix on the insns.syzygy wrote: I think on my Sandybridge cpu the overhead for single-threaded generation was just 1 or 2 percent.
-
- Posts: 151
- Joined: Thu Nov 12, 2009 6:31 pm
Re: How could a compiler break the lockless hashing method?
The -ftrapv option is broken on gcc for x86-64 target. Your program should abort if compiled with -m32.bob wrote:That appears to not even work on gcc 4.7.3... I wrote a small test that overflowed multiple times, and absolutely nothing happened other than the printf() results showed the effects of the expected overflows...abulmo wrote:gcc invoked the -ftrapv option is an example of such a compiler.
The -ftrapv option also works on clang.
Richard
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: How could a compiler break the lockless hashing method?
Yes, but according to docs, 8 bytes is ALL that can be "locked"... BTW there are lockless semaphores that work, so long as the architecture guarantees in-order writes from a single processor, which X86 does. Then you can safely acquire a lock without needing an atomic exchange or whatever.syzygy wrote:Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.bob wrote:Certainly the xchg reg,men is the simplest way to implement an atomic lock so long as one does not spin on the xchg instruction (see Crafty Lock() macro for details, an approach called a "shadow lock".)wgarvin wrote:Thanks, I just learned something. Its been a while since I wrote any asm with a lock prefix on it. Back when most machines were still single core, I saw a neat trick: a multi-threaded program that detected at startup that it was running on a single-core machine and patched most of its lock instructions with NOPs. It did run noticably faster.syzygy wrote:[...] with modern CPUs that doesn't lock the bus anymore if the cache line is not contended. It used to flush the pipeline, but also that has been improved. It still has a cost, but the cost is small in the uncontended case.
I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: How could a compiler break the lockless hashing method?
As I have mentioned, there IS a way of doing locks without atomic operations, so long a in-order writes are guaranteed from a single CPU, which Intel does. Doesn't work worth a crap on alpha, and we may one day see in-order writes go on X86 since they are a bit of a pain to the hardware.wgarvin wrote:Of course, but back in the days of single-core processors, if you were't dealing with memory-mapped I/O or something, if you were just synchronizing access to a data structure shared between your threads... You could ensure it was aligned and elide the LOCK prefix, relying on the atomicity of the read/write to cacheline. So I thought a multi-threaded app that self-patched those LOCKs to NOPs in startup, was pretty neat. But yeah, with multi-core CPUs (edit: Or any machine that can run more than one stream of instructions at once, i guess) you do need the stronger guarantee that LOCK (or XCHG) provides.syzygy wrote:Any (correct) lock implementation on x86 relies on some form of "locked" instruction. It could be xcgh (with implicit lock prefix) or "lock cmpxchg" or "lock inc", but something like this is required.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: How could a compiler break the lockless hashing method?
There is definitely overhead, it just is not that significant. That "read for ownership" operation is more complex than just a plain read.wgarvin wrote:I should clarify my statement a bit. I didn't mean I won't use it where it's needed! That would be nuts. All I meant was I try to not to write code that has to do a lot of accesses to shared state that would need this kind of protection. Of course it depends on your problem domain, sometimes it just is the right solution, e.g. your tb gen case.syzygy wrote:I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
Nice! By "single-threaded" I assume you mean also that there was no lock prefix on the insns.syzygy wrote: I think on my Sandybridge cpu the overhead for single-threaded generation was just 1 or 2 percent.
-
- Posts: 5696
- Joined: Tue Feb 28, 2012 11:56 pm
Re: How could a compiler break the lockless hashing method?
Don't worry, I did not misunderstand. My point is only that they have become quite cheap. Obviously Intel had to do something about their cost now that multicore CPUs have become prevalent.wgarvin wrote:I should clarify my statement a bit. I didn't mean I won't use it where it's needed! That would be nuts. All I meant was I try to not to write code that has to do a lot of accesses to shared state that would need this kind of protection. Of course it depends on your problem domain, sometimes it just is the right solution, e.g. your tb gen case.syzygy wrote:I use such instructions very heavily in my (multi-threaded) tablebase generator. Without them the correctness of the generated table would be dependent on luck.wgarvin wrote:I've always tried to avoid lock, and the memory forms of xchg (which are always locked, prefix or no prefix)
Yes, single-threaded with atomic updates versus single-threaded with non-atomic updates. The atomic updates are implemented using inline assembly that uses lock cmpxchg on byte values.Nice! By "single-threaded" I assume you mean also that there was no lock prefix on the insns.syzygy wrote:I think on my Sandybridge cpu the overhead for single-threaded generation was just 1 or 2 percent.
-
- Posts: 5696
- Joined: Tue Feb 28, 2012 11:56 pm
Re: How could a compiler break the lockless hashing method?
The exception is lock cmpxchg16b. It can be used to construct atomic operations on 16 bytes. See Intel Manual Vol. 2A, 3-150.bob wrote:Yes, but according to docs, 8 bytes is ALL that can be "locked"...
Hmm, maybe you are right (didn't Leslie Lamport come up with something like that)? I will look into this.BTW there are lockless semaphores that work, so long as the architecture guarantees in-order writes from a single processor, which X86 does. Then you can safely acquire a lock without needing an atomic exchange or whatever.
edit: It's Lamport's Bakery algorithm:
Lamport wrote:Unlike any previous algorithm, and almost all subsequent algorithms, the bakery algorithm works regardless of what value is obtained by a read that overlaps a write. If the write changes the value from 0 to 1, a concurrent read could obtain the value 7456 (assuming that 7456 is a value that could be in the memory location). The algorithm still works. I didn't try to devise an algorithm with this property. I discovered that the bakery algorithm had this property after writing a proof of its correctness and noticing that the proof did not depend on what value is returned by a read that overlaps a write.
I don't know how many people realize how remarkable this algorithm is. Perhaps the person who realized it better than anyone is Anatol Holt, a former colleague at Massachusetts Computer Associates. When I showed him the algorithm and its proof and pointed out its amazing property, he was shocked. He refused to believe it could be true. He could find nothing wrong with my proof, but he was certain there must be a flaw. He left that night determined to find it. I don't know when he finally reconciled himself to the algorithm's correctness.
(...)
What is significant about the bakery algorithm is that it implements mutual exclusion without relying on any lower-level mutual exclusion. Assuming that reads and writes of a memory location are atomic actions, as previous mutual exclusion algorithms had done, is tantamount to assuming mutually exclusive access to the location. So a mutual exclusion algorithm that assumes atomic reads and writes is assuming lower-level mutual exclusion. Such an algorithm cannot really be said to solve the mutual exclusion problem. Before the bakery algorithm, people believed that the mutual exclusion problem was unsolvable--that you could implement mutual exclusion only by using lower-level mutual exclusion. Brinch Hansen said exactly this in a 1972 paper. Many people apparently still believe it.
-
- Posts: 5696
- Joined: Tue Feb 28, 2012 11:56 pm
Re: How could a compiler break the lockless hashing method?
If you allow signed overflow to return ANY bit pattern, it seems difficult to imagine any program (written by a programmer well aware that signed overflow can return ANY bit pattern) that on purpose lets integers overflow. In other words, any overflow would be a bug and can easily lead to crashes, buffer overflows and any other kind of unexpected behavior. There does not seem to be much to gain here.wgarvin wrote:The least they ought to have done, was define some semantics for it with unsigned integer types, providing that it might return ANY bit pattern, but would not trap or otherwise blow up. I have a hard time imagining a C-capable platform that could not meet that requirement.
Maybe what could be done is defining in the C standard certain behaviors that compilers could optionally support, e.g. via compiler flags.
-
- Posts: 589
- Joined: Tue Jun 04, 2013 10:15 pm
Re: How could a compiler break the lockless hashing method?
I found a case of signed overflow in my program where this would still be ok.syzygy wrote: If you allow signed overflow to return ANY bit pattern, it seems difficult to imagine any program (written by a programmer well aware that signed overflow can return ANY bit pattern) that on purpose lets integers overflow. In other words, any overflow would be a bug and can easily lead to crashes, buffer overflows and any other kind of unexpected behavior. There does not seem to be much to gain here.
The functions in Rookie's evaluator, besides spitting out a total score, also provide a broken-down score. With breakdown defined something like this:
Code: Select all
struct evaluate_breakdown {
int material[2]; // separate white and black
int king_safety[2];
int pawn_structure[2];
int piece_placement[2];
int other[2];
};
Now the thing is: the breakdown is really only of interest at the end of the PV. So what my program does is zero this structure only prior to evaluations from PV nodes, and otherwise just let it run. This can cause overflows outside the PV, but that is ok because there I don't look at it...
I found this today with -ftrapv, and for now I just stripped the whole breakdown thing from the code. (There are better ways to preserve the functionality, but I'm not really looking at the information anymore).
Mind that this case has no uninitialized variables, because there is a zeroing before the first update. So it is really signed overflow for a case where it doesn't matter. I would prefer C to treat signed overflow as IDB instead of UB.
[Account deleted]