Interlock clusters

sje · Post by **sje** » Sun Jan 25, 2009 8:24 am

Interlock clusters

Consider a general multiprocessor multithreaded search where the threads share a single transposition table with a large number of entries. A known problem is the possibility of non-atomic updates by more than one thread on the same entry.

One approach, probably never used, is to allocate an interlock for each entry. If the interlock can be implemented so that there is a very low time overhead for a non-contested access, then the only problem here is the huge memory space overhead.

However, a slightly modified approach can bypass a large amount of the space overhead. Instead of allocating an interlock for each entry, a program can cluster the entries so that each contiguous region of, say, 1,024 entries has a shared interlock and all of these interlocks reside in their own table. When a process wants to access entry number N, it first gains interlock number N&0x03ff which is likely to be uncontested if the entry count/interlock count ratio is reasonable with respect to the thread count.

bob · Post by **bob** » Sun Jan 25, 2009 8:32 am

sje wrote:Interlock clusters

Consider a general multiprocessor multithreaded search where the threads share a single transposition table with a large number of entries. A known problem is the possibility of non-atomic updates by more than one thread on the same entry.

One approach, probably never used, is to allocate an interlock for each entry. If the interlock can be implemented so that there is a very low time overhead for a non-contested access, then the only problem here is the huge memory space overhead.

However, a slightly modified approach can bypass a large amount of the space overhead. Instead of allocating an interlock for each entry, a program can cluster the entries so that each contiguous region of, say, 1,024 entries has a shared interlock and all of these interlocks reside in their own table. When a process wants to access entry number N, it first gains interlock number N&0x03ff which is likely to be uncontested if the entry count/interlock count ratio is reasonable with respect to the thread count.

Won't work on X86. I tried this this past week in fact. The exchange instruction is _horribly_ slow, it grabs the bus for a long time, and crushes SMP performance, even when the lock is almost never set. You should try it on an 8 core or 16 core box. One or two per node will absolutely kill performance... the lock prefix instructions just ramp up the contention for the bus...

Even using the current "shadow lock" approach used in Crafty for locks is not workable if the thing is used every node.

sje · Post by **sje** » Sun Jan 25, 2009 9:32 am

I'm more used to PowerPC interlocks that seem a bit more efficient.

bob · Post by **bob** » Sun Jan 25, 2009 6:31 pm

sje wrote:I'm more used to PowerPC interlocks that seem a bit more efficient.

They are not _that_ much more efficient. Locking the bus for a read/update/write (which also bypasses cache of course) is a performance killer if you do this on every CPU for every node, which is what a hash lock will do. Add in a pawn hash lock and you now have two of those every node, which with 8 or 16 cpus becomes a reall contention point... there are also some interesting cache issues for this topic dealing with MOESI and what happens when a lock gets cleared, which invalidates the cache block for that hash entry on other caches, adding additional overhead in more simple pawn endings.

bob · Post by **bob** » Mon Jan 26, 2009 3:19 am

I should add that we _did_ lock the hash on Cray systems, because the Cray had a zero-penalty lock instruction, really a 64 bit register shared among all CPUs, where each bit was atomically settable without interfering with others. However, when we moved to the T932 with 32 CPUs, we began to run into trouble and Harry came up with the same Idea as Tim Mann did about 10 years later, that of XORing the two pieces.

I tried that approach first last week. 8-core NPS dropped from 20M to under 5M so that went out the window. I then tried the one 32 bit lock per hash entry This was somewhat better but I could not ever pass 8M nodes per second. The triple-XOR currently being used was no slower at all which was good...

I had thought about the group/cluster idea, but it couldn't be better than individual locks and those were so bad...

wgarvin · Post by **wgarvin** » Tue Jan 27, 2009 11:38 pm

bob wrote:I had thought about the group/cluster idea, but it couldn't be better than individual locks and those were so bad...

Maybe clustered locks could be slightly better than individual locks, if you could somehow spread them across much fewer cache lines (otherwise you pay for the cache misses to the lock word itself). But you would have to somehow do this without introducing significant amounts of cache line contention, or performance will be killed as you have seen.

In the end, the non-locking XOR scheme is still going to be way better. It's very fast, and it doesn't touch any cachelines except the single cacheline with the entry itself in it. There's no way you can possibly beat that with locks.

bob · Post by **bob** » Wed Jan 28, 2009 7:24 am

wgarvin wrote:
bob wrote:I had thought about the group/cluster idea, but it couldn't be better than individual locks and those were so bad...
Maybe clustered locks could be slightly better than individual locks, if you could somehow spread them across much fewer cache lines (otherwise you pay for the cache misses to the lock word itself). But you would have to somehow do this without introducing significant amounts of cache line contention, or performance will be killed as you have seen.

In the end, the non-locking XOR scheme is still going to be way better. It's very fast, and it doesn't touch any cachelines except the single cacheline with the entry itself in it. There's no way you can possibly beat that with locks.

That's my thinking. The only downside is that when a mis-matched store happens, that table entry is effectively invalidated as one can never match the signature to use the data. That is, to me, acceptable until the frequency goes way high. But with huge hash tables, the probability is still very low and overall this is more acceptable based on my performance measurements...

Now if Vincent can just grasp the problem and stop polluting the threads on this topic...

hgm · Post by **hgm** » Wed Jan 28, 2009 1:39 pm

bob wrote:Won't work on X86. I tried this this past week in fact. The exchange instruction is _horribly_ slow, it grabs the bus for a long time, and crushes SMP performance, even when the lock is almost never set. You should try it on an 8 core or 16 core box. One or two per node will absolutely kill performance... the lock prefix instructions just ramp up the contention for the bus...

Is this also true if you use XCHG on a variable that would produce a cache miss anyway?

I have not really tested it, but I seem to remember that modern x86 caches are not 'write through', but have 'allocate on write' policy. This means that a STORE which misses L2 would trigger a READ memory cycle to fetch the part of the cache line that is not overwritten, while at the same time it would have to inform other CPUs (with independent caches) that this cache line is loaded in MODIFYED state. (So the others that migt have it in cache can INVALIDATE it). One presumes that this is implemented by havingthe other CPUs snoop the bus cycle that had to take place anyway, without any additional overhead. The actual WRITE memory cycle will take place later, when the cache line is flushed.

This is not much different then what needs to happen when you do an XCHG. And for hash entries you have to do the write sooner or later anyway. So if it makes little difference, you might as well prepare the entry as dirty with an XCHG, rather than with a normal load, while bringing it in cache.

wgarvin · Post by **wgarvin** » Wed Jan 28, 2009 3:55 pm

hgm wrote:
bob wrote:Won't work on X86. I tried this this past week in fact. The exchange instruction is _horribly_ slow, it grabs the bus for a long time, and crushes SMP performance, even when the lock is almost never set. You should try it on an 8 core or 16 core box. One or two per node will absolutely kill performance... the lock prefix instructions just ramp up the contention for the bus...
Is this also true if you use XCHG on a variable that would produce a cache miss anyway?

I think the XCHG instruction automatically locks the bus for exclusive access during its entire operation. This is a legacy thing dating back to the ancient 8086. I'm not positive, but I think on modern processors, CMPXCHG relies on cache coherency and doesn't actually lock the bus when the memory involved is cachable (even if the LOCK prefix is used). So it might be worth trying that. But it would probably still be way worse than continuing to use the XOR trick and not lock at all.

hgm · Post by **hgm** » Wed Jan 28, 2009 4:52 pm

Sure, for hash storing the XOR trick is great. But I am also interested in having locks for more critical tasks.

I am still a bit puzzled by this. On Pentium M, when I measure the latency of XCHG REG,MEM. This was 20 clocks, which is pretty bad for an L1 cache hit, but not completely lethal. It does seem a pipeline stall, though, as I could not really execute other instructions in parallel.

But if I try it on a new memory variable all the time, which is not cached, it seems to take about 90 clocks more than incrementing that variable. This even happens when I do the increment first, and then the XCHG. While I would have thought that the increment would be enough to mark the cache line as MODIFIED, and the test on the cached memory variable show that in principle a 20-cycle XCHG on such an MODIFIED variable is possible.

Perhaps it is an out-of-order problem, that the XCHG in practice is executed before the INC.

There might be a huge inefficiency in the implementation of this; in principle XCHG on an MODIFIED L1-cached variable should not be anymore difficult than, say, an INC (as L1 is never sa shared cache). I read that the new processors would increase the efficiency of semaphores, which indeed suggest that this was a known inefficiency in older CPUs.

Interlock clusters

Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters