Interlock clusters

hgm · Post by **hgm** » Mon Feb 02, 2009 9:00 pm

bob wrote:This case I am focusing on is the case that is most important, the case where there _is_ contention for a resource. If it is rarely used, there are other lockless approaches that can be used with less performance loss.

The reason I was interested in the other case was that I had a scheme in mind where every hash bucket would contain a lock that controls acces to that hash bucket only. And a thread would only use the lock when it was trying to obtain ownership of a node in that hash bucket. After it has succeeded in that, it would have altered the entry in such a way that other threads would only have to read it, which they could do without ever using the lock.

The probability that two different threads are trying to acquire ownership of the same node is pretty small. And the usual situation is that hash-table entries are not cached. The thread acquire ownership would have to load the entry from memory, but that is unavoidable for hash probes anyway. After the initial probe, it is very likely to be the only CPU that has the entry cached. But then it has to set a BUSY flag that will prevent other threads that sooner or later come along from claiming the node as their own. And it has to do it in a thread-safe way.

So if there would be procedures to do this reasonably efficiently, I would be very interested. Unfortunately, using XCHG to first access the entry and bring it in cache is a disaster: it takes 340 clocks on C2D, where doing an INC takes only 141 clocks. So it seems the bus locking causes an overhead of 199 clocks, when aplied naively. Now I have the feeling it must be possible to save a lot on this, as the XCHG with an already MODIFIED cache line seems to take only 25 clocks (as the timing of a loop containing it showed). I would like to understand what causes the difference. Because it should be possible to bring the entry in cache by doing the INC instruction, (taking 141 clocks), after which it should be possible to use the XCHG on the cached lock taking only 25 clocks. Although 25 clocks is still non-negligible, it is a lot less than 199. But somehow I have not found a way to do it that takes only 141 + 25 clocks, and I would like to understand why.

bob · Post by **bob** » Mon Feb 02, 2009 11:06 pm

hgm wrote:
bob wrote:This case I am focusing on is the case that is most important, the case where there _is_ contention for a resource. If it is rarely used, there are other lockless approaches that can be used with less performance loss.
The reason I was interested in the other case was that I had a scheme in mind where every hash bucket would contain a lock that controls acces to that hash bucket only. And a thread would only use the lock when it was trying to obtain ownership of a node in that hash bucket. After it has succeeded in that, it would have altered the entry in such a way that other threads would only have to read it, which they could do without ever using the lock.

The probability that two different threads are trying to acquire ownership of the same node is pretty small. And the usual situation is that hash-table entries are not cached. The thread acquire ownership would have to load the entry from memory, but that is unavoidable for hash probes anyway. After the initial probe, it is very likely to be the only CPU that has the entry cached. But then it

I has to set a BUSY flag that will prevent other threads that sooner or later come along from claiming the node as their own. And it has to do it in a thread-safe way.

So if there would be procedures to do this reasonably efficiently, I would be very interested. Unfortunately, using XCHG to first access the entry and bring it in cache is a disaster: it takes 340 clocks on C2D, where doing an INC takes only 141 clocks. So it seems the bus locking causes an overhead of 199 clocks, when aplied naively. Now I have the feeling it must be possible to save a lot on this, as the XCHG with an already MODIFIED cache line seems to take only 25 clocks (as the timing of a loop containing it showed). I would like to understand what causes the difference. Because it should be possible to bring the entry in cache by doing the INC instruction, (taking 141 clocks), after which it should be possible to use the XCHG on the cached lock taking only 25 clocks. Although 25 clocks is still non-negligible, it is a lot less than 199. But somehow I have not found a way to do it that takes only 141 + 25 clocks, and I would like to understand why.

I believe that my intel contact told me that if the requested value (target of xchg) is not in cache, then the lock worked as it always did, which is to lock the bus as well as the target cache block, because we still have to read the memory address into cache, and modify it, before we allow anyone else to access it. If we let two caches try to fetch at the same time, we are back to the original race condition.. According to what I read, the cache hit prevents the #LOCK from propagating beyond that cache level. But a miss propagates it right on up the line until it ultimately reaches the bus and locks the main memory read / cache write operations.

hgm · Post by **hgm** » Mon Feb 02, 2009 11:40 pm

OK, so it is probably a matter of making sure the lock is cached by doing an ordinary read or write to the cache line it is in (but not the lock itself!), and allow enough time for that to complete. And only then attempt a locked access. Then the #LOCK will no longer propagate to the bus. I will try that out.

If this is useful for my SMP design I don't know. Perhaps I would be better off by simply having a CPU write its personal code to a 'traffic light' variable in the hash bucket, when this tested as zero before, in the hope that no other CPU will attempt that at the same time. And then check if it really contains that personal code much later, say after it has finished searching the first move, to catch most of the rare cases that another CPU grabbed the node at the same time.

bob · Post by **bob** » Mon Feb 02, 2009 11:52 pm

hgm wrote:OK, so it is probably a matter of making sure the lock is cached by doing an ordinary read or write to the cache line it is in (but not the lock itself!), and allow enough time for that to complete. And only then attempt a locked access. Then the #LOCK will no longer propagate to the bus. I will try that out.

If this is useful for my SMP design I don't know. Perhaps I would be better off by simply having a CPU write its personal code to a 'traffic light' variable in the hash bucket, when this tested as zero before, in the hope that no other CPU will attempt that at the same time. And then check if it really contains that personal code much later, say after it has finished searching the first move, to catch most of the rare cases that another CPU grabbed the node at the same time.

This sounds like the ABADABA or whatever the devil it is called??? If so, it is not worth the effort...

hgm · Post by **hgm** » Tue Feb 03, 2009 11:32 am

I fully realize that this cannot give guaranteed muual exclusion, but that is not essential here. The purpose of this 'lock' would not be to prevent crashes or deadlocks, just to prevent time wastage by having two CPUs do exactly the same. If, due to some extreme coincidence of attempting to search the same node combined with an unfortunate interrupt between testing the lock and writing it, two threads would start to search the same node, well, too bad. But not the end of the world.

The 'lock' is basically only intended only to inform other threads that "this node is already being searched; go do something else in stead, as later you will be able to pick up the search result here for free".

bob · Post by **bob** » Tue Feb 03, 2009 8:17 pm

hgm wrote:I fully realize that this cannot give guaranteed muual exclusion, but that is not essential here. The purpose of this 'lock' would not be to prevent crashes or deadlocks, just to prevent time wastage by having two CPUs do exactly the same. If, due to some extreme coincidence of attempting to search the same node combined with an unfortunate interrupt between testing the lock and writing it, two threads would start to search the same node, well, too bad. But not the end of the world.

The 'lock' is basically only intended only to inform other threads that "this node is already being searched; go do something else in stead, as later you will be able to pick up the search result here for free".

There are really two kinds of issues. One is similar to my hash table (and your proposed parallel search) where an error won't cause a crash. There you can safely not lock at all, and even omit other tricks such as the lockless hashing idea, so long as you are sure an "interleaved entry" from two different stores by different threads) won't cause any significant problem.

THe other kind is similar to my pawn hash issue where bogus data could cause the program to crash, because I chose to not check every subscript for legality for efficiency issues. As a result, bad data will crash me there, and either the XOR trick, or a lock, will solve that...

There are some lockless approaches to do certain types of operations that might also be helpful...

Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters

Re: Interlock clusters