The reason I was interested in the other case was that I had a scheme in mind where every hash bucket would contain a lock that controls acces to that hash bucket only. And a thread would only use the lock when it was trying to obtain ownership of a node in that hash bucket. After it has succeeded in that, it would have altered the entry in such a way that other threads would only have to read it, which they could do without ever using the lock.bob wrote:This case I am focusing on is the case that is most important, the case where there _is_ contention for a resource. If it is rarely used, there are other lockless approaches that can be used with less performance loss.
The probability that two different threads are trying to acquire ownership of the same node is pretty small. And the usual situation is that hash-table entries are not cached. The thread acquire ownership would have to load the entry from memory, but that is unavoidable for hash probes anyway. After the initial probe, it is very likely to be the only CPU that has the entry cached. But then it has to set a BUSY flag that will prevent other threads that sooner or later come along from claiming the node as their own. And it has to do it in a thread-safe way.
So if there would be procedures to do this reasonably efficiently, I would be very interested. Unfortunately, using XCHG to first access the entry and bring it in cache is a disaster: it takes 340 clocks on C2D, where doing an INC takes only 141 clocks. So it seems the bus locking causes an overhead of 199 clocks, when aplied naively. Now I have the feeling it must be possible to save a lot on this, as the XCHG with an already MODIFIED cache line seems to take only 25 clocks (as the timing of a loop containing it showed). I would like to understand what causes the difference. Because it should be possible to bring the entry in cache by doing the INC instruction, (taking 141 clocks), after which it should be possible to use the XCHG on the cached lock taking only 25 clocks. Although 25 clocks is still non-negligible, it is a lot less than 199. But somehow I have not found a way to do it that takes only 141 + 25 clocks, and I would like to understand why.