Lockless hash: Thank you Bob Hyatt!

JuLieN · Post by **JuLieN** » Sun Aug 25, 2013 9:56 pm

I am currently rewriting my engine in c, making it multithreaded in the process, and had to scratch my head when I found a bottleneck: locking my hashtable's accesses with mutexes proved extremely slow, to the point that height threads barely made for the equivalent of two threads. I was pretty despaired, so I took a look at the programming wiki and heard about Bob's lockless hash system. And it now works extremely well! So I just wanted to thank Bob for this brilliant idea!

mar · Post by **mar** » Sun Aug 25, 2013 10:28 pm

My latest creation uses the same idea so count me in

sje · Post by **sje** » Mon Aug 26, 2013 1:42 am

Bob's system is clever, no doubt about it.

But there's another, more general way which I use in Symbolic. It requires a little bit more code, but in practice I have not noticed a measurable slowdown. Of course, what works on my systems might not work as well on others. But maybe it might work better.

Symbolic has four different transposition tables used during search and a fifth used for perft() calculations. Each uses the same access control scheme.

The program has a C++ class TranBase which contains, among other instance members, an array of 256 spinlock objects. A spinlock acts like a common mutex, except that it is usually smaller and much faster in low-contention usage. Each of the five transposition table classes inherits from TranBase, so they all get a spinlock array.

When either a Probe() or Stash() operation is performed on a transposition table, an eight bit portion of the table element index is generated via a mask and is then used as an index to the spinlock vector to lock a 1/256 size region of the table. The actual element access is then performed, followed by an unlock.

This approach works well, because a spinlock can be locked and unlocked extremely quickly if there is no contention. And even with eight threads accessing a transposition table, I have found the contention to be so low that going from 256 to 512 spinlocks per table makes no measurable difference.

There is no need for concern about table element alignment, hash key size, or element size.

Constraints: 1) For speed, element and spinlock counts should be an integral power of two; 2) the platform must have a fast spinlock implementation; 3) the platform must support relatively large numbers of simultaneous spinlock instances.

bob · Post by **bob** » Mon Aug 26, 2013 2:53 am

JuLieN wrote:I am currently rewriting my engine in c, making it multithreaded in the process, and had to scratch my head when I found a bottleneck: locking my hashtable's accesses with mutexes proved extremely slow, to the point that height threads barely made for the equivalent of two threads. I was pretty despaired, so I took a look at the programming wiki and heard about Bob's lockless hash system. And it now works extremely well! So I just wanted to thank Bob for this brilliant idea!

wasn't mine. Original idea came from Harry Nelson somewhere in 1990-1992, not sure exactly when. Sounded good to me and it worked...

I'll accept your thanks on behalf of Harry.

bob · Post by **bob** » Mon Aug 26, 2013 2:57 am

sje wrote:Bob's system is clever, no doubt about it.

But there's another, more general way which I use in Symbolic. It requires a little bit more code, but in practice I have not noticed a measurable slowdown. Of course, what works on my systems might not work as well on others. But maybe it might work better.

Symbolic has four different transposition tables used during search and a fifth used for perft() calculations. Each uses the same access control scheme.

The program has a C++ class TranBase which contains, among other instance members, an array of 256 spinlock objects. A spinlock acts like a common mutex, except that it is usually smaller and much faster in low-contention usage. Each of the five transposition table classes inherits from TranBase, so they all get a spinlock array.

When either a Probe() or Stash() operation is performed on a transposition table, an eight bit portion of the table element index is generated via a mask and is then used as an index to the spinlock vector to lock a 1/256 size region of the table. The actual element access is then performed, followed by an unlock.

This approach works well, because a spinlock can be locked and unlocked extremely quickly if there is no contention. And even with eight threads accessing a transposition table, I have found the contention to be so low that going from 256 to 512 spinlocks per table makes no measurable difference.

There is no need for concern about table element alignment, hash key size, or element size.

Constraints: 1) For speed, element and spinlock counts should be an integral power of two; 2) the platform must have a fast spinlock implementation; 3) the platform must support relatively large numbers of simultaneous spinlock instances.

I would hope each "spin-lock" is a 64-byte structure in that element? Otherwise you really will have cache issues when you go to 8-16-32 cores. It is called "false-sharing", where modifying one lock blows that entire block of 64 bytes of locks out of cache on every other CPU. Not exactly what one wants for performance. If you make each lock a 4-byte thing followed by 60 bytes of padding, the problem disappears, although locks are still bad on even larger numbers of cores, and on NUMA architectures with a number of nodes...

sje · Post by **sje** » Mon Aug 26, 2013 3:55 am

bob wrote:I would hope each "spin-lock" is a 64-byte structure in that element? Otherwise you really will have cache issues when you go to 8-16-32 cores. It is called "false-sharing", where modifying one lock blows that entire block of 64 bytes of locks out of cache on every other CPU. Not exactly what one wants for performance. If you make each lock a 4-byte thing followed by 60 bytes of padding, the problem disappears, although locks are still bad on even larger numbers of cores, and on NUMA architectures with a number of nodes...

In Symbolic, the SpinLock class contains not a spinlock, but rather a pointer to a platform specific spinlock allocated on the heap by the SpinLock constructor. So the actual spinlocks are aligned, but are not adjacent because there's some padding introduced by new. How much padding? I don' t know, and it doesn't matter because testing shows that increasing the number of spinlocks in the table vector does not improve performance. And the testing was done with a multithreaded perft(), a transposition stressor tougher than a regular search.

The key is low contention usage, and that's what makes the spinlock vector regionalization of a transposition table work.

bob · Post by **bob** » Mon Aug 26, 2013 6:01 am

sje wrote:
bob wrote:I would hope each "spin-lock" is a 64-byte structure in that element? Otherwise you really will have cache issues when you go to 8-16-32 cores. It is called "false-sharing", where modifying one lock blows that entire block of 64 bytes of locks out of cache on every other CPU. Not exactly what one wants for performance. If you make each lock a 4-byte thing followed by 60 bytes of padding, the problem disappears, although locks are still bad on even larger numbers of cores, and on NUMA architectures with a number of nodes...
In Symbolic, the SpinLock class contains not a spinlock, but rather a pointer to a platform specific spinlock allocated on the heap by the SpinLock constructor. So the actual spinlocks are aligned, but are not adjacent because there's some padding introduced by new. How much padding? I don' t know, and it doesn't matter because testing shows that increasing the number of spinlocks in the table vector does not improve performance. And the testing was done with a multithreaded perft(), a transposition stressor tougher than a regular search.

The key is low contention usage, and that's what makes the spinlock vector regionalization of a transposition table work.

Here's the problem. Multiple spinlocks in a single cache line will hurt performance. When you acquire a lock in one thread, you evict it from all the other caches. Then when they try to acquire one of the locks... and there the process repeats. When threads are trying to acquire, they bounce that single block of memory back and forth like mad, and that hurts performance.

Right approach is to make a spinlock structure at least 64 bytes long, then there is no "false sharing" of data going on between different caches... and no resulting excessive cache-to-cache traffic...

ibid · Post by **ibid** » Mon Aug 26, 2013 7:26 am

I guess I do exactly that... from gperft's source code:

Code: Select all

#define NUM_SPINLOCKS 256

struct SL
&#123;
	pthread_spinlock_t lock;
	char dummy&#91;64-sizeof&#40;pthread_spinlock_t&#41;&#93;;
&#125;;

SL spinlocks&#91;NUM_SPINLOCKS&#93;;

I taught myself about "false sharing" the hard way with one of my early attempts at threaded code -- managed to write a program that ran the same speed using all 16 cpus (single core cpus) of an itanium system as it did using just one cpu -- took me a while to puzzle out what had gone wrong. :)

For what it's worth, I commented out the lock/unlock code to see what the cost was... gperft ran 1.2% faster. So the cost for me is pretty small -- but gperft does not use the hash table with only 1 ply remaining.

As to why I use locks at all: long-running perfts really need a hash key more than 64 bits since all it takes is a single mistake to screw up the result. gperft saves 80 to 96 bits of hash key for confirmation -- independent of the bits used to index the hash table. I couldn't see any way to remove the locks for such a setup.

-paul

sje · Post by **sje** » Mon Aug 26, 2013 8:21 am

Symbolic packs a full 16 byte hash signature in each transposition entry. This makes the Fetch()/Stash() calls faster than does a 10 or 12 byte signature because there's no need for packing or unpacking. And because the likelihood of a false positive is far less than the likelihood of memory corruption from cosmic rays, there's no need to execute costly sanity checking code on every move or every score retrieval.

We have long since passed the days of 640 KB desktops. Or 640 KB handhelds, for that matter.

Michel · Post by **Michel** » Mon Aug 26, 2013 8:28 am

As to why I use locks at all: long-running perfts really need a hash key more than 64 bits since all it takes is a single mistake to screw up the result. gperft saves 80 to 96 bits of hash key for confirmation -- independent of the bits used to index the hash table. I couldn't see any way to remove the locks for such a setup.

It seems to me that Hyatt hash is just some kind of checksum. The checksum is not explicitly added since there is already redundancy in the data which can be exploited. The clever idea is to distribute this redundancy equally over all the bits.

But the checksum principle works in general. Rather than a lock one can use a checksum. Of course which is better depends on the situation.

Lockless hash: Thank you Bob Hyatt!

Lockless hash: Thank you Bob Hyatt!

Re: Lockless hash: Thank you Bob Hyatt!

Another way

Re: Lockless hash: Thank you Bob Hyatt!

Re: Another way

Re: Another way

Re: Another way

Re: Another way

Re: Another way

Re: Another way