Questions on SMP search

bob · Post by **bob** » Fri Apr 22, 2011 4:51 pm

rvida wrote:My implementation (in Critter):

- Main hash is shared. No locking nor lockless xoring is done. Moves from TT are checked for legality instead.

- Pawn hash and evaluation cache are local to thread.

- History array is shared. Since writes here are atomic there is no need to care about conflicts. History tends to be very noisy anyway with higher depths. There might be some speed penalty due to cache coherency traffic / snooping.

- With killer moves it is a bit more complicated. While each thread maintains its private killer array, during split() operation a few plies worth of killer moves are copied over.

What is the reason for a local pawn hash rather than global? I've tested both and global gives better performance (as it logically should). If you believe in global transpositions (which obviously happen) then why not use a global pawn hash for the very same reason?

As far as killers go, I simply copy all killers _below_ the split point ply. I don't need the split point ply itself as I use the shared tree structure for that since I need to control who searches what anyway.

bob · Post by **bob** » Fri Apr 22, 2011 4:53 pm

bhlangonijr wrote:
rvida wrote:My implementation (in Critter):

- Main hash is shared. No locking nor lockless xoring is done. Moves from TT are checked for legality instead.

- Pawn hash and evaluation cache are local to thread.

- History array is shared. Since writes here are atomic there is no need to care about conflicts. History tends to be very noisy anyway with higher depths. There might be some speed penalty due to cache coherency traffic / snooping.

- With killer moves it is a bit more complicated. While each thread maintains its private killer array, during split() operation a few plies worth of killer moves are copied over.
Interesting. I understand Bob's position that killers are a "local issue", although in the non-smp version of the search implementation we usually don't keep many killers arrays per sub-tree, for example. I guess we end up creating a somewhat different move ordering scheme, when implementing the SMP version of the search. Is this a problem when testing new changes - not related to the SMP code?

There is no reliable way to test SMP changes for accuracy. You can't even get the same node count on two identical searches, much less after you make a change...

SMP changes can only be validated with actual game testing... Or to a lesser degree, by using a group of positions and testing time-to-depth for all of them.

uaf · Post by **uaf** » Fri Apr 22, 2011 5:58 pm

In Chiron:

- Main Hash and Pawn Hash are shared. I use for both Hyatt's lockless xoring.

- Evaluation Cache and History Heuristic are shared. They are updated/read with atomic operations.

- Killers are local to each thread. I copy a few plies of killers for each child and then I update the parent's killers with the ones of the child that returned the best value.

mcostalba · Post by **mcostalba** » Fri Apr 22, 2011 7:13 pm

bob wrote: What is the reason for a local pawn hash rather than global?

To avoid using lockless xoring on a quite big struct like is the pawn entry and the fact that hit rate is anyhow well above 95% also with per-thread versions.

bob · Post by **bob** » Sat Apr 23, 2011 12:31 am

bhlangonijr wrote:
bob wrote:
I share hash and phash. I have local killers and don't do history at all in Crafty. I don't think shared killers make much sense as they are pretty much a "local issue."

+
Thanks all you guys for the answers.

Another important question is, how you do the load balance when distributing the work among the available threads?

In my second attempt to implement SMP (the first one failed miserably), I have a SMP search method with a thread-safe access to the move stack, control variables, hash, etc. All the available threads are booked and then call simultaneously the SMP search method, passing as parameter the same move stack, control variables, etc. The master is defined as the thread which called the "split" method. If some slave finishes its work it will become available for helping in a further split call. But when the master finishes its work it will stay in a busy loop until all slaves get their work done.

With limited testing, I got in average 1.6 speed up when using 2 cores and about 2.9 when using up to 6 cores (Considering NPS). Also, it appears that the speed up gained when using 2 cores doesn't help at all, in Elo terms...

What are your numbers? Mine are too bad?

Thanks,

My 2 cpu number is around 1.7-1.8... for 4 threads, it is around 3.1-3.4x. Both of those are averaged over a bunch of test positions using time to depth measurements.

bob · Post by **bob** » Sat Apr 23, 2011 12:33 am

mcostalba wrote:
bob wrote: What is the reason for a local pawn hash rather than global?
To avoid using lockless xoring on a quite big struct like is the pawn entry and the fact that hit rate is anyhow well above 95% also with per-thread versions.

The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.

The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.

bob · Post by **bob** » Sat Apr 23, 2011 12:34 am

uaf wrote:In Chiron:

- Main Hash and Pawn Hash are shared. I use for both Hyatt's lockless xoring.

- Evaluation Cache and History Heuristic are shared. They are updated/read with atomic operations.

- Killers are local to each thread. I copy a few plies of killers for each child and then I update the parent's killers with the ones of the child that returned the best value.

What exactly do you mean by "atomic operations"??? Using a lock to protect an entry? Not really needed. The history data is so suspect already, you can easily just update entries whenever you want and ignore the SMP interleaved updates that will occasionally make a counter slightly off...

bhlangonijr · Post by **bhlangonijr** » Sat Apr 23, 2011 4:06 am

bob wrote: What exactly do you mean by "atomic operations"??? Using a lock to protect an entry?

You don't need to use a lock to protect a history entry, since it is usually an integer type so that the cpu will use only a single instruction to update it. There is no concurrency issue as It is done atomically. I suppose is that what he means...

bob · Post by **bob** » Sat Apr 23, 2011 7:16 am

bhlangonijr wrote:
bob wrote: What exactly do you mean by "atomic operations"??? Using a lock to protect an entry?
You don't need to use a lock to protect a history entry, since it is usually an integer type so that the cpu will use only a single instruction to update it. There is no concurrency issue as It is done atomically. I suppose is that what he means...

He's wrong. You have to read a value from memory, update it, and write it back. That is _not_ done atomically unless you do something like an add with a lock prefix. This is subject to the classic "interleaved update" problem. Two cpus, A and B. A and B simultaneously read a value for X. Each adds 1 to that value, then each writes X back to memory. When you are done, X was incremented by 1, not 2.

If a single instruction, such as add [history + 8 * edx], eax is done, the CPU has to do a read and write and they most definitely are _not_ done atomically unless you request it with a lck prefix. And a compiler doesn't do that normally.

mcostalba · Post by **mcostalba** » Sat Apr 23, 2011 11:03 am

bob wrote: The XORs are essentially free, they provide some pretty easy to use extra work for a pipe. When I added the lockless hashing stuff, I didn't see any drop in NPS at all.

The down-side of local pawn hashes shows up on a 12-core or 24 core box. 24 separate pawn hashes can burn some memory. And for any configuration, it has a somewhat negative effect on cache.

Apart that bus sniffing is heavier for shared variables than for thread local ones, so that your argumentation on cache pressure is at least dubious.

But the key reason is that once we get a pointer to a pawn hash table entry we can use it much later in the evaluation (unlike of the main hash where you just quickly read value and tt move). For instance passed pawns are stored in pawn hash entry and so is some king safety related stuff that is accessed and used only much later the pawn entry pointer has been grabbed...and we really don't want to care about someone changes the pawn entry under our feet while we are still working in the evaluation code.

Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search