Questions on SMP search

bhlangonijr · Post by **bhlangonijr** » Thu Apr 21, 2011 7:27 am

Concerning the following structures of your chess engine:

- Main Hash Table
- Pawn Hash Table
- Killers array
- History array

How you deal with these structures, when implementing a SMP search? To be more specific:
- For what structures, you keep a separate instance for each thread and what structures share a single instance with all threads;
- How you handle the concurrent access - if applicable - for each of these structures;
- How to maintain the killers/history arrays.

Thanks,

Joost Buijs · Post by **Joost Buijs** » Thu Apr 21, 2011 7:52 am

I have almost everything thread global except for the killer arrays, they are maintained locally.

For the hash table(s) I use lockless (Hyatt style) hashing, locking each TT access is way to expensive.

For the history array I don't do anything special, because in my engine this is an atomic operation.

Recently I tried to change the history array from thread global to thread local. After doing al lot of tests it seems this performes worse, probably because there is less information per thread.

ldesnogu · Post by **ldesnogu** » Thu Apr 21, 2011 10:47 am

Robert Hyatt has a short article here : A lockless transposition table implementation for parallel search.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 21, 2011 4:33 pm

You have to share the main one for SMP. Sharing that table is so important for alpha-beta , distributed hash tables are sometimes used even on clusters.
I currently do not share them there since it seems a lot of work for a 15% speed up. I do not share the others even on SMP systems. I do not "lockless hash" but test for the legality of the move instead.

marcelk · Post by **marcelk** » Thu Apr 21, 2011 8:11 pm

bhlangonijr wrote:Concerning the following structures of your chess engine:

- Main Hash Table
- Pawn Hash Table
- Killers array
- History array

How you deal with these structures, when implementing a SMP search? To be more specific:
- For what structures, you keep a separate instance for each thread and what structures share a single instance with all threads;
- How you handle the concurrent access - if applicable - for each of these structures;
- How to maintain the killers/history arrays.

Thanks,

I have everything separated by address space.
Only the transposition table is in shared memory.

I don't use locking or 'lockless' tricks but just accept the
race conditions and make sure the engine doesn't
choke on reading an invalid move from the table but
silently discards it.

I might speed up communication speed in a later version
also through shared memory.

bob · Post by **bob** » Thu Apr 21, 2011 8:45 pm

marcelk wrote:
bhlangonijr wrote:Concerning the following structures of your chess engine:

- Main Hash Table
- Pawn Hash Table
- Killers array
- History array

How you deal with these structures, when implementing a SMP search? To be more specific:
- For what structures, you keep a separate instance for each thread and what structures share a single instance with all threads;
- How you handle the concurrent access - if applicable - for each of these structures;
- How to maintain the killers/history arrays.

Thanks,
I have everything separated by address space.
Only the transposition table is in shared memory.

I don't use locking or 'lockless' tricks but just accept the
race conditions and make sure the engine doesn't
choke on reading an invalid move from the table but
silently discards it.

I might speed up communication speed in a later version
also through shared memory.

I share hash and phash. I have local killers and don't do history at all in Crafty. I don't think shared killers make much sense as they are pretty much a "local issue."

+

rvida · Post by **rvida** » Thu Apr 21, 2011 8:56 pm

My implementation (in Critter):

- Main hash is shared. No locking nor lockless xoring is done. Moves from TT are checked for legality instead.

- Pawn hash and evaluation cache are local to thread.

- History array is shared. Since writes here are atomic there is no need to care about conflicts. History tends to be very noisy anyway with higher depths. There might be some speed penalty due to cache coherency traffic / snooping.

- With killer moves it is a bit more complicated. While each thread maintains its private killer array, during split() operation a few plies worth of killer moves are copied over.

mcostalba · Post by **mcostalba** » Fri Apr 22, 2011 7:47 am

rvida wrote:My implementation (in Critter):

- Main hash is shared. No locking nor lockless xoring is done. Moves from TT are checked for legality instead.

- Pawn hash and evaluation cache are local to thread.

- History array is shared. Since writes here are atomic there is no need to care about conflicts. History tends to be very noisy anyway with higher depths. There might be some speed penalty due to cache coherency traffic / snooping.

- With killer moves it is a bit more complicated. While each thread maintains its private killer array, during split() operation a few plies worth of killer moves are copied over.

In Stockfish is 100% the same.

The only difference is that we don't have an evaluation cache but we use the main hash table to store evaluation together with the usual data.

bhlangonijr · Post by **bhlangonijr** » Fri Apr 22, 2011 3:46 pm

rvida wrote:My implementation (in Critter):

- Main hash is shared. No locking nor lockless xoring is done. Moves from TT are checked for legality instead.

- Pawn hash and evaluation cache are local to thread.

- History array is shared. Since writes here are atomic there is no need to care about conflicts. History tends to be very noisy anyway with higher depths. There might be some speed penalty due to cache coherency traffic / snooping.

- With killer moves it is a bit more complicated. While each thread maintains its private killer array, during split() operation a few plies worth of killer moves are copied over.

Interesting. I understand Bob's position that killers are a "local issue", although in the non-smp version of the search implementation we usually don't keep many killers arrays per sub-tree, for example. I guess we end up creating a somewhat different move ordering scheme, when implementing the SMP version of the search. Is this a problem when testing new changes - not related to the SMP code?

bhlangonijr · Post by **bhlangonijr** » Fri Apr 22, 2011 4:37 pm

bob wrote:
I share hash and phash. I have local killers and don't do history at all in Crafty. I don't think shared killers make much sense as they are pretty much a "local issue."

+

Thanks all you guys for the answers.

Another important question is, how you do the load balance when distributing the work among the available threads?

In my second attempt to implement SMP (the first one failed miserably), I have a SMP search method with a thread-safe access to the move stack, control variables, hash, etc. All the available threads are booked and then call simultaneously the SMP search method, passing as parameter the same move stack, control variables, etc. The master is defined as the thread which called the "split" method. If some slave finishes its work it will become available for helping in a further split call. But when the master finishes its work it will stay in a busy loop until all slaves get their work done.

With limited testing, I got in average 1.6 speed up when using 2 cores and about 2.9 when using up to 6 cores (Considering NPS). Also, it appears that the speed up gained when using 2 cores doesn't help at all, in Elo terms...

What are your numbers? Mine are too bad?

Thanks,

Questions on SMP search

Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search

Re: Questions on SMP search