SMP and Thread Pool Design pattern

Edsel Apostol · Post by **Edsel Apostol** » Wed Oct 02, 2013 8:32 am

I'm done converting our SMP in Hannibal into using C++11 threads instead of just Windows native thread to make it portable to many platforms. Now I wanted to take it further and implement some kind of Thread Pool Design pattern.

There is a similar Thread Pool in Stockfish but I don't want that kind of implementation. I'm thinking of using the thread pool to not just generate threads and park them into their own idle loop waiting for work from a splitpoint. What I'm planning to do is have a queue of work for potential splitpoints and threads gets the work from this queue.

The advantage of this method is that I think it will be more efficient compared to just the classic "threads waiting for available splitpoints". In this method splitpoints can be stored in the queue even if there are no threads available, and when threads become available, instead of waiting for splitpoints from active threads, it will get work from the queue where a saved splitpoint is stored, maybe choosing the best available splitpoint from there. Of course the number of work being queued can be subject to tuning to produce the best performance.

The only thing holding me back right now is how to remove efficiently the stored splitpoint from the queue if the splitpoint has already expired (all moves are searched or a cutoff has occured already) and the work in the queue has not been processed yet by any threads.

Right now I have a working thread pool but not yet working on the splitpoints, just the same stuff as what SF is doing.

Anyone done this before? Maybe this is similar to what Spike and Ronald de Man's implementation in their engine based on what I've read from previous posts.

I have no idea if this is going to be successful, or be as fast or faster than our current implementation. Any thoughts?

syzygy · Post by **syzygy** » Wed Oct 02, 2013 11:09 am

Edsel Apostol wrote:The only thing holding me back right now is how to remove efficiently the stored splitpoint from the queue if the splitpoint has already expired (all moves are searched or a cutoff has occured already) and the work in the queue has not been processed yet by any threads.

Placing candidate splitpoints in a real queue seems inefficient to me. Splitting is much rarer than creating/destroying a candidate splitpoint, so the latter must be as cheap as possible.

What I do is keep a 64-bit integer for each thread that shows at which plies there is a candidate split point. An idle thread loops through these integers, takes the lowest bit set, and then looks at the corresponding node to find the depth at which the node is being searched (each node / Position struct includes a "split block" portion where alpha,beta,depth are stored when the node becomes a candidate splitpoint, each thread has a globally accessible array of such structs). It then splits at the node with highest depth. Helpful masters have to be careful that they join a node being searched by a slave.

Creating a candidate/destroying a splitpoint is a matter of setting/resetting the corresponding bit.

Daniel Shawul · Post by **Daniel Shawul** » Wed Oct 02, 2013 7:25 pm

I don't think you can get anything out of it because idle time is non-existent even for the simplest YBW implementation. If you want to try out a radically different approach along the lines of 'work stealing', instead of the more conventional method, you may want to try Transposition Driven Scheduling (TDS).

syzygy · Post by **syzygy** » Wed Oct 02, 2013 8:49 pm

Apart from reduced idle time, which I do not think is insignificant at faster time controls, the advantage is that the engine splits closer to the root (so fewer splits and possibly fewer wasted nodes).

In addition, it allows you to be more selective about where you split. Splitting after captures and killers have been searched (in non-PV nodes) is again likely to reduce the number of wasted nodes.

Edsel Apostol · Post by **Edsel Apostol** » Wed Oct 02, 2013 10:18 pm

syzygy wrote:
Edsel Apostol wrote:The only thing holding me back right now is how to remove efficiently the stored splitpoint from the queue if the splitpoint has already expired (all moves are searched or a cutoff has occured already) and the work in the queue has not been processed yet by any threads.
Placing candidate splitpoints in a real queue seems inefficient to me. Splitting is much rarer than creating/destroying a candidate splitpoint, so the latter must be as cheap as possible.

What I do is keep a 64-bit integer for each thread that shows at which plies there is a candidate split point. An idle thread loops through these integers, takes the lowest bit set, and then looks at the corresponding node to find the depth at which the node is being searched (each node / Position struct includes a "split block" portion where alpha,beta,depth are stored when the node becomes a candidate splitpoint, each thread has a globally accessible array of such structs). It then splits at the node with highest depth. Helpful masters have to be careful that they join a node being searched by a slave.

Creating a candidate/destroying a splitpoint is a matter of setting/resetting the corresponding bit.

That seems efficient than the work queue I've mentioned but should achieve the same purpose. I may need to redo some data structures in Hannibal to make it work there efficiently.

I have a few questions. Do you limit the number of candidate splitpoints relative to the number of threads? I remember in previous discussions that you'll have to lock the movelist in case anytime a thread joins thats splitpoint. So even if there is no other threads yet helping searching that node, you may need to lock the movelist, and if there's a lot of candidate splitpoints, it would incur overhead for the locking? or maybe using spinlock is not that expensive?

Also how do you make the current thread aware that some threads has joined it so that it could wait at the end of the search for those threads to finish? In order for it to also be a helful master I assume it also enters the idle loop even if it is still the single thread searching that candidate splitpoint?

Edsel Apostol · Post by **Edsel Apostol** » Wed Oct 02, 2013 10:20 pm

Daniel Shawul wrote:I don't think you can get anything out of it because idle time is non-existent even for the simplest YBW implementation. If you want to try out a radically different approach along the lines of 'work stealing', instead of the more conventional method, you may want to try Transposition Driven Scheduling (TDS).

A quick read about it in the Wiki, it seems only applicable in a Cluster implementation and only in problems which searches the entire tree. I doubt it would be beneficial in chess as splitting the effort in searching a node may not be efficient to be shared on more than a few processors/computers for the fact that at any time one of the branched being searched may cause a cutoff and all the effort of the other computers searching the node is gone to waste. You also have to communicate the cutoff and it will incur overhead. This is going to be most effective if effort is being split at an All node or at the root since all of the moves are to be searched anyway. Either you wait for the first move to be searched to have a bound for the zero search window like in YBW, or you search the first N moves with N processors/computers with a wide window like in Nulti-PV. I just don't think it will be effective on a single computer setup.

As for the benefit of the approach I mentioned above, you could search the splits nearer or at the root, at the same time minimize the times the thread being idle. Lets say all of the active threads are searching below the minimum depth and a thread is waiting in idle loop, it will wait from there until one of the active threads goes higher up in the tree again and searches a depth higher than the minimum split depth to be able to join the search. In the concept I described the thread will just get the work from the queue from previously stored potential splitpoints without further waiting for other threads. It will also choose a better splitpoint.

mcostalba · Post by **mcostalba** » Wed Oct 02, 2013 11:04 pm

Edsel Apostol wrote: There is a similar Thread Pool in Stockfish but I don't want that kind of implementation. I'm thinking of using the thread pool to not just generate threads and park them into their own idle loop waiting for work from a splitpoint. What I'm planning to do is have a queue of work for potential splitpoints and threads gets the work from this queue.

Maybe you can find this interesting:

http://talkchess.com/forum/viewtopic.php?t=43243

It was an attempt to do not send the threads to idle loop after they have finished searching, but to allow teh thread to find an active split point and joining it.

Pro:
- Easy to implement above standard YBWC scheme
- Does what advertized: no idle time for threads.

Cons:
- Didn't prove stronger than standard, so was retired.

syzygy · Post by **syzygy** » Thu Oct 03, 2013 1:03 am

Edsel Apostol wrote:I have a few questions. Do you limit the number of candidate splitpoints relative to the number of threads?

No, I only limit candidate splitpoints to nodes with depth > 1.

I remember in previous discussions that you'll have to lock the movelist in case anytime a thread joins thats splitpoint. So even if there is no other threads yet helping searching that node, you may need to lock the movelist, and if there's a lot of candidate splitpoints, it would incur overhead for the locking? or maybe using spinlock is not that expensive?

Once a node has been designated as a candidate splitpoint, the search much assume that there may be other threads searching the node, so acquire a spinlock on the node before fetching a new move. But acquiring an uncontested spinlock is very fast.

Also how do you make the current thread aware that some threads has joined it so that it could wait at the end of the search for those threads to finish?

The node / Position struct has a (volatile) 64-bit integer called slaves. Slave threads joining the node set the bit corresponding to their thread id.

If a node is a candidate splitpoint and no moves are left, then the master thread for that node executes:

Code: Select all

  while (node->slaves && !tree->abort)
    help_slaves(tree, node);

when slaves finish they clear their bit in the slaves field of the node they joined.

The tree->abort variable is a thread-specific flag that is -1 if the search should be completely aborted (e.g. time has run out) and is n > 0 if there was a fail high in a split point at ply = n. Tree->abort == 0 means that the search should go on. (I don't currently split at the root / ply = 0, but it would be easy to shift values a bit.)

The tree->abort and tree->splitpoints (or rather tree->candidate_splitpoints) variables are updated using atomic instructions, so without locking. For tree->splitpoints this is relatively easy using "lock btsq" and "lock btrq". Updating the tree->abort variable is a bit more tricky. If the search is to be fully aborted, it should be set to -1. If the searching thread is to be aborted at ply = n, then I have to check whether it was already aborted at a lower ply. For example, if there was a fail high at ply = 10, but earlier there was a fail high at ply = 6, then tree->abort should remain 6. So the required semantics is:

Code: Select all

  if (tree->abort == 0 || tree->abort > n) tree->abort = n;

To implement this atomically I use the following macro:

Code: Select all

#define ATOMIC_SET_ABORT(x,y) \
do { __asm__( \
"xorl %%eax, %%eax\n" \
"0:\n" \
"lock cmpxchgl %1, %0\n\t" \
"jz 1f\n\t" \
"cmpl %1, %%eax\n\t" \
"jns 0b\n" \
"1:" \
: "+m" (x) : "r" (y) : "eax"); \
} while (0)

and then I can simply do:

Code: Select all

  ATOMIC_SET_ABORT(tree->abort, n);

I am afraid this is getting too complicated (and this is not the whole story), but this tree->abort flag turns out to be extremely useful in getting partial aborts right in a very simple and efficient manner. It might become clear that you need something like this when you implement the scheme. Essentially, if (current_ply > tree->abort) you need to cancel any slaves (those in node->slaves) and return. If (current_ply == tree->abort), then you are in a split point in which a move failed high and (if you are the master) you have to clean up, reset the abort flag (again using atomic instructions, because it should only be reset if it still == current_ply) and return beta (or the value returned by the slave thread that failed high in case of fail soft). The inequality current_ply >= tree->abort always holds.

In order for it to also be a helful master I assume it also enters the idle loop even if it is still the single thread searching that candidate splitpoint?

As you see above. The help_slaves() function also looks for a candidate splitpoint, but one that is being searched by a slave.

Daniel Shawul · Post by **Daniel Shawul** » Thu Oct 03, 2013 2:31 am

Well my suggestion is for you to investigate a different approach than what is standard here. ABDADA is, for instance, a well performing algorithm that wasn't explored as much as YBW here. It is driven by TT and is a good compromise b/n simplicity of implementation & performance. So investigating TDS in the same manner can be a good contribution. When I hear of optimizations that aren't needed all that much, honestly I yawn which is why I suggested something different. A thread will be picked up quite soon, as you will see later, and having a choice of split points doesn't help much either. I tried both and I barely got improvement out of them. In fact for the later, I even tried to do more with 'speculative splitting' at the parent node before all the children are searched similar to what DTS does, but that requires iterative search. If we agree that there is no idle time, then your hopes will be from splitting better at shallow depths while loosing time. I wouldn't bet on that. So considering all that, you will see good old recursive YBW is a solid algorithm by itself.

You are wrong about your impressions of TDS. It can be used in SMP environment and definitely doesn't require you to search the whole tree. I don't recall exactly but Cilkchess also uses work stealing approach.

bob · Post by **bob** » Thu Oct 03, 2013 8:22 pm

mcostalba wrote:
Edsel Apostol wrote: There is a similar Thread Pool in Stockfish but I don't want that kind of implementation. I'm thinking of using the thread pool to not just generate threads and park them into their own idle loop waiting for work from a splitpoint. What I'm planning to do is have a queue of work for potential splitpoints and threads gets the work from this queue.
Maybe you can find this interesting:

http://talkchess.com/forum/viewtopic.php?t=43243

It was an attempt to do not send the threads to idle loop after they have finished searching, but to allow teh thread to find an active split point and joining it.

Pro:
- Easy to implement above standard YBWC scheme
- Does what advertized: no idle time for threads.

Cons:
- Didn't prove stronger than standard, so was retired.

I think I mentioned that was my finding as well. The idea sounds perfectly reasonable, but it doesn't seem to help at all and does add to the complexity.

SMP and Thread Pool Design pattern

SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern