SMP and Thread Pool Design pattern

Edsel Apostol · Post by **Edsel Apostol** » Fri Oct 04, 2013 3:40 am

mcostalba wrote:
Edsel Apostol wrote: There is a similar Thread Pool in Stockfish but I don't want that kind of implementation. I'm thinking of using the thread pool to not just generate threads and park them into their own idle loop waiting for work from a splitpoint. What I'm planning to do is have a queue of work for potential splitpoints and threads gets the work from this queue.
Maybe you can find this interesting:

http://talkchess.com/forum/viewtopic.php?t=43243

It was an attempt to do not send the threads to idle loop after they have finished searching, but to allow teh thread to find an active split point and joining it.

Pro:
- Easy to implement above standard YBWC scheme
- Does what advertized: no idle time for threads.

Cons:
- Didn't prove stronger than standard, so was retired.

I've implemented this and I found around 4 to 5% NPS increase. So it helps a bit. Didn't test the improvement with games though. I think it can still further be improved by looking over all the available splitpoints from all threads and choosing the best one from there but I'm not sure it's worth the complexity.

jdart · Post by **jdart** » Fri Oct 04, 2013 4:17 am

I think things are often different with high core counts. Many things that work or don't work on a few cores behave differently with 16+ cores, in my experience.

--Jon

Edsel Apostol · Post by **Edsel Apostol** » Fri Oct 04, 2013 4:37 am

syzygy wrote:
Edsel Apostol wrote:I have a few questions. Do you limit the number of candidate splitpoints relative to the number of threads?
No, I only limit candidate splitpoints to nodes with depth > 1.

I remember in previous discussions that you'll have to lock the movelist in case anytime a thread joins thats splitpoint. So even if there is no other threads yet helping searching that node, you may need to lock the movelist, and if there's a lot of candidate splitpoints, it would incur overhead for the locking? or maybe using spinlock is not that expensive?
Once a node has been designated as a candidate splitpoint, the search much assume that there may be other threads searching the node, so acquire a spinlock on the node before fetching a new move. But acquiring an uncontested spinlock is very fast.

Also how do you make the current thread aware that some threads has joined it so that it could wait at the end of the search for those threads to finish?
The node / Position struct has a (volatile) 64-bit integer called slaves. Slave threads joining the node set the bit corresponding to their thread id.

If a node is a candidate splitpoint and no moves are left, then the master thread for that node executes:
Code: Select all
  while &#40;node->slaves && !tree->abort&#41;
    help_slaves&#40;tree, node&#41;;
when slaves finish they clear their bit in the slaves field of the node they joined.

The tree->abort variable is a thread-specific flag that is -1 if the search should be completely aborted (e.g. time has run out) and is n > 0 if there was a fail high in a split point at ply = n. Tree->abort == 0 means that the search should go on. (I don't currently split at the root / ply = 0, but it would be easy to shift values a bit.)

The tree->abort and tree->splitpoints (or rather tree->candidate_splitpoints) variables are updated using atomic instructions, so without locking. For tree->splitpoints this is relatively easy using "lock btsq" and "lock btrq". Updating the tree->abort variable is a bit more tricky. If the search is to be fully aborted, it should be set to -1. If the searching thread is to be aborted at ply = n, then I have to check whether it was already aborted at a lower ply. For example, if there was a fail high at ply = 10, but earlier there was a fail high at ply = 6, then tree->abort should remain 6. So the required semantics is:
Code: Select all
  if &#40;tree->abort == 0 || tree->abort > n&#41; tree->abort = n;
To implement this atomically I use the following macro:
Code: Select all
#define ATOMIC_SET_ABORT&#40;x,y&#41; \
do &#123; __asm__( \
"xorl %%eax, %%eax\n" \
"0&#58;\n" \
"lock cmpxchgl %1, %0\n\t" \
"jz 1f\n\t" \
"cmpl %1, %%eax\n\t" \
"jns 0b\n" \
"1&#58;" \
&#58; "+m" &#40;x&#41; &#58; "r" &#40;y&#41; &#58; "eax"); \
&#125; while &#40;0&#41;
and then I can simply do:
Code: Select all
  ATOMIC_SET_ABORT&#40;tree->abort, n&#41;;
I am afraid this is getting too complicated (and this is not the whole story), but this tree->abort flag turns out to be extremely useful in getting partial aborts right in a very simple and efficient manner. It might become clear that you need something like this when you implement the scheme. Essentially, if (current_ply > tree->abort) you need to cancel any slaves (those in node->slaves) and return. If (current_ply == tree->abort), then you are in a split point in which a move failed high and (if you are the master) you have to clean up, reset the abort flag (again using atomic instructions, because it should only be reset if it still == current_ply) and return beta (or the value returned by the slave thread that failed high in case of fail soft). The inequality current_ply >= tree->abort always holds.

In order for it to also be a helful master I assume it also enters the idle loop even if it is still the single thread searching that candidate splitpoint?
As you see above. The help_slaves() function also looks for a candidate splitpoint, but one that is being searched by a slave.

Thanks. Here's what I'm doing. I limit splitting with depth >= 4. My position structure is too big I guess, the cost of splitting is huge in my engine.

I have implemented something of a hybrid with my idea and yours. What I did is to let a thread split when it meets the criteria except that it never checks for available threads. I limit this at 1 active splitpoint for each thread in the meantime. So a thread will enter the split alone just saving the data for the splitpoint to be copied by joining threads later on, goes to the idle loop, and search from there. If other threads join the split, it will simply act as master, if not, it will search the split node alone.

Other idle threads then checks all the active split points of other threads and choose the best one from there, then it would join that split. It works and I guess it will be more efficient that way but it didn't beat my old implementation. There is just too much copying of splitpoint data going on, or my implementation still has bugs or still needs to be polished, but I think it has potential.

It's a bit slower on my AMD 8 cores just using 4 cores compared to my old implementation, but it is faster in an Intel core i5 using the same number of cores compared to the old implementation.

Here's some statistics from my old and the experimental one:

Old:

Code: Select all

getBestMove&#58; thread_id&#58;0, num_sp&#58;0 searching&#58;1 stop&#58;1 started&#58;26780 ended&#58;26780 nodes&#58;5320043 numsplits&#58;12711
getBestMove&#58; thread_id&#58;1, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;29806 ended&#58;29806 nodes&#58;5378968 numsplits&#58;11619
getBestMove&#58; thread_id&#58;2, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;29982 ended&#58;29982 nodes&#58;5445576 numsplits&#58;10065
getBestMove&#58; thread_id&#58;3, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;30010 ended&#58;30010 nodes&#58;5444595 numsplits&#58;10090
getBestMove&#58; thread_id&#58;4, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;29647 ended&#58;29647 nodes&#58;5578231 numsplits&#58;10717
getBestMove&#58; thread_id&#58;5, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;32408 ended&#58;32408 nodes&#58;5288067 numsplits&#58;12553

Experimental:

Code: Select all

getBestMove&#58; thread_id&#58;0, num_sp&#58;0 searching&#58;1 stop&#58;1 started&#58;1209235 ended&#58;1209235 nodes&#58;6981172 numsplits&#58;222
getBestMove&#58; thread_id&#58;1, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;1148600 ended&#58;1148600 nodes&#58;4756707 numsplits&#58;1926
getBestMove&#58; thread_id&#58;2, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;1199367 ended&#58;1199367 nodes&#58;4170914 numsplits&#58;2050
getBestMove&#58; thread_id&#58;3, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;1275446 ended&#58;1275446 nodes&#58;4218084 numsplits&#58;2062
getBestMove&#58; thread_id&#58;4, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;1317257 ended&#58;1317257 nodes&#58;4125861 numsplits&#58;2076
getBestMove&#58; thread_id&#58;5, num_sp&#58;0 searching&#58;0 stop&#58;1 started&#58;1131348 ended&#58;1131348 nodes&#58;4139415 numsplits&#58;2017

As you can see the old implementation is doing more splits but the actual search started in idle loop is fewer. Also note that the main thread (thread_id = 0) is having more splits than the other threads. The node count per thread is also somewhat balanced.

In my experimental one, there are fewer splits as I limit it to 1 active per thread, but there are a lot of joining threads in the idle loop and it has a memory bottleneck due to lots of data copying when joining the split. In this implementation the main thread performs less split as it is the first one doing the split at a depth closer to the root, but it also searches more nodes compared to the other threads.

Do you have some kind of statistics like that in your engine? I would be curious how splits, nodes, and split joins from idle loop are being spread out across all threads.

Anyways, I may try your exact implementation one of this days. My data structure seems not yet ready for it to be efficient, I copy the entire pos structure in the meantime, and it is a bit over 2kb so it's somewhat costing too much on the split. I'm now doing splits on the root by the way.

Here's what I'm using as spinlock in C++11 using atomic_flag. Your assembly may be much faster.

Code: Select all

class Spinlock &#123;
public&#58;
    Spinlock&#40;) &#123; m_Lock.clear&#40;std&#58;&#58;memory_order_release&#41;; &#125;
    ~Spinlock&#40;) &#123; &#125;
    void lock&#40;) &#123; while &#40;m_Lock.test_and_set&#40;std&#58;&#58;memory_order_acquire&#41;); &#125;
    void unlock&#40;) &#123; m_Lock.clear&#40;std&#58;&#58;memory_order_release&#41;; &#125;
private&#58;
    std&#58;&#58;atomic_flag m_Lock;
&#125;;

I'm using the same functions as the std::mutex lock/unlock so I could change the variable declaration from mutex to spinlock and vice versa without changing other parts of the code.

Edsel Apostol · Post by **Edsel Apostol** » Fri Oct 04, 2013 4:49 am

Daniel Shawul wrote:Well my suggestion is for you to investigate a different approach than what is standard here. ABDADA is, for instance, a well performing algorithm that wasn't explored as much as YBW here. It is driven by TT and is a good compromise b/n simplicity of implementation & performance. So investigating TDS in the same manner can be a good contribution. When I hear of optimizations that aren't needed all that much, honestly I yawn which is why I suggested something different. A thread will be picked up quite soon, as you will see later, and having a choice of split points doesn't help much either. I tried both and I barely got improvement out of them. In fact for the later, I even tried to do more with 'speculative splitting' at the parent node before all the children are searched similar to what DTS does, but that requires iterative search. If we agree that there is no idle time, then your hopes will be from splitting better at shallow depths while loosing time. I wouldn't bet on that. So considering all that, you will see good old recursive YBW is a solid algorithm by itself.

You are wrong about your impressions of TDS. It can be used in SMP environment and definitely doesn't require you to search the whole tree. I don't recall exactly but Cilkchess also uses work stealing approach.

I haven't really looked at how ABDADA works but a quick skim at the wiki it seems to use a shared hash table to keep track of which nodes are being searched and not. It seems very simple to implement. I should have tried it first when I've started.

Do you have success implementing it? How does it compare with YBW? I guess you'll need a separate dedicated hash table so that it wouldn't interfere with the actual hash table.

With a complicated node classification nowadays you also need to take note if the node type being searched is the same, in case the node type being tagged in the hash table is for example a cut node, while what you are about to search is an all node or a pv node, but you avoided searching it because it is being flagged as being searched in the hash table by other threads, and you missed searching the node altogether.

You also need to note if the the node flagged as being searched in the hash table has the same depth as what you will be searching. So it seems you need to store more than just the nproc. Nodetype and depth to begin with. It might already be the case but it is not explicit in the pseudo code in the wiki.

Anyway, can you describe how to implement TDS in SMP? If I understand it right, I would implement it using processes, each process will have its own hash table, and they will communicate through MPI or something.

Edsel Apostol · Post by **Edsel Apostol** » Sat Oct 05, 2013 7:29 am

OK, I've managed to make this work. It's not using a centralized queue as I couldn't find a way to efficiently cancel the splitpoint when it is not applicable anymore. I've used the threads stack instead to store the splitpoints and idle threads and helpful master thread search work from there.

NPS scaling is a bit better than the classic YBW. I've noticed not much difference in time to depth speedup.

The advantage of this method is that it has the option to choose a better splitpoint and threads has a lesser tendency to be idle.

As a conclusion, there's not much to be improved upon the classic YBW. It's one solid algorithm as many has confirmed before me. I'm going to use this implementation over my old one though. Not much engines are using it (only Syzygy and Spike as far as I know) and I think I could improve it further later on.

Anyways, thanks everyone for the feedback especially Ronald.

syzygy · Post by **syzygy** » Sat Oct 05, 2013 2:49 pm

Edsel Apostol wrote:Thanks. Here's what I'm doing. I limit splitting with depth >= 4. My position structure is too big I guess, the cost of splitting is huge in my engine.

I have implemented something of a hybrid with my idea and yours. What I did is to let a thread split when it meets the criteria except that it never checks for available threads. I limit this at 1 active splitpoint for each thread in the meantime. So a thread will enter the split alone just saving the data for the splitpoint to be copied by joining threads later on, goes to the idle loop, and search from there. If other threads join the split, it will simply act as master, if not, it will search the split node alone.

I am copying much less data.

The common case is creating a candidate split point, or rather marking a node being searched as splittable. This is done by setting the "ply"th bit in tree->splitpoints, saving alpha, beta, depth to the "splitpoint" portion of the position structure and setting a local flag so that the search knows it must take a lock when obtaining the next move and check a few more things. The overhead of this turns out to be very low, so that there is very little slowdown for a single threaded search.

The less common (but not uncommon) case is an idle thread joining at a candidate splitpoint of another thread. This is done by the joining thread taking the lock on the candidate splitpoint, fetching a move, copying the moves leading from the root to that splitpoint to its own tree structure, releasing the lock. This takes little time and most likely does not disturb the other threads at all. The joining thread then constructs the position structure in its own time (i.e. without blocking any other threads, which is a major advantage) by making the moves leading from the root to the splitpoint. It then searches the move it fetched and fetches more moves to search if available.

For this to work, any thread must have fast access to the position structure at ply N of any other thread. If those are held on the stack as in Stockfish, I suppose one could assign each thread a global array of 64 pointers to those position structures as a quick solution.

At some point I might try to hack my scheme into Stockfish to see if it gives a measurable gain.

Here's what I'm using as spinlock in C++11 using atomic_flag. Your assembly may be much faster.

It's probably very similar to other spinlock implementations including pthread_spinlock and the inline assembly one I use myself.

The complicated stuff I use for the thread-specific (rather than node-specific) variables is to avoid locking for access to variables. It can probably be replaced without losing much efficiency by a thread-specific lock (or 2 locks) that is only taken when thread->splitpoints and thread->abort are being written to. (I believe but am not completely sure at the moment that it is OK to read those variables without locking, because the locked write operations would not leave inconsistent state in the middle of the operation...) I had to figure out how to efficiently do complicated atomic memory updates when implementing my parallel tablebase generator, so I just applied similar tricks when implementing SMP.

SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern

Re: SMP and Thread Pool Design pattern