Recursive DTS-like search algorithm

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
petero2
Posts: 587
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Recursive DTS-like search algorithm

Post by petero2 » Wed Jul 24, 2013 8:30 pm

Recursive DTS-like search algorithm

I have been working lately on a multi-threaded search algorithm for my chess program texel and came up with an algorithm that is inspired by both DTS and ABDADA. The goal is to combine the power of DTS with the simplicity of ABDADA. The algorithm is described below.

Algorithm

There is one master thread and N-1 helper threads. The master thread behaves just like in a single threaded search, except that it also creates "hints" about what it is doing during the search. The single threaded search is basically a standard recursive negascout algorithm.

The helper threads examine the "hints" and decide what sub-trees to search. Search results are stored in the shared transposition table. This is the only way that search results are transferred from the helper threads to the master thread. If the helper threads do a good job the master thread finds lots of sub-tree results in the transposition table so it does not have to search them itself.

The "hints" are objects of the class SplitPoint. A SplitPoint object is created for all nodes in the search tree that have depth >= 10 and ply <= 16. The negaScout function creates a SplitPoint object after it has generated the moves to search (texel does not use an incremental move generator) and deletes the SplitPoint object before the function returns. This means that the active SplitPoints for a thread correspond to the top up to 16 recursive calls of the negaScout function.

A SplitPoint object is quite lightweight and contains a position, alpha, beta, ply, the repetition hash list, and references to history and killer tables. Parent and child pointers are also maintained so that the SplitPoint objects form a tree. Finally a SplitPoint object contains a vector of SplitPointMove objects. Each SplitPointMove object contains move, depth, LMR reduction and flags that say if the move is currently being searched by some thread and if the move has been canceled (result no longer needed).

The existence of a SplitPoint object does not mean that any helper thread is actually searching a move in that SplitPoint, it only means that a helper thread can decide to search at that SplitPoint. Therefore "potential split point" may be a more accurate name, but I did not want the class name to be that long.

Each active SplitPoint object is also stored in a priority queue, ordered by the estimated probability that searching the first non-searched move in the SplitPoint would be useful. A helper thread that is idle will constantly monitor the priority queue, and as soon as it becomes non-empty, the first available move in the first SplitPoint is selected and the helper thread starts searching the corresponding sub-tree. A non-idle helper thread periodically checks if the SplitPointMove it is currently searching has been canceled, or if the the estimated probability of the first item in the priority queue is significantly better than the estimated probability for the move currently being searched. If this is the case, the search is aborted and a new SplitPointMove is selected from the priority queue.

A SplitPointMove is canceled when the master thread starts searching it itself, or if a beta cutoff or alpha improvement "earlier" in the search tree causes the result to be no longer needed. An alpha increase that is not a beta cutoff causes the helper thread to cancel its search, putting the move it was searching back in the priority queue. The next time it selects a move to search, it may very well select the same move, but this time it will search with the new alpha value. The idea is that the transposition table makes the abort and restart procedure efficient, since the re-search will get lots of transposition table cutoffs until it reaches the same point in the search tree as before the abort/restart.

When a helper thread starts to search a move, it calls the same negaScout function that the master thread is using for its search, which means that it can also create SplitPoint objects. What is said above about the master thread also applies to such a helper thread that is considered the master of the SplitPointMove it decided to search.

The probability that searching a move is useful is estimated by collecting fail high statistics and alpha increase statistics during the search. When a move usefulness depends an a series of SplitPointMoves, the probability is computed as the product of the individual move probabilities. Move probabilities are recomputed for the affected moves every time the state of a SplitPoint changes.

Statistics is collected for 5 different types of nodes, which are PV, CUT1, CUT2, ALL and NULL. For PV nodes, it is counted how often alpha increases after searching move 0,1,...,14. For the other node types, it is counted how often a beta cutoff happens after searching move 0,1,...,14, and how often the node fails low. CUT1 nodes are nodes where the previous move was not the first move at the parent node. ALL and CUT2 nodes are nodes where the previous move was the first move at the parent node. Whether a node has type ALL or CUT2 depends on if the number of plies up to the previous non-first move is even or odd. NULL nodes are nodes where the previous move was a null move.

No young brothers wait concept is used. A helper thread will search the move with the highest estimated probability of being useful, even if this is for example the second move at an expected cut node, and the first move is currently being searched. The idea is that it is better to do something than nothing, and the mechanism that aborts a helper thread when a move with a significantly higher probability becomes available makes sure that there is no long term penalty for starting to search a move with a low probability.

Implementation

Texel is written in C++11. Multi-threading is implemented using std::thread, std::mutex and std::lock_guard. SplitPoint lifetimes are managed using std::shared_ptr and std::weak_ptr. The priority queue is implemented using std::set. Exceptions are used to abort the recursive search when either time runs out (for the master thread) or when a SplitPoint is canceled (for the helper threads). RAII is used to clean up resources (cancel SplitPoints) when the stack unwinds. Template metaprogramming is used to ensure that the code in the negaScout function that creates and deletes SplitPoint objects does not negatively affect the performance close to the leaves.

The implementation already works but is not quite ready for a new texel release yet. The current source code is available here. The new data structures for the parallel search are in the parallel.[ch]pp files and the negaScout function is in search.cpp.

Advantages

Like in the DTS algorithm, helper threads are free to work anywhere in the search tree and no thread ever has to wait for any other thread to finish searching a sub-tree.

Unlike the DTS algorithm, the required changes to the recursive negaScout function are pretty small. An extra loop over the move list to create the SplitPoint information is basically all that is needed.

The framework for estimating move usefulness probabilities means that no ad-hoc rules for where it is allowed to split the search tree are needed.

Disadvantages

If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the helper thread fails high, the master thread still finishes searching move one before it finds the fail high result for move two in the transposition table and propagates the fail high result up in the search tree. In this case there is an opportunity for the parallel search to search fewer nodes than the serial search would search, by immediately aborting the master thread. I don't know if the DTS algorithm handles this case but I would guess that it does.

If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the master thread fails low on move one, it will start searching move two. When this happens the helper thread is "kicked out" of that branch and has to start searching somewhere else. The idea is that the master thread should quickly get to the same point where the helper thread was interrupted because of transposition table hits. However, there is still some overhead involved. This seems hard to avoid with the current design, but the hope is that the overhead is small.

If the transposition table is too small it is possible that search results computed by helper threads are overwritten before they are needed by the master thread. It would be straightforward to store the search results in the SplitPointMove objects, but I have not implemented this because I don't think it is a problem in practice.

Results

I don't have many results yet, but I have played two matches between the new texel version and the version just before I started to implement the parallel search. The old version is Texel103a3. The new version first played 400 games against the old version, where the new version used two cores, called TexelHead2c in the table below. Then the new version played 400 games against the old version, where the new version used twice the thinking time but only one core, called TexelHead2t in the table below. The time control was 5 seconds plus 1 second increment per move. (The 2t version obviously used 10s+2s time control.)

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 TexelHead2t    42   19   19   400   67%   -60   44% 
   2 TexelHead2c    18   19   19   400   63%   -60   45% 
   3 Texel103a3    -60   13   14   800   35%    30   45% 
   1 TexelHead2t    42 400.0 &#40;266.5 &#58; 133.5&#41;
                       400.0 &#40;266.5 &#58; 133.5&#41; Texel103a3    -60
   2 TexelHead2c    18 400.0 &#40;250.5 &#58; 149.5&#41;
                       400.0 &#40;250.5 &#58; 149.5&#41; Texel103a3    -60
   3 Texel103a3    -60 800.0 &#40;283.0 &#58; 517.0&#41;
                       400.0 &#40;133.5 &#58; 266.5&#41; TexelHead2t    42
                       400.0 &#40;149.5 &#58; 250.5&#41; TexelHead2c    18
             Te Te Te
TexelHead2t     89 99
TexelHead2c  10    99
Texel103a3    0  0   
This shows that the elo increase for the 2c version is about 75% of the elo increase for the 2t version, which seems reasonable. Unfortunately the error bars are very large. It takes a lot of time to test SMP search. It would probably be a good investment to write a script to automate time-to-depth testing.

It is currently unknown how well the algorithm scales to large number of threads. The design is meant to make sure that all threads have useful work to do, but at some point the locking associated with the priority queue manipulation may start to become a bottleneck.

Possible improvements

It should be possible to fix the first disadvantage above by having the master thread periodically check its SplitPoint objects to see if there has been a beta cutoff (or alpha increase). If this is the case, an exception could be thrown and caught at the correct level in the recursion. There the search result from the helper thread could be picked up and returned.

The algorithm could try harder to search at moves with larger remaining search depth. Currently the move with the highest estimated usefulness probability is searched, and in case of a tie the move with the largest depth is chosen. However, it may be better to search a move with a larger depth but with a slightly lower probability, because the overhead associated with starting and stopping the search, including the priority queue manipulation and copying of history and killer tables, is reduced. If statistics were collected regarding overhead as a function of search depth, this could be weighted into the probability calculations.

Another idea is to perform a "wider" search in the helper threads if the search depth is low, so that the overhead is high. The widening could be implemented by ignoring the LMR reduction in this case, and the effect would be that the helper thread would spend more time examining search tree nodes and less time on overhead. I have no idea if this would improve the program elo though.

Daniel Shawul
Posts: 3757
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Recursive DTS-like search algorithm

Post by Daniel Shawul » Wed Jul 24, 2013 9:08 pm

Great work Peter! I am currently too pre-occupied to try and understand your algorithm, but I just want to say kudos to you for doing the job quietly and publishing your findings. Please have it in pdf :) and maybe put it on your website. If you spent hours of your time on it, then it is definitely important for someone, and the work shouldn't be lost in the jungle of talkchess threads. I like this trend and I hope others do the same. I am sure a lot of work from many programmers have gone unnoticed just because it is not in print. I am sure their source code has it, but it is tedious to understand and figure out the contributions from it.
Daniel

bob
Posts: 20550
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Recursive DTS-like search algorithm

Post by bob » Thu Jul 25, 2013 3:54 am

petero2 wrote:Recursive DTS-like search algorithm

I have been working lately on a multi-threaded search algorithm for my chess program texel and came up with an algorithm that is inspired by both DTS and ABDADA. The goal is to combine the power of DTS with the simplicity of ABDADA. The algorithm is described below.

Algorithm

There is one master thread and N-1 helper threads. The master thread behaves just like in a single threaded search, except that it also creates "hints" about what it is doing during the search. The single threaded search is basically a standard recursive negascout algorithm.

The helper threads examine the "hints" and decide what sub-trees to search. Search results are stored in the shared transposition table. This is the only way that search results are transferred from the helper threads to the master thread. If the helper threads do a good job the master thread finds lots of sub-tree results in the transposition table so it does not have to search them itself.

The "hints" are objects of the class SplitPoint. A SplitPoint object is created for all nodes in the search tree that have depth >= 10 and ply <= 16. The negaScout function creates a SplitPoint object after it has generated the moves to search (texel does not use an incremental move generator) and deletes the SplitPoint object before the function returns. This means that the active SplitPoints for a thread correspond to the top up to 16 recursive calls of the negaScout function.

A SplitPoint object is quite lightweight and contains a position, alpha, beta, ply, the repetition hash list, and references to history and killer tables. Parent and child pointers are also maintained so that the SplitPoint objects form a tree. Finally a SplitPoint object contains a vector of SplitPointMove objects. Each SplitPointMove object contains move, depth, LMR reduction and flags that say if the move is currently being searched by some thread and if the move has been canceled (result no longer needed).

The existence of a SplitPoint object does not mean that any helper thread is actually searching a move in that SplitPoint, it only means that a helper thread can decide to search at that SplitPoint. Therefore "potential split point" may be a more accurate name, but I did not want the class name to be that long.

Each active SplitPoint object is also stored in a priority queue, ordered by the estimated probability that searching the first non-searched move in the SplitPoint would be useful. A helper thread that is idle will constantly monitor the priority queue, and as soon as it becomes non-empty, the first available move in the first SplitPoint is selected and the helper thread starts searching the corresponding sub-tree. A non-idle helper thread periodically checks if the SplitPointMove it is currently searching has been canceled, or if the the estimated probability of the first item in the priority queue is significantly better than the estimated probability for the move currently being searched. If this is the case, the search is aborted and a new SplitPointMove is selected from the priority queue.

A SplitPointMove is canceled when the master thread starts searching it itself, or if a beta cutoff or alpha improvement "earlier" in the search tree causes the result to be no longer needed. An alpha increase that is not a beta cutoff causes the helper thread to cancel its search, putting the move it was searching back in the priority queue. The next time it selects a move to search, it may very well select the same move, but this time it will search with the new alpha value. The idea is that the transposition table makes the abort and restart procedure efficient, since the re-search will get lots of transposition table cutoffs until it reaches the same point in the search tree as before the abort/restart.

When a helper thread starts to search a move, it calls the same negaScout function that the master thread is using for its search, which means that it can also create SplitPoint objects. What is said above about the master thread also applies to such a helper thread that is considered the master of the SplitPointMove it decided to search.

The probability that searching a move is useful is estimated by collecting fail high statistics and alpha increase statistics during the search. When a move usefulness depends an a series of SplitPointMoves, the probability is computed as the product of the individual move probabilities. Move probabilities are recomputed for the affected moves every time the state of a SplitPoint changes.

Statistics is collected for 5 different types of nodes, which are PV, CUT1, CUT2, ALL and NULL. For PV nodes, it is counted how often alpha increases after searching move 0,1,...,14. For the other node types, it is counted how often a beta cutoff happens after searching move 0,1,...,14, and how often the node fails low. CUT1 nodes are nodes where the previous move was not the first move at the parent node. ALL and CUT2 nodes are nodes where the previous move was the first move at the parent node. Whether a node has type ALL or CUT2 depends on if the number of plies up to the previous non-first move is even or odd. NULL nodes are nodes where the previous move was a null move.

No young brothers wait concept is used. A helper thread will search the move with the highest estimated probability of being useful, even if this is for example the second move at an expected cut node, and the first move is currently being searched. The idea is that it is better to do something than nothing, and the mechanism that aborts a helper thread when a move with a significantly higher probability becomes available makes sure that there is no long term penalty for starting to search a move with a low probability.

Implementation

Texel is written in C++11. Multi-threading is implemented using std::thread, std::mutex and std::lock_guard. SplitPoint lifetimes are managed using std::shared_ptr and std::weak_ptr. The priority queue is implemented using std::set. Exceptions are used to abort the recursive search when either time runs out (for the master thread) or when a SplitPoint is canceled (for the helper threads). RAII is used to clean up resources (cancel SplitPoints) when the stack unwinds. Template metaprogramming is used to ensure that the code in the negaScout function that creates and deletes SplitPoint objects does not negatively affect the performance close to the leaves.

The implementation already works but is not quite ready for a new texel release yet. The current source code is available here. The new data structures for the parallel search are in the parallel.[ch]pp files and the negaScout function is in search.cpp.

Advantages

Like in the DTS algorithm, helper threads are free to work anywhere in the search tree and no thread ever has to wait for any other thread to finish searching a sub-tree.

Unlike the DTS algorithm, the required changes to the recursive negaScout function are pretty small. An extra loop over the move list to create the SplitPoint information is basically all that is needed.

The framework for estimating move usefulness probabilities means that no ad-hoc rules for where it is allowed to split the search tree are needed.

Disadvantages

If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the helper thread fails high, the master thread still finishes searching move one before it finds the fail high result for move two in the transposition table and propagates the fail high result up in the search tree. In this case there is an opportunity for the parallel search to search fewer nodes than the serial search would search, by immediately aborting the master thread. I don't know if the DTS algorithm handles this case but I would guess that it does.

If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the master thread fails low on move one, it will start searching move two. When this happens the helper thread is "kicked out" of that branch and has to start searching somewhere else. The idea is that the master thread should quickly get to the same point where the helper thread was interrupted because of transposition table hits. However, there is still some overhead involved. This seems hard to avoid with the current design, but the hope is that the overhead is small.

If the transposition table is too small it is possible that search results computed by helper threads are overwritten before they are needed by the master thread. It would be straightforward to store the search results in the SplitPointMove objects, but I have not implemented this because I don't think it is a problem in practice.

Results

I don't have many results yet, but I have played two matches between the new texel version and the version just before I started to implement the parallel search. The old version is Texel103a3. The new version first played 400 games against the old version, where the new version used two cores, called TexelHead2c in the table below. Then the new version played 400 games against the old version, where the new version used twice the thinking time but only one core, called TexelHead2t in the table below. The time control was 5 seconds plus 1 second increment per move. (The 2t version obviously used 10s+2s time control.)

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 TexelHead2t    42   19   19   400   67%   -60   44% 
   2 TexelHead2c    18   19   19   400   63%   -60   45% 
   3 Texel103a3    -60   13   14   800   35%    30   45% 
   1 TexelHead2t    42 400.0 &#40;266.5 &#58; 133.5&#41;
                       400.0 &#40;266.5 &#58; 133.5&#41; Texel103a3    -60
   2 TexelHead2c    18 400.0 &#40;250.5 &#58; 149.5&#41;
                       400.0 &#40;250.5 &#58; 149.5&#41; Texel103a3    -60
   3 Texel103a3    -60 800.0 &#40;283.0 &#58; 517.0&#41;
                       400.0 &#40;133.5 &#58; 266.5&#41; TexelHead2t    42
                       400.0 &#40;149.5 &#58; 250.5&#41; TexelHead2c    18
             Te Te Te
TexelHead2t     89 99
TexelHead2c  10    99
Texel103a3    0  0   
This shows that the elo increase for the 2c version is about 75% of the elo increase for the 2t version, which seems reasonable. Unfortunately the error bars are very large. It takes a lot of time to test SMP search. It would probably be a good investment to write a script to automate time-to-depth testing.

It is currently unknown how well the algorithm scales to large number of threads. The design is meant to make sure that all threads have useful work to do, but at some point the locking associated with the priority queue manipulation may start to become a bottleneck.

Possible improvements

It should be possible to fix the first disadvantage above by having the master thread periodically check its SplitPoint objects to see if there has been a beta cutoff (or alpha increase). If this is the case, an exception could be thrown and caught at the correct level in the recursion. There the search result from the helper thread could be picked up and returned.

The algorithm could try harder to search at moves with larger remaining search depth. Currently the move with the highest estimated usefulness probability is searched, and in case of a tie the move with the largest depth is chosen. However, it may be better to search a move with a larger depth but with a slightly lower probability, because the overhead associated with starting and stopping the search, including the priority queue manipulation and copying of history and killer tables, is reduced. If statistics were collected regarding overhead as a function of search depth, this could be weighted into the probability calculations.

Another idea is to perform a "wider" search in the helper threads if the search depth is low, so that the overhead is high. The widening could be implemented by ignoring the LMR reduction in this case, and the effect would be that the helper thread would spend more time examining search tree nodes and less time on overhead. I have no idea if this would improve the program elo though.
The main problem with this is that DTS depended on being able to choose a split point in ANY subtree, of ANY thread, when a thread becomes idle. You can't afford to supply enough "hints" because the hints are provided at many different points, and are time-skewed from the point where a thread becomes idle.

In DTS you use the information for ALL active searches, all nodes in each of those searches, to choose the node most likely to be an ALL node, in the most useful place in the tree.

Recursion makes that a pain. In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...

BTW, in current Crafty, no thread ever waits for others to finish anywhere. A thread only waits until some busy thread reaches any node where the YBW criterion has been satisfied, which results in almost zero idle time spent waiting.

User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 6:43 pm

Re: Recursive DTS-like search algorithm

Post by sje » Thu Jul 25, 2013 7:28 am

bob wrote:In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
But it doesn't have to be this way. True, converting a large and complex piece of recursive code can be a royal pain. However, if the code is written from the start to run iteratively, then that's not so hard and maybe the code will be faster at run time.

The drawbacks of iterative vs recursive tree search code:

1) What do you mean, I can't use local variables for all my stuff?

2) The language gives me a perfectly good stack, why should I have to make one myself?

The benefits:

1) Having no local variables on the stack means no worry about running out of stack space, or having the stack smash into the heap, or having to reserve too much stack space when creating a thread. Insufficient stack space always means a crash and is hard to predict.

2) It's not hard to make your own stack for local variables. Symbolic has a PlyRec class for this; each instance contains all the "local" variables for a given depth for a given tree branch. Using templates, the data for a tree branch is contained in a two-way linked list (extended as needed) class PlyRecList. Each element in the list is an instance of class PlyRecNode and each one of those inherits from class PlyRec as a base class. Going up or down is a matter of a simple pointer assignment. So there is one true local variable, the pointer to the current explicit stack record.

3) With an explicit do-it-yourself stack, nothing is hidden. When the program cashes, the carcass of the explicit stack can be preserved and probed with a debugger much mare easily than using the default, mostly hidden stack.

4) At a given depth, it's easy to take a peek at variable values of a parent or other ancestor. Admittedly, this is not always a good thing to do.

5) If a search branch is aborted for some reason, it's much faster and simpler to make an emergency exit compared to the same in a recursive approach.

6) A search branch may be partially aborted, making an emergency return several levels up. that's not so simple this with a recursive approach.

7) A search branch can be paused with its state saved. This would allow perhaps hundreds of near simultaneous search branches to exist with some master supervisor selecting which of them to reward with a thread and a core.

petero2
Posts: 587
Joined: Mon Apr 19, 2010 5:07 pm
Location: Sweden
Contact:

Re: Recursive DTS-like search algorithm

Post by petero2 » Thu Jul 25, 2013 7:06 pm

bob wrote: The main problem with this is that DTS depended on being able to choose a split point in ANY subtree, of ANY thread, when a thread becomes idle. You can't afford to supply enough "hints" because the hints are provided at many different points, and are time-skewed from the point where a thread becomes idle.
In my algorithm an idle thread can choose a split point in any subtree with depth>=10 and ply<16 from ANY thread. So it is not as general as in DTS, but the hope is that depth<10 represents subtrees that are too small to be worth splitting, and that at depth>=10 the overhead of maintaining the potential split points is low enough. The ply<16 condition is added in order to avoid excessive overhead in late endgames where a depth 10 subtree may be very small. So if you are searching at iteration 50 in "fine #70", the search can be split anywhere where depth >= 50-16 = 34. (This assumes that depth+ply=constant in an iteration, which is not true when using extensions/reductions, so the 34 value is only approximately correct.)

In DTS terminology, you could say that the HELP command is not needed in my algorithm because all threads copy their tree state to shared memory all the time (subject to the depth>=10, ply<16 condition) just in case it may later be needed. I am not sure what your "time-skewed" comment means, but I guess it does not apply to my algorithm, because my algorithm does not care if other threads are idle or busy when it decides to create SplitPoint objects.

The depth>=10 and ply<16 conditions can probably be improved, possibly they could be computed dynamically based on some sampled properties of the search tree.

It may also be more efficient to not insert every created SplitPoint immediately in the shared priority queue. Even though each operation on the queue is O(log n), I have to acquire a mutex before manipulating the queue, so the mutex could potentially become highly contended. It may be more efficient to keep a separate list of SplitPoints per thread and let an idle thread examine the per thread SplitPoint lists when it wants to pick a subtree to search.

jdart
Posts: 3824
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Recursive DTS-like search algorithm

Post by jdart » Thu Jul 25, 2013 9:19 pm

BTW, in current Crafty, no thread ever waits for others to finish anywhere. A thread only waits until some busy thread reaches any node where the YBW criterion has been satisfied, which results in almost zero idle time spent waiting.
Is this still true with large numbers of cores (>20)?

--Jon

syzygy
Posts: 4455
Joined: Tue Feb 28, 2012 10:56 pm

Re: Recursive DTS-like search algorithm

Post by syzygy » Fri Jul 26, 2013 12:33 am

bob wrote:In DTS you use the information for ALL active searches, all nodes in each of those searches, to choose the node most likely to be an ALL node, in the most useful place in the tree.

Recursion makes that a pain. In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
If we define DTS as "idle threads look for a good split point themselves anywhere in the trees searched by the active threads", then DTS and recursive search do not bite each other. No need to do ugly things with the stack. The only limitation on DTS imposed by a recursive search is that a master thread after running out of work is limited to the subtrees being searched by its slave threads.

See here for more information on my implementation.

The answer to what is likely your main objection: no, I do not copy position structures when doing a split. I let the idle thread copy the moves from root to split node and let it make those moves on its private board. Those moves have to be accessible globally, but that is not difficult to achieve.

bob
Posts: 20550
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Recursive DTS-like search algorithm

Post by bob » Fri Jul 26, 2013 1:21 am

sje wrote:
bob wrote:In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
But it doesn't have to be this way. True, converting a large and complex piece of recursive code can be a royal pain. However, if the code is written from the start to run iteratively, then that's not so hard and maybe the code will be faster at run time.

The drawbacks of iterative vs recursive tree search code:

1) What do you mean, I can't use local variables for all my stuff?

2) The language gives me a perfectly good stack, why should I have to make one myself?

The benefits:

1) Having no local variables on the stack means no worry about running out of stack space, or having the stack smash into the heap, or having to reserve too much stack space when creating a thread. Insufficient stack space always means a crash and is hard to predict.

2) It's not hard to make your own stack for local variables. Symbolic has a PlyRec class for this; each instance contains all the "local" variables for a given depth for a given tree branch. Using templates, the data for a tree branch is contained in a two-way linked list (extended as needed) class PlyRecList. Each element in the list is an instance of class PlyRecNode and each one of those inherits from class PlyRec as a base class. Going up or down is a matter of a simple pointer assignment. So there is one true local variable, the pointer to the current explicit stack record.

3) With an explicit do-it-yourself stack, nothing is hidden. When the program cashes, the carcass of the explicit stack can be preserved and probed with a debugger much mare easily than using the default, mostly hidden stack.

4) At a given depth, it's easy to take a peek at variable values of a parent or other ancestor. Admittedly, this is not always a good thing to do.

5) If a search branch is aborted for some reason, it's much faster and simpler to make an emergency exit compared to the same in a recursive approach.

6) A search branch may be partially aborted, making an emergency return several levels up. that's not so simple this with a recursive approach.

7) A search branch can be paused with its state saved. This would allow perhaps hundreds of near simultaneous search branches to exist with some master supervisor selecting which of them to reward with a thread and a core.
I have done both. A simple question, which is simpler to implement and modify? It is nice to be able to recursively research something in the middle of the tree. Something that becomes more complex in an iterated search.

But the case I responded to was not a true iterated search if you read his description. While DTS was. There were two main benefits to DTS over what had been done previously to that.

(1) processors can search anywhere in the tree in an unsynchronized/uncoordinated way, as opposed to the previous approach of having all threads search at the same single split point.

(2) processors could see all of the subtrees in progress, all of the nodes within those trees, to choose the most likely to be an ALL node from everything visible, where current YBW implementations do not do this.

He had mentioned DTS, but what he was describing (processors writing out some sort of split point hints here and there) was a LONG way from what DTS was about, and really doesn't sound like it would be as good as current YBW implementations. Hence my comments.

bob
Posts: 20550
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Recursive DTS-like search algorithm

Post by bob » Fri Jul 26, 2013 1:24 am

jdart wrote:
BTW, in current Crafty, no thread ever waits for others to finish anywhere. A thread only waits until some busy thread reaches any node where the YBW criterion has been satisfied, which results in almost zero idle time spent waiting.
Is this still true with large numbers of cores (>20)?

--Jon
I am not certain there. Certainly true through 16, that's been measured repeatedly. I've only run a bit on a 24-core box, and a couple of years back on a 32 and 64 core box. 64 core box was an itanium box that was not very good for anything. But the wait time is minimal, and with the min thread group setting normally set to 4 or so, one doesn't wait very long at all to get busy.

bob
Posts: 20550
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Recursive DTS-like search algorithm

Post by bob » Fri Jul 26, 2013 1:29 am

syzygy wrote:
bob wrote:In DTS you use the information for ALL active searches, all nodes in each of those searches, to choose the node most likely to be an ALL node, in the most useful place in the tree.

Recursion makes that a pain. In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
If we define DTS as "idle threads look for a good split point themselves anywhere in the trees searched by the active threads", then DTS and recursive search do not bite each other. No need to do ugly things with the stack. The only limitation on DTS imposed by a recursive search is that a master thread after running out of work is limited to the subtrees being searched by its slave threads.

See here for more information on my implementation.

The answer to what is likely your main objection: no, I do not copy position structures when doing a split. I let the idle thread copy the moves from root to split node and let it make those moves on its private board. Those moves have to be accessible globally, but that is not difficult to achieve.
Not quite. I am searching at ply=20. I want to split back at ply=10. It is not easy to do. I already have the search state for this path in a split block. But it needs to be moved to another split block. Making all the previous plies use a different split block, yet they have pointers in their local stack space. That's not easy to solve.

In your case, what you put in your split block is irrelevant. You have to have some sort of shared data structure that contains the moves that are to be searched in parallel. Yet the ply back up in the recursive call stack doesn't know about it. Or are you not really using recursive programming for anything other than calling the procedure, and have EVERYTHING else in a global array indexed by ply??

The approach I am using seems to be what everyone else has settled on doing as well. It works effectively and allows a clean recursive implementation with a single search function.

Post Reply