I am seeing some falloff starting even at > 8 cores but getting worse at higher core counts, where threads are blocked for significant amounts of time waiting for work. In my implementation each node that can be split checks for threads idle (using a fast check w/o locking) and then if so does a more detailed check for the YBWC. What I see is that fairly often the first check is succeeding and the second one is failing. Could be a bug of course but also something mis-tuned.
--Jon
Recursive DTS-like search algorithm
Moderators: hgm, Rebel, chrisw
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Recursive DTS-like search algorithm
Once a node becomes a candidate split point, I copy the values of alpha, beta and depth into the position struct that is globally accessible. I don't have any separate split block allocation issues.bob wrote:Not quite. I am searching at ply=20. I want to split back at ply=10. It is not easy to do. I already have the search state for this path in a split block. But it needs to be moved to another split block. Making all the previous plies use a different split block, yet they have pointers in their local stack space. That's not easy to solve.syzygy wrote:If we define DTS as "idle threads look for a good split point themselves anywhere in the trees searched by the active threads", then DTS and recursive search do not bite each other. No need to do ugly things with the stack. The only limitation on DTS imposed by a recursive search is that a master thread after running out of work is limited to the subtrees being searched by its slave threads.bob wrote:In DTS you use the information for ALL active searches, all nodes in each of those searches, to choose the node most likely to be an ALL node, in the most useful place in the tree.
Recursion makes that a pain. In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
See here for more information on my implementation.
The answer to what is likely your main objection: no, I do not copy position structures when doing a split. I let the idle thread copy the moves from root to split node and let it make those moves on its private board. Those moves have to be accessible globally, but that is not difficult to achieve.
When an idle thread joins the split point, it copies those values and it keeps a pointer to the original position struct.
Each thread has its own move stack. A node becomes a candidate split point only after all moves have already been generated (i.e. after captures and killers have been searched, except for PV nodes where I make sure to generate everything after the first move was searched).In your case, what you put in your split block is irrelevant. You have to have some sort of shared data structure that contains the moves that are to be searched in parallel.
I don't store the position struct on the stack like e.g. Stockfish does (but Crafty doesn't either). There is absolutely no reason to call my search anything else than a recursive search, unless you consider Crafty's to be iterative.Yet the ply back up in the recursive call stack doesn't know about it. Or are you not really using recursive programming for anything other than calling the procedure, and have EVERYTHING else in a global array indexed by ply??
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Recursive DTS-like search algorithm
That seems like a terribly tight restriction (after all moves have been generated). And it would seem that would add up to a lot of idle time, where the normal YBW criterion only requires one move be searched before splitting. In my code, that one move can even be the hash best-move...syzygy wrote:Once a node becomes a candidate split point, I copy the values of alpha, beta and depth into the position struct that is globally accessible. I don't have any separate split block allocation issues.bob wrote:Not quite. I am searching at ply=20. I want to split back at ply=10. It is not easy to do. I already have the search state for this path in a split block. But it needs to be moved to another split block. Making all the previous plies use a different split block, yet they have pointers in their local stack space. That's not easy to solve.syzygy wrote:If we define DTS as "idle threads look for a good split point themselves anywhere in the trees searched by the active threads", then DTS and recursive search do not bite each other. No need to do ugly things with the stack. The only limitation on DTS imposed by a recursive search is that a master thread after running out of work is limited to the subtrees being searched by its slave threads.bob wrote:In DTS you use the information for ALL active searches, all nodes in each of those searches, to choose the node most likely to be an ALL node, in the most useful place in the tree.
Recursion makes that a pain. In theory, anything one can do iteratively one can do recursively. In practice, the book-keeping details are simply not worth the effort, and are likely non-portable as well since you have to deal with the stack in an ugly way...
See here for more information on my implementation.
The answer to what is likely your main objection: no, I do not copy position structures when doing a split. I let the idle thread copy the moves from root to split node and let it make those moves on its private board. Those moves have to be accessible globally, but that is not difficult to achieve.
When an idle thread joins the split point, it copies those values and it keeps a pointer to the original position struct.
Each thread has its own move stack. A node becomes a candidate split point only after all moves have already been generated (i.e. after captures and killers have been searched, except for PV nodes where I make sure to generate everything after the first move was searched).In your case, what you put in your split block is irrelevant. You have to have some sort of shared data structure that contains the moves that are to be searched in parallel.
I don't store the position struct on the stack like e.g. Stockfish does (but Crafty doesn't either). There is absolutely no reason to call my search anything else than a recursive search, unless you consider Crafty's to be iterative.Yet the ply back up in the recursive call stack doesn't know about it. Or are you not really using recursive programming for anything other than calling the procedure, and have EVERYTHING else in a global array indexed by ply??
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Recursive DTS-like search algorithm
I have practically no idle time, since idle threads look for candidate split points themselves. In your approach, idle threads can only join "new" candidate split points at the point they are found which means you have to be more liberal at designating a node as a candidate split point.bob wrote:That seems like a terribly tight restriction (after all moves have been generated). And it would seem that would add up to a lot of idle time, where the normal YBW criterion only requires one move be searched before splitting. In my code, that one move can even be the hash best-move...
Splitting only after captures and killers decreases the chance of splitting at a node that fails high early in the move list. Most likely the node does not fail high at all, which means it was perfect for splitting, or the node fails high basically at a random move which we could not predict anyway (I think Crafty does not even attempt to sort those moves).
For PV nodes I do split immediately after the first move, because in these nodes you have a good expectation of having searched the best move first.
So far I haven't done many measurements though, and I will admit that intuition can be deceiving.
In my present approach I do see some inefficiencies when the PV node at ply = n is almost finished and until the search starts on the PV node at ply = n -1. All threads join the last move being searched and probably end up doing lots of split at shallow depths until no split points are available anymore and they are idle until the PV node at ply = n is completely finished and the main thread starts at the PV node at ply = n -1. Of course this is not specific to my implementation but is an inherent property of YBW.
Your DTS paper gives the interesting suggestion to start on the PV node at ply = n - 1 before the PV node at ply = n is completely finished. This is probably tricky to get right, but it seems to ensure that candidate split points at relatively large depths are always available. So I am tempted to try this out at some point.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Recursive DTS-like search algorithm
It is easy to implement that last feature, but the problem is you might well split before you have an alpha bound at N. Suppose you have 32 CPUS, as we had on the last Cray I used. If ply=N has < 32 moves, DTS would split at N-1 for the remainder of those CPUS. With no alpha bound in sight. But once alpha was established, it was very easy to inform those at n-1 about it.syzygy wrote:I have practically no idle time, since idle threads look for candidate split points themselves. In your approach, idle threads can only join "new" candidate split points at the point they are found which means you have to be more liberal at designating a node as a candidate split point.bob wrote:That seems like a terribly tight restriction (after all moves have been generated). And it would seem that would add up to a lot of idle time, where the normal YBW criterion only requires one move be searched before splitting. In my code, that one move can even be the hash best-move...
Splitting only after captures and killers decreases the chance of splitting at a node that fails high early in the move list. Most likely the node does not fail high at all, which means it was perfect for splitting, or the node fails high basically at a random move which we could not predict anyway (I think Crafty does not even attempt to sort those moves).
For PV nodes I do split immediately after the first move, because in these nodes you have a good expectation of having searched the best move first.
So far I haven't done many measurements though, and I will admit that intuition can be deceiving.
In my present approach I do see some inefficiencies when the PV node at ply = n is almost finished and until the search starts on the PV node at ply = n -1. All threads join the last move being searched and probably end up doing lots of split at shallow depths until no split points are available anymore and they are idle until the PV node at ply = n is completely finished and the main thread starts at the PV node at ply = n -1. Of course this is not specific to my implementation but is an inherent property of YBW.
Your DTS paper gives the interesting suggestion to start on the PV node at ply = n - 1 before the PV node at ply = n is completely finished. This is probably tricky to get right, but it seems to ensure that candidate split points at relatively large depths are always available. So I am tempted to try this out at some point.
I saw two choices. (1) wait until alpha is known, which might mean CPUs sit idld; (2) start somewhere else speculatively and then update bounds as they become known... At worst, the two are equal, but with reasonable luck, (2) can save time.
Current YBW's don't do that, of course, they will split when searching any node where at least one branch has been completed...
For up to 16 cpus, I don't see enough idle time to measure (I do place restrictions on splitting, such as no more than N cpus split at any single point (N is tunable)).
I had decided that once 32 core (and beyond) boxes are readily available, I'd probably start over and write an iterated search..
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Re: Recursive DTS-like search algorithm
As I've written in another thread, I believe that an iterated search is a good idea even if there is only a single core available.bob wrote:I had decided that once 32 core (and beyond) boxes are readily available, I'd probably start over and write an iterated search..
It will likely be many years before we see 32 core machines; they would be very costly and would not be needed by a typical home user with no heavy tasks other than rendering and video ripping. If such machines do appear relatively soon, they would likely have four octocore CPUs.
I think a more likely development, already in progress, is the deployment of near super-computer video cards. These would be used by gamers who drive much of the market and who will soon be demanding 4K (3840x2160) video. Perhaps the upcoming Apple Mac Pro, which has two very high end video cards as standard, will lead the way for applications which rely upon OpenGL and the like more than upon the CPU.
-
- Posts: 690
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Recursive DTS-like search algorithm
My current implementation contains two algorithm improvements and a number of implementation bug fixes.
The first algorithm improvement is that the lock contention on the priority work queue is measured in real-time and if it goes above 10% the minimum split depth is incremented. If it goes below 5% and the current minimum split depth is >10, the minimum split depth is decremented. This means that the algorithm adapts automatically to endgame positions where the branching factor is much smaller than in middlegame positions.
The second algorithm improvement addresses disadvantage #2 in the original algorithm:
An example might make it more clear. Assume that at a SplitPoint the available moves are:
e4, d4, Nf3, Nc3, c4, f4
Lets say the SplitPoint owner has finished searching e4 and is now searching d4. A thread that wants to help at this SplitPoint could choose to search Nf3. However, if it is almost sure that Nf3 needs to be searched (because the confidence that this is an ALL node is very high), it is assumed that Nc3 also has to be searched. In this case the helper thread will start searching Nc3. When the SplitPoint owner thread finishes the d4 search, it will continue searching Nf3, but since no helper thread has been searching that move, it is quite likely that the Nc3 search finishes before the Nf3 search. When this happens, the helper thread will again conclude that it could start searching c4, but it will instead search f4 because the confidence that f4 needs to be searched is high.
With these two improvements I get pretty good speedup results if I disable all forward pruning except null move in my program. Using my time to depth measuring tool I get:
So the NPS speedup with 4 threads is 3.88 and the time to depth speedup is 3.79.
Disabling all forward pruning (LMR, LMP, futility, reverse futility, razoring) is clearly not a good thing to do except possibly when trying to understand how the parallel search algorithm behaves. Unfortunately, when all forward pruning methods are enabled the speedup results are much worse:
The NPS speedup with 4 threads is still good, 3.84 instead of 3.88, but the time to depth speedup is now only 2.38 instead of 3.79. I don't do any intentional "widening" of the search, so it is likely that the extra nodes searched are mostly wasted.
With forward pruning enabled we can see that it is much more common that the multi-threaded search returns a different score than the single threaded search (see columns ds2, ds3, ds4). I am not sure what causes this but my guess is that the way I propagate history and killer information between threads has some problems.
When a thread starts to search at a SplitPoint, it copies the current killer and history tables from the SplitPoint owner thread. This copy is made without any locking. The assumption is that it is better to avoid the the locking overhead and accept the occasional corrupted entries caused by the data races. When a helper thread starts searching a new move at the same SplitPoint it was previously searching at, it keeps its killer and history tables so no killer or history tables are copied from the SplitPoint owner. When a helper thread has finished searching at a SplitPoint the killer and history information in the helper thread is thrown away. There is no mechanism to propagate the information back to the SplitPoint owner.
Is this a sensible way to handle killer and history information? How are other programs doing this?
The first algorithm improvement is that the lock contention on the priority work queue is measured in real-time and if it goes above 10% the minimum split depth is incremented. If it goes below 5% and the current minimum split depth is >10, the minimum split depth is decremented. This means that the algorithm adapts automatically to endgame positions where the branching factor is much smaller than in middlegame positions.
The second algorithm improvement addresses disadvantage #2 in the original algorithm:
To reduce the number of "kick-outs", when a helper thread is about to select a move to search from a potential SplitPoint, it checks if the next available move has an estimated usefulness probability above 98%. If this is the case, the helper thread will select the move after the next available move. This has the effect that the SplitPoint owner has to search the next available move itself, and hopefully by the time that search is finished the helper thread has also finished its search so that no kick-out is necessary.petero2 wrote: If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the master thread fails low on move one, it will start searching move two. When this happens the helper thread is "kicked out" of that branch and has to start searching somewhere else. The idea is that the master thread should quickly get to the same point where the helper thread was interrupted because of transposition table hits. However, there is still some overhead involved. This seems hard to avoid with the current design, but the hope is that the overhead is small.
An example might make it more clear. Assume that at a SplitPoint the available moves are:
e4, d4, Nf3, Nc3, c4, f4
Lets say the SplitPoint owner has finished searching e4 and is now searching d4. A thread that wants to help at this SplitPoint could choose to search Nf3. However, if it is almost sure that Nf3 needs to be searched (because the confidence that this is an ALL node is very high), it is assumed that Nc3 also has to be searched. In this case the helper thread will start searching Nc3. When the SplitPoint owner thread finishes the d4 search, it will continue searching Nf3, but since no helper thread has been searching that move, it is quite likely that the Nc3 search finishes before the Nf3 search. When this happens, the helper thread will again conclude that it could start searching c4, but it will instead search f4 because the confidence that f4 needs to be searched is high.
With these two improvements I get pretty good speedup results if I disable all forward pruning except null move in my program. Using my time to depth measuring tool I get:
Code: Select all
pos d time spd2 spd3 spd4 nodes n2 n3 n4 nps nps2 nps3 nps4 score ds2 ds3 ds4
1 13 158.001 1.741 2.789 3.477 390481446 1.122 1.054 1.118 2471390 1.954 2.940 3.888 21 0 0 0
2 13 74.117 1.945 3.006 3.761 181807120 1.009 0.983 1.045 2452980 1.962 2.956 3.929 17 0 0 0
3 12 105.165 1.969 2.874 3.649 258324459 1.002 1.025 1.069 2456380 1.972 2.945 3.901 17 0 0 0
4 13 92.929 1.873 3.062 3.725 235221371 1.024 0.972 1.057 2531190 1.918 2.976 3.937 23 0 0 0
5 13 88.262 2.021 2.934 3.728 224639423 0.986 1.005 1.055 2545130 1.992 2.949 3.932 28 0 0 0
6 14 111.331 1.940 2.314 2.896 289098572 1.023 1.263 1.340 2596740 1.984 2.923 3.879 15 0 0 0
7 13 86.060 1.851 2.883 3.828 220508128 1.040 1.030 1.030 2562260 1.924 2.971 3.944 -24 0 0 0
8 13 459.788 3.519 3.549 6.718 1166951782 0.557 0.836 0.589 2538020 1.959 2.967 3.957 -26 0 0 0
9 13 295.005 2.026 3.026 4.595 761747959 0.974 0.980 0.859 2582150 1.973 2.966 3.945 -29 0 0 0
10 13 390.187 2.288 3.199 4.188 984488779 0.850 0.932 0.946 2523120 1.945 2.981 3.964 -35 0 0 0
11 13 161.333 1.865 3.219 4.213 420113648 0.995 0.918 0.929 2604010 1.856 2.954 3.915 22 0 0 0
12 15 269.960 1.869 2.853 3.892 712708990 1.024 1.036 1.012 2640060 1.914 2.955 3.938 27 0 0 0
13 14 304.837 1.922 2.884 3.790 795950521 1.028 1.024 1.032 2611070 1.975 2.952 3.912 23 0 0 0
14 14 192.420 2.054 3.121 3.818 505919224 0.936 0.937 1.014 2629240 1.923 2.924 3.873 29 0 0 0
15 14 95.780 1.976 2.473 3.364 258563711 0.953 1.167 1.125 2699550 1.883 2.886 3.783 26 -2 -2 -2
16 15 299.518 1.726 2.489 3.512 791776805 1.143 1.184 1.117 2643500 1.973 2.946 3.922 19 0 0 0
17 16 525.107 1.680 2.890 3.460 1372305676 1.176 1.018 1.127 2613380 1.976 2.942 3.899 0 0 0 0
18 15 95.574 2.055 2.643 3.114 254539882 0.956 1.081 1.222 2663280 1.965 2.858 3.805 0 0 0 0
19 14 23.994 1.721 2.250 2.892 64430788 1.125 1.293 1.329 2685240 1.937 2.909 3.843 0 0 0 0
20 16 138.037 2.025 2.765 4.324 393727539 0.961 1.044 0.889 2852330 1.946 2.887 3.843 54 -16 -8 -7
21 14 59.830 1.464 2.558 3.317 174209005 1.252 1.135 1.156 2911710 1.832 2.904 3.834 46 1 1 1
22 13 11.797 2.007 2.047 5.391 34931060 0.879 1.389 0.702 2961040 1.765 2.843 3.784 18 4 0 4
23 13 11.601 1.744 2.623 3.403 35027326 1.126 1.091 1.122 3019480 1.965 2.862 3.818 31 0 0 0
24 14 25.757 1.129 2.604 1.907 76064258 1.706 1.103 1.973 2953100 1.926 2.872 3.762 55 0 0 -13
avg 169.850 1.934 2.794 3.790 441814061 1.035 1.063 1.077 2656097 1.934 2.928 3.884
Disabling all forward pruning (LMR, LMP, futility, reverse futility, razoring) is clearly not a good thing to do except possibly when trying to understand how the parallel search algorithm behaves. Unfortunately, when all forward pruning methods are enabled the speedup results are much worse:
Code: Select all
pos d time spd2 spd3 spd4 nodes n2 n3 n4 nps nps2 nps3 nps4 score ds2 ds3 ds4
1 18 76.198 2.308 2.271 3.036 140705368 0.851 1.286 1.262 1846580 1.964 2.919 3.833 0 0 -3 -4
2 18 48.578 1.892 2.195 2.818 90289115 1.017 1.337 1.373 1858660 1.925 2.934 3.870 0 -9 0 8
3 17 54.028 1.812 2.437 2.413 100038381 1.078 1.191 1.610 1851600 1.953 2.904 3.884 16 -12 -29 7
4 18 41.333 1.928 2.357 3.484 79155826 1.026 1.238 1.112 1915080 1.978 2.919 3.874 3 -3 -19 -3
5 18 34.182 2.379 1.983 2.120 64808802 0.844 1.488 1.825 1895990 2.009 2.951 3.869 0 0 0 9
6 19 50.132 1.977 1.653 1.742 95170456 1.022 1.759 2.188 1898390 2.020 2.908 3.812 2 4 12 -2
7 18 32.034 0.763 1.448 1.605 61156020 2.562 2.037 2.436 1909110 1.953 2.951 3.909 -57 25 -16 -4
8 18 71.050 1.575 1.695 2.331 136117702 1.233 1.720 1.653 1915810 1.943 2.916 3.853 -66 23 13 12
9 18 61.141 1.685 2.540 3.037 117226780 1.121 1.151 1.266 1917310 1.888 2.922 3.845 -62 -3 7 10
10 18 102.272 1.789 3.262 4.286 199391848 1.100 0.915 0.913 1949630 1.968 2.984 3.915 -75 -2 -66 -42
11 18 43.191 1.832 2.500 2.247 85370535 1.063 1.147 1.676 1976560 1.947 2.868 3.766 28 -18 -15 -13
12 20 40.677 1.494 1.245 1.271 82464862 1.342 2.344 3.044 2027320 2.005 2.920 3.871 0 20 31 25
13 19 29.984 0.799 0.865 1.015 60412013 2.459 3.346 3.757 2014820 1.965 2.895 3.813 24 -12 -18 -1
14 19 49.076 0.852 2.619 1.651 96911003 2.287 1.088 2.329 1974710 1.949 2.848 3.845 20 -5 2 5
15 19 60.583 1.661 1.798 1.869 126451201 1.169 1.586 2.009 2087240 1.942 2.853 3.756 0 0 0 0
16 20 26.787 1.200 1.876 1.721 53529373 1.600 1.508 2.196 1998310 1.921 2.830 3.780 4 -4 -4 -4
17 21 71.311 1.489 1.981 3.441 151231576 1.320 1.452 1.085 2120750 1.965 2.875 3.735 0 0 0 0
18 20 29.540 1.718 1.870 2.340 65524777 1.120 1.521 1.601 2218160 1.925 2.845 3.745 0 0 0 0
19 19 47.341 1.458 1.849 2.340 110390929 1.343 1.586 1.646 2331800 1.958 2.932 3.851 0 0 0 0
20 21 80.663 1.119 0.975 1.281 199634703 1.761 3.022 3.046 2474910 1.971 2.945 3.901 0 70 34 40
21 19 81.436 1.602 2.076 3.131 203068873 1.228 1.408 1.234 2493610 1.967 2.923 3.863 75 4 1 -58
22 18 28.240 1.808 1.397 4.831 74345393 1.080 2.079 0.801 2632650 1.953 2.904 3.869 56 -23 18 -6
23 18 28.886 1.522 1.475 1.317 75867252 1.286 1.979 2.956 2626460 1.958 2.919 3.892 64 0 13 21
24 19 40.983 1.527 1.486 1.707 105705955 1.261 1.998 2.282 2579290 1.924 2.969 3.895 106 0 17 10
avg 51.235 1.591 1.911 2.376 107290364 1.341 1.674 1.888 2104781 1.956 2.910 3.844
With forward pruning enabled we can see that it is much more common that the multi-threaded search returns a different score than the single threaded search (see columns ds2, ds3, ds4). I am not sure what causes this but my guess is that the way I propagate history and killer information between threads has some problems.
When a thread starts to search at a SplitPoint, it copies the current killer and history tables from the SplitPoint owner thread. This copy is made without any locking. The assumption is that it is better to avoid the the locking overhead and accept the occasional corrupted entries caused by the data races. When a helper thread starts searching a new move at the same SplitPoint it was previously searching at, it keeps its killer and history tables so no killer or history tables are copied from the SplitPoint owner. When a helper thread has finished searching at a SplitPoint the killer and history information in the helper thread is thrown away. There is no mechanism to propagate the information back to the SplitPoint owner.
Is this a sensible way to handle killer and history information? How are other programs doing this?