Lazy SMP and "lazy cluster" experiments

jstanback · Post by **jstanback** » Wed Aug 30, 2017 5:05 pm

lucasart wrote:
petero2 wrote:Lazy SMP algorithm
[...]
Half of the helper threads search at the same depth as the master thread and the other half search at a one ply deeper depth.

When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.

The master thread is responsible for all UCI communication and when reporting search depth it uses its own search depth. This means that even if a "depth+1" helper thread finished first, the reported depth is "depth", even though in that case the "quality" of the search is "depth+1".

The master thread does not directly talk to all helper threads. Instead all search threads are arranged in a tree, where the master thread is the tree root node. Each tree node has at most 4 children. The tree is created to have minimal height and children are assigned from left to right. For example if there are 16 search threads they are arranged as follows:
Code: Select all
0
|
+--------+-----------+---------+
|        |           |         |
1        2           3         4
|        |           |
+-+-+-+  +-+--+--+   +--+--+
| | | |  | |  |  |   |  |  |
5 6 7 8  9 10 11 12  13 14 15
A search thread is only allowed to talk to its neighbors in the tree graph. When a search thread receives a command from a neighbor it acts on the command and also forwards it to its parent or children as appropriate.

This thread tree structure is not strictly necessary for SMP search, since SMP hardware does not have an excessive number of hardware threads, so the master thread could talk directly to all helper threads. However for clusters the number of possible hardware threads could be very large (the largest super computers have more than one million cores), so in that case it would be very bad if the master thread tried to talk directly to all helper threads.

The algorithm is NUMA aware. This means that each search thread is locked to a specific NUMA node, and care is taken to allocate thread local memory on the NUMA node where the thread runs. The shared transposition table is allocated on NUMA node 0 if there is enough free memory on that node.
Interesting. My Lazy SMP is rather different. Here's how it works:
* main thread (the one that starts the main() function) does UCI I/O, and is always responsive.
* upon receiving a go command, it spawns a timer thread.
* the timer thread spawns N worker threads.

Timer and Worker(s) only live when the engine is searching. It turns out that spawning a few threads is so fast, the overhead is totally negligible. So you don't need to keep threads alive all the time, and deal with condition variables. Simple thread_create() and thread_join() operations are all I ever need.

Already the philosophy is quite different. All workers are equal. There is no concept of main thread among the workers. This simplifies the code, as there is no special case for the main thread.

UCI output (printing completed iterations with depth, score, pv etc.) is centralised in a global object, that contains the last iteration completed (highest depth completed so far, don't care which of the equal workers completed it). The object embeds a mutex, obviously. When a worker completes an iteration, it just tries to update the PV (only updated if deeper, when update occurs, print to screen, mutex guarantees no concurrency problem that would cause scrambled output).

In the ID loop, before a worker starts an iteration, it checks how many workers are searching this iteration or above. If at least half the workers are searching >= depth, then there is no point in searching this iteration, try the next one (and so on, until we find a useful iteration to work on).

Search abortion is twofold:
* Timer can signal all workers to stop immediately (using long_jmp() in C, or exception if it was C++).
* When a worker completes an iteration, it signals all workers working <= depth to stop immediately. When the affected workers stop, they return in the ID loop and look for a useful job to do, as previously described.

All this ensures that you have half the workers at each depth d and d+1. But it does so in a rather martial way. Workers are not allowed to just go their own way in the ID loop. We constantly want to keep them in rank and correct towards 1/2 at d, and 1/2 at d+1.

Next steps:
1/ The next thing to try is to use some more complicated patterns as 1/2 at d and 1/2 at d+1. This has shown to be very useful with many threads in Stockfish testing. But I cannot see any gain in variying this for 4 threads. And I do not have big machines at my disposal. So I stay with the simple scheme for now.

2/ The other thing to try is obviously ABDADA. Again, I really doubt I will gain anything on 4 Threads above what I have. Probably needs 8 or 16 threads to see anything.

This is very similar to what I do in Wasp. Except that I have a "main" thread which handles I/O and a timer and create a search thread pool when the program is started. Before a search, the main thread copies the root position to all the search threads and sets a flag for them to start searching. When the time limit is reached, the main thread sets a global timeout flag. Each search thread checks this flag frequently (in the moveloop at every depth) and returns if it's set, so no longjmp is used. At every iteration (beyond depth 5) each thread checks whether half or more of the other threads are already searching at that depth and if so, bumps the depth. Also, if a thread is not the 1st to search at a given depth I apply a little randomness to the scores used to order the root moves. I don't know if this really helps at all. I have a global structure that contains the PV, depth, and score for the best move and each thread can update that if it finds a new best move (higher depth or same depth and higher score). A search thread will discontinue an iteration if it is about to search a new move at the root with a depth lower than the depth of the current best move (but not if the depth is the same). This is very simple and works reasonably well. I haven't yet tried ABDADA.

John

TomKerrigan · Post by **TomKerrigan** » Wed Aug 30, 2017 6:34 pm

jstanback wrote: ...
This is very similar to what I do in Wasp. Except that I have a "main" thread which handles I/O and a timer and create a search thread pool when the program is started. Before a search, the main thread copies the root position to all the search threads and sets a flag for them to start searching. When the time limit is reached, the main thread sets a global timeout flag. Each search thread checks this flag frequently (in the moveloop at every depth) and returns if it's set, so no longjmp is used. At every iteration (beyond depth 5) each thread checks whether half or more of the other threads are already searching at that depth and if so, bumps the depth. Also, if a thread is not the 1st to search at a given depth I apply a little randomness to the scores used to order the root moves. I don't know if this really helps at all. I have a global structure that contains the PV, depth, and score for the best move and each thread can update that if it finds a new best move (higher depth or same depth and higher score). A search thread will discontinue an iteration if it is about to search a new move at the root with a depth lower than the depth of the current best move (but not if the depth is the same). This is very simple and works reasonably well. I haven't yet tried ABDADA. ...

Hi John, hope you've been doing well.

If you're interested in ABDADA, I recently posted this very easy way of implementing it:

http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/

jstanback · Post by **jstanback** » Wed Aug 30, 2017 6:41 pm

TomKerrigan wrote:
jstanback wrote: ...
This is very similar to what I do in Wasp. Except that I have a "main" thread which handles I/O and a timer and create a search thread pool when the program is started. Before a search, the main thread copies the root position to all the search threads and sets a flag for them to start searching. When the time limit is reached, the main thread sets a global timeout flag. Each search thread checks this flag frequently (in the moveloop at every depth) and returns if it's set, so no longjmp is used. At every iteration (beyond depth 5) each thread checks whether half or more of the other threads are already searching at that depth and if so, bumps the depth. Also, if a thread is not the 1st to search at a given depth I apply a little randomness to the scores used to order the root moves. I don't know if this really helps at all. I have a global structure that contains the PV, depth, and score for the best move and each thread can update that if it finds a new best move (higher depth or same depth and higher score). A search thread will discontinue an iteration if it is about to search a new move at the root with a depth lower than the depth of the current best move (but not if the depth is the same). This is very simple and works reasonably well. I haven't yet tried ABDADA. ...
Hi John, hope you've been doing well. If you're interested in ABDADA, I recently posted this very easy way of implementing it:

http://www.tckerrigan.com/Chess/Paralle ... ed_ABDADA/

Hi Tom! I hope you're also doing well. Yes I saw your post and that of Peter O. which perks my interest in ABDADA or some variant of it. I've been hoping to improve the strength with a single thread first, but haven't made much progress with that lately despite trying just about everything I can think of. But results in the last few days with a singular extension variant are looking a bit promising...

John

petero2 · Post by **petero2** » Wed Aug 30, 2017 9:40 pm

Laskos wrote:Your results in this post with Lazy Texel are in fact quite remarkable and, after looking more carefully, weird to me.

I agree, the results look weird to me too. I measured the 2 core and 4 core elo gains again with the latest texel, which uses the approximate ABDADA algorithm (~+10 elo) and removes bad NUMA code (0 elo for this test).

I used the same base time control 24+0.24 as before:

Code: Select all

config        dElo  sDev  win loss draw
2*T vs 1*T  +109.0   4.2  826  159 1211
2*C vs 2*T   -32.5   4.4  269  458 1299

4*T vs 1*T  +214.8   3.9 1152   39  833
4*C vs 4*T   -50.1   2.9  473 1106 2843

This gives effective speedups:

Code: Select all

2^&#40;1-32.5/109.0&#41; = 1.63
4^&#40;1-50.1/214.8&#41; = 2.89

I then ran the same tests using stockfish 8:

Code: Select all

config        dElo  sDev  win loss draw
2*T vs 1*T   +81.3   3.5  580   79 1521
2*C vs 2*T    -9.5   2.7  266  348 2394

4*T vs 1*T  +146.1   2.4 2086   47 2997
4*C vs 4*T   -29.0   2.7  177  419 2306

Effective speedups:

Code: Select all

2^&#40;1- 9.5/ 81.3&#41; = 1.84
2^&#40;1-29.0/146.1&#41; = 3.04

The trend is the same as before. Texel gains more going from 2C to 4C than going from 1C to 2C. For stockfish it is the opposite.

The turboboost effect is around 5% for the 4*C vs 4*T matches, averaged over all test computers I used. For the 2*C vs 2*T matches the effect is almost zero. In my previous tests the turboboost effect was also almost zero since the maximum core imbalance was 2*N*C vs 1*N*C matches. Also the EC2 instances do not seem to have any turboboost effect. Anyway, the turboboost effect would make the texel scaling more weird, not less, so some other explanation is needed.

The time control is 12.5 times shorter than in Andreas Strangmullers test. I don't see why that would penalize 2 cores more than 4 cores for texel though.

Laskos wrote:Now the speedups: 1.56, 1.62, 1.69, 1.72 and for NUMA node, 1.49. NUMA node, from your results too behaves with number of cores similarly to non-NUMA, but at smaller values of effective speedup. Small with few cores, larger with many cores. The results are frankly weird to me. This is not Vas Rajlich's empirical diminishing numbers, this is not Robert Houdart's rule of thumb 1.75**doublings, this is not Amdahl's law, this is... most closely related to Robert Hyatt's seemingly ridiculous rule of thumb:

effective speedup = 1 + 0.7 * (n_cores-1).

for his YBW.

I don't see any theoretical reason to believe Amdahl's law would apply, at least not with an efficiency below 99.99%, since the amount of computation that runs serially is very small and decreases when the search tree increases. Instead there is a more abstract overhead caused by doing unnecessary work, but quantifying how that abstract overhead behaves when the number of cores goes to infinity seems very hard.

I don't see any theoretical reason to believe any other formula would apply either though, since I don't really understand what makes lazy SMP work.

petero2 · Post by **petero2** » Wed Aug 30, 2017 9:56 pm

lucasart wrote:Interesting. My Lazy SMP is rather different.

Aside from the fact that texel's design is more complicated in order to support clusters, the main algorithmic difference seems to be that in texel the root search is done serially. This simplifies some things but might not be best for elo:

* My time management uses information about fail lows at the root (ie score below aspiration window). It is not obvious how to handle that if you have N threads searching at the root at the same time, using possibly different aspiration windows and searching root moves in different orders.

* If I run out of time before an iteration is complete, I use the best move found so far in the current iteration. How should that be handled if each thread has its own opinion about which move is best?

I suppose the above could be handled by ignoring all partial iteration results for all threads except the first, but there might be better ways to use the possibly conflicting information.

TomKerrigan · Post by **TomKerrigan** » Tue Sep 05, 2017 11:45 pm

petero2 wrote:Since my experiments to improve on lazy SMP in texel by using ideas from the ABDADA algorithm were quite successful, I decided to write a summary of the algorithm I ended up with.
...

Hi Peter, I was wondering if you saw my parallel work, it sounds similar in a number of ways:

http://www.tckerrigan.com/Chess/Parallel_Search/

I've come up with a few easy-to-implement improvements over ABDADA that yield small but measurable performance gains:

1) Only set the "exclusive" bit if you're doing a null-window search of a move. As soon as a null-window search fails, zero out the exclusive bit. This increases the odds that other threads will help search the move that's failing.

2) Cutoff Checks: at higher depths, if you've deferred any moves, probe the hash table in your search function's inner loop to see if the position has failed high.

3) Synchronize searches at the root. If ANY thread does something to move the search along, have a mechanism to stop and re-start ALL the threads so they're helping with the latest search:
a) If a thread finishes a ply
b) If a thread finishes searching a move such that it becomes the PV move and changes alpha (including the first move)
c) If a thread finishes a null-window search of a move and it fails, all threads should stop immediately and help search that move (at that point searching any other move becomes a waste of time)

On average, these improvements combine to improve my speedups by about 10%-15% total; mostly the can provide big wins when the PV is changing, which IMO is when you really want your engine to be sped up anyway. What's the point of searching another ply deeper if it's just going to tell you the same thing the last ply did?

I would be interested to know the results if you decide to implement all or some of my improvements!

petero2 · Post by **petero2** » Wed Sep 06, 2017 10:26 pm

TomKerrigan wrote:Hi Peter, I was wondering if you saw my parallel work, it sounds similar in a number of ways:

http://www.tckerrigan.com/Chess/Parallel_Search/

Yes, I saw that. Very interesting to read.

TomKerrigan wrote:1) Only set the "exclusive" bit if you're doing a null-window search of a move. As soon as a null-window search fails, zero out the exclusive bit. This increases the odds that other threads will help search the move that's failing.

I already do this. This seemed like the logical thing to do so I have not tried it any other way.

TomKerrigan wrote:2) Cutoff Checks: at higher depths, if you've deferred any moves, probe the hash table in your search function's inner loop to see if the position has failed high.

After reading your description I tried a variant of it. If the depth is large enough always do cutoff checks even if there are no deferred moves. A cutoff is possible even if there are no deferred moves because I use thread local killer and history tables, so move ordering is not the same in all threads.

However, this was not a win for me. I will try again with your unmodified idea, as it should lead to a lower overhead (fewer cutoff checks).

TomKerrigan wrote:3) Synchronize searches at the root. If ANY thread does something to move the search along, have a mechanism to stop and re-start ALL the threads so they're helping with the latest search:
a) If a thread finishes a ply
b) If a thread finishes searching a move such that it becomes the PV move and changes alpha (including the first move)
c) If a thread finishes a null-window search of a move and it fails, all threads should stop immediately and help search that move (at that point searching any other move becomes a waste of time)

My implementation is different, since only the master thread is performing the root search. The helper threads only start helping after the first half move has been played. Therefore each helper thread will be searching the same root move using the same alpha/beta window. When any thread finishes its search, the master thread will use the result and then tell the helper threads to start searching a different root move, or to research the same root move with a larger window.

So I get kind of the same effect even though the details are quite different.

TomKerrigan · Post by **TomKerrigan** » Fri Sep 08, 2017 10:20 pm

petero2 wrote:
TomKerrigan wrote:2) Cutoff Checks: at higher depths, if you've deferred any moves, probe the hash table in your search function's inner loop to see if the position has failed high.
After reading your description I tried a variant of it. If the depth is large enough always do cutoff checks even if there are no deferred moves. A cutoff is possible even if there are no deferred moves because I use thread local killer and history tables, so move ordering is not the same in all threads.

However, this was not a win for me. I will try again with your unmodified idea, as it should lead to a lower overhead (fewer cutoff checks).

Interesting thought re: move ordering but I would expect that doing a cutoff check before every move (even at pretty high depths) would cause a ton of unnecessary memory traffic since it's probably pretty uncommon that multiple threads are searching the same node.

Cutoff checks on average don't provide much of a benefit to my engine (maybe around 3% total). On most positions they don't make a difference but for positions where the PV is in flux they can help a lot. Not sure how you're testing but maybe it's obscuring this fact.

Also it's pretty easy to tell how much work cutoff checks are saving... if you were going to return early from a cutoff check, instead keep track of the node count at that point, and then add that to a "saved nodes" count before returning from the search function. I find that to be an interesting number.

petero2 wrote:
TomKerrigan wrote:3) Synchronize searches at the root. If ANY thread does something to move the search along, have a mechanism to stop and re-start ALL the threads so they're helping with the latest search:
a) If a thread finishes a ply
b) If a thread finishes searching a move such that it becomes the PV move and changes alpha (including the first move)
c) If a thread finishes a null-window search of a move and it fails, all threads should stop immediately and help search that move (at that point searching any other move becomes a waste of time)
My implementation is different, since only the master thread is performing the root search. The helper threads only start helping after the first half move has been played. Therefore each helper thread will be searching the same root move using the same alpha/beta window. When any thread finishes its search, the master thread will use the result and then tell the helper threads to start searching a different root move, or to research the same root move with a larger window.

So I get kind of the same effect even though the details are quite different.

That's probably good too since you'll be searching the best moves first, i.e., the moves that are most likely to change the PV.

I've tried this sort of thread synchronization in the past and found that a non-trivial amount of time was spent waiting for threads to stop and then re-starting them. If you don't have a system set up to do that asynchronously, I would recommend it!

petero2 · Post by **petero2** » Sat Sep 09, 2017 10:43 am

TomKerrigan wrote:
petero2 wrote:
TomKerrigan wrote:2) Cutoff Checks: at higher depths, if you've deferred any moves, probe the hash table in your search function's inner loop to see if the position has failed high.
After reading your description I tried a variant of it. If the depth is large enough always do cutoff checks even if there are no deferred moves. A cutoff is possible even if there are no deferred moves because I use thread local killer and history tables, so move ordering is not the same in all threads.

However, this was not a win for me. I will try again with your unmodified idea, as it should lead to a lower overhead (fewer cutoff checks).
Interesting thought re: move ordering but I would expect that doing a cutoff check before every move (even at pretty high depths) would cause a ton of unnecessary memory traffic since it's probably pretty uncommon that multiple threads are searching the same node.

Cutoff checks on average don't provide much of a benefit to my engine (maybe around 3% total). On most positions they don't make a difference but for positions where the PV is in flux they can help a lot. Not sure how you're testing but maybe it's obscuring this fact.

I was testing using self-play games at time control 24+0.24. I tested both with 2 threads and with 4 threads. The result after around 10000 games was well below error margins. I have switched to using your original idea (only do cutoff checks if there are deferred moves) but that did not make a noticeable difference.

TomKerrigan wrote:Also it's pretty easy to tell how much work cutoff checks are saving... if you were going to return early from a cutoff check, instead keep track of the node count at that point, and then add that to a "saved nodes" count before returning from the search function. I find that to be an interesting number.

I added code to keep track of the number of saved nodes. On average the saving is pretty small, usually between 0.1% and 1%, often closer to 0.1% than 1%. I tested this both with 4 threads and 24 threads. In some cases though the savings are much bigger. The largest saving I saw was around 14% but that has only happened once. These results are not repeatable because of SMP non-determinism. I did not see any significant difference between tactical and quiet positions.

TomKerrigan wrote:
petero2 wrote:My implementation is different, since only the master thread is performing the root search. The helper threads only start helping after the first half move has been played. Therefore each helper thread will be searching the same root move using the same alpha/beta window. When any thread finishes its search, the master thread will use the result and then tell the helper threads to start searching a different root move, or to research the same root move with a larger window.

So I get kind of the same effect even though the details are quite different.
That's probably good too since you'll be searching the best moves first, i.e., the moves that are most likely to change the PV.

I've tried this sort of thread synchronization in the past and found that a non-trivial amount of time was spent waiting for threads to stop and then re-starting them. If you don't have a system set up to do that asynchronously, I would recommend it!

I see no reason to wait for threads to start/stop. Just tell the threads what to do and assume they will obey the order. No need to wait for confirmation. This is what I already do:

petero2 wrote:Advantages

* The algorithm is fully asynchronous. No part of the algorithm requires waiting for results from other threads or other cluster nodes.

D Sceviour · Post by **D Sceviour** » Sun Sep 10, 2017 2:43 pm

TomKerrigan wrote:I've tried this sort of thread synchronization in the past and found that a non-trivial amount of time was spent waiting for threads to stop and then re-starting them. If you don't have a system set up to do that asynchronously, I would recommend it!

Code: Select all

typedef struct &#123;
	int        alpha&#91;255&#93;;		// lower bound
	int        beta&#91;255&#93;;		// upper bound
	int8      active&#91;255&#93;;	 // additive threads,  127 MAX_THREADS
	bool      searched&#91;255&#93;;    // abort move signal 
&#125; threadController;

threadController ml;
int best_thread_score;  // global variable

// thread controller and synchronization

int i;    // local move number index

	MutexLock hmutexready;
	if &#40;ml.active&#91;i&#93;) &#123;
	   alpha = ml.alpha&#91;i&#93;;
	   beta = ml.beta&#91;i&#93;;
	   ml.active&#91;i&#93; +=1;
	&#125; else &#123;
	   if &#40;best_thread_score>alpha && i>1&#41; &#123;
		alpha = best_thread_score;
		beta = alpha+1;
	   &#125;
	   ml.alpha&#91;i&#93; = alpha;
	   ml.beta&#91;i&#93; = beta
	   ml.active&#91;i&#93; +=1;
	&#125;
	MutexUnlock hmutexready;

I tried something like this yesterday to synchronize the root thread values. It is not clear how the mutex can be avoided but it seems the threads have to stop at the root somewhere and wait. If they do not wait then the threads fall over each other and race, especially during PVS research. PVS research requires accurate hash values.

Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and lazy cluster algorithm

Re: Lazy SMP and lazy cluster algorithm

Re: Lazy SMP and lazy cluster algorithm

Re: ELO measurements

Re: Lazy SMP and lazy cluster algorithm

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA