lazy smp questions

lucasart · Post by **lucasart** » Wed Sep 09, 2015 2:28 am

Let's say we start with the most basic implementation:
* we have 4 threads
* run an id loop starting at depth 1 with 2 threads T1, T2
* run an id loop starting at depth 2 with 2 threads T3, T4

Ok, now let's say T2 finishes depth 1:
* in this totally dumb implementation, T2 will start searching depth 2, while T1 is still searching depth 1.
* instead we want to have always 2 threads searching some depth N and the other 2 threads searching depth N+1.
* here we have T1 continuing a useless search, 3 threads at depth 2 (instead of 2), and no one searching depth 3.

Looks like we need a global array (to RW under lock protection obviously), that knows which thread is doing what, and a signaling process to abort search, so that:
* when T2 finishes, it knows that there are already 2 threads searching depth 2, so the next iteration for T2 should be depth 3
* also, T2 knows that T1 is now doing useless work, so T2 raises a signal that T1 can observe to abort the search and go back to the id loop, where the above logic will kick in as well so that T1 will start depth 3.

Another thing is the main thread. What to do with it? It seems natural to park it in a loop that reads stdin to process UCI commands (and raise a "all searching should stop" signal when stop command is received for example).

And then there's stopping the search, because we're out of time, have finished the max depth, or reached the node limit. Looks like the cleanest way would be to have another thread that manages that like:
* sleep for a few msec
* check all these conditions (nodes will be a pain as node counters will be per thread, and locking could slow things down too much), and raise the appropriate signal.

And what of aspiration? If 2 threads sear the same window same depth, one finishes with a score outside the window, then the other is now doing useless work on a refuted window and should be signald to abort and move to the next window, plus windows should be global and for each depth (with lock protection).

Seems not so lazy after all, if I want to do this right. Am I missing something?

bob · Post by **bob** » Wed Sep 09, 2015 7:46 am

lucasart wrote:Let's say we start with the most basic implementation:
* we have 4 threads
* run an id loop starting at depth 1 with 2 threads T1, T2
* run an id loop starting at depth 2 with 2 threads T3, T4

Ok, now let's say T2 finishes depth 1:
* in this totally dumb implementation, T2 will start searching depth 2, while T1 is still searching depth 1.
* instead we want to have always 2 threads searching some depth N and the other 2 threads searching depth N+1.
* here we have T1 continuing a useless search, 3 threads at depth 2 (instead of 2), and no one searching depth 3.

Looks like we need a global array (to RW under lock protection obviously), that knows which thread is doing what, and a signaling process to abort search, so that:
* when T2 finishes, it knows that there are already 2 threads searching depth 2, so the next iteration for T2 should be depth 3
* also, T2 knows that T1 is now doing useless work, so T2 raises a signal that T1 can observe to abort the search and go back to the id loop, where the above logic will kick in as well so that T1 will start depth 3.

Another thing is the main thread. What to do with it? It seems natural to park it in a loop that reads stdin to process UCI commands (and raise a "all searching should stop" signal when stop command is received for example).

And then there's stopping the search, because we're out of time, have finished the max depth, or reached the node limit. Looks like the cleanest way would be to have another thread that manages that like:
* sleep for a few msec
* check all these conditions (nodes will be a pain as node counters will be per thread, and locking could slow things down too much), and raise the appropriate signal.

And what of aspiration? If 2 threads sear the same window same depth, one finishes with a score outside the window, then the other is now doing useless work on a refuted window and should be signald to abort and move to the next window, plus windows should be global and for each depth (with lock protection).

Seems not so lazy after all, if I want to do this right. Am I missing something?

I normally always respond to questions about this topic, but not in this case. You guys like to act childish and make clever insulting remarks, so rather than explaining why this is a poor approach, I'll leave it as an exercise for you to work out for yourself, WITHOUT telling you what was known about this approach 30+ years ago. And it WAS done back then by at least two different groups.

lucasart · Post by **lucasart** » Wed Sep 09, 2015 8:21 am

I'm asking whether Bob Hyatt thinks that lazy SMP is a good idea or not. I know already the answer to that question.

I'm asking some some specific question about how to implement it, in the lazyest, yet still effective way.

cdani · Post by **cdani** » Wed Sep 09, 2015 10:17 am

Hi.
My lazy smp implementation is the most lazy I know about

Anyway seems to be good for minimum 80 elo gain at 4 threads, probably 100 in the latest implementation, pending to be evaluated at CCRL.

The thread 0 manages the id loop. It signals other threads to start thinking at search_root for each new depth. Itself calls search_root, and controls user input every n nodes (calculated on the fly), inside alpha_beta function.

How all this stop? Simply when search_root of thread 0 finishes, it stops the other threads. Some of them maybe had ended their search before, so are not doing nothing. Thats it.

So the win comes only from hash table, as thread 0 ignores the results of the other threads. The only thing directly used from other threads is the number of nodes analyzed, to calculate the NPS.

lucasart · Post by **lucasart** » Wed Sep 09, 2015 11:20 am

Thanks Daniel. That is even lazyer than what I had in mind.

cdani · Post by **cdani** » Wed Sep 09, 2015 2:01 pm

I add some more info on hashes shared between threads, apart from the main one which also serves as a evaluation hash:

* Double move refutation hash table (like discocheck).
* Triple/quadruple move refutation hash table.
* Piece hash, like killers but associated to (hash ^ hash pawns).
* Counter moves.
* Pawn hash.

Each one is supposed to help a little to increase benefices of SMP, but I did not verified if it's true.

The rest, like killers, history and others, are exclusive for each thread.

mar · Post by **mar** » Wed Sep 09, 2015 6:13 pm

What I do:
- always finish depth 1 in main search thread no matter what (also all limits [time/nodes etc] are ignored in this case)
- main search thread (="master") starts (=notifies) helper threads (="slaves") inside id loop (let's say main search thread is thread 0 and takes part in lazy smp itself)
- whenever a slave search finishes it aborts all others (using a special flag present in main search thread)
if it's main search thread that finished first, nothing special happens, otherwise main search thread finds the "lazy helper" that finished and "borrows" its result.
After that I simply continue id loop.

("main thread" in my case in simply I/O loop in main(), because I have a separate thread that manages search and I don't actively poll for user input within search)

lucasart · Post by **lucasart** » Thu Sep 10, 2015 2:08 am

Thanks. But I'm wondering if you even need to have permanent threads, and manage them with locks and condition variables. Why not just create threads just when you need them, starting them directly in the ID loop and join them (with abort signal) or even kill them when you're done. Spawning a few threads is such a cheap operation nowadays (at least on linux where i tested it and was amazed by the results), why bother?

cetormenter · Post by **cetormenter** » Thu Sep 10, 2015 3:26 am

I am pretty sure Nirvana uses the laziest of SMP approaches. I think the chess programming wiki calls it "Shared Hash Table". I simply spawn the threads at the root position and set them free. The main thread then controls when all of the threads stop their search. Not sure how well it scales to many threads but up to 4 is it pretty competitive considering how little time it takes to implement even compared to "lazy" smp. However I don't think this approach satisfies your "yet still effective" criteria since a lot of work is wasted.

JVMerlino · Post by **JVMerlino** » Thu Sep 10, 2015 6:00 pm

cetormenter wrote:I am pretty sure Nirvana uses the laziest of SMP approaches. I think the chess programming wiki calls it "Shared Hash Table". I simply spawn the threads at the root position and set them free. The main thread then controls when all of the threads stop their search. Not sure how well it scales to many threads but up to 4 is it pretty competitive considering how little time it takes to implement even compared to "lazy" smp. However I don't think this approach satisfies your "yet still effective" criteria since a lot of work is wasted.

I did the same thing in Myrddin. Each slave process simply acts as if it is in analysis mode and fills the hash table. I got this idea from H.G Muller.

In my own testing, it got +69 elo at 4 CPU. Definitely not as good as other options, but very easy to implement and so I'm satisfied.

jm

lazy smp questions

lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions

Re: lazy smp questions