Lazy SMP in Cheng

mar · Post by **mar** » Mon Feb 02, 2015 2:14 pm

Because some may find it useful, I will attempt to describe how I do lazy smp in cheng:

(actually I just found a bug where minimum qs depth is not copied to helpers before the start of new iteration so right now min qs depth is for helpers now actually depends on maximum depth reached in previous search, I'm not sure if/what impact this may have)

Code: Select all

IterativeDeepening&#58;
    synchronize smp threads &#40;copy age, board, history, repetition list, multipv => helpers&#41;
    depth 1 with full width window on 1 thread
    loop &#40;depth=2 .. max&#41;
        AspirationLoop&#58;
            &#40;as usual&#41;
            start helper threads&#40; depth, alpha, beta )
            root&#40; depth, alpha, beta&#41;
            stop helper threads
            &#40;rest as usual&#41;
        end aspiration loop
    end id loop

Code: Select all

starting helper threads&#58;
    clear smp abort flag
    for each helper thread&#58;
        copy rootmoves and minimum qs depth => helper
        signal helper to start root search at current depth &#40;add 1 for each even helper &#40;assuming 0-based indexing&#41; with aspiration alpha, beta bounds and wait until helper starts searching

note: helper threads run in infinite mode => no need to check for timeout

Code: Select all

aborting helper threads&#58;
    set abort flag for each helper and wait for each to stop searching

The search (including root handling) is the same for master and helpers (no code duplication just a couple of extra trivial conditions).
helpers hold ref pointer to master (if it's 0 it's master) - actually a pointer to abort smp flag would do.
At the end of root search helpers simply set a special flag (abort smp flag) in master.
When master detects abort (smp) flag coming from the helpers (different from typical abort flag),
it scans through helpers and copies score/bestmove/PV from the first helper that finished root.

Pondering doesn't require any special handling (handled by master search thread).

I hope I didn't forget something important, perhaps it even makes sense

zullil · Post by **zullil** » Mon Feb 02, 2015 3:39 pm

Maybe someone can implement this in Stockfish, since the current implementation seems to scale badly from 8 to 16 threads. Just a thought, for any talented folks who have a lot of free time ...

elcabesa · Post by **elcabesa** » Mon Feb 02, 2015 8:53 pm

I haven't never studied your smp code.
Reading your description I can't understood if the root code simply wait for the other to end is search and save the result of the first one who end the search.
Am' i wrong?

mar · Post by **mar** » Mon Feb 02, 2015 9:02 pm

It doesn't wait. Root position is searched by all helper threads at once (including master), if any of the helpers finishes root before master, it notifies the master to abort search.
After master completes root, if it was signalled it grabs the result from the helper that finished (otherwise it keeps its own).
Then all helpers are aborted and you continue standard aspiration (and ID) loop.

elcabesa · Post by **elcabesa** » Mon Feb 02, 2015 9:11 pm

ok, now I understand

mar · Post by **mar** » Mon Feb 02, 2015 9:34 pm

I'm not sure it would scale better, but certainly it would be interesting to compare.

Joerg Oster · Post by **Joerg Oster** » Mon Feb 02, 2015 10:42 pm

mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.

At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions

petero2 · Post by **petero2** » Mon Feb 02, 2015 11:56 pm

Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions

It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46%

bob · Post by **bob** » Tue Feb 03, 2015 12:00 am

Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions

Nothing is free. Overhead climbs quickly because you are searching many more nodes at a particular depth than a sequential search would examine. YBW is the correct way to do parallel search and minimize overhead. I've yet to see any non-YBW-based algorithm produce even a 4x speedup at 8 cores (over 1 core to same depth). There's a reason many of us have invested all of the time we have spent in doing a traditional splitting algorithm - performance.

I've not had any problems at all playing bullet games with 12 cores. And the key point is that whatever the 12 cores do is generally important work. It is easy to keep all 12 cores busy, but if they are doing redundant work, what's the point?

Joerg Oster · Post by **Joerg Oster** » Tue Feb 03, 2015 12:15 am

petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:
Code: Select all
Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 

Thank you, Peter.
Yes, 4-core performance is very good, but how does it look like with 8 and 16 cores?

Lazy SMP in Cheng

Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng

Re: Lazy SMP in Cheng