Lazy SMP in Cheng

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Lazy SMP in Cheng

Post by mar »

Because some may find it useful, I will attempt to describe how I do lazy smp in cheng:

(actually I just found a bug where minimum qs depth is not copied to helpers before the start of new iteration so right now min qs depth is for helpers now actually depends on maximum depth reached in previous search, I'm not sure if/what impact this may have)

Code: Select all

IterativeDeepening:
    synchronize smp threads (copy age, board, history, repetition list, multipv => helpers)
    depth 1 with full width window on 1 thread
    loop (depth=2 .. max)
        AspirationLoop:
            (as usual)
            start helper threads( depth, alpha, beta )
            root( depth, alpha, beta)
            stop helper threads
            (rest as usual)
        end aspiration loop
    end id loop

Code: Select all

starting helper threads:
    clear smp abort flag
    for each helper thread:
        copy rootmoves and minimum qs depth => helper
        signal helper to start root search at current depth (add 1 for each even helper (assuming 0-based indexing) with aspiration alpha, beta bounds and wait until helper starts searching
note: helper threads run in infinite mode => no need to check for timeout

Code: Select all

aborting helper threads:
    set abort flag for each helper and wait for each to stop searching
The search (including root handling) is the same for master and helpers (no code duplication just a couple of extra trivial conditions).
helpers hold ref pointer to master (if it's 0 it's master) - actually a pointer to abort smp flag would do.
At the end of root search helpers simply set a special flag (abort smp flag) in master.
When master detects abort (smp) flag coming from the helpers (different from typical abort flag),
it scans through helpers and copies score/bestmove/PV from the first helper that finished root.

Pondering doesn't require any special handling (handled by master search thread).

I hope I didn't forget something important, perhaps it even makes sense :)
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Lazy SMP in Cheng

Post by zullil »

Maybe someone can implement this in Stockfish, since the current implementation seems to scale badly from 8 to 16 threads. Just a thought, for any talented folks who have a lot of free time ...
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: Lazy SMP in Cheng

Post by elcabesa »

I haven't never studied your smp code.
Reading your description I can't understood if the root code simply wait for the other to end is search and save the result of the first one who end the search.
Am' i wrong?
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Lazy SMP in Cheng

Post by mar »

It doesn't wait. Root position is searched by all helper threads at once (including master), if any of the helpers finishes root before master, it notifies the master to abort search.
After master completes root, if it was signalled it grabs the result from the helper that finished (otherwise it keeps its own).
Then all helpers are aborted and you continue standard aspiration (and ID) loop.
elcabesa
Posts: 855
Joined: Sun May 23, 2010 1:32 pm

Re: Lazy SMP in Cheng

Post by elcabesa »

ok, now I understand
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Lazy SMP in Cheng

Post by mar »

I'm not sure it would scale better, but certainly it would be interesting to compare.
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: Lazy SMP in Cheng

Post by Joerg Oster »

mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
Jörg Oster
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Lazy SMP in Cheng

Post by petero2 »

Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Lazy SMP in Cheng

Post by bob »

Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
Nothing is free. Overhead climbs quickly because you are searching many more nodes at a particular depth than a sequential search would examine. YBW is the correct way to do parallel search and minimize overhead. I've yet to see any non-YBW-based algorithm produce even a 4x speedup at 8 cores (over 1 core to same depth). There's a reason many of us have invested all of the time we have spent in doing a traditional splitting algorithm - performance.

I've not had any problems at all playing bullet games with 12 cores. And the key point is that whatever the 12 cores do is generally important work. It is easy to keep all 12 cores busy, but if they are doing redundant work, what's the point?
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: Lazy SMP in Cheng

Post by Joerg Oster »

petero2 wrote:
Joerg Oster wrote:
mar wrote:I'm not sure it would scale better, but certainly it would be interesting to compare.
At least lazy smp wouldn't suffer from 2 limitations as the current implementation.
1. takes some time to fully kick in because of the min split depth parameter
2. not enough workload especially for threads >= 8 due to heavy pruning and reductions
It seems 1 is not a big problem for YBWC either. I ran some tests with stockfish 6 at time control 1s+0.08s/move, 4 cores vs 1 core:

Code: Select all

Rank Name     Elo    +    - games score oppo. draws 
   1 sf6_4c    70    7    7  2282   73%   -70   46% 
   2 sf6      -70    7    7  2282   27%    70   46% 
Thank you, Peter.
Yes, 4-core performance is very good, but how does it look like with 8 and 16 cores?
Jörg Oster