Explanation for non-expert?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: Explanation for non-expert?

Post by zamar »

I'm no where near an expert on parallel alpha-beta, but it seems to me that people in this thread are reinventing optimal scheduling of jobs related to potential split points. AFAICS, this is mostly solved in the literature by randomized work-stealing such as pioneered by Cilk-chess and currently available as Cilk-Plus from Intel (there is a g++ branch).
I'm nowhere near expert in this field, but my understanding is that for standard fork-join model it's assumed that there is a fixed amount of work to be done, and the question is how to optimally divide this work between threads.

Whereas for alpha-beta, it often happens that when two threads are working under the same split point, the other thread produces cut-off which completely invalidates the work done by the other thread. We want to minimize the probability of this, and I don't think that this is in any way related to optimal scheduling or Cilk and friends.
Joona Kiiski
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Explanation for non-expert?

Post by Laskos »

Stockfish_15021621_x64_modern, that with SMP patch, widens from 1 to 16 threads. Games at fixed depth 9:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 1015 - 900 - 1548  [0.52] 3463
ELO difference: 12
SPRT: H1 was accepted
Finished match


Stockfish_15021602_x64_modern immediately preceding the SMP patch doesn't widen:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 809 - 826 - 1127  [0.50] 2762
ELO difference: -2
SPRT: H0 was accepted
Finished match
petero2
Posts: 688
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: Explanation for non-expert?

Post by petero2 »

Rein Halbersma wrote:
zamar wrote:
petero2 wrote: For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I've thought along similar lines, so that makes perfect sense to me.
About your algorithm: Why are you not just using YBW?
I'm no where near an expert on parallel alpha-beta, but it seems to me that people in this thread are reinventing optimal scheduling of jobs related to potential split points. AFAICS, this is mostly solved in the literature by randomized work-stealing such as pioneered by Cilk-chess and currently available as Cilk-Plus from Intel (there is a g++ branch).

The way they implement work-stealing is by using a cactus-stack, essentially one deque of split points for each hardware thread. During execution, each thread keeps popping the bottom of a randomly chosen deque, and also pushes each potential split point to its own deque. There are tons of papers and PhD theses from the Cilk project in which they prove that this kind of scheduling is optimal for fork-join parallelism. Because the randomization automatically load-balances, there is no need for heuristics on the depth in the tree where a split point is being added.
I did study work-stealing algorithms and fork-join parallelism before designing my algorithm. However, as far as I know those algorithms make the assumption that all work units that are stored in the work queues have to be computed by some thread at some point in time. They don't handle the case where work units have an associated probability of being unnecessary.

In the alpha beta case, there is no guarantee that stealing a work unit from the bottom of a deque is best, because that might correspond to a PV node where the first move has not yet been searched, while a work unit higher up in the deque may correspond to a strongly expected ALL node.

That said, a drawback of the priority queue I use is that it is almost guaranteed to become a bottleneck if the number of threads is large enough. I have optimized the algorithm so that this is seldom a problem on my 16 core computer, but I have not been able to test on larger machines.

At the time I designed the algorithm I did not find any scalable priority queue algorithms I could use, but I did a new search today and found this SprayList paper. Maybe that can be used to improve scalability.
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: Explanation for non-expert?

Post by Rein Halbersma »

petero2 wrote: I did study work-stealing algorithms and fork-join parallelism before designing my algorithm. However, as far as I know those algorithms make the assumption that all work units that are stored in the work queues have to be computed by some thread at some point in time. They don't handle the case where work units have an associated probability of being unnecessary.

In the alpha beta case, there is no guarantee that stealing a work unit from the bottom of a deque is best, because that might correspond to a PV node where the first move has not yet been searched, while a work unit higher up in the deque may correspond to a strongly expected ALL node.
I don't know if that is true. IIRC, the old Cilk project had an "abort" keyword that would cancel work from the various queues, so I don't think that would invalidate the optimality of the work-stealing scheduler.
That said, a drawback of the priority queue I use is that it is almost guaranteed to become a bottleneck if the number of threads is large enough. I have optimized the algorithm so that this is seldom a problem on my 16 core computer, but I have not been able to test on larger machines.
I think the Cilk scheduler has a bounded O(P) space overhead for P processors. There should be no blow-up. The problem is that the old Cilk-5 language is only compatible with C and old gcc compilers (I haven't tested which versions of the libc can link to it).

The new Cilk-plus works with C++ but does not (yet) have an "abort" keyword. I don't fancy writing a work-stealing scheduler myself, but whenever Cilk-plus acquires the "abort" keyword, you will be able to parallelize any chess program with just a few "sync" and "spawn" keyword, just as described in http://supertech.csail.mit.edu/papers/icca99.pdf
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Explanation for non-expert?

Post by zullil »

Laskos wrote:Stockfish_15021621_x64_modern, that with SMP patch, widens from 1 to 16 threads. Games at fixed depth 9:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 1015 - 900 - 1548  [0.52] 3463
ELO difference: 12
SPRT: H1 was accepted
Finished match


Stockfish_15021602_x64_modern immediately preceding the SMP patch doesn't widen:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 809 - 826 - 1127  [0.50] 2762
ELO difference: -2
SPRT: H0 was accepted
Finished match
I don't see that this means anything (for standard time controls). Depth 9? With 16 threads, the default min split depth is 7. Not sure what you are testing, or what (if anything) your results mean. But thanks for the data in any case.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Explanation for non-expert?

Post by Laskos »

zullil wrote:
Laskos wrote:Stockfish_15021621_x64_modern, that with SMP patch, widens from 1 to 16 threads. Games at fixed depth 9:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 1015 - 900 - 1548  [0.52] 3463
ELO difference: 12
SPRT: H1 was accepted
Finished match


Stockfish_15021602_x64_modern immediately preceding the SMP patch doesn't widen:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 809 - 826 - 1127  [0.50] 2762
ELO difference: -2
SPRT: H0 was accepted
Finished match
I don't see that this means anything (for standard time controls). Depth 9? With 16 threads, the default min split depth is 7. Not sure what you are testing, or what (if anything) your results mean. But thanks for the data in any case.
Means that on 16 threads the tree is a bit wider than on 1 thread to the same depth. Also, that time-to-depth is an incorrect measure of SMP efficiency for SF.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Explanation for non-expert?

Post by bob »

Laskos wrote:
zullil wrote:
Laskos wrote:Stockfish_15021621_x64_modern, that with SMP patch, widens from 1 to 16 threads. Games at fixed depth 9:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 1015 - 900 - 1548  [0.52] 3463
ELO difference: 12
SPRT: H1 was accepted
Finished match


Stockfish_15021602_x64_modern immediately preceding the SMP patch doesn't widen:

Code: Select all

HO:  0
H1:  8
alpha, beta 0.05
Score of SF 16 threads vs SF 1 thread: 809 - 826 - 1127  [0.50] 2762
ELO difference: -2
SPRT: H0 was accepted
Finished match
I don't see that this means anything (for standard time controls). Depth 9? With 16 threads, the default min split depth is 7. Not sure what you are testing, or what (if anything) your results mean. But thanks for the data in any case.
Means that on 16 threads the tree is a bit wider than on 1 thread to the same depth. Also, that time-to-depth is an incorrect measure of SMP efficiency for SF.
For the record EVERY parallel search is wider than the serial search. It is absolutely impossible to avoid this since we don't have perfect move ordering and absolutely always much tolerate search overhead caused by splitting at CUT nodes that can not be recognized as CUT nodes until after a search has started.

Time-to-depth is the correct measure 99% of the time at least...
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Explanation for non-expert?

Post by lucasart »

bob wrote:
lucasart wrote:
bob wrote: I didn't understand it because spLevel was not in the Stockfish6 code I looked at. That change is not going to be worth 50 Elo or so however. But I will let YOU figure out why. And yes, if you look at my old DTS paper this kind of thing was factored in. There's very little new stuff in normal PVS/YBW type parallel searching.
If you've understood all this since.the 80's, how come Crafty scales so badly from 8 to 16 cores ?
It doesn't on my machine. It depends on the box. The PC is not exactly the perfect platform. 16 cores typically has two chips although it could be a single now. 16 cores on a single chip, with a single path to memory has a problem.

Additionally, I chose to write Crafty in a simpler way that the DTS algorithm used in Cray Blitz. I may well go back and rewrite that one day, because it is significantly better. It just hasn't been a high priority. But the issues for SMP scaling are not new. SMP search has been around for almost 40 years now... Alpha/Beta hasn't changed.

On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores. You do have to know what you are doing when testing, and make certain turbo is disabled, or the answers are useless. Most miss that.
Nobody cares about NPS scaling. It's the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's ELO we care about. Here's the evidence that shows that there's a big problem with Crafty's scaling from 8 to 16 cores:
http://www.talkchess.com/forum/viewtopi ... ght=crafty
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
syzygy
Posts: 5563
Joined: Tue Feb 28, 2012 11:56 pm

Re: Explanation for non-expert?

Post by syzygy »

lucasart wrote:Nobody cares about NPS scaling. It's the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's ELO we care about.
Lots of people certainly do care about NPS scaling and for good reason. If 16 cores only give an 8-fold speedup of nps, say, then that means there is a lot of room for improvement.

Of course if improvements are made to the efficiency of the parallel search, then that's great as well. But that does not mean at all that one should not care about threads being idle half the time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Explanation for non-expert?

Post by bob »

lucasart wrote:
bob wrote:
lucasart wrote:
bob wrote: I didn't understand it because spLevel was not in the Stockfish6 code I looked at. That change is not going to be worth 50 Elo or so however. But I will let YOU figure out why. And yes, if you look at my old DTS paper this kind of thing was factored in. There's very little new stuff in normal PVS/YBW type parallel searching.
If you've understood all this since.the 80's, how come Crafty scales so badly from 8 to 16 cores ?
It doesn't on my machine. It depends on the box. The PC is not exactly the perfect platform. 16 cores typically has two chips although it could be a single now. 16 cores on a single chip, with a single path to memory has a problem.

Additionally, I chose to write Crafty in a simpler way that the DTS algorithm used in Cray Blitz. I may well go back and rewrite that one day, because it is significantly better. It just hasn't been a high priority. But the issues for SMP scaling are not new. SMP search has been around for almost 40 years now... Alpha/Beta hasn't changed.

On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores. You do have to know what you are doing when testing, and make certain turbo is disabled, or the answers are useless. Most miss that.
Nobody cares about NPS scaling. It's the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's ELO we care about. Here's the evidence that shows that there's a big problem with Crafty's scaling from 8 to 16 cores:
http://www.talkchess.com/forum/viewtopi ... ght=crafty
Stick to topics you understand.

If your NPS scaling is bad, EVERYTHING parallel is bad. If your NPS scaling is good, then you have a chance to do decently on the usual SMP speedup measurements. But if your NPS is 1/2 of what it should be, your speedup is guaranteed to be 1/2 of what it could be as well.

Here's one 16 core run I made about a month ago that shows that for this specific 2x8core box, scaling was within reason and can be improved a bit:

Code: Select all

        1-core  =  nps= 6.1M     ---
        4-cores =  nps=23.7M    4.0x
        8-cores =  nps=45.0M    7.4x
       16-cores =  nps=86.2M    14.1x
Those are not that bad. Different machines behave differently and require some tuning. If that represents a "big problem" I will keep it. Every time I move to something new it takes some tweaking... ho hum.