Lazy SMP Better than YBWC?

bob · Post by **bob** » Sat Oct 24, 2015 8:46 pm

syzygy wrote:
bob wrote:
syzygy wrote:
bob wrote:That is simply wrong. You are accepting a lousy serial search implementation and using extra processors to offset for the existing weakness.
Serial???
syzygy wrote:If the lazy approach outperforms YBW, then lazy is better than YBW. It does not mean that the lazy idea is of any use for a serial search. We're simply not talking about 1-core machines here.
You don't want to follow the discussion, fine by me...
I'm not willing to follow your fallacies, indeed.

More like "not able to follow the discussion."

There was NO fallacy in what _I_ wrote.

bob · Post by **bob** » Sat Oct 24, 2015 9:22 pm

A parallel search improves over a serial search in two ways. One major, and one minor, normally.

1. Speed. If the parallel search is faster to the same depth, that is the same thing as making the single CPU that much faster. This is traditional parallel search.

2. Widening. This is almost 100% unintentional, but it does have a possible benefit. If the serial search is overly aggressive on pruning or reductions, searching additional moves at each ply reduces the likelihood that you reduce or prune a critical move.

An example:

Suppose you measure time to depth and you discover that with 16 cores, you run 8x faster. Which means you have doubled the single cpu clock speed three times. If you play a bunch of games and you discover that the Elo gain maps to an elo gain a bit higher. Where did that come from? Obviously the search overhead (extra nodes) helped a bit, luckily.

Now let's plug in numbers. You do the above test and you get an effective speedup of 10x. That is, the Elo of the parallel search is equal to a time-odds of 10x in favor of the parallel search. If you assume this is linear improvement, to make the math easy, 80% of the improvement is due to searching faster (parallel search) and 20% of the improvement is due to searching wider.

My claim is that 20% from widening can be transferred back to the serial search, although only a fraction (assuming 16 cores, then 1/16th of that gain can be moved back to the serial search if the excessive pruning or reducing is reduced.

The ONLY way this widening argument holds water is (a) you have maxed out your parallel search speedup, so that (b) the search overhead actually improves things and you can't seem to reduce this overhead to gain more search speed (and depth).

A simple explanation is that given 16 cores, the BEST possible result will be to search 16x faster than the serial program. No exceptions. If the serial program has pruning or reduction issues, they can be fixed and that will help both the serial and the parallel version. The fact that widening adds to the elo only helps because the search is searching things it really wants to avoid (search overhead). Reduce this overhead, increase the depth.

At Kai's suggestion, I ran a ton of games with crafty vs the gauntlet, with crafty at fixed depth, then crafty vs the same gauntlet, same depth, but using all cores in a parallel search. This causes the search overhead to not penalize the search times as both reach exactly the same depths. And for crafty, the elo gain for this "widening" was zero. Exactly zero. Last test was 30K 5m+5s games. So it is safe to say that "widening" has no impact on my search, and any Elo gain I see is purely by search speedup... Which is expected.

My take is "widening is interesting but NOT intentional."

Laskos · Post by **Laskos** » Sat Oct 24, 2015 9:30 pm

bob wrote:
Laskos wrote:
bob wrote:
Joerg Oster wrote:
bob wrote:
Steve Maughan wrote:FYI I see the StockFish patch of Oct 20th says that Lazy SMP is actually better than YBWC for large numbers of cores.

This would be amazing if it's true (and I see no reason to doubt it).

Anyone else found it to be true?

- Steve
I believe what they said is "it is better than THEIR YBW implementation." Which is not exactly what you wrote. Its NPS scales particularly well since there is little synchronization. As to actual speedup, that's a different matter.
True, but obviously Stockfish gains a lot from going wider.
OTOH, the nominal depth shown doesn't necessarily tell you the whole truth, since some of the helper threads
are already searching deeper.
So it's possible that the main thread gets useful information from an already deeper search, compensating for the supposed slower speedup.

Comparing SF's YBW time-to-depth with Lazy ttd is most likely like comparing apples with oranges. Not to say misleading.
This "widening" comes up over and over, but it is a broken argument. If widening helps, widening will help the serial search just as much. Yes it can help some, no doubt. But the poorer time-to-depth has to hurt more, making this a net loss compared to a normal YBW (or even better, DTS-like) parallel search.

I won't go so far as to say it can't possibly work, but logic and theory are pretty clear. Hoping that it works better won't make it work better. It is simply horribly counter-intuitive to accept it. And no, this is not because "you can't teach an old dog new tricks" or any of the other nonsense that will likely follow this. Parallel search has been examined for 40 years now and the theory behind it is well known...
Bob, this argument was discussed at some length in the past. The argument with serial search or multiplexing was shown to be misleading. Parallel search need not follow the same search tree with the serial search. The parallel problem is optimizing performance on 20 cores for fixed time T. The serial problem is optimizing on 1 core fixed time T. Not on 1 core fixed time 20*T. Nobody is disputing that serial one core with fixed time 20*T performs better than parallel on 20 cores in fixed time T. This is given, and you continue to insist on this non-argument.
Here's the point NOBODY addresses. We make poor pruning and reduction decisions, yes. So tell me EXACTLY how the parallel search (lazy SMP) is going to improve on that? Only answer is "serendipitous discovery". That's not an algorithmic improvement. The worse your pruning decisions are, the more likely "widening" will help. But suppose your program searches the OPTIMAL tree already? There's nothing to discover, and now lazy SMP gives nothing, where a traditional YBW or DTS algorithm will STILL improve the speed.

So depending on serendipity to improve things is not a rational way to go here, until you reach a point with a traditional SMP search that you can't improve it further given more processors. At the point where more processors offers nothing, then searching wider and depending on luck to discover things the normal search misses will probably be useful.

But the argument of "YBW stockfish scales worse than lazy-amp SF" is simply NOT an argument to make. We already know the YBW version scaled poorly. NPS-wise and TTD wise. Lacking an interest in trying to fix that, using lazy smp is a reasonable alternative assuming it actually helps, which as Louis pointed out has NOT been shown yet. Very fast games are a poor way to test SMP anything, anyway. So we have no data to show anything other than for extremely fast games it is not worse, so I remain unconvinced in the light of a lack of evidence showing anything at all.

Meanwhile I have a mountain of evidence for my parallel search performance, and I consider it absolutely impossible for lazy SMP (added to Crafty) to even remotely approach the benefit of a 13x speedup (and it will likely be higher now, with this better NPS scaling)..

I wrote a paper back in the late 80's that took the Knuth/Moore analysis into the world of parallel search. There's some math that shows this lazy stuff adds a HUGE amount of tree growth to reach the same depth. And the growth is unconstrained and undirected, so whenever it helps, it is luck. You aren't going to get lucky more than 1/2 of the time at best.

Maybe one day we will see some actual data. Since TTD is not comparable between lazy smp and a good yaw, about all that can be done is to play lots of games at reasonable time limits. The games are painful to run, but I can play quite a few 24 core games at once. I could play a lot of games with 1 cpu vs 24 cpus crafty vs crafty, and then stockfish vs stockfish, so see what kind of improvement 24 cores will produce. I'm just not sure I want to burn that much time for an answer I am 99.9% certain I already know.

This is a classic "if it seems too good to be true..." case..

BTW you are not correctly understanding my point. Never mentioned the 20x longer is better than 1x longer issue. It is a case of NX longer is better than 1, where N is the actual SMP performance improvement. For MOST programs this is nothing more than TTD. For YBW TTD is less meaningful. But we have no data to connect/compare the two. Until then, this is all idle nonsensical speculation. We certainly have a ton of Crafty TTD data that everyone could look at. We need some sort of measure of the SF SMP performance, which we don't have and which is painful to produce in light of needing thousands of games vs hundreds of test positions.

When/if they stop changing this code, I might decide to give it a run.

I agree with the part about superficiality of Fishtest about this SMP issue, testing only at bullet and hyper-fast I (and they had all necessary resources to test comprehensively). The only relevant tests were performed by Louis on one machine at much longer time controls, and they came fairly well for Lazy-SMP. Also, I suspect that their YBW is not exactly state-of-the-art.

The optimality of the search is an empiric quantity depending on time control and hardware. More ELO at fixed time and fixed hardware. What search is optimal at 1s/move on 1 core is fairly different from what is optimal at 1minute/move on 20 similar cores. With LMR and such "elastic" pruning techniques, one will get huge variations in TTD, varying only a little the optimality (more ELO) of the search. It is possible that you are right, and your argument can add some 2-3 ELO points at longer than tested in Fishtest time control of 1 core SF.

Again, the main point you seem to make is that for the same amount of spent electricity, 20*Time serial on 1*core is better than 1*Time parallel on 20*cores. But we all agree on that.

bob · Post by **bob** » Sat Oct 24, 2015 10:04 pm

Laskos wrote:
bob wrote:
Laskos wrote:
bob wrote:
Joerg Oster wrote:
bob wrote:
Steve Maughan wrote:FYI I see the StockFish patch of Oct 20th says that Lazy SMP is actually better than YBWC for large numbers of cores.

This would be amazing if it's true (and I see no reason to doubt it).

Anyone else found it to be true?

- Steve
I believe what they said is "it is better than THEIR YBW implementation." Which is not exactly what you wrote. Its NPS scales particularly well since there is little synchronization. As to actual speedup, that's a different matter.
True, but obviously Stockfish gains a lot from going wider.
OTOH, the nominal depth shown doesn't necessarily tell you the whole truth, since some of the helper threads
are already searching deeper.
So it's possible that the main thread gets useful information from an already deeper search, compensating for the supposed slower speedup.

Comparing SF's YBW time-to-depth with Lazy ttd is most likely like comparing apples with oranges. Not to say misleading.
This "widening" comes up over and over, but it is a broken argument. If widening helps, widening will help the serial search just as much. Yes it can help some, no doubt. But the poorer time-to-depth has to hurt more, making this a net loss compared to a normal YBW (or even better, DTS-like) parallel search.

I won't go so far as to say it can't possibly work, but logic and theory are pretty clear. Hoping that it works better won't make it work better. It is simply horribly counter-intuitive to accept it. And no, this is not because "you can't teach an old dog new tricks" or any of the other nonsense that will likely follow this. Parallel search has been examined for 40 years now and the theory behind it is well known...
Bob, this argument was discussed at some length in the past. The argument with serial search or multiplexing was shown to be misleading. Parallel search need not follow the same search tree with the serial search. The parallel problem is optimizing performance on 20 cores for fixed time T. The serial problem is optimizing on 1 core fixed time T. Not on 1 core fixed time 20*T. Nobody is disputing that serial one core with fixed time 20*T performs better than parallel on 20 cores in fixed time T. This is given, and you continue to insist on this non-argument.
Here's the point NOBODY addresses. We make poor pruning and reduction decisions, yes. So tell me EXACTLY how the parallel search (lazy SMP) is going to improve on that? Only answer is "serendipitous discovery". That's not an algorithmic improvement. The worse your pruning decisions are, the more likely "widening" will help. But suppose your program searches the OPTIMAL tree already? There's nothing to discover, and now lazy SMP gives nothing, where a traditional YBW or DTS algorithm will STILL improve the speed.

So depending on serendipity to improve things is not a rational way to go here, until you reach a point with a traditional SMP search that you can't improve it further given more processors. At the point where more processors offers nothing, then searching wider and depending on luck to discover things the normal search misses will probably be useful.

But the argument of "YBW stockfish scales worse than lazy-amp SF" is simply NOT an argument to make. We already know the YBW version scaled poorly. NPS-wise and TTD wise. Lacking an interest in trying to fix that, using lazy smp is a reasonable alternative assuming it actually helps, which as Louis pointed out has NOT been shown yet. Very fast games are a poor way to test SMP anything, anyway. So we have no data to show anything other than for extremely fast games it is not worse, so I remain unconvinced in the light of a lack of evidence showing anything at all.

Meanwhile I have a mountain of evidence for my parallel search performance, and I consider it absolutely impossible for lazy SMP (added to Crafty) to even remotely approach the benefit of a 13x speedup (and it will likely be higher now, with this better NPS scaling)..

I wrote a paper back in the late 80's that took the Knuth/Moore analysis into the world of parallel search. There's some math that shows this lazy stuff adds a HUGE amount of tree growth to reach the same depth. And the growth is unconstrained and undirected, so whenever it helps, it is luck. You aren't going to get lucky more than 1/2 of the time at best.

Maybe one day we will see some actual data. Since TTD is not comparable between lazy smp and a good yaw, about all that can be done is to play lots of games at reasonable time limits. The games are painful to run, but I can play quite a few 24 core games at once. I could play a lot of games with 1 cpu vs 24 cpus crafty vs crafty, and then stockfish vs stockfish, so see what kind of improvement 24 cores will produce. I'm just not sure I want to burn that much time for an answer I am 99.9% certain I already know.

This is a classic "if it seems too good to be true..." case..

BTW you are not correctly understanding my point. Never mentioned the 20x longer is better than 1x longer issue. It is a case of NX longer is better than 1, where N is the actual SMP performance improvement. For MOST programs this is nothing more than TTD. For YBW TTD is less meaningful. But we have no data to connect/compare the two. Until then, this is all idle nonsensical speculation. We certainly have a ton of Crafty TTD data that everyone could look at. We need some sort of measure of the SF SMP performance, which we don't have and which is painful to produce in light of needing thousands of games vs hundreds of test positions.

When/if they stop changing this code, I might decide to give it a run.
I agree with the part about superficiality of Fishtest about this SMP issue, testing only at bullet and hyper-fast I (and they had all necessary resources to test comprehensively). The only relevant tests were performed by Louis on one machine at much longer time controls, and they came fairly well for Lazy-SMP. Also, I suspect that their YBW is not exactly state-of-the-art.

The optimality of the search is an empiric quantity depending on time control and hardware. More ELO at fixed time and fixed hardware. What search is optimal at 1s/move on 1 core is fairly different from what is optimal at 1minute/move on 20 similar cores. With LMR and such "elastic" pruning techniques, one will get huge variations in TTD, varying only a little the optimality (more ELO) of the search. It is possible that you are right, and your argument can add some 2-3 ELO points at longer than tested in Fishtest time control of 1 core SF.

Again, the main point you seem to make is that for the same amount of spent electricity, 20*Time serial on 1*core is better than 1*Time parallel on 20*cores. But we all agree on that.

The point I am trying to make is that the ideal behavior is to search as deeply as possible. In a perfect SMP implementation, you would be exactly correct. IE SMP speedup = 20, which would be the same result as 20* faster single core.

But the inverse is not so clear in terms of reasoning. That of _intentionally_ searching less deep in order to widen the search and extract that improvement. I do and always will consider that to be 100% unintentional behavior, unless you are ready to admit you simply can't get a reasonable search speedup so you invest the extra cpus to widen the search instead. That widening happens is observable. That it is intentional is debatable. That it is as useful as going deeper is clearly incorrect...

The only thing we have zero empirical data for is how much the widening helps, specifically for the case of stockfish. I might be tempted to produce this data if this lazy stuff ever stabilizes... I'm not going to waste that much time only to hear "but there is a better version."

I'd rather continue to improve my traditional smp search results, which are getting better and better over time.

An interesting performance metric for current code. There is ZERO lock contention anywhere except at split points where multiple threads are trying to acquire the next move. IE zero lock contention for splitting, for joining, or for completing sub-searches. That's on 20 cores also...

I am now looking at reducing the NextMove() lock as it ought to be possible to select a move quicker... and reduce that. And that is not a very big contention point, but it is THE contention point that prevents perfect NPS scaling...

mcostalba · Post by **mcostalba** » Sat Oct 24, 2015 10:30 pm

bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.

NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.

Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?

Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.

mcostalba · Post by **mcostalba** » Sat Oct 24, 2015 10:45 pm

bob wrote: An interesting performance metric for current code. There is ZERO lock contention anywhere except at split points where multiple threads are trying to acquire the next move.

Well, this is exactly where the biggest part of lock contention is, especially with a staged move generator (there are tools that can help you verifying this. Intel's "vtune" is one...but there are also many others).

bob · Post by **bob** » Sun Oct 25, 2015 12:04 am

mcostalba wrote:
bob wrote: An interesting performance metric for current code. There is ZERO lock contention anywhere except at split points where multiple threads are trying to acquire the next move.
Well, this is exactly where the biggest part of lock contention is, especially with a staged move generator (there are tools that can help you verifying this. Intel's "vtune" is one...but there are also many others).

Late to the party. I already use vtune. It is _very_ good. Not seen anything else that can give that much data about everything. And it even intelligently figures out what to use the hardware performance monitors to measure as it observes... And note I don't have MUCH contention there as shown by NPS scaling...

bob · Post by **bob** » Sun Oct 25, 2015 12:09 am

mcostalba wrote:
bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.
NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.

Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?

Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.

This is a simple point. NPS scaling is a critical measure. Because it provides an absolute upper bound for parallel search speedup. If you run on 20 cores and your NPS is only 15x. Your smp speedup will never exceed 15x, even if it is perfect. So you can't ignore it.

But it is not the ultimate measure. Speedup is the critical part.

If you have a 10x speedup, what do you do for improvement? If your NPS speedup is only 13x, you start there, not on the splitting and such. Find out why it is so slow. If your speedup is 10x but nps is 19x, then there is an issue in choosing split points that is causing excessive search overhead.

My main criticism of lazy SMP is purely on theoretical grounds. If you do fixed depth searches and you don't see much speedup, that represents a problem. While lazy smp might work pretty well, a traditional parallel search should always have a significant edge, particularly as cores climb.

mcostalba · Post by **mcostalba** » Sun Oct 25, 2015 7:12 am

bob wrote: I am now looking at reducing the NextMove() lock as it ought to be possible to select a move quicker... and reduce that. And that is not a very big contention point, but it is THE contention point that prevents perfect NPS scaling...

You can let each thread to have a local move generator and use move count to sync them:

Code: Select all

  while &#40;move = NextMove&#40;), move != MOVE_NONE&#41;
  &#123;
    if (++moveCount <= sp->moveCount&#41;
        continue; // No contention on fast path

    sp->spinlock.acquire&#40;);  // sp is a pointer to our split point

    if &#40;moveCount <= sp->moveCount&#41;
    &#123;
        sp->spinlock.release&#40;);
        continue; // Race
    &#125;

    assert&#40;moveCount == sp->moveCount + 1&#41;;

    sp->moveCount = moveCount;

    sp->spinlock.release&#40;);

    ...

In this way the contention is reduced to the bare minimum: a very fast spinlock around a single moveCount check.

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 7:59 am

mcostalba wrote:
bob wrote: I am now looking at reducing the NextMove() lock as it ought to be possible to select a move quicker... and reduce that. And that is not a very big contention point, but it is THE contention point that prevents perfect NPS scaling...
You can let each thread to have a local move generator and use move count to sync them:
Code: Select all
  while &#40;move = NextMove&#40;), move != MOVE_NONE&#41;
  &#123;
    if (++moveCount <= sp->moveCount&#41;
        continue; // No contention on fast path

    sp->spinlock.acquire&#40;);  // sp is a pointer to our split point

    if &#40;moveCount <= sp->moveCount&#41;
    &#123;
        sp->spinlock.release&#40;);
        continue; // Race
    &#125;

    assert&#40;moveCount == sp->moveCount + 1&#41;;

    sp->moveCount = moveCount;

    sp->spinlock.release&#40;);

    ...
In this way the contention is reduced to the bare minimum: a very fast spinlock around a single moveCount check.

This is not so easy as it seams when your move order depends upon information e.g. history which might have been changed in between by another thread.

Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP (slight shift of topic)

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?