Lazy SMP Better than YBWC?

Steve Maughan · Post by **Steve Maughan** » Sat Oct 24, 2015 12:41 am

FYI I see the StockFish patch of Oct 20th says that Lazy SMP is actually better than YBWC for large numbers of cores.

This would be amazing if it's true (and I see no reason to doubt it).

Anyone else found it to be true?

- Steve

bob · Post by **bob** » Sat Oct 24, 2015 2:48 am

Steve Maughan wrote:FYI I see the StockFish patch of Oct 20th says that Lazy SMP is actually better than YBWC for large numbers of cores.

This would be amazing if it's true (and I see no reason to doubt it).

Anyone else found it to be true?

- Steve

I believe what they said is "it is better than THEIR YBW implementation." Which is not exactly what you wrote. Its NPS scales particularly well since there is little synchronization. As to actual speedup, that's a different matter.

Joerg Oster · Post by **Joerg Oster** » Sat Oct 24, 2015 10:39 am

bob wrote:
Steve Maughan wrote:FYI I see the StockFish patch of Oct 20th says that Lazy SMP is actually better than YBWC for large numbers of cores.

This would be amazing if it's true (and I see no reason to doubt it).

Anyone else found it to be true?

- Steve
I believe what they said is "it is better than THEIR YBW implementation." Which is not exactly what you wrote. Its NPS scales particularly well since there is little synchronization. As to actual speedup, that's a different matter.

True, but obviously Stockfish gains a lot from going wider.
OTOH, the nominal depth shown doesn't necessarily tell you the whole truth, since some of the helper threads
are already searching deeper.
So it's possible that the main thread gets useful information from an already deeper search, compensating for the supposed slower speedup.

Comparing SF's YBW time-to-depth with Lazy ttd is most likely like comparing apples with oranges. Not to say misleading.

bob · Post by **bob** » Sat Oct 24, 2015 5:55 pm

Joerg Oster wrote:
bob wrote:
Steve Maughan wrote:FYI I see the StockFish patch of Oct 20th says that Lazy SMP is actually better than YBWC for large numbers of cores.

This would be amazing if it's true (and I see no reason to doubt it).

Anyone else found it to be true?

- Steve
I believe what they said is "it is better than THEIR YBW implementation." Which is not exactly what you wrote. Its NPS scales particularly well since there is little synchronization. As to actual speedup, that's a different matter.
True, but obviously Stockfish gains a lot from going wider.
OTOH, the nominal depth shown doesn't necessarily tell you the whole truth, since some of the helper threads
are already searching deeper.
So it's possible that the main thread gets useful information from an already deeper search, compensating for the supposed slower speedup.

Comparing SF's YBW time-to-depth with Lazy ttd is most likely like comparing apples with oranges. Not to say misleading.

This "widening" comes up over and over, but it is a broken argument. If widening helps, widening will help the serial search just as much. Yes it can help some, no doubt. But the poorer time-to-depth has to hurt more, making this a net loss compared to a normal YBW (or even better, DTS-like) parallel search.

I won't go so far as to say it can't possibly work, but logic and theory are pretty clear. Hoping that it works better won't make it work better. It is simply horribly counter-intuitive to accept it. And no, this is not because "you can't teach an old dog new tricks" or any of the other nonsense that will likely follow this. Parallel search has been examined for 40 years now and the theory behind it is well known...

mcostalba · Post by **mcostalba** » Sat Oct 24, 2015 6:48 pm

Lazy SMP convincely outperformed YBWC with 20 threads and was at least on par with 3 threads and 7 threads (possibly stronger, but we didn't test for it). This is a fact.

Our YBWC implementation is state-of-the-art. This is another fact.

bob · Post by **bob** » Sat Oct 24, 2015 7:11 pm

mcostalba wrote:Lazy SMP convincely outperformed YBWC with 20 threads and was at least on par with 3 threads and 7 threads (possibly stronger, but we didn't test for it). This is a fact.

Our YBWC implementation is state-of-the-art. This is another fact.

Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.

There are tools that can help you. Intel's "vtune" is one. But you have to realize you have a problem FIRST... Actually, FIRST you have to realize that there is ALWAYS someone that is better than you are. I figured that out 30 years ago. So I don't keep putting people down with snide comments...

I'm not an expert on lazy smp. I certainly tried it when it first surfaced over 20 years ago. It has some inherent flaws. However, I AM an expert on parallel search. There's nothing I haven't tried that has ever been talked about, nothing I haven't analyzed, and I decided 30 years ago this was something I was going to study intently because it is an interesting problem. yes, lazy smp improves performance. Just nowhere near what a good YBW (or even better, DTS) algorithm will produce. Yes, lazy smp searches undesirable nodes that can help a bit when a program is too aggressive with pruning. So some of the overhead washes out. But that issue can be fixed in a better way. If you prune too much, it can be fixed. If the goal for an SMP search is to not search any deeper, just search more thoroughly, yes, that is an approach. But ONLY when you can't use the extra processors for their intended purpose, to search the serial tree more deeply in the same amount of time...

So if lazy smh is doing acceptably for you, that's good. But it does NOT mean it is doing BEST for you. Perhaps I will have time to run some games on our 24 core boxes at a reasonable time control. Very fast games are NOT the way to make these tests.

zullil · Post by **zullil** » Sat Oct 24, 2015 7:19 pm

mcostalba wrote:Lazy SMP convincely outperformed YBWC with 20 threads and was at least on par with 3 threads and 7 threads (possibly stronger, but we didn't test for it). This is a fact.

True, but so far only at short time controls, where YBW is expected to under-perform.

At longer time controls, to me nothing is clear yet. After 40 games on 20 cores at 1800"+10", lazy leads ybw 3-2, with 35 draws. My guess is that lazy is weaker than ybw at time controls like we will see in the TCEC Superfinal.

zullil · Post by **zullil** » Sat Oct 24, 2015 7:28 pm

bob wrote:
For example, my current code is showing a 19.5x NPS speedup on 20 cores.

That seems amazing. Looking forward to the arrival of Crafty-25.

syzygy · Post by **syzygy** » Sat Oct 24, 2015 7:32 pm

bob wrote:This "widening" comes up over and over, but it is a broken argument. If widening helps, widening will help the serial search just as much. Yes it can help some, no doubt. But the poorer time-to-depth has to hurt more, making this a net loss compared to a normal YBW (or even better, DTS-like) parallel search.

We've been through this many times. It is nonsense to compare this with a serial search.

If you have a system with a single core, then a serial search is just fine.

But now take a system with 20 cores. A serial search will not make use of 19 of those cores. So you need to do something different.

You can try YBW, you can try the lazy approach (that happens to result in widening). If the lazy approach outperforms YBW, then lazy is better than YBW. It does not mean that the lazy idea is of any use for a serial search. We're simply not talking about 1-core machines here.

bob · Post by **bob** » Sat Oct 24, 2015 7:52 pm

syzygy wrote:
bob wrote:This "widening" comes up over and over, but it is a broken argument. If widening helps, widening will help the serial search just as much. Yes it can help some, no doubt. But the poorer time-to-depth has to hurt more, making this a net loss compared to a normal YBW (or even better, DTS-like) parallel search.
We've been through this many times. It is nonsense to compare this with a serial search.

If you have a system with a single core, then a serial search is just fine.

But now take a system with 20 cores. A serial search will not make use of 19 of those cores. So you need to do something different.

You can try YBW, you can try the lazy approach (that happens to result in widening). If the lazy approach outperforms YBW, then lazy is better than YBW. It does not mean that the lazy idea is of any use for a serial search. We're simply not talking about 1-core machines here.

That is simply wrong. You are accepting a lousy serial search implementation and using extra processors to offset for the existing weakness. If you want to argue from that perspective, that's fine. But in parallel processing research, the general idea is to compare the *best* serial algorithm against the *best* parallel algorithm.

Your statement is based on "I accept that I can't get the actual parallel implementation of alpha/beta to work well. So I will at least get *something* with the extra processors by widening the tree so that I make fewer pruning mistakes. A *good* implementation doesn't accept that built in flaw.

And of course we are not talking about 1 core machines. We are talking about the *best* possible use of N cores. That is *not* lazy smh. It is only best when the basic parallel search has significant issues. Issues that COULD be repaired. Or else they can be covered up by a sub-optimal parallel search that accepts search overhead (extra nodes) as a method of serendipitous repair.

Better can be done.

I *know* that lazy in my code doesn't outperform the normal parallel search. It doesn't even get close. The "widening" is not something I want. It is wasted effort that could drive the search deeper, or make it use less time, which is *the* goal for parallel search. If you hire 20 people to build you a house, what is your expectation? That they will finish the work 20 times faster, or that they will finish the work only 10 times faster because every pair of workers does the same thing twice? I know what *I* want here. The fact that you might get a somewhat better house since everything is done twice and you can keep the best of the two is *not* a justification. Make the workers improve their technique so that advantage goes away.

Lazy SMP Better than YBWC?

Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?