Laskos wrote:bob wrote:Joerg Oster wrote:bob wrote:Steve Maughan wrote:FYI I see the StockFish patch of Oct 20th says that Lazy SMP is actually better than YBWC for large numbers of cores.
This would be amazing if it's true (and I see no reason to doubt it).
Anyone else found it to be true?
- Steve
I believe what they said is "it is better than THEIR YBW implementation." Which is not exactly what you wrote. Its NPS scales particularly well since there is little synchronization. As to actual speedup, that's a different matter.
True, but obviously Stockfish gains a lot from going wider.
OTOH, the nominal depth shown doesn't necessarily tell you the whole truth, since some of the helper threads
are already searching deeper.
So it's possible that the main thread gets useful information from an already deeper search, compensating for the supposed slower speedup.
Comparing SF's YBW time-to-depth with Lazy ttd is most likely like comparing apples with oranges. Not to say misleading.
This "widening" comes up over and over, but it is a broken argument. If widening helps, widening will help the serial search just as much. Yes it can help some, no doubt. But the poorer time-to-depth has to hurt more, making this a net loss compared to a normal YBW (or even better, DTS-like) parallel search.
I won't go so far as to say it can't possibly work, but logic and theory are pretty clear. Hoping that it works better won't make it work better. It is simply horribly counter-intuitive to accept it. And no, this is not because "you can't teach an old dog new tricks" or any of the other nonsense that will likely follow this. Parallel search has been examined for 40 years now and the theory behind it is well known...
Bob, this argument was discussed at some length in the past. The argument with serial search or multiplexing was shown to be misleading. Parallel search need not follow the same search tree with the serial search. The parallel problem is optimizing performance on 20 cores for fixed time T. The serial problem is optimizing on 1 core fixed time T. Not on 1 core fixed time 20*T. Nobody is disputing that serial one core with fixed time 20*T performs better than parallel on 20 cores in fixed time T. This is given, and you continue to insist on this non-argument.
Here's the point NOBODY addresses. We make poor pruning and reduction decisions, yes. So tell me EXACTLY how the parallel search (lazy SMP) is going to improve on that? Only answer is "serendipitous discovery". That's not an algorithmic improvement. The worse your pruning decisions are, the more likely "widening" will help. But suppose your program searches the OPTIMAL tree already? There's nothing to discover, and now lazy SMP gives nothing, where a traditional YBW or DTS algorithm will STILL improve the speed.
So depending on serendipity to improve things is not a rational way to go here, until you reach a point with a traditional SMP search that you can't improve it further given more processors. At the point where more processors offers nothing, then searching wider and depending on luck to discover things the normal search misses will probably be useful.
But the argument of "YBW stockfish scales worse than lazy-amp SF" is simply NOT an argument to make. We already know the YBW version scaled poorly. NPS-wise and TTD wise. Lacking an interest in trying to fix that, using lazy smp is a reasonable alternative assuming it actually helps, which as Louis pointed out has NOT been shown yet. Very fast games are a poor way to test SMP anything, anyway. So we have no data to show anything other than for extremely fast games it is not worse, so I remain unconvinced in the light of a lack of evidence showing anything at all.
Meanwhile I have a mountain of evidence for my parallel search performance, and I consider it absolutely impossible for lazy SMP (added to Crafty) to even remotely approach the benefit of a 13x speedup (and it will likely be higher now, with this better NPS scaling)..
I wrote a paper back in the late 80's that took the Knuth/Moore analysis into the world of parallel search. There's some math that shows this lazy stuff adds a HUGE amount of tree growth to reach the same depth. And the growth is unconstrained and undirected, so whenever it helps, it is luck. You aren't going to get lucky more than 1/2 of the time at best.
Maybe one day we will see some actual data. Since TTD is not comparable between lazy smp and a good yaw, about all that can be done is to play lots of games at reasonable time limits. The games are painful to run, but I can play quite a few 24 core games at once. I could play a lot of games with 1 cpu vs 24 cpus crafty vs crafty, and then stockfish vs stockfish, so see what kind of improvement 24 cores will produce. I'm just not sure I want to burn that much time for an answer I am 99.9% certain I already know.
This is a classic "if it seems too good to be true..." case..
BTW you are not correctly understanding my point. Never mentioned the 20x longer is better than 1x longer issue. It is a case of NX longer is better than 1, where N is the actual SMP performance improvement. For MOST programs this is nothing more than TTD. For YBW TTD is less meaningful. But we have no data to connect/compare the two. Until then, this is all idle nonsensical speculation. We certainly have a ton of Crafty TTD data that everyone could look at. We need some sort of measure of the SF SMP performance, which we don't have and which is painful to produce in light of needing thousands of games vs hundreds of test positions.
When/if they stop changing this code, I might decide to give it a run.