Joerg Oster wrote:bob wrote:mcostalba wrote:bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.
NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.
Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?
Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.
This is a simple point. NPS scaling is a critical measure. Because it provides an absolute upper bound for parallel search speedup. If you run on 20 cores and your NPS is only 15x. Your smp speedup will never exceed 15x, even if it is perfect. So you can't ignore it.
But it is not the ultimate measure. Speedup is the critical part.
If you have a 10x speedup, what do you do for improvement? If your NPS speedup is only 13x, you start there, not on the splitting and such. Find out why it is so slow. If your speedup is 10x but nps is 19x, then there is an issue in choosing split points that is causing excessive search overhead.
My main criticism of lazy SMP is purely on theoretical grounds. If you do fixed depth searches and you don't see much speedup, that represents a problem. While lazy smp might work pretty well, a traditional parallel search should always have a significant edge, particularly as cores climb.
Me thinks this is only a valid/fair comparison if all helper threads are searching the same depth as the main thread.
Alas, nobody is doing lazy smp in this stupid way ...
Actually you are, you have a ton of search consistency issues. How do you compare two different moves searched to different depths? And do you control which moves are searched to which depth or is this controlled by serendipity?
IE there is a reason for the "L" in LMR, so you have control over what gets reduced and by how much. As opposed to using RMR (Random Move Reductions).
If you like the results, fine. If you think this is better than a directed search that tries to do what a single-thread search does, only faster, that's a bit of a stretch.
This is not a new idea. Monty Newborn did a flavor of this in the 70's, I don't remember who first proposed ABDADA, but that was also over 20 years ago. The intent was "easy" as opposed to "optimal". That's exactly what this has produced so far, at least for all the published data over the years.
If you want to claim lazy SMP is better, you have to provide data, not hand-waving. And it has to be data run at reasonable time controls on a reasonable number of cores...
when lazy SF stops changing, I'll see if I can give some data that is real by running some longer games on a cluster of 24 core machines. Having actual data would be nice.