rbarreira wrote:In your
DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.
So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?
To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
How many times does it take for me to say this before it sinks in.
The formula is not exact. It is not intended to be accurate to decimal places. It is simply a linear approximation to suggest where a particular data point will fall, within a reasonable margin of error. To answer your question, it is _neither_. DTS is better, it has less overhead due to being able to choose better split points. But once you get a lot of processors involved, it actually could get worse than the current algorithm. I've since spent quite a bit of time (things like "min-thread-group" that everyone seems to have copied. That addressed something with DTS that we did not have time to test and tune. With that idea the DTS paper in the JICCA might (emphasis on might) have been a little better at 16 or 32. We ran on the 32 cpu box for one tournament, and the data we got there was not nearly as good as we had hoped. But time marches on, new ideas were discovered to address specific weaknesses that were difficult to anticipate until the actual hardware became available.
So please stop trying to make this formula about accuracy. It was intended to answer the question that always came up in the days when nobody else was doing parallel search, namely "Bob, what kind of speedup should I see if I buy a dual-cpu box? A quad-cpu box? I'm looking at a 4-socked AMD with dual cores, how much faster will that run?"
That was all it was intended. Now a couple of idiots want to take a linear approximation to a very much non-linear (and small) set of data points (nothing about chess is linear) and quibble about it's accuracy.
Feel free to run a few thousand positions on 1, 2, 4, 8, and 16/32/64 if you can get access, and produce a more accurate formula. Would not bother me at all. When that was written, there was no T932 available, so we could not go to 32. If we had had access, I'd bet we could have pushed that flattening up a bit. The simplest example is that we always split somewhere with all processors that were available. Why split at a node where you have 8 cpus and 8 moves (or even fewer, and sometimes we did not know how many moves we would have since we didn't generate them all at once at every position. In Crafty, I limit the number of threads working at any single split point to address this problem, plus it is better to not put all your eggs in one basket (or all your cpus on one node) in those cases where this ALL node turns into a CUT node. With only 4 processors working there, you get less overhead.
So there are ideas to improve things, but for smallish numbers of processors, Cray Blitz could simply choose better split points than the simple YBW algorithm I use today.