Above 16, DTS is unknown for me. Some have used it on more recent hardware, but in every case, I am not aware of anyone doing everything I did in CB (one problem is memory bandwidth. C90 had tons, PC has very little in comparison, so that limits what you are willing to access and use to make split decisions) The fall-off from 4 to 16 with DTS was significant. And we really had little access to a dedicated C90 to test and tune and tweak. Can't use a "shared machine" since parallel performance is meaningless when others run at the same time.rbarreira wrote:What part of my post made you think that I was complaining about 11.1 not being equal to 11.5?Dann Corbit wrote:11.1 seems to answer very neatly for the 11.5 predicted by the forumla.rbarreira wrote:In your DTS paper you give a measured speedup of 11.1 for 16 CPUs, which is less than your formula says (11.5), despite the fact that DTS is supposed to be a better parallel algorithm than what you're using today.
So which one is it... the formula is wrong, or your current parallel algorithm is actually about as good or even better than DTS for 16 CPUs and possibly above?
To me the results in the DTS paper make more sense, i.e. efficiency keeps going down when adding CPUs instead of hovering around a minimum. Unfortunately the paper doesn't have results for more than 16 CPUs.
You were expecting an exact answer agreement? Of course, that would be an utterly ridiculous expectation (in fact we rightly view such data with mathematical suspicion).
If, for instance, a student dropped a penny and measured the force of gravity using the mass of the penny and the mass of the earth and came up with 6.673 x 10^(-11) m^3/(kg*s^2) it would seem a little far-fetched.
Did you actually read/understand my post and the fact that the formula is predicing something different from DTS? According to Robert, the formula is supposed to predict Crafty's scaling, which is supposed to be worse than DTS scaling.
The fact that the formula predicts a scaling which is better than DTS (or about equal, if you want to ignore the small difference) is what I find strange given that Robert always says that DTS is quite a lot better than the other parallel algorithms.
Does DTS only pay off with more than 16 CPUs? Does the formula wrongly predict Crafty's scaling above 8 CPUs? Or maybe Crafty's parallel algorithm is already as good as DTS? All are possibilities, I'm just trying to find out the reason for the discrepancy.
Crafty has been better tuned and tested, because 8 cpu machines are everywhere. We have a cluster of 70 dual-quadcore boxes right now, and are in the process of buying either a bunch of dual-hexacore boxes or dual-octacore boxes, depending on delivery times. That will provide a ton of accurate 16-core data. It won't likely change the formula, since that is just a rough approximation to keep it simple. And if things get a bit worse than the formula predicts at 32-64, lowering the slope will make it less accurate at the more common core counts most will be using. It is just a rough estimate. When you get to 16, it could be off by 2 or 3 and still be useful.
I've never seen people lock on to a very rough approximation and complain that "Hey, this is a very rough approximation, it is off here, it is off here, it is off here..." Like I didn't already know it was "rough". But it does get you into the right ballpark, quickly and easily, which is why I reference it.