bob wrote:Laskos wrote:bob wrote:
1. Where is there ANY mention about Crafty's speedup or Elo gain in the original post?
Assuming 90 Elo points for doubling time at 60''+0.05'', from Andreas results I plotted the effective speed-up for Crafty (your linear 1+(Ncpus -1)*0.7), Komodo 8 and Stockfish 5.
It seems all top engines are very weak in their SMP implementation. They are buggy. If you Bob will go to 32 threads or so, Crafty, with such unbelievable SMP performance (linear), will be the strongest engine around. Good job, Bob.
Congratulations for improving Komodo by 70 Elo points (single or SMP or both).
Always "I assume" or "I think" or "I believe" or "I heard" or "I saw somewhere" and such.
Notice my formula is right with Komodo through 8. Highly inaccurate. That it diverges a bit at 16 is cause for a major panic? When I have specifically stated that it is really well-tested only through 16? Where I have specifically stated that it is also architecturally dependent since multi-core chips have a bottleneck that single-core multi-chip machines do not. NO formula will predict SMP performance with two decimal place accuracy across all existing platforms using Intel/AMD processors. Nobody in their right mind would expect them to do so.
Those that actually know how to measure speedup would do the following, something you have not done, probably were not aware of needing to do, etc.
1. measure NPS with one thread nothing else running.
2. if you have N cores, run N instances of a single thread search and measure the NPS. If you add the N NPS values together, it might or might not be equal to N times the NPS from 1.
Now you know the hardware scaling limit. Assume a REALLY good implementation of the hardware so that for 2 you really do get N * number 1, which means raw speed scales perfectly.
3. Run a set of positions to fixed depth using 1 and N threads. For the N thread version, run them several times and average. Divide 1 thread total time by N thread total time. You have an actual SMP speedup. The JICCA DTS article gives this data for Cray Blitz for 1, 2, 4, 8 and 16 processors on a good architecture. You can find the numbers, or I can post them later. There was a thread, started by Vincent when he was not happy with the speedup he got on a supercomputer he was going to use for one of the WCCC events, and he was complaining that my numbers were grossly exaggerated. I ran a bunch of positions at the AMD development lab, using 1,2,4 and 8 processors (no multi-core) and gave the data to Fierz. He discovered that my formula was a low estimate and that the actual numbers were a bit better.
I don't waste all of my time trying to measure parallel speedup unless I make significant changes to the parallel search. Could it be better or worse today? quite possible since my search (not the parallel part) has changed a lot with more aggressive LMR, more aggressive null-move pruning, more aggressive forward pruning, singular extensions, and on and on. And one day I will take the time to test again and see if the formula needs an update. Until then, it is an ESTIMATE.
However, I consider your plot to be bogus, because you are mixing apples and oranges. My formula is purely SMP speedup. You are using Elo, an estimate of elo per doubling, to compute an estimate of an estimate for what Komodo's parallel speedup would be. There is a real, scientific, accurate, no-guesswork method to compute parallel speedup. You are not using it.
Again, I don't see the point in (a) your broken calculations or (b) your concern with an estimate that actually matches your broken data almost perfectly through 8 processors and diverges later. Does the speedup grow linearly? No, and I have never claimed it did. The estimate is a linear function because one can compute that in their head to get a quick estimate. Want an accurate number. Spend the time and run the tests. That is how _I_ measure and report speedup. ALWAYS.