I've posted my numbers many times. On 4, the observed number was really 3.4x for my test set. On 8, it was something over 6, but I will have to scrounge up the data to see exactly. 6.1 comes to mind...diep wrote:Hi Bob,bob wrote:That assumes that if you have 4 real cores, and you test with 4 threads, and then use 8 logical cores (HT on) and rul with 8 threads, then the tree will grow about 30% in size due to the parallel search overhead.Werewolf wrote:Your answer is too interesting to let it slip by!bob wrote:
Problem with HT on is that if you have 4 physical cores, and search X NPS, when you go to 8 cores (HT on) the tree will grow by 30%. If your NPS doesn't grow by MORE than 30%, you see a net loss.
NPS is NOT the way to measure parallel search performance. It provides completely bogus comparisons...
a) Why does the tree grow by 30% with HT on?
b) Is this also true if we move from 4 physical cores to 8?
c) Why do the np/s have to increase by 30% or more to maintain performance? (because surely the HT tree isn't the same tree as the Non-HT tree and therefore time to depth is misleading)
alpha/beta is a purely sequential algorithm as defined. You need to establish a bound at each node, by searching the best move first, then you use that bound to search the remaining nodes more efficiently. When you don't do this (and you can't in a parallel search) you search a larger tree to reach the same depth..
For (b) yes. It is not a "core" issue but a "number of threads" issue.
(c) think about it. Going from 4 to 8 threads makes the tree 30% larger. If you don't speed up enough with the extra 4 threads to offset that loss, you see a net decrease in performance. If the NPS increases by more than that amount, you see a (small) net gain.
If your speedup is 3.1 out of 4 and 30% is the break even point moving to 8 cores for hyperthreading, then that would mean that crafty's speedup deteriorates a lot namely that it's break even at:
Assuming 100% scaling now: 4 * 1.3 = 5.2
So you get less than 3.1 out of 5.2 with 5.2 being what you get at 8 cores.
8 * 3.1 / 5.2 = 4.76 out of 8
So if even 30% increase in nps by hyperthreading doesn't benefit
crafty then that means that assuming you get 3.1 out of 4 as a speedup,
that at 8 cores you get 4.76 out of 8.
For Young Brother Wait that seems like a rather small speedup out of 8 cores to me.
Vincent
That 30% overhead has been in Crafty for 10+ years now, and the only thing that will reduce it is more accurate move ordering, which seems unlikely to happen... Note that 30% is just a statistical approximation used to fit a straight line (estimated speedup) to data that is not exactly linear. So lower core speedups are generally understated, as the 30% was the number I got when running on a 16 cpu Cray a long while back... Going beyond 16 the speedup is overstated using that formula. But it is a ballpark...
I have several 12 core boxes now so perhaps I ought to run the test on those, as well as on our 8-core boxes which I would expect to perform a bit better on a per-cpu basis, since there is less cache contention and bandwidth required...