michiguel wrote: Rein Halbersma wrote:
For DTS (S = speedup) I suggest
S = 35 * n / (34 + n)
is fairly simple.
This can also be expressed as the reciprocal of the speed up is linear to the reciprocal of the CPUs.
1/S = a * 1/CPUs + b
This is not an approximation, it is what it is supposed to be if Amdahl's apply. The good thing is that it has also "some" predicting power over numbers you did not test.
Parallel programing is not my expertise, but the basic math is identical to enzyme kinetics or simple electricity problems (when you have resistors connected in parallel).
Could it be that your example of Amdahl's law reflects the YBW paradigm for the average chess positions's branching factor of 35? If this relationship for up to N=16 can be extrapolated, the upper bound on the speedup is equal to S=35 for N=infinity. That would be very poor scaling behavior for the DTS algorithm for large N, considering that chess search trees have ample parallelism. Didn't Cilkchess show linear speedup for a much larger range of processors?
If Amdahl's law applies and the formula is correct, 35 means that 1/35 is the fraction of the program that cannot be made parallel. In this case, 1/35 ~ 0.028 or 2.8%. In my humble experience of making Gaviota parallel (which is not efficient BTW), I observed that non-parallel fraction of the program is bigger at low depths. This is because a big chunk of non-parallelism comes from the fact that the PV and its last x plies never split. The deeper you search, the less significant this becomes. I measured it and it could go from 15% at the beginning to 1% or less, but It never goes to zero. So, how well these algorithms scale depend on the depth reached. If I am not mistaken, The DTS experiments may have been done at a reasonably "low" depths by today's standards. Maybe it would be even better today.
To make things more complicated, and maybe the cause of a non-Amdahl behavior is that, as Bob pointed out, each processor added makes the tree bigger (this may end up looking like it is a non parallel fraction of the tree, but the concept is different). OTOH, there is a positive effect because there is a chance to find the solution faster by peaking to successful late moves earlier, like making a "fast-forward" of the process.
That is why the problem may look like Amdahl, but the k=35 may not mean what we think it means but an average of all the issues at stake.
Since we are on a different tangent now, some inefficient points...
(1) split cost. Non-trivial copying a lot of data to various split blocks. Totally unavoidable, one just has to try to limit the number of splits.
(2) search overhead. Every time you do a split, and one of the threads gets a fail high, some (or all) of what the other threads have done is wasted. How much overhead is difficult to project. If you generate 40 moves, and for some reason the fail-high move is at slot 20, you search move one serially to satisfy the YBWC, then you split and start the search. If any moves past slot 20 are searched, partially or completely, by the time move 20 fails high, that is wasted effort.
(3) synchronization. Locks are generally used. It is possible to write lockless code, and various algorithms are given in parallel programming books, assuming you can guarantee the order of memory reads and writes. This is not true for all processors today, and there is no guarantee it will be true even for Intel in the future. So it is dangerous. But in any case, locks are sequential, by definition, and factor directly into Amdahl's law, where the first two items do not necessarily.
(4) hardware. If you share things, and they end up close in memory, they can share a cache block. And you can generate a ton of cache coherency traffic keeping up with who has the most current copy of something. This is difficult to measure unless you have some software tools. AMD has some slick stuff, but it is non-public, or at least was the last time they ran it for me. There are other issues such as memory conflicts, bus conflicts, does each processor have its own cache (L1 and L2) or a local L1 and a shared L2, or a local L1 and a L2 shared between 2 or 4 cores? Each changes performance by adding some unknown serialization.
There's a lot to this, as can be seen.
Then there is non-deterministic behaviour, where one thread just has time to store something in hash before another thread needs it. The next time it is just too late and the tree changes if you repeat the search. One thread "gets ahead" since moves don't have to be searched one at a time, and you get the occasionally infamous "super-linear speedup". And of course different threads update things in different orders. Which further alters the speedup data.