Evert wrote:BBauer wrote:
NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.
Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.
A parallel search can give you new and different information about the shape of the tree, because you can direct the search as a function of information that you gather. One might argue that even a turing machine might gather the same information. But perhaps the amount of parallel information we gather can shape the tree more and more quickly so that having hundreds or thousands or millions of threads can allow us to prune in ways that we cannot prune when we have a single thread of execution.
Consider also the approach of having some threads reading forward nodes. It is a simple innovation but it seems to offer measurable benefit. Similarly, after making a move, most chess moves do not ponder on the opponent's move but guess what the opponent will do and ponder on the response to the move it is assumed that the opponent will make. Parallel threads allow this sort of thing on a much bigger scale. If there is a real effect there, then perhaps it can be exploited.
I ask also to consider the enormous difference between:
"This is a hard limit on how much a search can speed up."
and
"This might be a hard limit on how much a search can speed up."
If there is some way to exploit the information gained in parallel search to reduce the branching factor, then parallel search could conceivably offer even more benefit than the SMP loss (e.g. run superlinear speedup). Before alpha-beta was invented, searching only the square root of the nodes of the tree probably sounded absurd. But by exploiting knowledge gathered in the search, it was possible to reduce the branching factor.
Now, let's look at the other side of the coin. If the search is implemented improperly or inefficiently, the NPS might be unaffected but the search speed is not increased as expected. In this case the NPS told us nothing about the search speed.
Example:
We start 12 threads on a 12 core machine. Each thread has its own hash table and searches the same root position. In some sense, the search is "correct" because eventually it will produce the right answer. But it is not appreciably faster than a sequential search (sometimes due to randomness, the right answer might be found faster). The NPS is 12x the 1 core search and the Elo is approximately the same.
Now, I will admit the following:
If we do not exploit special knowledge gathered from the parallel search in some way that is not equivalent in the original search, then the parallel search cannot be faster on average than the sequential search but must be slower {or equal at best with perfect search} due to SMP loss.
It might be true that some simple modification like searching the right advanced nodes (though a serial version could do this as well) might experience speedup from the right sort of "speculation".
After all LMR and other pruning (and extensions) are all different forms of speculation [based on knowledge collected while searching]. Some forms of speculation might be easier in parallel.
It is even possible that some forms of speculation possible in parallel search are not possible in serial search or are so inefficient in serial search that they do not work as well.
So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?
Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
Monte-Carlo seems to work well in GO.