Stockfish still scales poorly?

Evert · Post by **Evert** » Wed Feb 25, 2015 10:44 am

BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.

I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.

Dann Corbit · Post by **Dann Corbit** » Wed Feb 25, 2015 11:55 am

Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

A parallel search can give you new and different information about the shape of the tree, because you can direct the search as a function of information that you gather. One might argue that even a turing machine might gather the same information. But perhaps the amount of parallel information we gather can shape the tree more and more quickly so that having hundreds or thousands or millions of threads can allow us to prune in ways that we cannot prune when we have a single thread of execution.

Consider also the approach of having some threads reading forward nodes. It is a simple innovation but it seems to offer measurable benefit. Similarly, after making a move, most chess moves do not ponder on the opponent's move but guess what the opponent will do and ponder on the response to the move it is assumed that the opponent will make. Parallel threads allow this sort of thing on a much bigger scale. If there is a real effect there, then perhaps it can be exploited.

I ask also to consider the enormous difference between:
"This is a hard limit on how much a search can speed up."
and
"This might be a hard limit on how much a search can speed up."

If there is some way to exploit the information gained in parallel search to reduce the branching factor, then parallel search could conceivably offer even more benefit than the SMP loss (e.g. run superlinear speedup). Before alpha-beta was invented, searching only the square root of the nodes of the tree probably sounded absurd. But by exploiting knowledge gathered in the search, it was possible to reduce the branching factor.

Now, let's look at the other side of the coin. If the search is implemented improperly or inefficiently, the NPS might be unaffected but the search speed is not increased as expected. In this case the NPS told us nothing about the search speed.

Example:
We start 12 threads on a 12 core machine. Each thread has its own hash table and searches the same root position. In some sense, the search is "correct" because eventually it will produce the right answer. But it is not appreciably faster than a sequential search (sometimes due to randomness, the right answer might be found faster). The NPS is 12x the 1 core search and the Elo is approximately the same.

Now, I will admit the following:
If we do not exploit special knowledge gathered from the parallel search in some way that is not equivalent in the original search, then the parallel search cannot be faster on average than the sequential search but must be slower {or equal at best with perfect search} due to SMP loss.

It might be true that some simple modification like searching the right advanced nodes (though a serial version could do this as well) might experience speedup from the right sort of "speculation".

After all LMR and other pruning (and extensions) are all different forms of speculation [based on knowledge collected while searching]. Some forms of speculation might be easier in parallel.
It is even possible that some forms of speculation possible in parallel search are not possible in serial search or are so inefficient in serial search that they do not work as well.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.

Monte-Carlo seems to work well in GO.

Michel · Post by **Michel** » Wed Feb 25, 2015 1:35 pm

Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.

I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.

bob · Post by **bob** » Wed Feb 25, 2015 7:01 pm

Dann Corbit wrote:
bob wrote:But no matter, NPS speedup is an upper bound on parallel search speedup, which is an upper bound on Elo improvement.
I don't think this is proven. Now, I *do* agree that 32 threads going full tilt cannot exceed the speed of one thread going full tilt multiplied by 32. That is simple mathematics.

But I do not agree that the Elo gain has the same bound. It seems conceivable to me that the information gained by some of the searching threads could guide the search of some other threads in some intelligent way to increase the effectiveness of the search super-linearly.

We already know that we do not have to examine all the nodes of the search tree to get a perfect answer. The analysis of the alpha-beta search algorithm offers a formal proof of that.

If it were true, for instance, that every node had to be examined, then SMP scaling would have a hard limit even with 100% utilization and no two theads searching the same node. But we already know that a perfect search can search as few as the square root of nodes remaining times some constant.

If there is a formal proof that multiple threads of execution cannot search better than a single thread (less SMP loss) times the thread count then I am wrong. But I have never seen such a proof. And a room full of experts that all believe such an algorithm cannot exist is not a proof to me.

Not that I am claiming to have such an algorithm either.

I do know that there are many SMP implementations that scale badly not in NPS but in Elo as a function of processor count. So if they relied on NPS as a measure of improvement, then they would be measuring the wrong thing. And when cores scaled up, they would get a nasty surprise.

Simply stated, "super-linear" performance is an absolute myth. Any good parallel programming book will explain this clearly. So let's stay out of the mythical/mystical and stick to reality. No super-linear speedup is possible, except for the occasional case, but not on average. If it happens consistently, the serial algorithm is broken and can be fixed to perform better itself, eliminating the super-liner behavior.

The "formal proof" you want is trivial. If it is better to search things in parallel, then you can do that in a serial search by multiplexing. Just divide the tree into pieces, just like the parallel search, then search one node on each sub-tree and repeat.

This is parallel programming 101...

bob · Post by **bob** » Wed Feb 25, 2015 7:02 pm

BBauer wrote:bob wrote:
But no matter, NPS speedup is an upper bound on parallel search speedup, which is an upper bound on Elo improvement.

NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.

Kind regards
Bernhard

It certainly is. If you search no faster in terms of NPS, you will get zero Elo gain...

I don't see where all these mystical properties are coming from...

bob · Post by **bob** » Wed Feb 25, 2015 7:05 pm

Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.

You are exactly correct. But it is even simpler. If you could consistently get a super-linear speedup because your move ordering is broken, either (a) fix the move ordering or (b) multiplex the search so that you search the same tree in serial mode that you do in the parallel search.

bob · Post by **bob** » Wed Feb 25, 2015 7:07 pm

Dann Corbit wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.
A parallel search can give you new and different information about the shape of the tree, because you can direct the search as a function of information that you gather. One might argue that even a turing machine might gather the same information. But perhaps the amount of parallel information we gather can shape the tree more and more quickly so that having hundreds or thousands or millions of threads can allow us to prune in ways that we cannot prune when we have a single thread of execution.

Consider also the approach of having some threads reading forward nodes. It is a simple innovation but it seems to offer measurable benefit. Similarly, after making a move, most chess moves do not ponder on the opponent's move but guess what the opponent will do and ponder on the response to the move it is assumed that the opponent will make. Parallel threads allow this sort of thing on a much bigger scale. If there is a real effect there, then perhaps it can be exploited.

I ask also to consider the enormous difference between:
"This is a hard limit on how much a search can speed up."
and
"This might be a hard limit on how much a search can speed up."

If there is some way to exploit the information gained in parallel search to reduce the branching factor, then parallel search could conceivably offer even more benefit than the SMP loss (e.g. run superlinear speedup). Before alpha-beta was invented, searching only the square root of the nodes of the tree probably sounded absurd. But by exploiting knowledge gathered in the search, it was possible to reduce the branching factor.

Now, let's look at the other side of the coin. If the search is implemented improperly or inefficiently, the NPS might be unaffected but the search speed is not increased as expected. In this case the NPS told us nothing about the search speed.

Example:
We start 12 threads on a 12 core machine. Each thread has its own hash table and searches the same root position. In some sense, the search is "correct" because eventually it will produce the right answer. But it is not appreciably faster than a sequential search (sometimes due to randomness, the right answer might be found faster). The NPS is 12x the 1 core search and the Elo is approximately the same.

Now, I will admit the following:
If we do not exploit special knowledge gathered from the parallel search in some way that is not equivalent in the original search, then the parallel search cannot be faster on average than the sequential search but must be slower {or equal at best with perfect search} due to SMP loss.

It might be true that some simple modification like searching the right advanced nodes (though a serial version could do this as well) might experience speedup from the right sort of "speculation".

After all LMR and other pruning (and extensions) are all different forms of speculation [based on knowledge collected while searching]. Some forms of speculation might be easier in parallel.
It is even possible that some forms of speculation possible in parallel search are not possible in serial search or are so inefficient in serial search that they do not work as well.
So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
Monte-Carlo seems to work well in GO.

If you can discover this "special knowledge" in the parallel search, you can ALSO discover the same thing in the serial search. Just multiplex the search so that you divide the tree into pieces just like the parallel search, but then serially search one node on each sub-tree and then repeat. Now you have the SAME information you had from the parallel search. Down-side is you just broke alpha/beta and the tree gets bigger.

bob · Post by **bob** » Wed Feb 25, 2015 7:11 pm

Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.

Glad someone actually reads. That is parallel programming 101. Super-linear speedup shows an underlying flaw in the basic algorithm that can be exploited by the serial multiplexing just as well as the parallel search. Now move this program to the parallel machine, and rather than multiplexing, you do those sub-trees in parallel. The only gain you have is the gain produced by searching at a higher NPS. Again, NPS is an absolute upper bound, unless you do something stupid and intentionally break the serial search to make it perform worse. Speedup is ALWAYS defined as "best sequential algorithm time on one processor divided by best parallel algorithm on multiple processors." Not "broken sequential algorithm divided by parallel algorithm".

Yet it seems we can't get off of that pot here...

Dann Corbit · Post by **Dann Corbit** » Wed Feb 25, 2015 8:31 pm

Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.

The fastest parallel sorting algorithms are different than the fastest serial sorting algorithms because you can do things in parallel that are not efficiently done serially.

Yes, you could do things serially like look ahead with your one thread and this is even done with pondering. But it is a risky strategy for a single threaded app and I doubt if anyone is doing it.

bob · Post by **bob** » Wed Feb 25, 2015 8:46 pm

Dann Corbit wrote:
Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.
The fastest parallel sorting algorithms are different than the fastest serial sorting algorithms because you can do things in parallel that are not efficiently done serially.

Yes, you could do things serially like look ahead with your one thread and this is even done with pondering. But it is a risky strategy for a single threaded app and I doubt if anyone is doing it.

IF it would work for a parallel algorithm, EVERYBODY would be doing it in the serial algorithm.

Your sort statement is not exactly correct. The fastest sort algorithms are always going to be N log N. But those are hard to do in parallel due to the synchronization while sub-splitting the lists. But going to something that works better in parallel takes a big performance hit (for example, a simple bubble sort scales really well in terms of processors, but it starts off at N*N vs N log N. It therefore starts off significantly worse than the N log N sorts.

Alpha/Beta is sqrt(W^D) basically. Even 1024 processors doesn't make minimax overtake alpha/beta, and minimax is the most efficient parallel search one can do since everything has to be searched.

For example, given W^D of 10 billion (pretty easy to search today even on one CPU)

alpha/beta = 100K

minimax = 10B. How many processors needed to reduce 10B to 100K (the break-even point)? 100,000. Not doable in any affordable machine today or in 5 years. So alpha/beta rules the roost, and traditional parallel search rules alpha/beta. And note that 100K processors just breaks even with alpha/beta, it would be no faster or better. To even 2x faster would require 200K processors which is even more "out there". A good parallel search today can certainly produce 8x faster with 16 cores. Minimax would be utterly hopeless.

Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?