Stockfish still scales poorly?

Dann Corbit · Post by **Dann Corbit** » Wed Feb 25, 2015 9:24 pm

bob wrote:
Dann Corbit wrote:
Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.
The fastest parallel sorting algorithms are different than the fastest serial sorting algorithms because you can do things in parallel that are not efficiently done serially.

Yes, you could do things serially like look ahead with your one thread and this is even done with pondering. But it is a risky strategy for a single threaded app and I doubt if anyone is doing it.
IF it would work for a parallel algorithm, EVERYBODY would be doing it in the serial algorithm.

Crafty does not ponder the opponent's move after making a move. Instead, crafty ponders on the expected response. This is an example of speculation. You told me that you got a speedup from that. Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.

Your sort statement is not exactly correct. The fastest sort algorithms are always going to be N log N. But those are hard to do in parallel due to the synchronization while sub-splitting the lists. But going to something that works better in parallel takes a big performance hit (for example, a simple bubble sort scales really well in terms of processors, but it starts off at N*N vs N log N. It therefore starts off significantly worse than the N log N sorts.

The fastest serial comparison sorts are O(N*log(N)).
Bitonic Sort is a so-so sort on serial machines, but one of the fastest on parallel machines because of the difference in architecture.
Sample Sort is generally considered the fastest sort on SMP machines, but it is not the fastest sort on serial machines.

Alpha/Beta is sqrt(W^D) basically. Even 1024 processors doesn't make minimax overtake alpha/beta, and minimax is the most efficient parallel search one can do since everything has to be searched.

For example, given W^D of 10 billion (pretty easy to search today even on one CPU)

alpha/beta = 100K

minimax = 10B. How many processors needed to reduce 10B to 100K (the break-even point)? 100,000. Not doable in any affordable machine today or in 5 years. So alpha/beta rules the roost, and traditional parallel search rules alpha/beta. And note that 100K processors just breaks even with alpha/beta, it would be no faster or better. To even 2x faster would require 200K processors which is even more "out there". A good parallel search today can certainly produce 8x faster with 16 cores. Minimax would be utterly hopeless.

I agree that minimax is the wrong algorithm for parallel search in chess.

michiguel · Post by **michiguel** » Wed Feb 25, 2015 9:48 pm

Michel wrote:
Evert wrote:
BBauer wrote: NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.
I've seen this argument before, but no one has ever given an explanation I can understand for this claim.

Suppose that we have a super-speedup Elo improvement from a parallel search. Where does this improvement come from? The only possibility I can see is that it comes from the parallel search examining nodes that the single thread search did not examine, but should have (counter argument: if it didn't have to examine them, then the parallel search would not have needed to examine them either). In other words, the parallel search has a super-speedup Elo improvement because it fixes a defect in the normal linear search. If you would fix that defect, the super-parallel improvement would go away.

So as far as I can see, yes, you can have super-speedup Elo gain from a parallel search, but only if the non-parallel search is sub-optimal. Can someone explain why that wouldn't be true?

Note that saying that the non-parallel search is sub-optimal and can be improved doesn't imply that it would be in any way easy to do. Relying on the non-determinism of a parallel search to repair defects in the normal search sounds unreliable though, and I would also qualify it as bad engineering.
I think this is essentially Bob Hyatt's argument. He has another argument which I actually like more. A SMP search with N threads can be emulated using virtual threads with a single threaded search which runs on a clock which is N times faster.

So this shows that any (hypothetical) super speedup with an SMP search can be realized using a single threaded search as well.

1) This is true in an ideal Universe with no friction. But we know that emulating this could bring more problems than solutions.

2) Only true if you double the threads without doubling the resources (memory). If you double the memory, then there are algorithms that finish in less than half the time. Probably Dan, which is an expert on it, can give you sorting examples. Each CPU comes with a cache, so... it is theoretically possible to have this behavior for a given algorithm if the cache memory is intrinsically critical.

Miguel

syzygy · Post by **syzygy** » Wed Feb 25, 2015 11:10 pm

Dann Corbit wrote:
bob wrote:But no matter, NPS speedup is an upper bound on parallel search speedup, which is an upper bound on Elo improvement.
I don't think this is proven. Now, I *do* agree that 32 threads going full tilt cannot exceed the speed of one thread going full tilt multiplied by 32. That is simple mathematics.

But I do not agree that the Elo gain has the same bound. It seems conceivable to me that the information gained by some of the searching threads could guide the search of some other threads in some intelligent way to increase the effectiveness of the search super-linearly.

Anything can happen in a particular instance, but it's the average that counts and on the average it won't.

It's similar to entropy: it's a law of nature that it always increases, but in a particular instance that does not need to happen at all. It might just be your lucky day where one billion atoms happen to line up neatly.

But we already know that a perfect search can search as few as the square root of nodes remaining times some constant.

We know that you can't do better than the minimal tree.

If there is a formal proof that multiple threads of execution cannot search better than a single thread (less SMP loss) times the thread count then I am wrong. But I have never seen such a proof. And a room full of experts that all believe such an algorithm cannot exist is not a proof to me.

With many threads you still can't do better than the minimal tree. (And it will be extremely hard to approach that minimal tree.)

I do know that there are many SMP implementations that scale badly not in NPS but in Elo as a function of processor count. So if they relied on NPS as a measure of improvement, then they would be measuring the wrong thing. And when cores scaled up, they would get a nasty surprise.

Nobody is saying that NPS is all that counts. The point is that NPS speedup bounds what you can hope for.

syzygy · Post by **syzygy** » Wed Feb 25, 2015 11:17 pm

Dann Corbit wrote:Now, I ask the kind audience to engage in a mind experiment with me. We are going to visualize something. Consider a CPU collection with as many real CPU cores as we see in today's high end video cards. We have literally tens of thousands of high energy threads at our disposal. (...)

You actually believe that there might just be an algorithm (we only have to find it!) that would turn that machine into a better chess-playing entity than a single-threaded engine on a single CPU core that is running at tens of thousands times the speed of each of your threads?

Not very realistic.

syzygy · Post by **syzygy** » Wed Feb 25, 2015 11:32 pm

Dann Corbit wrote:Don Homan does the same thing in his parallel search and he saw a speedup. But people do not do that in the normal alpha-beta serial search.

He certainly did not see a superlinear speedup.

I have no problem accepting that there might be good approaches to multi-threaded searching that to some extent leave the "try to mimick the serial search as best as we can"-approach. If a particular smp implementation happens to reach max Elo at 55 cores, because with 56 cores the extra synchronisation overhead outweighs the benefit of that extra core, then doing anything else with that core that makes remotely sense can only be an improvement. But you're just not going to get the performance of a single core search at 56x the speed.

Dann Corbit · Post by **Dann Corbit** » Wed Feb 25, 2015 11:56 pm

syzygy wrote:
Dann Corbit wrote:Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?
No, but a parallel search that leaves a lot of potential untapped leaves a lot of potential untapped.

It seems people are thinking that Joona's patch is being criticised. It is not.

Before Joona's patch, most people were aware that SF scales poorly when going from 8 to 16 cores and that this in particular shows in the poor scaling of NPS. This means there is a treasure waiting to be found: one that fixes NPS scaling.

Joona comes up with a patch that seems to improve SMP significantly. But it does not fix NPS scaling. It means a treasure was found but that it is not the treasure we knew existed. That one is still hidden and waiting to be found.

I am neither criticizing nor defending Joona's patch.

I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.

petero2 · Post by **petero2** » Wed Feb 25, 2015 11:57 pm

syzygy wrote:
Dann Corbit wrote:But we already know that a perfect search can search as few as the square root of nodes remaining times some constant.
We know that you can't do better than the minimal tree.

How is the minimal tree defined in the presence of LMR guided by the history heuristic?

Dann Corbit · Post by **Dann Corbit** » Thu Feb 26, 2015 12:04 am

syzygy wrote:
Dann Corbit wrote:Now, I ask the kind audience to engage in a mind experiment with me. We are going to visualize something. Consider a CPU collection with as many real CPU cores as we see in today's high end video cards. We have literally tens of thousands of high energy threads at our disposal. (...)
You actually believe that there might just be an algorithm (we only have to find it!) that would turn that machine into a better chess-playing entity than a single-threaded engine on a single CPU core that is running at tens of thousands times the speed of each of your threads?

Not very realistic.

The assumption (which implies speedup is not possible) is that no special information can be gathered from parallel search. We already know that that assumption is not true for sorting. Perhaps it is true or not true for tree search. Your hand waving is not convincing to me, despite the chorus of agreement from others.

syzygy · Post by **syzygy** » Thu Feb 26, 2015 12:24 am

Dann Corbit wrote:I attempted to say that examination of NPS is not enough to know that an SMP implementation scales well.

But knowing that NPS scales badly is certainly enough to conclude that the SMP implementation does not scale well Elo-wise.

bob · Post by **bob** » Thu Feb 26, 2015 1:15 am

petero2 wrote:
syzygy wrote:
Dann Corbit wrote:But we already know that a perfect search can search as few as the square root of nodes remaining times some constant.
We know that you can't do better than the minimal tree.
How is the minimal tree defined in the presence of LMR guided by the history heuristic?

The tree searched by one thread (a serial search). In reality you can search a smaller tree, by improving move ordering. But once you get move ordering right, that's as far as you can go. And in the context of a chess engine, perfect ordering is simply a concept, not a reality.

so

perfect-tree <= minimal-tree <= parallel-tree

In terms of size (number of nodes)

Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?