Stockfish still scales poorly?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Stockfish still scales poorly?

Post by lucasart »

jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.
Thanks. I'm glad at least one person understands the obvious.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...
1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.
This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.
Please. This is going into never-never land. Parallel search is ONLY about speed. Nothing more, nothing less. "smartness" is not related to parallel search. You can make a program smarter (or dumber) without even thinking about parallel search. If your NPS is only 1/2 of what it COULD be, then the SMP elo gain is only 1/2 of what it COULD be. And there is absolutely nothing that will refute that statement, it is simply a cold, hard fact, that ANYBODY doing actually parallel programming will agree with.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

lucasart wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.
Thanks. I'm glad at least one person understands the obvious.
I am glad to see only two people "don't have a clue" here. That would be the two of you.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

zullil wrote:
bob wrote: Turbo-boost disabled? If not that wrecks the numbers right off the bat.
Yes.
OK, with TB off, which is the only way you can measure pure SMP performance, those numbers look pretty bad. I've had similar results (hence the major changes in the current 25.0 version. I have this one main lock for the basic split operation and I have been looking at getting rid of it, as the more threads you have, the more contention for that lock you have. I have already addressed one major issue with splitting, avoiding "log-jams" where there are lots of idle processors and lots of splits are attempted. This only gets worse as the number of cores climbs.

However, that said, I have no doubt there are platforms that produce good NPS scaling, and others that will do poorly. It is what makes this such a pain.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...
1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.
This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.
Please. This is going into never-never land. Parallel search is ONLY about speed. Nothing more, nothing less. "smartness" is not related to parallel search. You can make a program smarter (or dumber) without even thinking about parallel search. If your NPS is only 1/2 of what it COULD be, then the SMP elo gain is only 1/2 of what it COULD be. And there is absolutely nothing that will refute that statement, it is simply a cold, hard fact, that ANYBODY doing actually parallel programming will agree with.
Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?

More nodes is one way to make a program stronger.
Better eval is another way.
The recent parallel search idea with having some threads examine forward nodes is a clever riff on the standard methodologies.

I would say that a parallel search is only as good as the results it achieves.

The better the search is implemented, and the better it scales as the thread count rises, the more valuable it is to the goal of Elo (which is, after all, the only sensible goal for improving a chess program's playing strength).

I can write a chess program that has 32 threads. 31 of those threads will search this position for eternity:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
It might have stupendous NPS, but it will rapidly lose any Elo gain as we move away from the root node.
The distribution of work for the threads is clearly very important for the efficiency of the search.
So new ideas like those put forth by Dan Homan and the Stockfish team are clearly worthy of careful scrutiny.

And if the new ideas bring more Elo, then they are better ideas than the old ones.

IMO-YMMV
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...
1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.
This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.
Please. This is going into never-never land. Parallel search is ONLY about speed. Nothing more, nothing less. "smartness" is not related to parallel search. You can make a program smarter (or dumber) without even thinking about parallel search. If your NPS is only 1/2 of what it COULD be, then the SMP elo gain is only 1/2 of what it COULD be. And there is absolutely nothing that will refute that statement, it is simply a cold, hard fact, that ANYBODY doing actually parallel programming will agree with.
Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?

More nodes is one way to make a program stronger.
Better eval is another way.
The recent parallel search idea with having some threads examine forward nodes is a clever riff on the standard methodologies.

I would say that a parallel search is only as good as the results it achieves.

The better the search is implemented, and the better it scales as the thread count rises, the more valuable it is to the goal of Elo (which is, after all, the only sensible goal for improving a chess program's playing strength).

I can write a chess program that has 32 threads. 31 of those threads will search this position for eternity:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
It might have stupendous NPS, but it will rapidly lose any Elo gain as we move away from the root node.
The distribution of work for the threads is clearly very important for the efficiency of the search.
So new ideas like those put forth by Dan Homan and the Stockfish team are clearly worthy of careful scrutiny.

And if the new ideas bring more Elo, then they are better ideas than the old ones.

IMO-YMMV
I've explained this a dozen times. There are two parts to this.

Part 1, doing more work in a fixed interval of time. If you double the NPS, you are doing twice as much work and you MIGHT search twice as fast, or you might search no faster. You will NOT search more than twice as fast however. Getting the NPS up is one issue in parallel search. Lock contention is a big problem. Sharing too much data in a single cache block is a big problem. There are many others. These things prevent decent NPS scaling. So for starters, until you get a decent NPS scaling, there is no point in worrying much about the parallel search part of the issue. The more work you do per unit of time, the more potential performance benefit you have.

Part 2, using that potential work speed is a different issue. If your NPS is 16x faster with N cores (N >= 16 of course) then you can potentially search 16x faster with the parallel search. But that NPS number is an absolute upper bound on SMP improvement. So once you get the NPS up to a decent level, then you dive into better ways of splitting, better ways to reduce synchronization conflicts, better ways to avoid memory / cache interference. And now you see a true SMP speedup. But it is guaranteed to conform to this relationship:

SMP speedup <= NPS speedup

No exceptions. Both are important. If your NPS speedup is bad, your SMP speedup will be worse (since SMP speedup will never be perfect due to move ordering issues). Both are equally important, and they are affected by different parts of the program and hardware components.

So please stop trying to turn the discussion to nonsensical directions. Again, if NPS speedup is bad, SMP speedup is guaranteed to be worse. Period. As I said, NPS is an upper bound. In your rather ridiculous example, the above is still true. NPS = 32X, SMP < 32X.

But I believe most will see that example is nonsensical. Feel free to find ANY example that violates the NPS limit mentioned. NPS speedup is a hard upper bound, just like C in physics. Only thing is, unlike C, we can change the NPS speedup with clever (or not-so-clever) programming tricks. Having the largest possible upper bound is a clear advantage.
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:
Dann Corbit wrote:
bob wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...
1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.
This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.
Please. This is going into never-never land. Parallel search is ONLY about speed. Nothing more, nothing less. "smartness" is not related to parallel search. You can make a program smarter (or dumber) without even thinking about parallel search. If your NPS is only 1/2 of what it COULD be, then the SMP elo gain is only 1/2 of what it COULD be. And there is absolutely nothing that will refute that statement, it is simply a cold, hard fact, that ANYBODY doing actually parallel programming will agree with.
Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?

More nodes is one way to make a program stronger.
Better eval is another way.
The recent parallel search idea with having some threads examine forward nodes is a clever riff on the standard methodologies.

I would say that a parallel search is only as good as the results it achieves.

The better the search is implemented, and the better it scales as the thread count rises, the more valuable it is to the goal of Elo (which is, after all, the only sensible goal for improving a chess program's playing strength).

I can write a chess program that has 32 threads. 31 of those threads will search this position for eternity:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
It might have stupendous NPS, but it will rapidly lose any Elo gain as we move away from the root node.
The distribution of work for the threads is clearly very important for the efficiency of the search.
So new ideas like those put forth by Dan Homan and the Stockfish team are clearly worthy of careful scrutiny.

And if the new ideas bring more Elo, then they are better ideas than the old ones.

IMO-YMMV
I've explained this a dozen times. There are two parts to this.

Part 1, doing more work in a fixed interval of time. If you double the NPS, you are doing twice as much work and you MIGHT search twice as fast, or you might search no faster. You will NOT search more than twice as fast however. Getting the NPS up is one issue in parallel search. Lock contention is a big problem. Sharing too much data in a single cache block is a big problem. There are many others. These things prevent decent NPS scaling. So for starters, until you get a decent NPS scaling, there is no point in worrying much about the parallel search part of the issue. The more work you do per unit of time, the more potential performance benefit you have.

Part 2, using that potential work speed is a different issue. If your NPS is 16x faster with N cores (N >= 16 of course) then you can potentially search 16x faster with the parallel search. But that NPS number is an absolute upper bound on SMP improvement. So once you get the NPS up to a decent level, then you dive into better ways of splitting, better ways to reduce synchronization conflicts, better ways to avoid memory / cache interference. And now you see a true SMP speedup. But it is guaranteed to conform to this relationship:

SMP speedup <= NPS speedup

No exceptions. Both are important. If your NPS speedup is bad, your SMP speedup will be worse (since SMP speedup will never be perfect due to move ordering issues). Both are equally important, and they are affected by different parts of the program and hardware components.

So please stop trying to turn the discussion to nonsensical directions. Again, if NPS speedup is bad, SMP speedup is guaranteed to be worse. Period. As I said, NPS is an upper bound. In your rather ridiculous example, the above is still true. NPS = 32X, SMP < 32X.

But I believe most will see that example is nonsensical. Feel free to find ANY example that violates the NPS limit mentioned. NPS speedup is a hard upper bound, just like C in physics. Only thing is, unlike C, we can change the NPS speedup with clever (or not-so-clever) programming tricks. Having the largest possible upper bound is a clear advantage.
Re:"SMP speedup <= NPS speedup"
I see no guarantee of this, as far as search effectiveness.
There are known cases where SMP search happens to discover a result much more quickly than it "ought to" due to one the many parallel threads stumbling upon the right node. There is no reason to assume that this would not happen more and more as we scale to thousands or millions of threads.

I think we can all agree that the number of nodes for SMP search will be less than or equal to {if it were somehow performed perfectly} the number of nodes (for one thread) times the number of threads. But searching the wrong nodes or even searching sub-optimal nodes reduces the effectiveness (and hence the *correctness*) of the parallel search.

In my example, I was searching the wrong nodes (the root node for the game of chess). Now, even that stupid search will get some benefit from the additional threads of work. After all, given infinite time, they would eventually arrive at the perfect answer to the question "What is the best move, if any exists?" and as long as they are making plies of progress they can conceivably add information to the search. The problem with the bonehead search I described is that the threads are working sub-optimally. In that particular case it is obvious that the search is extremely sub-optimal because of redundant search of nodes that become less and less salient.

In a similar way, however, all parallel search methods must aim to search "the right" nodes and the better they are at achieving that goal, the quicker their "time to ply" will be. We would like, in a perfect world, to have each thread search nodes that none of the other threads have even bothered with and yet they are the most important nodes of all those unsearched by the others to analyze.

Clearly, the way that we dole out work is crucial to optimal searching.

My example is a proof to any thinking person that the NPS is not nearly so important as the way the effort is applied (since it is clear that very high NPS can be reached by my trivial search, though it is not effective). Hence, some measure other than NPS is obviously more important.

Gain in Elo is (I think) the most logical measure, though it is much harder to collect than NPS or time to ply or other extremely simple measures. On the other hand, time to ply is irrelevant if we choose the wrong node when we are done. And a trillion NPS search that chooses bad moves is a bad search.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...
1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.
This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.
Please. This is going into never-never land. Parallel search is ONLY about speed. Nothing more, nothing less. "smartness" is not related to parallel search. You can make a program smarter (or dumber) without even thinking about parallel search. If your NPS is only 1/2 of what it COULD be, then the SMP elo gain is only 1/2 of what it COULD be. And there is absolutely nothing that will refute that statement, it is simply a cold, hard fact, that ANYBODY doing actually parallel programming will agree with.
Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?

More nodes is one way to make a program stronger.
Better eval is another way.
The recent parallel search idea with having some threads examine forward nodes is a clever riff on the standard methodologies.

I would say that a parallel search is only as good as the results it achieves.

The better the search is implemented, and the better it scales as the thread count rises, the more valuable it is to the goal of Elo (which is, after all, the only sensible goal for improving a chess program's playing strength).

I can write a chess program that has 32 threads. 31 of those threads will search this position for eternity:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
It might have stupendous NPS, but it will rapidly lose any Elo gain as we move away from the root node.
The distribution of work for the threads is clearly very important for the efficiency of the search.
So new ideas like those put forth by Dan Homan and the Stockfish team are clearly worthy of careful scrutiny.

And if the new ideas bring more Elo, then they are better ideas than the old ones.

IMO-YMMV
I've explained this a dozen times. There are two parts to this.

Part 1, doing more work in a fixed interval of time. If you double the NPS, you are doing twice as much work and you MIGHT search twice as fast, or you might search no faster. You will NOT search more than twice as fast however. Getting the NPS up is one issue in parallel search. Lock contention is a big problem. Sharing too much data in a single cache block is a big problem. There are many others. These things prevent decent NPS scaling. So for starters, until you get a decent NPS scaling, there is no point in worrying much about the parallel search part of the issue. The more work you do per unit of time, the more potential performance benefit you have.

Part 2, using that potential work speed is a different issue. If your NPS is 16x faster with N cores (N >= 16 of course) then you can potentially search 16x faster with the parallel search. But that NPS number is an absolute upper bound on SMP improvement. So once you get the NPS up to a decent level, then you dive into better ways of splitting, better ways to reduce synchronization conflicts, better ways to avoid memory / cache interference. And now you see a true SMP speedup. But it is guaranteed to conform to this relationship:

SMP speedup <= NPS speedup

No exceptions. Both are important. If your NPS speedup is bad, your SMP speedup will be worse (since SMP speedup will never be perfect due to move ordering issues). Both are equally important, and they are affected by different parts of the program and hardware components.

So please stop trying to turn the discussion to nonsensical directions. Again, if NPS speedup is bad, SMP speedup is guaranteed to be worse. Period. As I said, NPS is an upper bound. In your rather ridiculous example, the above is still true. NPS = 32X, SMP < 32X.

But I believe most will see that example is nonsensical. Feel free to find ANY example that violates the NPS limit mentioned. NPS speedup is a hard upper bound, just like C in physics. Only thing is, unlike C, we can change the NPS speedup with clever (or not-so-clever) programming tricks. Having the largest possible upper bound is a clear advantage.
Re:"SMP speedup <= NPS speedup"
I see no guarantee of this, as far as search effectiveness.
There are known cases where SMP search happens to discover a result much more quickly than it "ought to" due to one the many parallel threads stumbling upon the right node. There is no reason to assume that this would not happen more and more as we scale to thousands or millions of threads.

I think we can all agree that the number of nodes for SMP search will be less than or equal to {if it were somehow performed perfectly} the number of nodes times the number of threads. But searching the wrong nodes or even searching sub-optimal nodes reduces the effectiveness (and hence the *correctness*) of the parallel search.

In my example, I was searching the wrong nodes (the root node for the game of chess). Now, even that stupid search will get some benefit from the additional threads of work. After all, given infinite time, they would eventually arrive at the perfect answer to the question "What is the best move, if any exists?" and as long as they are making plies of progress they can conceivably add information to the search. The problem with the bonehead search I described is that the threads are working sub-optimally. In that particular case it is obvious that the search is extremely sub-optimal because of redundant search of nodes that become less and less salient.

In a similar way, however, all parallel search methods must aim to search "the right" nodes and the better they are at achieving that goal, the quicker their "time to ply" will be. We would like, in a perfect world, to have each thread search nodes that none of the other threads have even bothered with and yet they are the most important nodes of all those unsearched by the others to analyze.

Clearly, the way that we dole out work is crucial to optimal searching.

My example is a proof to any thinking person that the NPS is not nearly so important as the way the effort is applied (since it is clear that very high NPS can be reached by my trivial search, though it is not effective). Hence, some measure other than NPS is obviously more important.

Gain in Elo is (I think) the most logical measure, though it is much harder to collect than NPS or time to ply or other extremely simple measures. On the other hand, time to ply is irrelevant if we choose the wrong node when we are done. And a trillion NPS search that chooses bad moves is a bad search.
SMP speedup < NPS speedup is an absolute. If you get ZERO NPS speedup, exactly HOW will you get any sort of SMP speedup? If you search the same number of nodes in a fixed amount of time, how can you search any faster? If you eliminate nodes somehow, you can do that in the non-SMP search exactly the same way.

Second, "super-linear speedup" is an urban myth. It will happen on the occasional position, but you will also see MORE positions with a sub-optimal speedup. Overall, super-linear speedup is impossible. There is a ton of theoretical explanations on why this is true, and I have repeated 'em too many times over the past 20 years or so to want to rehash that again.

Searching MORE nodes than optimal alpha/beta is a bad idea. You are doing MORE than the optimal amount of search. You might win here, but you will lose many more times over there. You are not seriously arguing this I hope?

Again, I did NOT say "NPS is that important". But it is IMPORTANT. If you don't search faster, parallel search offers exactly zero. The faster you search, the more the POTENTIAL (not guaranteed) gain there is. Parallel search, once again, is NOT about smarts or anything else. It is about raw speed and searching the tree faster so that you can go deeper. This is not just true in chess. It is what parallel programming is all about.. You can always argue the ridiculous and use multiple cores to search or play different games. But that is not the way to make the program play a single game any stronger. Feel free to describe any way you can get a faster parallel search than the NPS bound already given. And do NOT cite super-linear speedup which is the exception, rather than the rule. Overall, across a reasonable set of positions, super-linear speedup is not possible.

If you prefer, I can send you email addresses for some people that are actually involved, or have been involved in parallel search. Ken Thompson, Murray Campbell, Monty Newborn, Jonathan Schaeffer, to name 4 well-known people. All are going to tell you exactly the same thing.

Yes, you can have high NPS and a poor speedup. But NO, you can NOT have low NPS and a good speedup. The NPS speedup is the upper bound on the search speedup. This is really intuitive to anyone that writes this stuff...
Dann Corbit
Posts: 12541
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...
1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.
This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.
Please. This is going into never-never land. Parallel search is ONLY about speed. Nothing more, nothing less. "smartness" is not related to parallel search. You can make a program smarter (or dumber) without even thinking about parallel search. If your NPS is only 1/2 of what it COULD be, then the SMP elo gain is only 1/2 of what it COULD be. And there is absolutely nothing that will refute that statement, it is simply a cold, hard fact, that ANYBODY doing actually parallel programming will agree with.
Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?

More nodes is one way to make a program stronger.
Better eval is another way.
The recent parallel search idea with having some threads examine forward nodes is a clever riff on the standard methodologies.

I would say that a parallel search is only as good as the results it achieves.

The better the search is implemented, and the better it scales as the thread count rises, the more valuable it is to the goal of Elo (which is, after all, the only sensible goal for improving a chess program's playing strength).

I can write a chess program that has 32 threads. 31 of those threads will search this position for eternity:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
It might have stupendous NPS, but it will rapidly lose any Elo gain as we move away from the root node.
The distribution of work for the threads is clearly very important for the efficiency of the search.
So new ideas like those put forth by Dan Homan and the Stockfish team are clearly worthy of careful scrutiny.

And if the new ideas bring more Elo, then they are better ideas than the old ones.

IMO-YMMV
I've explained this a dozen times. There are two parts to this.

Part 1, doing more work in a fixed interval of time. If you double the NPS, you are doing twice as much work and you MIGHT search twice as fast, or you might search no faster. You will NOT search more than twice as fast however. Getting the NPS up is one issue in parallel search. Lock contention is a big problem. Sharing too much data in a single cache block is a big problem. There are many others. These things prevent decent NPS scaling. So for starters, until you get a decent NPS scaling, there is no point in worrying much about the parallel search part of the issue. The more work you do per unit of time, the more potential performance benefit you have.

Part 2, using that potential work speed is a different issue. If your NPS is 16x faster with N cores (N >= 16 of course) then you can potentially search 16x faster with the parallel search. But that NPS number is an absolute upper bound on SMP improvement. So once you get the NPS up to a decent level, then you dive into better ways of splitting, better ways to reduce synchronization conflicts, better ways to avoid memory / cache interference. And now you see a true SMP speedup. But it is guaranteed to conform to this relationship:

SMP speedup <= NPS speedup

No exceptions. Both are important. If your NPS speedup is bad, your SMP speedup will be worse (since SMP speedup will never be perfect due to move ordering issues). Both are equally important, and they are affected by different parts of the program and hardware components.

So please stop trying to turn the discussion to nonsensical directions. Again, if NPS speedup is bad, SMP speedup is guaranteed to be worse. Period. As I said, NPS is an upper bound. In your rather ridiculous example, the above is still true. NPS = 32X, SMP < 32X.

But I believe most will see that example is nonsensical. Feel free to find ANY example that violates the NPS limit mentioned. NPS speedup is a hard upper bound, just like C in physics. Only thing is, unlike C, we can change the NPS speedup with clever (or not-so-clever) programming tricks. Having the largest possible upper bound is a clear advantage.
Re:"SMP speedup <= NPS speedup"
I see no guarantee of this, as far as search effectiveness.
There are known cases where SMP search happens to discover a result much more quickly than it "ought to" due to one the many parallel threads stumbling upon the right node. There is no reason to assume that this would not happen more and more as we scale to thousands or millions of threads.

I think we can all agree that the number of nodes for SMP search will be less than or equal to {if it were somehow performed perfectly} the number of nodes times the number of threads. But searching the wrong nodes or even searching sub-optimal nodes reduces the effectiveness (and hence the *correctness*) of the parallel search.

In my example, I was searching the wrong nodes (the root node for the game of chess). Now, even that stupid search will get some benefit from the additional threads of work. After all, given infinite time, they would eventually arrive at the perfect answer to the question "What is the best move, if any exists?" and as long as they are making plies of progress they can conceivably add information to the search. The problem with the bonehead search I described is that the threads are working sub-optimally. In that particular case it is obvious that the search is extremely sub-optimal because of redundant search of nodes that become less and less salient.

In a similar way, however, all parallel search methods must aim to search "the right" nodes and the better they are at achieving that goal, the quicker their "time to ply" will be. We would like, in a perfect world, to have each thread search nodes that none of the other threads have even bothered with and yet they are the most important nodes of all those unsearched by the others to analyze.

Clearly, the way that we dole out work is crucial to optimal searching.

My example is a proof to any thinking person that the NPS is not nearly so important as the way the effort is applied (since it is clear that very high NPS can be reached by my trivial search, though it is not effective). Hence, some measure other than NPS is obviously more important.

Gain in Elo is (I think) the most logical measure, though it is much harder to collect than NPS or time to ply or other extremely simple measures. On the other hand, time to ply is irrelevant if we choose the wrong node when we are done. And a trillion NPS search that chooses bad moves is a bad search.
SMP speedup < NPS speedup is an absolute. If you get ZERO NPS speedup, exactly HOW will you get any sort of SMP speedup? If you search the same number of nodes in a fixed amount of time, how can you search any faster? If you eliminate nodes somehow, you can do that in the non-SMP search exactly the same way.

Second, "super-linear speedup" is an urban myth. It will happen on the occasional position, but you will also see MORE positions with a sub-optimal speedup. Overall, super-linear speedup is impossible. There is a ton of theoretical explanations on why this is true, and I have repeated 'em too many times over the past 20 years or so to want to rehash that again.

Searching MORE nodes than optimal alpha/beta is a bad idea. You are doing MORE than the optimal amount of search. You might win here, but you will lose many more times over there. You are not seriously arguing this I hope?

Again, I did NOT say "NPS is that important". But it is IMPORTANT. If you don't search faster, parallel search offers exactly zero. The faster you search, the more the POTENTIAL (not guaranteed) gain there is. Parallel search, once again, is NOT about smarts or anything else. It is about raw speed and searching the tree faster so that you can go deeper. This is not just true in chess. It is what parallel programming is all about.. You can always argue the ridiculous and use multiple cores to search or play different games. But that is not the way to make the program play a single game any stronger. Feel free to describe any way you can get a faster parallel search than the NPS bound already given. And do NOT cite super-linear speedup which is the exception, rather than the rule. Overall, across a reasonable set of positions, super-linear speedup is not possible.

If you prefer, I can send you email addresses for some people that are actually involved, or have been involved in parallel search. Ken Thompson, Murray Campbell, Monty Newborn, Jonathan Schaeffer, to name 4 well-known people. All are going to tell you exactly the same thing.

Yes, you can have high NPS and a poor speedup. But NO, you can NOT have low NPS and a good speedup. The NPS speedup is the upper bound on the search speedup. This is really intuitive to anyone that writes this stuff...
Re:
>>
Yes, you can have high NPS and a poor speedup. But NO, you can NOT have low NPS and a good speedup. The NPS speedup is the upper bound on the search speedup. This is really intuitive to anyone that writes this stuff...
<<

There is no reason I can imagine to assume that you cannot lower NPS and improve the search. LMR is an obvious DISPROOF of that statement.
It spends a lot of work to examine the nodes to be searched and prunes some nodes from the tree that we can ignore with a decent margin of safety. This will lower the NPS (compared to a simpler search algorithm), but the nodes that get examined are more important and the search solves the problem of finding the best node faster.
Another simple example of lower NPS with better search is a database lookup using EGTB. The NPS might drop enormously due to searching (for instance) 7 man EGTB files. And yet the perfect answer returned ends up being faster than the search process would have been.

I can imagine a search where an enormous effort goes into throwing out which nodes to spend time on due to fancy pruning algorithms, database lookups and the like, and which lowers the NPS by a factor of ten. And yet this search runs better than a dumb search with is ten times faster.

A clear example of that is a pure mini-max search for an engine that only counts wood. It will have incredible NPS because the eval is so simple and the algorithm is so simple. But it is wasting time on nodes that add no value to the search.

Hence, NPS is an utterly useless measure in that instance. The 11 ply mini-max search evaluated 2,097,651,003,696,806 distinct nodes (many of them multiple times) giving an incredible NPS figure, but the 35 ply intelligent search returned a vastly superior answer in 1% the time.
Pio
Posts: 334
Joined: Sat Feb 25, 2012 10:42 pm
Location: Stockholm

Re: Stockfish still scales poorly?

Post by Pio »

Hi Dann!

I do not think you understand what Bob is trying to say to you. If you make a parallel implementation that is faster (by some randomness) on average, you have invented a new better search algorithm that all the single threaded versions will imitate by creating new threads. We thus have a proof that if parallel >= single -> parallel = single. (With A >= B, I mean A is better or equal to B)

Good luck!!!