Stockfish still scales poorly?

Joerg Oster · Post by **Joerg Oster** » Sat Feb 21, 2015 11:18 am

lucasart wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.
Thanks. I'm glad at least one person understands the obvious.

There is one big mistake in this argumentation, though.
You can measure the elo gain and be happy with it, but only with this, you have no clue whether your smp implementation is performing at a top, medium or low level.
To know this, you have to do some other measurements. For instance, time to depth, nodes per second, idle time of the threads, etc.

Did you know before TCEC that SF's nps is (maybe) somewhat low compared to Komodo's in the endgame on 16 cores? No? Elo doesn't tell you that? What a pity ...

Of course, gaining playing strength is the main goal. Nobody denies this.

zullil · Post by **zullil** » Sat Feb 21, 2015 11:59 am

Joerg Oster wrote:
lucasart wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.
Thanks. I'm glad at least one person understands the obvious.
There is one big mistake in this argumentation, though.
You can measure the elo gain and be happy with it, but only with this, you have no clue whether your smp implementation is performing at a top, medium or low level.
To know this, you have to do some other measurements. For instance, time to depth, nodes per second, idle time of the threads, etc.

Did you know before TCEC that SF's nps is (maybe) somewhat low compared to Komodo's in the endgame on 16 cores? No? Elo doesn't tell you that? What a pity ...

Of course, gaining playing strength is the main goal. Nobody denies this.

Perhaps it would be interesting to have a SF that reports the idleness of the search threads, say as a percentage of the total search time. Or some similar core measurement of the parallel implementation.

zullil · Post by **zullil** » Sat Feb 21, 2015 12:43 pm

bob wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.
What hardware?

One issue with AMD is that most of the AMD BIOS chips give you two choices on memory setup. (a) NUMA (b) SMP.

NUMA is the traditional NUMA approach, if you have two chips as you do, with say 16gb of DRAM, chip 0 will have addresses 0-8gb and chip 1 will have addresses 8gb-16gb.

SMP interleaves pages between the two chips, so that chip 0 gets addresses 0-4K, chip 1 gets 4K-8K, chip 0 gets 8K-12K, etc. If a program understands NUMA, using the SMP setting will break it badly. If it doesn't understand NUMA, the SMP setting will help avoid memory hotspots but does introduce delays.

If it is intel, I don't believe they have done this, at least not on any machines I have run on and I have used a bunch of 'em over time.

I experimented by disabling NUMA (ie, choosing SMP) in BIOS. I believe SF is not NUMA aware, but it certainly performed worse with SMP than with NUMA on my system.

Dann Corbit · Post by **Dann Corbit** » Sat Feb 21, 2015 1:03 pm

Pio wrote:Hi Dann!

I do not think you understand what Bob is trying to say to you. If you make a parallel implementation that is faster (by some randomness) on average, you have invented a new better search algorithm that all the single threaded versions will imitate by creating new threads. We thus have a proof that if parallel >= single -> parallel = single. (With A >= B, I mean A is better or equal to B)

Good luck!!!

I clearly stated up-thread that nodes per second for one thread multiplied by the number of threads will be more than what is measured for any multi-threaded search. I do not disagree with that (speed) assumption {stated above} and I also said as much. However, I do not think it negates any of my points.

I think it is a mistake to assume that NPS is a direct indicator of Elo, which is the goal after all. We have many potential ways to measure the quality of a search. I gave counter-examples that demonstrate using NPS is not a sufficient measure to know if a search is efficient.

An ideal parallel search will solve more problems faster than a less than ideal search. To measure that, NPS and time to ply are not the ideal measures. While there is probably a correlation, why not measure what you are trying to gain directly?

I don't think Dr. Hyatt is stupid or anything like that. I just think that the notion that NPS is the measure of the effectiveness of SMP search is barking up the wrong tree.

I also think that people need to think outside of the box. That is where alpha-beta, null-move, and LMR came from. Eventually, the branching factor will drop below 1.5 and searches will reach absurd depths in a short time. Because of the expansion of cores currently and what is forseen on the horizon, SMP search excellence is one of the most important topics in computer chess. "We always did it that way and it can't be improved." is the wrong approach.

IMO-YMMV.

syzygy · Post by **syzygy** » Sat Feb 21, 2015 3:08 pm

Dann Corbit wrote:Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?

No, but a parallel search that leaves a lot of potential untapped leaves a lot of potential untapped.

It seems people are thinking that Joona's patch is being criticised. It is not.

Before Joona's patch, most people were aware that SF scales poorly when going from 8 to 16 cores and that this in particular shows in the poor scaling of NPS. This means there is a treasure waiting to be found: one that fixes NPS scaling.

Joona comes up with a patch that seems to improve SMP significantly. But it does not fix NPS scaling. It means a treasure was found but that it is not the treasure we knew existed. That one is still hidden and waiting to be found.

Laskos · Post by **Laskos** » Sat Feb 21, 2015 3:34 pm

syzygy wrote:
Laskos wrote:
jhellis3 wrote:
You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.
Well, if one wants to be pedantic.... This statement is wrong. If you search 16x many nodes in a given unit of time T with terrible efficiency (1/2 are redundant), then the effective speedup is only 8x. On the other hand, searching 12x as many nodes in T with say 90% efficiency is an effective speedup of 10.8x. I would take the 10.8x over the 8x, but that's just me.

At the end of the day though, Lucas is right in that it doesn't really matter what one does as long as the end result is higher Elo.
Good correction, I was thinking to post myself the same. The effective speed-up of SF on 16 threads must be in the 6x-8x range, so anything NPS which goes above 9x could be better, be it 9x or 15x.
Bob specifically said that NPS speedup bounds SMP speedup. This is the point. He also said NPS is not everything. What is quoted here is simply 100% correct if one reads carefully.

SF has poor NPS speedup when going from 8 to 16 cores. If that can be addressed (in way that is not stupid, so not e.g. by letting all cores search their own tree without any form of communication), the potential gain is very significant.

In the SF context, the issue is only vaguely related to NPS. To 16 threads, the effective speed-up is about 6x-8x, so it would be misleading to say that SMP implementation sucks because NPS speed-up is "only" 11x and not 15x, as Bob seems to show for Crafty. Then, there are anecdotal reports of Rybka cluster, with some 50x NPS speed-up on 64 cores, and only 10x effective speed-up. Or Jonny with some astronomic NPS on 2,000+ cores machine losing WCCC to Junior.

syzygy · Post by **syzygy** » Sat Feb 21, 2015 3:53 pm

Laskos wrote:
syzygy wrote:SF has poor NPS speedup when going from 8 to 16 cores. If that can be addressed (in way that is not stupid, so not e.g. by letting all cores search their own tree without any form of communication), the potential gain is very significant.
In the SF context, the issue is only vaguely related to NPS. To 16 threads, the effective speed-up is about 6x-8x, so it would be misleading to say that SMP implementation sucks because NPS speed-up is "only" 11x and not 15x

OK, so let's say NPS speedup is 11x and effective speedup is 8x on 16 threads.

Then the smp efficiency is 8/11 x 100% = 73% on 16 threads. That is very good! It is unlikely that this percentage can be improved by much.

Now how can total performance be improved further? Only by improving NPS speedup.

Or maybe 8/11 can still be improved to 9/11. Or it is only 6/11 and it can be improved to 8/11. That would be great as well. But that does not make the 11/16 ratio any less important. Both are important.

syzygy · Post by **syzygy** » Sat Feb 21, 2015 4:15 pm

Laskos wrote:NPS speed-up is "only" 11x and not 15x

Btw, Louis' post suggests that NPS speedup going from 8 to 16 is 1.45x.

In the unlikely case that NPS speedup on 8 cores is 8x, this would give 11.6x NPS speedup on 16 cores.

It seems more likely that it is 6x or so on 8 cores, meaning 8.7x NPS speedup on 16 cores.

And this was on 5 minute searches. On shorter searches the numbers most likely look much worse. Now I don't think it is important at all to have good 16-core performance at very short time controls (but note that this means that the reliability of testing smp changes at such time controls is extremely doubtful.. I would not be surprised if 8-core SF outperforms 16-core SF in the testing framework), but doing well at 30 second searches or so does seem to be worth something.

bob · Post by **bob** » Sat Feb 21, 2015 5:29 pm

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
bob wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...

1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.
This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.
Please. This is going into never-never land. Parallel search is ONLY about speed. Nothing more, nothing less. "smartness" is not related to parallel search. You can make a program smarter (or dumber) without even thinking about parallel search. If your NPS is only 1/2 of what it COULD be, then the SMP elo gain is only 1/2 of what it COULD be. And there is absolutely nothing that will refute that statement, it is simply a cold, hard fact, that ANYBODY doing actually parallel programming will agree with.
Would a parallel search that achieved 100x the NPS but a zero increase in Elo be implemented correctly?

More nodes is one way to make a program stronger.
Better eval is another way.
The recent parallel search idea with having some threads examine forward nodes is a clever riff on the standard methodologies.

I would say that a parallel search is only as good as the results it achieves.

The better the search is implemented, and the better it scales as the thread count rises, the more valuable it is to the goal of Elo (which is, after all, the only sensible goal for improving a chess program's playing strength).

I can write a chess program that has 32 threads. 31 of those threads will search this position for eternity:
rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq -
It might have stupendous NPS, but it will rapidly lose any Elo gain as we move away from the root node.
The distribution of work for the threads is clearly very important for the efficiency of the search.
So new ideas like those put forth by Dan Homan and the Stockfish team are clearly worthy of careful scrutiny.

And if the new ideas bring more Elo, then they are better ideas than the old ones.

IMO-YMMV
I've explained this a dozen times. There are two parts to this.

Part 1, doing more work in a fixed interval of time. If you double the NPS, you are doing twice as much work and you MIGHT search twice as fast, or you might search no faster. You will NOT search more than twice as fast however. Getting the NPS up is one issue in parallel search. Lock contention is a big problem. Sharing too much data in a single cache block is a big problem. There are many others. These things prevent decent NPS scaling. So for starters, until you get a decent NPS scaling, there is no point in worrying much about the parallel search part of the issue. The more work you do per unit of time, the more potential performance benefit you have.

Part 2, using that potential work speed is a different issue. If your NPS is 16x faster with N cores (N >= 16 of course) then you can potentially search 16x faster with the parallel search. But that NPS number is an absolute upper bound on SMP improvement. So once you get the NPS up to a decent level, then you dive into better ways of splitting, better ways to reduce synchronization conflicts, better ways to avoid memory / cache interference. And now you see a true SMP speedup. But it is guaranteed to conform to this relationship:

SMP speedup <= NPS speedup

No exceptions. Both are important. If your NPS speedup is bad, your SMP speedup will be worse (since SMP speedup will never be perfect due to move ordering issues). Both are equally important, and they are affected by different parts of the program and hardware components.

So please stop trying to turn the discussion to nonsensical directions. Again, if NPS speedup is bad, SMP speedup is guaranteed to be worse. Period. As I said, NPS is an upper bound. In your rather ridiculous example, the above is still true. NPS = 32X, SMP < 32X.

But I believe most will see that example is nonsensical. Feel free to find ANY example that violates the NPS limit mentioned. NPS speedup is a hard upper bound, just like C in physics. Only thing is, unlike C, we can change the NPS speedup with clever (or not-so-clever) programming tricks. Having the largest possible upper bound is a clear advantage.
Re:"SMP speedup <= NPS speedup"
I see no guarantee of this, as far as search effectiveness.
There are known cases where SMP search happens to discover a result much more quickly than it "ought to" due to one the many parallel threads stumbling upon the right node. There is no reason to assume that this would not happen more and more as we scale to thousands or millions of threads.

I think we can all agree that the number of nodes for SMP search will be less than or equal to {if it were somehow performed perfectly} the number of nodes times the number of threads. But searching the wrong nodes or even searching sub-optimal nodes reduces the effectiveness (and hence the *correctness*) of the parallel search.

In my example, I was searching the wrong nodes (the root node for the game of chess). Now, even that stupid search will get some benefit from the additional threads of work. After all, given infinite time, they would eventually arrive at the perfect answer to the question "What is the best move, if any exists?" and as long as they are making plies of progress they can conceivably add information to the search. The problem with the bonehead search I described is that the threads are working sub-optimally. In that particular case it is obvious that the search is extremely sub-optimal because of redundant search of nodes that become less and less salient.

In a similar way, however, all parallel search methods must aim to search "the right" nodes and the better they are at achieving that goal, the quicker their "time to ply" will be. We would like, in a perfect world, to have each thread search nodes that none of the other threads have even bothered with and yet they are the most important nodes of all those unsearched by the others to analyze.

Clearly, the way that we dole out work is crucial to optimal searching.

My example is a proof to any thinking person that the NPS is not nearly so important as the way the effort is applied (since it is clear that very high NPS can be reached by my trivial search, though it is not effective). Hence, some measure other than NPS is obviously more important.

Gain in Elo is (I think) the most logical measure, though it is much harder to collect than NPS or time to ply or other extremely simple measures. On the other hand, time to ply is irrelevant if we choose the wrong node when we are done. And a trillion NPS search that chooses bad moves is a bad search.
SMP speedup < NPS speedup is an absolute. If you get ZERO NPS speedup, exactly HOW will you get any sort of SMP speedup? If you search the same number of nodes in a fixed amount of time, how can you search any faster? If you eliminate nodes somehow, you can do that in the non-SMP search exactly the same way.

Second, "super-linear speedup" is an urban myth. It will happen on the occasional position, but you will also see MORE positions with a sub-optimal speedup. Overall, super-linear speedup is impossible. There is a ton of theoretical explanations on why this is true, and I have repeated 'em too many times over the past 20 years or so to want to rehash that again.

Searching MORE nodes than optimal alpha/beta is a bad idea. You are doing MORE than the optimal amount of search. You might win here, but you will lose many more times over there. You are not seriously arguing this I hope?

Again, I did NOT say "NPS is that important". But it is IMPORTANT. If you don't search faster, parallel search offers exactly zero. The faster you search, the more the POTENTIAL (not guaranteed) gain there is. Parallel search, once again, is NOT about smarts or anything else. It is about raw speed and searching the tree faster so that you can go deeper. This is not just true in chess. It is what parallel programming is all about.. You can always argue the ridiculous and use multiple cores to search or play different games. But that is not the way to make the program play a single game any stronger. Feel free to describe any way you can get a faster parallel search than the NPS bound already given. And do NOT cite super-linear speedup which is the exception, rather than the rule. Overall, across a reasonable set of positions, super-linear speedup is not possible.

If you prefer, I can send you email addresses for some people that are actually involved, or have been involved in parallel search. Ken Thompson, Murray Campbell, Monty Newborn, Jonathan Schaeffer, to name 4 well-known people. All are going to tell you exactly the same thing.

Yes, you can have high NPS and a poor speedup. But NO, you can NOT have low NPS and a good speedup. The NPS speedup is the upper bound on the search speedup. This is really intuitive to anyone that writes this stuff...
Re:
>>
Yes, you can have high NPS and a poor speedup. But NO, you can NOT have low NPS and a good speedup. The NPS speedup is the upper bound on the search speedup. This is really intuitive to anyone that writes this stuff...
<<

There is no reason I can imagine to assume that you cannot lower NPS and improve the search. LMR is an obvious DISPROOF of that statement.
It spends a lot of work to examine the nodes to be searched and prunes some nodes from the tree that we can ignore with a decent margin of safety. This will lower the NPS (compared to a simpler search algorithm), but the nodes that get examined are more important and the search solves the problem of finding the best node faster.
Another simple example of lower NPS with better search is a database lookup using EGTB. The NPS might drop enormously due to searching (for instance) 7 man EGTB files. And yet the perfect answer returned ends up being faster than the search process would have been.

I can imagine a search where an enormous effort goes into throwing out which nodes to spend time on due to fancy pruning algorithms, database lookups and the like, and which lowers the NPS by a factor of ten. And yet this search runs better than a dumb search with is ten times faster.

A clear example of that is a pure mini-max search for an engine that only counts wood. It will have incredible NPS because the eval is so simple and the algorithm is so simple. But it is wasting time on nodes that add no value to the search.

Hence, NPS is an utterly useless measure in that instance. The 11 ply mini-max search evaluated 2,097,651,003,696,806 distinct nodes (many of them multiple times) giving an incredible NPS figure, but the 35 ply intelligent search returned a vastly superior answer in 1% the time.

I will repeat, since you keep going off on tangents. This is about SMP performance, NOT just the performance of a serial search chess engine. SMP is about speed. it is NOT about quality, knowledge, or anything else. Just searching more nodes in the same amount of time. NPS speedup gives an upper bound on performance. Again, I'll be happy to provide you with emails of people that have actually worked on this stuff in chess. You will find complete agreement here.

Nobody is suggesting that you ARTIFICIALLY inflate the NPS with nonsensical changes. We are talking about the best serial search you can do, and then speeding that up with SMP. Your examples remain completely nonsensical. Pure minimax searches nodes faster than alpha/beta, because none get thrown out (which wastes time on the generation phase). But that is NOT what is being discussed. The basic alpha/beta search, move ordering, pruning and such do NOT get changed in a parallel search algorithm. It is supposedly already optimal otherwise one would first address that before moving on to the complexities of SMP.

bob · Post by **bob** » Sat Feb 21, 2015 5:32 pm

Dann Corbit wrote:
Pio wrote:Hi Dann!

I do not think you understand what Bob is trying to say to you. If you make a parallel implementation that is faster (by some randomness) on average, you have invented a new better search algorithm that all the single threaded versions will imitate by creating new threads. We thus have a proof that if parallel >= single -> parallel = single. (With A >= B, I mean A is better or equal to B)

Good luck!!!
I clearly stated up-thread that nodes per second for one thread multiplied by the number of threads will be more than what is measured for any multi-threaded search. I do not disagree with that (speed) assumption {stated above} and I also said as much. However, I do not think it negates any of my points.

I think it is a mistake to assume that NPS is a direct indicator of Elo, which is the goal after all. We have many potential ways to measure the quality of a search. I gave counter-examples that demonstrate using NPS is not a sufficient measure to know if a search is efficient.

An ideal parallel search will solve more problems faster than a less than ideal search. To measure that, NPS and time to ply are not the ideal measures. While there is probably a correlation, why not measure what you are trying to gain directly?

I don't think Dr. Hyatt is stupid or anything like that. I just think that the notion that NPS is the measure of the effectiveness of SMP search is barking up the wrong tree.

I also think that people need to think outside of the box. That is where alpha-beta, null-move, and LMR came from. Eventually, the branching factor will drop below 1.5 and searches will reach absurd depths in a short time. Because of the expansion of cores currently and what is forseen on the horizon, SMP search excellence is one of the most important topics in computer chess. "We always did it that way and it can't be improved." is the wrong approach.

IMO-YMMV.

You won't find ANY place where I said "NPS is an indicator of Elo". I said "NPS is certainly an upper bound on possible Elo gain, because it is an upper bound on possible parallel search speed gains. And there is no "we always did it that way so ..." in the discussion. But SMP is about speed, and nothing else. Which means NPS gives an upper bound on the potential gain. It does NOT necessarily give "the absolute gain" which will always be less than the NPS speed bound.

Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?