Stockfish still scales poorly?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

zullil wrote:
Joerg Oster wrote:
lucasart wrote:
jhellis3 wrote:No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.
Thanks. I'm glad at least one person understands the obvious.
There is one big mistake in this argumentation, though.
You can measure the elo gain and be happy with it, but only with this, you have no clue whether your smp implementation is performing at a top, medium or low level.
To know this, you have to do some other measurements. For instance, time to depth, nodes per second, idle time of the threads, etc.

Did you know before TCEC that SF's nps is (maybe) somewhat low compared to Komodo's in the endgame on 16 cores? No? Elo doesn't tell you that? What a pity ...

Of course, gaining playing strength is the main goal. Nobody denies this.
Perhaps it would be interesting to have a SF that reports the idleness of the search threads, say as a percentage of the total search time. Or some similar core measurement of the parallel implementation.
That's the first step to solving any of these kinds of problems, proper instrumentation. If it is expensive to compute, make it something that can be compiled out for real games, compiled in for testing and improving.

I added the "cpu %" number in Crafty so I could see exactly that and determine whether the performance issues might be the actual parallel search waiting time (the % value) or the more problematic cache/memory issues that can't easily be measured.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Stockfish still scales poorly?

Post by Laskos »

syzygy wrote:
And this was on 5 minute searches. On shorter searches the numbers most likely look much worse. Now I don't think it is important at all to have good 16-core performance at very short time controls (but note that this means that the reliability of testing smp changes at such time controls is extremely doubtful.. I would not be surprised if 8-core SF outperforms 16-core SF in the testing framework), but doing well at 30 second searches or so does seem to be worth something.
Short TC are surely doubtful in measuring both NPS and effective speed-up scaling with threads. I plotted 1->4 cores Hodini 3 results more than a year ago:
Image

Both NPS and effective speed-up (time-to-depth or TTD for Houdini, but not for Komodo) are better used as notions for reasonable to long TC.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Laskos wrote:
syzygy wrote:
And this was on 5 minute searches. On shorter searches the numbers most likely look much worse. Now I don't think it is important at all to have good 16-core performance at very short time controls (but note that this means that the reliability of testing smp changes at such time controls is extremely doubtful.. I would not be surprised if 8-core SF outperforms 16-core SF in the testing framework), but doing well at 30 second searches or so does seem to be worth something.
Short TC are surely doubtful in measuring both NPS and effective speed-up scaling with threads. I plotted 1->4 cores Hodini 3 results more than a year ago:
Image

Both NPS and effective speed-up (time-to-depth or TTD for Houdini, but not for Komodo) are better used as notions for reasonable to long TC.
I like the plot because it makes the point I have been after all along here recently, NPS is an upper bound on SMP speedup. The close the SMP and NPS speedups get, the better the SMP search is working. The bigger the NPS speedup, the better the program works with the existing hardware being used.

There is still a HUGE variability in NPS speedup with all the different hardware platforms available today. Too much trickiness in the cache stuff, shared memory accessing, memory interleaving, cache forwarding, you name it.

BTW I am aware of one example where depth didn't influence SMP speedup much at all, Cray Blitz. We didn't have any "minimum split depth" there because the split overhead was almost non-existent. The only drawback to shallow depths I had back then was that it was harder to measure times, since every operating system has some sort of time resolution that is fairly coarse, and at very fast search times, you get lots of quantization error until you stretch the time to way beyond what might be added or subtracted due to this kind of time quantization.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:
Dann Corbit wrote:
Pio wrote:Hi Dann!

I do not think you understand what Bob is trying to say to you. If you make a parallel implementation that is faster (by some randomness) on average, you have invented a new better search algorithm that all the single threaded versions will imitate by creating new threads. We thus have a proof that if parallel >= single -> parallel = single. (With A >= B, I mean A is better or equal to B)

Good luck!!!
I clearly stated up-thread that nodes per second for one thread multiplied by the number of threads will be more than what is measured for any multi-threaded search. I do not disagree with that (speed) assumption {stated above} and I also said as much. However, I do not think it negates any of my points.

I think it is a mistake to assume that NPS is a direct indicator of Elo, which is the goal after all. We have many potential ways to measure the quality of a search. I gave counter-examples that demonstrate using NPS is not a sufficient measure to know if a search is efficient.

An ideal parallel search will solve more problems faster than a less than ideal search. To measure that, NPS and time to ply are not the ideal measures. While there is probably a correlation, why not measure what you are trying to gain directly?

I don't think Dr. Hyatt is stupid or anything like that. I just think that the notion that NPS is the measure of the effectiveness of SMP search is barking up the wrong tree.

I also think that people need to think outside of the box. That is where alpha-beta, null-move, and LMR came from. Eventually, the branching factor will drop below 1.5 and searches will reach absurd depths in a short time. Because of the expansion of cores currently and what is forseen on the horizon, SMP search excellence is one of the most important topics in computer chess. "We always did it that way and it can't be improved." is the wrong approach.

IMO-YMMV.
You won't find ANY place where I said "NPS is an indicator of Elo". I said "NPS is certainly an upper bound on possible Elo gain, because it is an upper bound on possible parallel search speed gains. And there is no "we always did it that way so ..." in the discussion. But SMP is about speed, and nothing else. Which means NPS gives an upper bound on the potential gain. It does NOT necessarily give "the absolute gain" which will always be less than the NPS speed bound.
Well then, I guess that we are in complete agreement.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
Pio wrote:Hi Dann!

I do not think you understand what Bob is trying to say to you. If you make a parallel implementation that is faster (by some randomness) on average, you have invented a new better search algorithm that all the single threaded versions will imitate by creating new threads. We thus have a proof that if parallel >= single -> parallel = single. (With A >= B, I mean A is better or equal to B)

Good luck!!!
I clearly stated up-thread that nodes per second for one thread multiplied by the number of threads will be more than what is measured for any multi-threaded search. I do not disagree with that (speed) assumption {stated above} and I also said as much. However, I do not think it negates any of my points.

I think it is a mistake to assume that NPS is a direct indicator of Elo, which is the goal after all. We have many potential ways to measure the quality of a search. I gave counter-examples that demonstrate using NPS is not a sufficient measure to know if a search is efficient.

An ideal parallel search will solve more problems faster than a less than ideal search. To measure that, NPS and time to ply are not the ideal measures. While there is probably a correlation, why not measure what you are trying to gain directly?

I don't think Dr. Hyatt is stupid or anything like that. I just think that the notion that NPS is the measure of the effectiveness of SMP search is barking up the wrong tree.

I also think that people need to think outside of the box. That is where alpha-beta, null-move, and LMR came from. Eventually, the branching factor will drop below 1.5 and searches will reach absurd depths in a short time. Because of the expansion of cores currently and what is forseen on the horizon, SMP search excellence is one of the most important topics in computer chess. "We always did it that way and it can't be improved." is the wrong approach.

IMO-YMMV.
You won't find ANY place where I said "NPS is an indicator of Elo". I said "NPS is certainly an upper bound on possible Elo gain, because it is an upper bound on possible parallel search speed gains. And there is no "we always did it that way so ..." in the discussion. But SMP is about speed, and nothing else. Which means NPS gives an upper bound on the potential gain. It does NOT necessarily give "the absolute gain" which will always be less than the NPS speed bound.
Well then, I guess that we are in complete agreement.
I don't think so. This statement is a problem:
My example is a proof to any thinking person that the NPS is not nearly so important as the way the effort is applied (since it is clear that very high NPS can be reached by my trivial search, though it is not effective). Hence, some measure other than NPS is obviously more important.
If your parallel search does NOT speed up the NPS at all, it is going to produce zero elo gain. Hence NPS is important. The bigger the increase, the bigger potential gain you have assuming a good parallel search implementation. But conversely, the smaller the NPS the smaller the potential gain.

So you need BOTH. One is not more important than the other. In fact, they are pretty much unrelated for the most part. High NPS requires lots of tweaking and tuning to minimize cache traffic and the like. Good speedup is a different issue entirely.

But no matter, NPS speedup is an upper bound on parallel search speedup, which is an upper bound on Elo improvement.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

bob wrote:But no matter, NPS speedup is an upper bound on parallel search speedup, which is an upper bound on Elo improvement.
I don't think this is proven. Now, I *do* agree that 32 threads going full tilt cannot exceed the speed of one thread going full tilt multiplied by 32. That is simple mathematics.

But I do not agree that the Elo gain has the same bound. It seems conceivable to me that the information gained by some of the searching threads could guide the search of some other threads in some intelligent way to increase the effectiveness of the search super-linearly.

We already know that we do not have to examine all the nodes of the search tree to get a perfect answer. The analysis of the alpha-beta search algorithm offers a formal proof of that.

If it were true, for instance, that every node had to be examined, then SMP scaling would have a hard limit even with 100% utilization and no two theads searching the same node. But we already know that a perfect search can search as few as the square root of nodes remaining times some constant.

If there is a formal proof that multiple threads of execution cannot search better than a single thread (less SMP loss) times the thread count then I am wrong. But I have never seen such a proof. And a room full of experts that all believe such an algorithm cannot exist is not a proof to me.

Not that I am claiming to have such an algorithm either.

I do know that there are many SMP implementations that scale badly not in NPS but in Elo as a function of processor count. So if they relied on NPS as a measure of improvement, then they would be measuring the wrong thing. And when cores scaled up, they would get a nasty surprise.
BBauer
Posts: 658
Joined: Wed Mar 08, 2006 8:58 pm

Re: Stockfish still scales poorly?

Post by BBauer »

bob wrote:
But no matter, NPS speedup is an upper bound on parallel search speedup, which is an upper bound on Elo improvement.

NPS speedup may be an upper bound on parallel search speedup
but
it is *not* an upper bound on Elo improvement.

Kind regards
Bernhard
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

BBauer wrote:bob wrote:
{snip}NPS speedup may be an upper bound on parallel search speedup
I do not think that we have to accept even this.
Consider alpha-beta search. By examining information collected from the search we can provably search less nodes and get the same result.
Consider LMR. By examination of search results, our position in the tree, inherent dangers and a few other things we can search less nodes and get a superior DEEPER answer in the same amount of time. We did that by reduction of branching factor.

Now, I ask the kind audience to engage in a mind experiment with me. We are going to visualize something. Consider a CPU collection with as many real CPU cores as we see in today's high end video cards. We have literally tens of thousands of high energy threads at our disposal. Let's enhance our search by coloring the tree image we collect by coloring hot areas of the search (good for us) with hot colors like red and orange and bad areas with blues and greens. We've got some neurons that know how to process this information to guide the search so that we waste less time in the cold areas and more time in the hot areas. This is a kind of information that is not nearly so feasible to collect with a single threaded search. Perhaps there are pruning algorithms that can utilize parallel information to prune better. If this is the case, and we can further reduce the branching factor because of data collected in parallel (after all, we are already doing that in many ways with data collected serially) then perhaps the only limit that really exists is a branching factor of 1.0000000

My weak mind cannot conceive that there is a way to improve upon that.
Doesn't mean that it cannot happen, just that I am not able to imagine it.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

Dann Corbit wrote:
BBauer wrote:bob wrote:
{snip}NPS speedup may be an upper bound on parallel search speedup
I do not think that we have to accept even this.
Consider alpha-beta search. By examining information collected from the search we can provably search less nodes and get the same result.
Consider LMR. By examination of search results, our position in the tree, inherent dangers and a few other things we can search less nodes and get a superior DEEPER answer in the same amount of time. We did that by reduction of branching factor.

Now, I ask the kind audience to engage in a mind experiment with me. We are going to visualize something. Consider a CPU collection with as many real CPU cores as we see in today's high end video cards. We have literally tens of thousands of high energy threads at our disposal. Let's enhance our search by coloring the tree image we collect by coloring hot areas of the search (good for us) with hot colors like red and orange and bad areas with blues and greens. We've got some neurons that know how to process this information to guide the search so that we waste less time in the cold areas and more time in the hot areas. This is a kind of information that is not nearly so feasible to collect with a single threaded search. Perhaps there are pruning algorithms that can utilize parallel information to prune better. If this is the case, and we can further reduce the branching factor because of data collected in parallel (after all, we are already doing that in many ways with data collected serially) then perhaps the only limit that really exists is a branching factor of 1.0000000

My weak mind cannot conceive that there is a way to improve upon that.
Doesn't mean that it cannot happen, just that I am not able to imagine
it.
I take it back. If chess is proven, I know the answer and I do not have to make any moves. So the branching factor is 0.

I am still having trouble trying to visualize a branching factor between 0 and 1 though, or fractal branching factors.
Dann Corbit
Posts: 12540
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish still scales poorly?

Post by Dann Corbit »

Dann Corbit wrote:
Dann Corbit wrote:
BBauer wrote:bob wrote:
{snip}NPS speedup may be an upper bound on parallel search speedup
I do not think that we have to accept even this.
Consider alpha-beta search. By examining information collected from the search we can provably search less nodes and get the same result.
Consider LMR. By examination of search results, our position in the tree, inherent dangers and a few other things we can search less nodes and get a superior DEEPER answer in the same amount of time. We did that by reduction of branching factor.

Now, I ask the kind audience to engage in a mind experiment with me. We are going to visualize something. Consider a CPU collection with as many real CPU cores as we see in today's high end video cards. We have literally tens of thousands of high energy threads at our disposal. Let's enhance our search by coloring the tree image we collect by coloring hot areas of the search (good for us) with hot colors like red and orange and bad areas with blues and greens. We've got some neurons that know how to process this information to guide the search so that we waste less time in the cold areas and more time in the hot areas. This is a kind of information that is not nearly so feasible to collect with a single threaded search. Perhaps there are pruning algorithms that can utilize parallel information to prune better. If this is the case, and we can further reduce the branching factor because of data collected in parallel (after all, we are already doing that in many ways with data collected serially) then perhaps the only limit that really exists is a branching factor of 1.0000000

My weak mind cannot conceive that there is a way to improve upon that.
Doesn't mean that it cannot happen, just that I am not able to imagine
it.
I take it back. If chess is proven, I know the answer and I do not have to make any moves. So the branching factor is 0.

I am still having trouble trying to visualize a branching factor between 0 and 1 though, or fractal branching factors.
Perhaps a sub-one branching factor can exist if chess were "nearly" proven, like checkers was a few years ago.