Best Stockfish NPS scaling yet

bob · Post by **bob** » Fri Mar 06, 2015 8:18 pm

Laskos wrote:
bob wrote:
zullil wrote:
Laskos wrote:
zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64
That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.
You're welcome. Yes, 1.70 at 300 sec. My guess is, say at 600 sec per position, that the ratio would change very little from 1.70. Don't think the extra 300 secs would change the nps numbers, not even for 16 threads. Or even if there was a small change, it would be indistinguishable from normal
variations caused by parallel searching.
I don't see a lot of NPS variation myself. I see a lot of time-to-depth variation however. That's been true since I ran on a univac dual CPU machine at the 1978 ACM tournament. With Cray Blitz, the NPS was rock-solid. But it had a better splitting algorithm (DTS). With Crafty, as I posted in another part of this thread last night, the NPS varies some, but pretty well settles in at the max. 240 secs to 480 secs changes nothing at all.
NPS is much more faithfully measured on 37 positions, it has a small variance, following a narrow normal distribution. TTD (time-to-depth) has much higher variance, and follows a wide beta distribution. 37 positions are not enough for TTD, flukes as one position takes forever, another is fast to the same depth happen all the time. Or the same position on 2 different runs on several cores. In fact you explained that in an earlier post.

I agree with TTD. I use a couple of hundred positions, and I run them all multiple times at each thread level.

But for NPS, Crafty doesn't vary much. In some positions the NPS might be 4M with 1 thread and close to 32M with 8. In others it might be 3M with 1 thread, but then it will be close to 24 with 8. No reason I can think of for NPS scaling to break so long as the 1 and N cpu tests are using the same position. I suppose if you toss EGTBs into the mix, and you pick a position that is close to the threshold where EGTB hits occur, you might see this. But I don't do NPS tests with EGTBs since they are sensitive to the disk drive type being used (typical 7200RPM drive vs 15K RPM SCSI drives to really fast SSD devices.

I can run a bunch of 8 core tests. Our 12 core boxes are currently busy for a while, but if you'd like to see opening, MG and EG positions with 1 and 8 I can do that easily.

bob · Post by **bob** » Fri Mar 06, 2015 8:21 pm

zullil wrote:
bob wrote:
zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64
I dislike these numbers. It is FAR clearer to always compare N core speed to 1 core speed. 1.64x times a bad number loses its significance when just reporting 1.64x. In normal parallel search papers, nobody cares about 8 core vs 16 core speedup. Everyone measures and compares 8 core vs 1 core and 16 core vs 1 core.

This way if you see 12.0x you know that 1/4 of the hardware is wasted. with the 1.64 you have to work your back to 1 cpu carefully to see whether that is good, bad or terrible.
My post was to report 8-to-16-core scaling data for Stockfish. It is generally accepted that Stockfish scales reasonably well to 8 cores, but underperforms upon transitioning from 8 to 16. Given the context, I posted (and tested) no more than needed.

Unfortunately "scales reasonably well to 8" is not a number one can use to understand the 1.64x result. What is reasonable? 7x? Then 1.64x of that would be 11.5, which is not so good. 7.5X? 12.3 which is also not so good.

Not a big deal, just not as clear as it could be.

If you use the 8 cpu speedup over 1, why not use the 16 cpu speedup over one rather than over the 8 cpu speedup?

zullil · Post by **zullil** » Fri Mar 06, 2015 9:00 pm

bob wrote:
Unfortunately "scales reasonably well to 8" is not a number one can use to understand the 1.64x result. What is reasonable? 7x? Then 1.64x of that would be 11.5, which is not so good. 7.5X? 12.3 which is also not so good.

Not a big deal, just not as clear as it could be.

If you use the 8 cpu speedup over 1, why not use the 16 cpu speedup over one rather than over the 8 cpu speedup?

Because at the moment I'm only interested in what happens from 8 to 16, as the SF team tries various SMP tweaks. Someone else is free to convert "scales reasonably well to 8" into a number. If one of the tweaks hurts scaling from 1 to 8, I'm sure others will report that.

Just trying to save some time by not testing 1 thread performance too.

michiguel · Post by **michiguel** » Fri Mar 06, 2015 9:16 pm

syzygy wrote:
Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.
There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.

Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).

Right, and strictly speaking, probably neither. A true Amdahl behavior may apply if you search the _exact_ same nodes and you see how much faster you do it (fixed task, variable time). But in chess, you have the same amount of time and you see if you can do even more (Gustafson, fixed time, variable length of task). But, not you only you do more, but you do it differently. That is, not only with the extra time you add more nodes, but also the whole tree changes. Some "Amdahl-like" equation (which is not really very innovative, it is a simple saturation scheme observed in the natural sciences) may describe some of the variables, but conceptually, the problems is different.

Miguel

bob · Post by **bob** » Fri Mar 06, 2015 10:34 pm

zullil wrote:
bob wrote:
Unfortunately "scales reasonably well to 8" is not a number one can use to understand the 1.64x result. What is reasonable? 7x? Then 1.64x of that would be 11.5, which is not so good. 7.5X? 12.3 which is also not so good.

Not a big deal, just not as clear as it could be.

If you use the 8 cpu speedup over 1, why not use the 16 cpu speedup over one rather than over the 8 cpu speedup?
Because at the moment I'm only interested in what happens from 8 to 16, as the SF team tries various SMP tweaks. Someone else is free to convert "scales reasonably well to 8" into a number. If one of the tweaks hurts scaling from 1 to 8, I'm sure others will report that.

Just trying to save some time by not testing 1 thread performance too.

You only have to test 1 cpu NPS once. Then you can test as many 8 or 16 core tests as you want as changes are made. Parallel stuff such as lock changes and such won't influence the single-thread NPS.

My question would be "what is acceptable for 8 to 16?" if 1-8 is bad, 2x for 8-16 would still be bad.

zullil · Post by **zullil** » Fri Mar 06, 2015 10:59 pm

bob wrote: You only have to test 1 cpu NPS once. Then you can test as many 8 or 16 core tests as you want as changes are made. Parallel stuff such as lock changes and such won't influence the single-thread NPS.

My question would be "what is acceptable for 8 to 16?" if 1-8 is bad, 2x for 8-16 would still be bad.

Yes, and I did test 1 thread, and I posted the result---somewhere in this thread.

bob · Post by **bob** » Sat Mar 07, 2015 12:24 am

michiguel wrote:
syzygy wrote:
Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.
There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.

Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).
Right, and strictly speaking, probably neither. A true Amdahl behavior may apply if you search the _exact_ same nodes and you see how much faster you do it (fixed task, variable time). But in chess, you have the same amount of time and you see if you can do even more (Gustafson, fixed time, variable length of task). But, not you only you do more, but you do it differently. That is, not only with the extra time you add more nodes, but also the whole tree changes. Some "Amdahl-like" equation (which is not really very innovative, it is a simple saturation scheme observed in the natural sciences) may describe some of the variables, but conceptually, the problems is different.

Miguel

Logic tells me that Amdahl's law will apply here as the goal is not about searching a larger tree, it is about searching any tree faster. Whether there is any serial code or not depends on the design. In DTS there was essentially none. Our scaling back then, in terms of NPS, was almost perfect from 1 to 32 cpus, the max we ever had on a Cray (the T932). But then we didn't have any minimum split depth to deal with either, which introduces serial code with N plies of the tips where you can't do a split.

But neither of these is the issue with the NPS issues being discussed. There are lots of places for NPS loss in a parallel program. Too much cache traffic caused by sharing frequently used data, too much memory traffic that the memory bus can't always keep up with, data on the wrong node in a NUMA architecture, excessive lock contention (which turns into excessive cache traffic since locks really smoke cache controllers) and so forth. Most of that can be driven to zero with effort and study. Perhaps all if you are willing to do as we did and not depend on YBW but just predicted whether a node was ALL or not and split if it appeared to be even before anything was searched. Of course getting that wrong will keep the NPS up, but the size of the tree will grow and TTD will suffer.

michiguel · Post by **michiguel** » Sat Mar 07, 2015 1:30 am

bob wrote:
michiguel wrote:
syzygy wrote:
Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.
There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.

Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).
Right, and strictly speaking, probably neither. A true Amdahl behavior may apply if you search the _exact_ same nodes and you see how much faster you do it (fixed task, variable time). But in chess, you have the same amount of time and you see if you can do even more (Gustafson, fixed time, variable length of task). But, not you only you do more, but you do it differently. That is, not only with the extra time you add more nodes, but also the whole tree changes. Some "Amdahl-like" equation (which is not really very innovative, it is a simple saturation scheme observed in the natural sciences) may describe some of the variables, but conceptually, the problems is different.

Miguel
Logic tells me that Amdahl's law will apply here as the goal is not about searching a larger tree, it is about searching any tree faster.

The goal in chess is not to find moves faster, is to search more with the amount of time you are given. Subtle, but big difference. For that reason only, Amdahl does not apply (by definition) and it becomes Gustafson "ish". You may have issues with serial parts here and there causing problems, but that is a different point.

Miguel

Whether there is any serial code or not depends on the design. In DTS there was essentially none. Our scaling back then, in terms of NPS, was almost perfect from 1 to 32 cpus, the max we ever had on a Cray (the T932). But then we didn't have any minimum split depth to deal with either, which introduces serial code with N plies of the tips where you can't do a split.

But neither of these is the issue with the NPS issues being discussed. There are lots of places for NPS loss in a parallel program. Too much cache traffic caused by sharing frequently used data, too much memory traffic that the memory bus can't always keep up with, data on the wrong node in a NUMA architecture, excessive lock contention (which turns into excessive cache traffic since locks really smoke cache controllers) and so forth. Most of that can be driven to zero with effort and study. Perhaps all if you are willing to do as we did and not depend on YBW but just predicted whether a node was ALL or not and split if it appeared to be even before anything was searched. Of course getting that wrong will keep the NPS up, but the size of the tree will grow and TTD will suffer.

syzygy · Post by **syzygy** » Sat Mar 07, 2015 2:45 am

I think the point is that the longer longer / deeper you search, the smaller the effect of inherent synchronisation / serialisation within the search algorithm becomes.

For example, with YBWC there is a clear synchronisation point as you get close to completing the search of the PV node at depth d. For a brief moment all but one thread will have run out of work and must wait for that last thread to finish. But the amount of "lost" search time due to this serialisation does not continue to grow for a single depth as the remaining depth of the tree gets bigger and bigger. In other words, the total time lost (at the PV nodes at all depths) is more or less linear with the total depth of the search tree. At the same time, the total size of the search tree grows exponentially. The ratio between time lost and time spent crunching nodes therefore goes to 0.

bob · Post by **bob** » Sat Mar 07, 2015 2:48 am

michiguel wrote:
bob wrote:
michiguel wrote:
syzygy wrote:
Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.
There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.

Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).
Right, and strictly speaking, probably neither. A true Amdahl behavior may apply if you search the _exact_ same nodes and you see how much faster you do it (fixed task, variable time). But in chess, you have the same amount of time and you see if you can do even more (Gustafson, fixed time, variable length of task). But, not you only you do more, but you do it differently. That is, not only with the extra time you add more nodes, but also the whole tree changes. Some "Amdahl-like" equation (which is not really very innovative, it is a simple saturation scheme observed in the natural sciences) may describe some of the variables, but conceptually, the problems is different.

Miguel
Logic tells me that Amdahl's law will apply here as the goal is not about searching a larger tree, it is about searching any tree faster.

The goal in chess is not to find moves faster, is to search more with the amount of time you are given. Subtle, but big difference. For that reason only, Amdahl does not apply (by definition) and it becomes Gustafson "ish". You may have issues with serial parts here and there causing problems, but that is a different point.

Miguel

Whether there is any serial code or not depends on the design. In DTS there was essentially none. Our scaling back then, in terms of NPS, was almost perfect from 1 to 32 cpus, the max we ever had on a Cray (the T932). But then we didn't have any minimum split depth to deal with either, which introduces serial code with N plies of the tips where you can't do a split.

But neither of these is the issue with the NPS issues being discussed. There are lots of places for NPS loss in a parallel program. Too much cache traffic caused by sharing frequently used data, too much memory traffic that the memory bus can't always keep up with, data on the wrong node in a NUMA architecture, excessive lock contention (which turns into excessive cache traffic since locks really smoke cache controllers) and so forth. Most of that can be driven to zero with effort and study. Perhaps all if you are willing to do as we did and not depend on YBW but just predicted whether a node was ALL or not and split if it appeared to be even before anything was searched. Of course getting that wrong will keep the NPS up, but the size of the tree will grow and TTD will suffer.

With pure alpha beta and parallel search, there is no gain by searching a larger tree. The search doesn't get more efficient as it goes more deeply, that is a characteristic of a particular implementation rather than parallel search in general. I don't think the Gustafson idea applies here since going deeper has no inherent additional efficiency, which was what his original paper talked about.

The only real issue in a parallel search is the time spent with processors idle (hence serial bottleneck or critical section processing). This can REALLY be driven down into the noise with enough effort (maybe not with a recursive search, but it can be done).

In parallel search, I think the "serial part" of the search can almost be zero. Not quite zero probably but close enough. YBW increases this serial part significantly, in an attempt to trade serial nodes searched and reduce the total nodes searched.

Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet