Best Stockfish NPS scaling yet

zullil · Post by **zullil** » Thu Mar 05, 2015 6:48 pm

Laskos wrote:
zullil wrote:
Laskos wrote: BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10
Miguel's Eq. 1 is equivalent to Dann's formula; f = 1 / (1 + R). Both make sense to me---simple math.

And in either case, the limit of speedup as cores --> ∞ is 1 + R ( = 1/f ).
For fixed n cores, R -> infinity as depth -> to infinity, and the speed-up to infinite depth is simply n, the number of cores.

Yes (based on above).

syzygy · Post by **syzygy** » Thu Mar 05, 2015 9:36 pm

Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.

There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.

Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).

bob · Post by **bob** » Fri Mar 06, 2015 4:59 am

syzygy wrote:
Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.
There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.

Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).

Actually there is a measurable serial part. YBW requires that at each PV node, the first move has to be searched first, before the rest are searched in parallel. Since everyone uses some sort of minimum split depth, the first PV node at the minimum split depth will be completely serial. Then there is the probabilistic part of this where a processor goes idle and has to wait until some processor that is busy reaches a node where YBW has been satisfied. I tend to agree that for every ply deeper, that minimum split depth becomes a smaller fraction of the total search space.

I don't think this reduces forever, however. I ran a quick test on a pretty old dual quad-core intel box. I ran the same position using 8 cpus, with 15 secs, 30 secs, 1 min, 2 mins, 4 mins, 8 mins and 16 mins. Here are the NPS numbers:

23.5M
24.0M
24.3M
24.9M
24.8M
25.2M
25.1M

I started a 32 minute run but after 25 minutes it was at 25.2M so it would seem that is about the peak for Crafty on this box. Crafty reported a total of 2% of the machine was lost to idle time, or 98% of the available compute power was spent doing everything but waiting on work.

BTW for the NPS scaling group, here's a 2 minute search, one cpu, same position:

time=2:00(100%) nodes=381257772(381.3M) fh1=93% pred=0 nps=3.2M

25.2 / 3.2 = 7.9x There is not a lot left to gain. Fastest possible NPS with 0% wait time would be 8 * 3.2M = 25.6M so it is getting pretty close to optimal for this box.

This is a dual intel E5345 @ 2.3ghz, part of a cluster we have had for 6-7 years at least.

This is also Crafty 25.0 which certainly has some SMP improvements included...

zullil · Post by **zullil** » Fri Mar 06, 2015 1:47 pm

zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.

Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.

Code: Select all

Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64

Laskos · Post by **Laskos** » Fri Mar 06, 2015 2:55 pm

zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64

That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.

zullil · Post by **zullil** » Fri Mar 06, 2015 3:51 pm

Laskos wrote:
zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64
That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.

You're welcome. Yes, 1.70 at 300 sec. My guess is, say at 600 sec per position, that the ratio would change very little from 1.70. Don't think the extra 300 secs would change the nps numbers, not even for 16 threads. Or even if there was a small change, it would be indistinguishable from normal
variations caused by parallel searching.

bob · Post by **bob** » Fri Mar 06, 2015 7:08 pm

zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64

I dislike these numbers. It is FAR clearer to always compare N core speed to 1 core speed. 1.64x times a bad number loses its significance when just reporting 1.64x. In normal parallel search papers, nobody cares about 8 core vs 16 core speedup. Everyone measures and compares 8 core vs 1 core and 16 core vs 1 core.

This way if you see 12.0x you know that 1/4 of the hardware is wasted. with the 1.64 you have to work your back to 1 cpu carefully to see whether that is good, bad or terrible.

bob · Post by **bob** » Fri Mar 06, 2015 7:11 pm

zullil wrote:
Laskos wrote:
zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64
That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.
You're welcome. Yes, 1.70 at 300 sec. My guess is, say at 600 sec per position, that the ratio would change very little from 1.70. Don't think the extra 300 secs would change the nps numbers, not even for 16 threads. Or even if there was a small change, it would be indistinguishable from normal
variations caused by parallel searching.

I don't see a lot of NPS variation myself. I see a lot of time-to-depth variation however. That's been true since I ran on a univac dual CPU machine at the 1978 ACM tournament. With Cray Blitz, the NPS was rock-solid. But it had a better splitting algorithm (DTS). With Crafty, as I posted in another part of this thread last night, the NPS varies some, but pretty well settles in at the max. 240 secs to 480 secs changes nothing at all.

Laskos · Post by **Laskos** » Fri Mar 06, 2015 7:19 pm

bob wrote:
zullil wrote:
Laskos wrote:
zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64
That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.
You're welcome. Yes, 1.70 at 300 sec. My guess is, say at 600 sec per position, that the ratio would change very little from 1.70. Don't think the extra 300 secs would change the nps numbers, not even for 16 threads. Or even if there was a small change, it would be indistinguishable from normal
variations caused by parallel searching.
I don't see a lot of NPS variation myself. I see a lot of time-to-depth variation however. That's been true since I ran on a univac dual CPU machine at the 1978 ACM tournament. With Cray Blitz, the NPS was rock-solid. But it had a better splitting algorithm (DTS). With Crafty, as I posted in another part of this thread last night, the NPS varies some, but pretty well settles in at the max. 240 secs to 480 secs changes nothing at all.

NPS is much more faithfully measured on 37 positions, it has a small variance, following a narrow normal distribution. TTD (time-to-depth) has much higher variance, and follows a wide beta distribution. 37 positions are not enough for TTD, flukes as one position takes forever, another is fast to the same depth happen all the time. Or the same position on 2 different runs on several cores. In fact you explained that in an earlier post.

zullil · Post by **zullil** » Fri Mar 06, 2015 7:24 pm

bob wrote:
zullil wrote:
zullil wrote:
Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550063
Nodes searched  &#58; 129857933271
Nodes/second    &#58; 23397560

./stockfish bench 16384 8 150000 default time
===========================
Total time &#40;ms&#41; &#58; 5550010
Nodes searched  &#58; 79227345027
Nodes/second    &#58; 14275171

23397560/14275171 = 1.64
I dislike these numbers. It is FAR clearer to always compare N core speed to 1 core speed. 1.64x times a bad number loses its significance when just reporting 1.64x. In normal parallel search papers, nobody cares about 8 core vs 16 core speedup. Everyone measures and compares 8 core vs 1 core and 16 core vs 1 core.

This way if you see 12.0x you know that 1/4 of the hardware is wasted. with the 1.64 you have to work your back to 1 cpu carefully to see whether that is good, bad or terrible.

My post was to report 8-to-16-core scaling data for Stockfish. It is generally accepted that Stockfish scales reasonably well to 8 cores, but underperforms upon transitioning from 8 to 16. Given the context, I posted (and tested) no more than needed.

Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet