Yes (based on above).Laskos wrote:For fixed n cores, R -> infinity as depth -> to infinity, and the speed-up to infinite depth is simply n, the number of cores.zullil wrote:Miguel's Eq. 1 is equivalent to Dann's formula; f = 1 / (1 + R). Both make sense to me---simple math.Laskos wrote: BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10
And in either case, the limit of speedup as cores --> ∞ is 1 + R ( = 1/f ).
Best Stockfish NPS scaling yet
Moderators: hgm, Rebel, chrisw
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Best Stockfish NPS scaling yet
-
- Posts: 5566
- Joined: Tue Feb 28, 2012 11:56 pm
Re: Best Stockfish NPS scaling yet
There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.
Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Best Stockfish NPS scaling yet
Actually there is a measurable serial part. YBW requires that at each PV node, the first move has to be searched first, before the rest are searched in parallel. Since everyone uses some sort of minimum split depth, the first PV node at the minimum split depth will be completely serial. Then there is the probabilistic part of this where a processor goes idle and has to wait until some processor that is busy reaches a node where YBW has been satisfied. I tend to agree that for every ply deeper, that minimum split depth becomes a smaller fraction of the total search space.syzygy wrote:There is no fixed serial part in a parallel alpha-beta implementation (well, it might depend on the implementation of course). The longer / deeper you search, the smaller the serial part.Dann Corbit wrote:If a program is only 2% serial then 42 is approximately the maximum speedup you can see, with infinite processors.
Gustafson's law might be a better approximation. (Of course we are only considering NPS speedup here).
I don't think this reduces forever, however. I ran a quick test on a pretty old dual quad-core intel box. I ran the same position using 8 cpus, with 15 secs, 30 secs, 1 min, 2 mins, 4 mins, 8 mins and 16 mins. Here are the NPS numbers:
23.5M
24.0M
24.3M
24.9M
24.8M
25.2M
25.1M
I started a 32 minute run but after 25 minutes it was at 25.2M so it would seem that is about the peak for Crafty on this box. Crafty reported a total of 2% of the machine was lost to idle time, or 98% of the available compute power was spent doing everything but waiting on work.
BTW for the NPS scaling group, here's a 2 minute search, one cpu, same position:
time=2:00(100%) nodes=381257772(381.3M) fh1=93% pred=0 nps=3.2M
25.2 / 3.2 = 7.9x There is not a lot left to gain. Fastest possible NPS with 0% wait time would be 8 * 3.2M = 25.6M so it is getting pretty close to optimal for this box.
This is a dual intel E5345 @ 2.3ghz, part of a cluster we have had for 6-7 years at least.
This is also Crafty 25.0 which certainly has some SMP improvements included...
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Best Stockfish NPS scaling yet
Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.zullil wrote:Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64
./stockfish bench 16384 16 150000 default time
===========================
Total time (ms) : 5550063
Nodes searched : 129857933271
Nodes/second : 23397560
./stockfish bench 16384 8 150000 default time
===========================
Total time (ms) : 5550010
Nodes searched : 79227345027
Nodes/second : 14275171
23397560/14275171 = 1.64
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Best Stockfish NPS scaling yet
That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.zullil wrote:Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.zullil wrote:Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 150000 default time =========================== Total time (ms) : 5550063 Nodes searched : 129857933271 Nodes/second : 23397560 ./stockfish bench 16384 8 150000 default time =========================== Total time (ms) : 5550010 Nodes searched : 79227345027 Nodes/second : 14275171 23397560/14275171 = 1.64
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Best Stockfish NPS scaling yet
You're welcome. Yes, 1.70 at 300 sec. My guess is, say at 600 sec per position, that the ratio would change very little from 1.70. Don't think the extra 300 secs would change the nps numbers, not even for 16 threads. Or even if there was a small change, it would be indistinguishable from normalLaskos wrote:That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.zullil wrote:Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.zullil wrote:Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 150000 default time =========================== Total time (ms) : 5550063 Nodes searched : 129857933271 Nodes/second : 23397560 ./stockfish bench 16384 8 150000 default time =========================== Total time (ms) : 5550010 Nodes searched : 79227345027 Nodes/second : 14275171 23397560/14275171 = 1.64
variations caused by parallel searching.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Best Stockfish NPS scaling yet
I dislike these numbers. It is FAR clearer to always compare N core speed to 1 core speed. 1.64x times a bad number loses its significance when just reporting 1.64x. In normal parallel search papers, nobody cares about 8 core vs 16 core speedup. Everyone measures and compares 8 core vs 1 core and 16 core vs 1 core.zullil wrote:Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.zullil wrote:Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 150000 default time =========================== Total time (ms) : 5550063 Nodes searched : 129857933271 Nodes/second : 23397560 ./stockfish bench 16384 8 150000 default time =========================== Total time (ms) : 5550010 Nodes searched : 79227345027 Nodes/second : 14275171 23397560/14275171 = 1.64
This way if you see 12.0x you know that 1/4 of the hardware is wasted. with the 1.64 you have to work your back to 1 cpu carefully to see whether that is good, bad or terrible.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Best Stockfish NPS scaling yet
I don't see a lot of NPS variation myself. I see a lot of time-to-depth variation however. That's been true since I ran on a univac dual CPU machine at the 1978 ACM tournament. With Cray Blitz, the NPS was rock-solid. But it had a better splitting algorithm (DTS). With Crafty, as I posted in another part of this thread last night, the NPS varies some, but pretty well settles in at the max. 240 secs to 480 secs changes nothing at all.zullil wrote:You're welcome. Yes, 1.70 at 300 sec. My guess is, say at 600 sec per position, that the ratio would change very little from 1.70. Don't think the extra 300 secs would change the nps numbers, not even for 16 threads. Or even if there was a small change, it would be indistinguishable from normalLaskos wrote:That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.zullil wrote:Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.zullil wrote:Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 150000 default time =========================== Total time (ms) : 5550063 Nodes searched : 129857933271 Nodes/second : 23397560 ./stockfish bench 16384 8 150000 default time =========================== Total time (ms) : 5550010 Nodes searched : 79227345027 Nodes/second : 14275171 23397560/14275171 = 1.64
variations caused by parallel searching.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Best Stockfish NPS scaling yet
NPS is much more faithfully measured on 37 positions, it has a small variance, following a narrow normal distribution. TTD (time-to-depth) has much higher variance, and follows a wide beta distribution. 37 positions are not enough for TTD, flukes as one position takes forever, another is fast to the same depth happen all the time. Or the same position on 2 different runs on several cores. In fact you explained that in an earlier post.bob wrote:I don't see a lot of NPS variation myself. I see a lot of time-to-depth variation however. That's been true since I ran on a univac dual CPU machine at the 1978 ACM tournament. With Cray Blitz, the NPS was rock-solid. But it had a better splitting algorithm (DTS). With Crafty, as I posted in another part of this thread last night, the NPS varies some, but pretty well settles in at the max. 240 secs to 480 secs changes nothing at all.zullil wrote:You're welcome. Yes, 1.70 at 300 sec. My guess is, say at 600 sec per position, that the ratio would change very little from 1.70. Don't think the extra 300 secs would change the nps numbers, not even for 16 threads. Or even if there was a small change, it would be indistinguishable from normalLaskos wrote:That's compared to 1.70 at 300 sec, right? I guess there is room for improvement beyond 300 sec. Thanks for the test.zullil wrote:Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.zullil wrote:Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 150000 default time =========================== Total time (ms) : 5550063 Nodes searched : 129857933271 Nodes/second : 23397560 ./stockfish bench 16384 8 150000 default time =========================== Total time (ms) : 5550010 Nodes searched : 79227345027 Nodes/second : 14275171 23397560/14275171 = 1.64
variations caused by parallel searching.
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Best Stockfish NPS scaling yet
My post was to report 8-to-16-core scaling data for Stockfish. It is generally accepted that Stockfish scales reasonably well to 8 cores, but underperforms upon transitioning from 8 to 16. Given the context, I posted (and tested) no more than needed.bob wrote:I dislike these numbers. It is FAR clearer to always compare N core speed to 1 core speed. 1.64x times a bad number loses its significance when just reporting 1.64x. In normal parallel search papers, nobody cares about 8 core vs 16 core speedup. Everyone measures and compares 8 core vs 1 core and 16 core vs 1 core.zullil wrote:Here are numbers for 150 sec. per position, with 8 threads and with 16 threads.zullil wrote:Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 150000 default time =========================== Total time (ms) : 5550063 Nodes searched : 129857933271 Nodes/second : 23397560 ./stockfish bench 16384 8 150000 default time =========================== Total time (ms) : 5550010 Nodes searched : 79227345027 Nodes/second : 14275171 23397560/14275171 = 1.64
This way if you see 12.0x you know that 1/4 of the hardware is wasted. with the 1.64 you have to work your back to 1 cpu carefully to see whether that is good, bad or terrible.