What speedup does Crafty get using 16 threads? Better than 11.9?bob wrote: For 16 I get 12.3. This matches my books in parallel computing.
Best Stockfish NPS scaling yet
Moderators: hgm, Rebel, chrisw
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Best Stockfish NPS scaling yet
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: Best Stockfish NPS scaling yet
I used the formula from here:bob wrote:My math doesn't quite match yours above. With .02 of a program in parallel,Dann Corbit wrote:Don't forget Amdahl's Law:zullil wrote:Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )
Best NPS scaling from 8 to 16 threads I've seen yet:
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 300000 default time =========================== Total time (ms) : 11100082 Nodes searched : 278067643640 Nodes/second : 25050954 ./stockfish bench 16384 8 300000 default time =========================== Total time (ms) : 11100004 Nodes searched : 163772870234 Nodes/second : 14754307 25050954/14754307 = 1.70
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9Code: Select all
./stockfish bench 16384 1 300000 default time =========================== Total time (ms) : 11100000 Nodes searched : 23435580318 Nodes/second : 2111313
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0
11.9 seems to suggest room for further improvement.
Code: Select all
#include <stdio.h> #include <stdlib.h> #include <string.h> char string[32767]; char *getsafe(char *buffer, int count) { char *result = buffer, *np; if ((buffer == NULL) || (count < 1)) result = NULL; else if (count == 1) *result = '\0'; else if ((result = fgets(buffer, count, stdin)) != NULL) if (np = strchr(buffer, '\n')) *np = '\0'; return result; } int main(void) { char *p; double f; unsigned n; puts("For fraction, 0 means 100% parallel. 1 means 100% serial."); puts("Fraction of program that is serial (0.0 - 1.00):"); p = getsafe(string, sizeof string); f = atof(p); for (n = 1; n <= 256; n++) printf("Speedup for %u threads is %14.12g\n", n, 1.0 / (f + (1.0 / n) * (1.0-f))); return 0; }
speedup = 1 / (0.02 + 0.98 / 8) = 7.01
Math is pretty simple. One cpu takes 1.0 time. N cpu takes 0.02 + 0.98 / 8 (assuming 8 cpus). 0.98 is done in parallel, 0.02 is done in serial.
Limit of this equation as # cpus goes to infinity is simply 1 . 0.02 which is 50x. I didn't try to figure out where your math is breaking down. Maybe it is just rounding or truncation error? But 7 looks right for 8 cpus, For 16 I get 12.3. This matches my books in parallel computing.
http://en.wikipedia.org/wiki/Amdahl%27s_law
My guess as to the limiting factor was wrong.
If I set N = UINT_MAX, I get this:
Speedup for 4294967295 threads is 49.9999994296
So it looks like 50 is the right answer.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Best Stockfish NPS scaling yet
On the 16 core box I run on occasionally, around 15x, if we are still talking just raw NPS. But I had another 16 core box where it started at around 11-12x and it took a lot of fiddling to get it up to 14x. Different machines act differently.zullil wrote:What speedup does Crafty get using 16 threads? Better than 11.9?bob wrote: For 16 I get 12.3. This matches my books in parallel computing.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Best Stockfish NPS scaling yet
Yes, but the other numbers didn't match either. i.e. 7 for 8 cpus, 12.3x for 16.Dann Corbit wrote:I used the formula from here:bob wrote:My math doesn't quite match yours above. With .02 of a program in parallel,Dann Corbit wrote:Don't forget Amdahl's Law:zullil wrote:Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )
Best NPS scaling from 8 to 16 threads I've seen yet:
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 300000 default time =========================== Total time (ms) : 11100082 Nodes searched : 278067643640 Nodes/second : 25050954 ./stockfish bench 16384 8 300000 default time =========================== Total time (ms) : 11100004 Nodes searched : 163772870234 Nodes/second : 14754307 25050954/14754307 = 1.70
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9Code: Select all
./stockfish bench 16384 1 300000 default time =========================== Total time (ms) : 11100000 Nodes searched : 23435580318 Nodes/second : 2111313
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0
11.9 seems to suggest room for further improvement.
Code: Select all
#include <stdio.h> #include <stdlib.h> #include <string.h> char string[32767]; char *getsafe(char *buffer, int count) { char *result = buffer, *np; if ((buffer == NULL) || (count < 1)) result = NULL; else if (count == 1) *result = '\0'; else if ((result = fgets(buffer, count, stdin)) != NULL) if (np = strchr(buffer, '\n')) *np = '\0'; return result; } int main(void) { char *p; double f; unsigned n; puts("For fraction, 0 means 100% parallel. 1 means 100% serial."); puts("Fraction of program that is serial (0.0 - 1.00):"); p = getsafe(string, sizeof string); f = atof(p); for (n = 1; n <= 256; n++) printf("Speedup for %u threads is %14.12g\n", n, 1.0 / (f + (1.0 / n) * (1.0-f))); return 0; }
speedup = 1 / (0.02 + 0.98 / 8) = 7.01
Math is pretty simple. One cpu takes 1.0 time. N cpu takes 0.02 + 0.98 / 8 (assuming 8 cpus). 0.98 is done in parallel, 0.02 is done in serial.
Limit of this equation as # cpus goes to infinity is simply 1 . 0.02 which is 50x. I didn't try to figure out where your math is breaking down. Maybe it is just rounding or truncation error? But 7 looks right for 8 cpus, For 16 I get 12.3. This matches my books in parallel computing.
http://en.wikipedia.org/wiki/Amdahl%27s_law
My guess as to the limiting factor was wrong.
If I set N = UINT_MAX, I get this:
Speedup for 4294967295 threads is 49.9999994296
So it looks like 50 is the right answer.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Best Stockfish NPS scaling yet
These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.zullil wrote:Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )
Best NPS scaling from 8 to 16 threads I've seen yet:
Code: Select all
Dual Intel(R) Xeon(R) CPU E5-2650 v2 @ 2.60GHz Turbo Boost and Hyper-Threading disabled GNU/Linux 3.18.2-031802-generic x86_64 ./stockfish bench 16384 16 300000 default time =========================== Total time (ms) : 11100082 Nodes searched : 278067643640 Nodes/second : 25050954 ./stockfish bench 16384 8 300000 default time =========================== Total time (ms) : 11100004 Nodes searched : 163772870234 Nodes/second : 14754307 25050954/14754307 = 1.70
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9Code: Select all
./stockfish bench 16384 1 300000 default time =========================== Total time (ms) : 11100000 Nodes searched : 23435580318 Nodes/second : 2111313
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0
11.9 seems to suggest room for further improvement.
BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Best Stockfish NPS scaling yet
Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.
I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Best Stockfish NPS scaling yet
I don't think there will be much improvement beyond 300 seconds. Crafty continuously displays the NPS and by the 30 second mark it is right close to its peak. It might climb from 50M to 52M by the time it gets to 3 - 4 minutes, but it usually doesn't climb much beyond that point. Mainly needs to get out of the shallow searches where the split overhead dominates everything except for the idle time waiting on YBW on real shallow tree searches.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: Best Stockfish NPS scaling yet
There is some trickiness to measuring this stuff. I think TTD for each position is the right answer, but using a fixed "D" can be misleading. You might get one position that takes 300 seconds and gets a good speedup, and the next takes 10 seconds and does poorly.
I've generally burned a lot of CPU time to make things as accurate as I can. For example, with N processors (N = max available) I run everything to a fixed time limit of 5 minutes. For each position I take the time to give the PV for the last iteration as the time for N cpus. I then run with N/2 processors, using 4*N time, then again N/2 and 4*N, in an attempt to make sure that the N/2 run gets to the same depth. And even then weighting might be worthwhile. You don't want one large time position with a good (or bad) speedup dominating several small time positions.
Bottom line is that this is a hurry up and wait situation. My dissertation on the Cray was PAINFUL, because I used a real game (ICCA / ACM events were 40/2hr back then) as my 16 cpu base time. Doing 1 cpu searches to the same depth, when the actual speedup was in the 11-12x range (don't have paper handy with actual results) meant that some of those 1 cpu searches took forever.
I've generally burned a lot of CPU time to make things as accurate as I can. For example, with N processors (N = max available) I run everything to a fixed time limit of 5 minutes. For each position I take the time to give the PV for the last iteration as the time for N cpus. I then run with N/2 processors, using 4*N time, then again N/2 and 4*N, in an attempt to make sure that the N/2 run gets to the same depth. And even then weighting might be worthwhile. You don't want one large time position with a good (or bad) speedup dominating several small time positions.
Bottom line is that this is a hurry up and wait situation. My dissertation on the Cray was PAINFUL, because I used a real game (ICCA / ACM events were 40/2hr back then) as my 16 cpu base time. Doing 1 cpu searches to the same depth, when the actual speedup was in the 11-12x range (don't have paper handy with actual results) meant that some of those 1 cpu searches took forever.
-
- Posts: 6442
- Joined: Tue Jan 09, 2007 12:31 am
- Location: PA USA
- Full name: Louis Zulli
Re: Best Stockfish NPS scaling yet
Miguel's Eq. 1 is equivalent to Dann's formula; f = 1 / (1 + R). Both make sense to me---simple math.Laskos wrote: BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10
And in either case, the limit of speedup as cores --> ∞ is 1 + R ( = 1/f ).
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: Best Stockfish NPS scaling yet
For fixed n cores, R -> infinity as depth -> to infinity, and the speed-up to infinite depth is simply n, the number of cores.zullil wrote:Miguel's Eq. 1 is equivalent to Dann's formula; f = 1 / (1 + R). Both make sense to me---simple math.Laskos wrote: BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10
And in either case, the limit of speedup as cores --> ∞ is 1 + R ( = 1/f ).