Best Stockfish NPS scaling yet

zullil · Post by **zullil** » Thu Mar 05, 2015 2:22 am

bob wrote: For 16 I get 12.3. This matches my books in parallel computing.

What speedup does Crafty get using 16 threads? Better than 11.9?

Dann Corbit · Post by **Dann Corbit** » Thu Mar 05, 2015 3:25 am

bob wrote:
Dann Corbit wrote:
zullil wrote:
zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )

Best NPS scaling from 8 to 16 threads I've seen yet:
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100082
Nodes searched  &#58; 278067643640
Nodes/second    &#58; 25050954

./stockfish bench 16384 8 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100004
Nodes searched  &#58; 163772870234
Nodes/second    &#58; 14754307

25050954/14754307 = 1.70
Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!
Code: Select all
./stockfish bench 16384 1 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100000
Nodes searched  &#58; 23435580318
Nodes/second    &#58; 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.
Don't forget Amdahl's Law:
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char string&#91;32767&#93;;
char *getsafe&#40;char *buffer, int count&#41;
&#123;
    char *result = buffer, *np;
    if (&#40;buffer == NULL&#41; || &#40;count < 1&#41;)
        result = NULL;
    else if &#40;count == 1&#41;
        *result = '\0';
    else if (&#40;result = fgets&#40;buffer, count, stdin&#41;) != NULL&#41;
        if &#40;np = strchr&#40;buffer, '\n'))
            *np = '\0';
    return result;
&#125;


int main&#40;void&#41;
&#123;
    char *p;
    double f;
    unsigned n;
    puts&#40;"For fraction, 0 means 100% parallel. 1 means 100% serial.");
    puts&#40;"Fraction of program that is serial &#40;0.0 - 1.00&#41;&#58;");
    
    p = getsafe&#40;string, sizeof string&#41;;
    f = atof&#40;p&#41;;
    for &#40;n = 1; n <= 256; n++)
    	printf&#40;"Speedup for %u threads is %14.12g\n", n, 1.0 / &#40;f + &#40;1.0 / n&#41; * &#40;1.0-f&#41;));
    
    return 0;
&#125;
My math doesn't quite match yours above. With .02 of a program in parallel,

speedup = 1 / (0.02 + 0.98 / 8) = 7.01

Math is pretty simple. One cpu takes 1.0 time. N cpu takes 0.02 + 0.98 / 8 (assuming 8 cpus). 0.98 is done in parallel, 0.02 is done in serial.

Limit of this equation as # cpus goes to infinity is simply 1 . 0.02 which is 50x. I didn't try to figure out where your math is breaking down. Maybe it is just rounding or truncation error? But 7 looks right for 8 cpus, For 16 I get 12.3. This matches my books in parallel computing.

I used the formula from here:
http://en.wikipedia.org/wiki/Amdahl%27s_law

My guess as to the limiting factor was wrong.

If I set N = UINT_MAX, I get this:
Speedup for 4294967295 threads is 49.9999994296

So it looks like 50 is the right answer.

bob · Post by **bob** » Thu Mar 05, 2015 6:41 am

zullil wrote:
bob wrote: For 16 I get 12.3. This matches my books in parallel computing.
What speedup does Crafty get using 16 threads? Better than 11.9?

On the 16 core box I run on occasionally, around 15x, if we are still talking just raw NPS. But I had another 16 core box where it started at around 11-12x and it took a lot of fiddling to get it up to 14x. Different machines act differently.

bob · Post by **bob** » Thu Mar 05, 2015 6:42 am

Dann Corbit wrote:
bob wrote:
Dann Corbit wrote:
zullil wrote:
zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )

Best NPS scaling from 8 to 16 threads I've seen yet:
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100082
Nodes searched  &#58; 278067643640
Nodes/second    &#58; 25050954

./stockfish bench 16384 8 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100004
Nodes searched  &#58; 163772870234
Nodes/second    &#58; 14754307

25050954/14754307 = 1.70
Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!
Code: Select all
./stockfish bench 16384 1 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100000
Nodes searched  &#58; 23435580318
Nodes/second    &#58; 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.
Don't forget Amdahl's Law:
Code: Select all
#include <stdio.h>
#include <stdlib.h>
#include <string.h>

char string&#91;32767&#93;;
char *getsafe&#40;char *buffer, int count&#41;
&#123;
    char *result = buffer, *np;
    if (&#40;buffer == NULL&#41; || &#40;count < 1&#41;)
        result = NULL;
    else if &#40;count == 1&#41;
        *result = '\0';
    else if (&#40;result = fgets&#40;buffer, count, stdin&#41;) != NULL&#41;
        if &#40;np = strchr&#40;buffer, '\n'))
            *np = '\0';
    return result;
&#125;


int main&#40;void&#41;
&#123;
    char *p;
    double f;
    unsigned n;
    puts&#40;"For fraction, 0 means 100% parallel. 1 means 100% serial.");
    puts&#40;"Fraction of program that is serial &#40;0.0 - 1.00&#41;&#58;");
    
    p = getsafe&#40;string, sizeof string&#41;;
    f = atof&#40;p&#41;;
    for &#40;n = 1; n <= 256; n++)
    	printf&#40;"Speedup for %u threads is %14.12g\n", n, 1.0 / &#40;f + &#40;1.0 / n&#41; * &#40;1.0-f&#41;));
    
    return 0;
&#125;
My math doesn't quite match yours above. With .02 of a program in parallel,

speedup = 1 / (0.02 + 0.98 / 8) = 7.01

Math is pretty simple. One cpu takes 1.0 time. N cpu takes 0.02 + 0.98 / 8 (assuming 8 cpus). 0.98 is done in parallel, 0.02 is done in serial.

Limit of this equation as # cpus goes to infinity is simply 1 . 0.02 which is 50x. I didn't try to figure out where your math is breaking down. Maybe it is just rounding or truncation error? But 7 looks right for 8 cpus, For 16 I get 12.3. This matches my books in parallel computing.
I used the formula from here:
http://en.wikipedia.org/wiki/Amdahl%27s_law

My guess as to the limiting factor was wrong.

If I set N = UINT_MAX, I get this:
Speedup for 4294967295 threads is 49.9999994296

So it looks like 50 is the right answer.

Yes, but the other numbers didn't match either. i.e. 7 for 8 cpus, 12.3x for 16.

Laskos · Post by **Laskos** » Thu Mar 05, 2015 4:20 pm

zullil wrote:
zullil wrote:Just tested Joona's nolocks (Retire global lock) patch (see http://tests.stockfishchess.org/tests/v ... 02160ebee8 )

Best NPS scaling from 8 to 16 threads I've seen yet:
Code: Select all
Dual Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2650 v2 @ 2.60GHz
Turbo Boost and Hyper-Threading disabled
GNU/Linux 3.18.2-031802-generic x86_64

./stockfish bench 16384 16 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100082
Nodes searched  &#58; 278067643640
Nodes/second    &#58; 25050954

./stockfish bench 16384 8 300000 default time 
===========================
Total time &#40;ms&#41; &#58; 11100004
Nodes searched  &#58; 163772870234
Nodes/second    &#58; 14754307

25050954/14754307 = 1.70
Here's a "baseline" 1-thread number for Joona's nolocks version, which resides in his repository and has not been committed yet!
Code: Select all
./stockfish bench 16384 1 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100000
Nodes searched  &#58; 23435580318
Nodes/second    &#58; 2111313
NPS scaling 1 to 16 threads is 25050954/2111313 = 11.9
NPS scaling 1 to 8 threads is 14754307/2111313 = 7.0

11.9 seems to suggest room for further improvement.

These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.

BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10

zullil · Post by **zullil** » Thu Mar 05, 2015 4:43 pm

Laskos wrote: These numbers are time (or depth) dependent. They increase with time. Could you run again with twice the time, or, if it's too lengthy, with half the time? To see the trend.

Yes, NPS increases as a function of search time, though the rate of increase slows and the NPS value seems to approach a fairly stable asymptotic value.

I think the numbers I reported (from 300 seconds per position for 37 positions) are quite close to the limits I'd get on my hardware. So testing again with 150 seconds per position might be more revealing than testing with 600 seconds per position). Will do this as soon as I am able.

bob · Post by **bob** » Thu Mar 05, 2015 4:50 pm

I don't think there will be much improvement beyond 300 seconds. Crafty continuously displays the NPS and by the 30 second mark it is right close to its peak. It might climb from 50M to 52M by the time it gets to 3 - 4 minutes, but it usually doesn't climb much beyond that point. Mainly needs to get out of the shallow searches where the split overhead dominates everything except for the idle time waiting on YBW on real shallow tree searches.

bob · Post by **bob** » Thu Mar 05, 2015 5:13 pm

There is some trickiness to measuring this stuff. I think TTD for each position is the right answer, but using a fixed "D" can be misleading. You might get one position that takes 300 seconds and gets a good speedup, and the next takes 10 seconds and does poorly.

I've generally burned a lot of CPU time to make things as accurate as I can. For example, with N processors (N = max available) I run everything to a fixed time limit of 5 minutes. For each position I take the time to give the PV for the last iteration as the time for N cpus. I then run with N/2 processors, using 4*N time, then again N/2 and 4*N, in an attempt to make sure that the N/2 run gets to the same depth. And even then weighting might be worthwhile. You don't want one large time position with a good (or bad) speedup dominating several small time positions.

Bottom line is that this is a hurry up and wait situation. My dissertation on the Cray was PAINFUL, because I used a real game (ICCA / ACM events were 40/2hr back then) as my 16 cpu base time. Doing 1 cpu searches to the same depth, when the actual speedup was in the 11-12x range (don't have paper handy with actual results) meant that some of those 1 cpu searches took forever.

zullil · Post by **zullil** » Thu Mar 05, 2015 6:07 pm

Laskos wrote: BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10

Miguel's Eq. 1 is equivalent to Dann's formula; f = 1 / (1 + R). Both make sense to me---simple math.

And in either case, the limit of speedup as cores --> ∞ is 1 + R ( = 1/f ).

Laskos · Post by **Laskos** » Thu Mar 05, 2015 6:28 pm

zullil wrote:
Laskos wrote: BTW The Amdahl's law is applied differently to SMP compared to what Dan does. See:
http://talkchess.com/forum/viewtopic.ph ... t&start=10
Miguel's Eq. 1 is equivalent to Dann's formula; f = 1 / (1 + R). Both make sense to me---simple math.

And in either case, the limit of speedup as cores --> ∞ is 1 + R ( = 1/f ).

For fixed n cores, R -> infinity as depth -> to infinity, and the speed-up to infinite depth is simply n, the number of cores.

Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet

Re: Best Stockfish NPS scaling yet