Stockfish still scales poorly?

zullil · Post by **zullil** » Fri Feb 20, 2015 11:38 am

Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...

Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing

. Suggestions, criticisms, explanations welcome.

lucasart · Post by **lucasart** » Fri Feb 20, 2015 12:19 pm

As I explained already, NPS is the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's elo we care about. Nothing else.

In all likelyhood, what happens is that SF does not search faster (in NPS), but the nodes it calculates are less often wasted.

Besides, SMP is non deterministic, so you can't even conclude from a single bench run anything about NPS scaling.

Joerg Oster · Post by **Joerg Oster** » Fri Feb 20, 2015 12:23 pm

zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.

Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.

zullil · Post by **zullil** » Fri Feb 20, 2015 3:10 pm

Joerg Oster wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.
Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.

Here are data for Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a, from just before Joona's patch. This time with 1 minute per position. Will now retest the latest SF, at 1 minute per position.

Code: Select all

Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a

./stockfish bench 16384 16 60000 default time 
===========================
Total time &#40;ms&#41; &#58; 2220002
Nodes searched  &#58; 35534000017
Nodes/second    &#58; 16006291

./stockfish bench 16384 8 60000 default time 
===========================
Total time &#40;ms&#41; &#58; 2220002
Nodes searched  &#58; 29143756112
Nodes/second    &#58; 13127806

16006291/13127806 = 1.22

zullil · Post by **zullil** » Fri Feb 20, 2015 4:52 pm

zullil wrote:
Joerg Oster wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.
Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.
Here are data for Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a, from just before Joona's patch. This time with 1 minute per position. Will now retest the latest SF, at 1 minute per position.
Code: Select all
Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a

./stockfish bench 16384 16 60000 default time 
===========================
Total time &#40;ms&#41; &#58; 2220002
Nodes searched  &#58; 35534000017
Nodes/second    &#58; 16006291

./stockfish bench 16384 8 60000 default time 
===========================
Total time &#40;ms&#41; &#58; 2220002
Nodes searched  &#58; 29143756112
Nodes/second    &#58; 13127806

16006291/13127806 = 1.22

Here are data for a post-Joona-patch SF:

Code: Select all

./stockfish bench 16384 16 60000 default time
===========================
Total time &#40;ms&#41; &#58; 2220015
Nodes searched  &#58; 35535049715
Nodes/second    &#58; 16006670

./stockfish bench 16384 8 60000 default time
===========================
Total time &#40;ms&#41; &#58; 2220001
Nodes searched  &#58; 29611602154
Nodes/second    &#58; 13338553

16006670/13338553 = 1.20

PS---Seems that 1 min per position is a bit too short to allow the 16 thread NPS to reach its potential.

Joerg Oster · Post by **Joerg Oster** » Fri Feb 20, 2015 5:19 pm

zullil wrote:
zullil wrote:
Joerg Oster wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.
Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.
Here are data for Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a, from just before Joona's patch. This time with 1 minute per position. Will now retest the latest SF, at 1 minute per position.
Code: Select all
Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a

./stockfish bench 16384 16 60000 default time 
===========================
Total time &#40;ms&#41; &#58; 2220002
Nodes searched  &#58; 35534000017
Nodes/second    &#58; 16006291

./stockfish bench 16384 8 60000 default time 
===========================
Total time &#40;ms&#41; &#58; 2220002
Nodes searched  &#58; 29143756112
Nodes/second    &#58; 13127806

16006291/13127806 = 1.22
Here are data for a post-Joona-patch SF:
Code: Select all
./stockfish bench 16384 16 60000 default time
===========================
Total time &#40;ms&#41; &#58; 2220015
Nodes searched  &#58; 35535049715
Nodes/second    &#58; 16006670

./stockfish bench 16384 8 60000 default time
===========================
Total time &#40;ms&#41; &#58; 2220001
Nodes searched  &#58; 29611602154
Nodes/second    &#58; 13338553

16006670/13338553 = 1.20
PS---Seems that 1 min per position is a bit too short to allow the 16 thread NPS to reach its potential.

Really? 1 minute is not enough to keep all threads busy? Wow.

fastgm · Post by **fastgm** » Fri Feb 20, 2015 5:40 pm

AMD Opteron 6376 @ 2.3 GHz

Code: Select all

stockfish_15012721_x64_modern = Stockfish 6
stockfish_15021923_x64_modern = Actual master &#40;Clarify we don't late join with only 2 threads&#41;

./stockfish_15021923_x64_modern bench 4096 8 60000 default time

===========================
Total time &#40;ms&#41; &#58; 2220075
Nodes searched  &#58; 15170515954
Nodes/second    &#58; 6833334

./stockfish_15021923_x64_modern bench 4096 16 60000 default time
 
===========================
Total time &#40;ms&#41; &#58; 2220090
Nodes searched  &#58; 17551656395
Nodes/second    &#58; 7905831

./stockfish_15012721_x64_modern bench 4096 8 60000 default time

===========================
Total time &#40;ms&#41; &#58; 2220063
Nodes searched  &#58; 15344047411
Nodes/second    &#58; 6911536

./stockfish_15012721_x64_modern bench 4096 16 60000 default time

===========================
Total time &#40;ms&#41; &#58; 2220085
Nodes searched  &#58; 18711073535
Nodes/second    &#58; 8428088

./stockfish_15021923_x64_modern bench 4096 16 25 default depth

===========================
Total time &#40;ms&#41; &#58; 247253
Nodes searched  &#58; 1168697899
Nodes/second    &#58; 4726728

./stockfish_15012721_x64_modern bench 4096 16 25 default depth

===========================
Total time &#40;ms&#41; &#58; 273054
Nodes searched  &#58; 1311022615
Nodes/second    &#58; 4801330

./stockfish_15021923_x64_modern bench 4096 16 30 default depth

===========================
Total time &#40;ms&#41; &#58; 1114996
Nodes searched  &#58; 8208679590
Nodes/second    &#58; 7362070

./stockfish_15012721_x64_modern bench 4096 16 30 default depth

===========================
Total time &#40;ms&#41; &#58; 1348255
Nodes searched  &#58; 10800407574
Nodes/second    &#58; 8010656

bob · Post by **bob** » Fri Feb 20, 2015 5:44 pm

zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.

What hardware?

One issue with AMD is that most of the AMD BIOS chips give you two choices on memory setup. (a) NUMA (b) SMP.

NUMA is the traditional NUMA approach, if you have two chips as you do, with say 16gb of DRAM, chip 0 will have addresses 0-8gb and chip 1 will have addresses 8gb-16gb.

SMP interleaves pages between the two chips, so that chip 0 gets addresses 0-4K, chip 1 gets 4K-8K, chip 0 gets 8K-12K, etc. If a program understands NUMA, using the SMP setting will break it badly. If it doesn't understand NUMA, the SMP setting will help avoid memory hotspots but does introduce delays.

If it is intel, I don't believe they have done this, at least not on any machines I have run on and I have used a bunch of 'em over time.

bob · Post by **bob** » Fri Feb 20, 2015 5:46 pm

lucasart wrote:As I explained already, NPS is the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's elo we care about. Nothing else.

In all likelyhood, what happens is that SF does not search faster (in NPS), but the nodes it calculates are less often wasted.

Besides, SMP is non deterministic, so you can't even conclude from a single bench run anything about NPS scaling.

You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.

NPS is just an upper bound, but it is an important number, because it gives you that upper bound that you can never exceed. But just because a program gets a 15x-16x NPS speedup does not mean they search 15x-16x faster. SMP overhead is still there.

Try to understand a topic before making absolute statements.

zullil · Post by **zullil** » Fri Feb 20, 2015 6:18 pm

bob wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.
What hardware?

One issue with AMD is that most of the AMD BIOS chips give you two choices on memory setup. (a) NUMA (b) SMP.

NUMA is the traditional NUMA approach, if you have two chips as you do, with say 16gb of DRAM, chip 0 will have addresses 0-8gb and chip 1 will have addresses 8gb-16gb.

SMP interleaves pages between the two chips, so that chip 0 gets addresses 0-4K, chip 1 gets 4K-8K, chip 0 gets 8K-12K, etc. If a program understands NUMA, using the SMP setting will break it badly. If it doesn't understand NUMA, the SMP setting will help avoid memory hotspots but does introduce delays.

If it is intel, I don't believe they have done this, at least not on any machines I have run on and I have used a bunch of 'em over time.

Dell T5610 dual Xeon with NUMA set in BIOS

Code: Select all

louis@LZsT5610&#58;~$ numactl --hardware
available&#58; 2 nodes &#40;0-1&#41;
node 0 cpus&#58; 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size&#58; 15991 MB
node 0 free&#58; 5994 MB
node 1 cpus&#58; 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size&#58; 16125 MB
node 1 free&#58; 8474 MB
node distances&#58;
node   0   1 
  0&#58;  10  20 
  1&#58;  20  10 
louis@LZsT5610&#58;~$ numastat stockfish
                           node0           node1
numa_hit               368078713       350129592
numa_miss                     74          287032
numa_foreign              287032              74
interleave_hit             17534           18437
local_node             365337202       350110658
other_node               2741585          305966

Stockfish still scales poorly?

Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?