Stockfish still scales poorly?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Stockfish still scales poorly?

Post by zullil »

Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched  : 233436761771
Nodes/second    : 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched  : 160664514528
Nodes/second    : 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing :wink:. Suggestions, criticisms, explanations welcome.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Stockfish still scales poorly?

Post by lucasart »

As I explained already, NPS is the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's elo we care about. Nothing else.

In all likelyhood, what happens is that SF does not search faster (in NPS), but the nodes it calculates are less often wasted.

Besides, SMP is non deterministic, so you can't even conclude from a single bench run anything about NPS scaling.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: Stockfish still scales poorly?

Post by Joerg Oster »

zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched  : 233436761771
Nodes/second    : 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched  : 160664514528
Nodes/second    : 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing :wink:. Suggestions, criticisms, explanations welcome.
Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.
Jörg Oster
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Stockfish still scales poorly?

Post by zullil »

Joerg Oster wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched  : 233436761771
Nodes/second    : 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched  : 160664514528
Nodes/second    : 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing :wink:. Suggestions, criticisms, explanations welcome.
Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.
Here are data for Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a, from just before Joona's patch. This time with 1 minute per position. Will now retest the latest SF, at 1 minute per position.

Code: Select all

Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a

./stockfish bench 16384 16 60000 default time 
===========================
Total time (ms) : 2220002
Nodes searched  : 35534000017
Nodes/second    : 16006291

./stockfish bench 16384 8 60000 default time 
===========================
Total time (ms) : 2220002
Nodes searched  : 29143756112
Nodes/second    : 13127806

16006291/13127806 = 1.22
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Stockfish still scales poorly?

Post by zullil »

zullil wrote:
Joerg Oster wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched  : 233436761771
Nodes/second    : 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched  : 160664514528
Nodes/second    : 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing :wink:. Suggestions, criticisms, explanations welcome.
Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.
Here are data for Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a, from just before Joona's patch. This time with 1 minute per position. Will now retest the latest SF, at 1 minute per position.

Code: Select all

Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a

./stockfish bench 16384 16 60000 default time 
===========================
Total time (ms) : 2220002
Nodes searched  : 35534000017
Nodes/second    : 16006291

./stockfish bench 16384 8 60000 default time 
===========================
Total time (ms) : 2220002
Nodes searched  : 29143756112
Nodes/second    : 13127806

16006291/13127806 = 1.22
Here are data for a post-Joona-patch SF:

Code: Select all

./stockfish bench 16384 16 60000 default time
===========================
Total time (ms) : 2220015
Nodes searched  : 35535049715
Nodes/second    : 16006670

./stockfish bench 16384 8 60000 default time
===========================
Total time (ms) : 2220001
Nodes searched  : 29611602154
Nodes/second    : 13338553

16006670/13338553 = 1.20
PS---Seems that 1 min per position is a bit too short to allow the 16 thread NPS to reach its potential.
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: Stockfish still scales poorly?

Post by Joerg Oster »

zullil wrote:
zullil wrote:
Joerg Oster wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched  : 233436761771
Nodes/second    : 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched  : 160664514528
Nodes/second    : 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing :wink:. Suggestions, criticisms, explanations welcome.
Interesting. But how does this compare to the version before Joona's patch?
Is this an improvement, even worse or similar to before?

Btw, I think searching 1 min per position would be sufficient, too.
I don't think nps will significantly change after that time.
Here are data for Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a, from just before Joona's patch. This time with 1 minute per position. Will now retest the latest SF, at 1 minute per position.

Code: Select all

Stockfish-f8f5dcbb682830a66a37f68f3c192bbbfc84a33a

./stockfish bench 16384 16 60000 default time 
===========================
Total time (ms) : 2220002
Nodes searched  : 35534000017
Nodes/second    : 16006291

./stockfish bench 16384 8 60000 default time 
===========================
Total time (ms) : 2220002
Nodes searched  : 29143756112
Nodes/second    : 13127806

16006291/13127806 = 1.22
Here are data for a post-Joona-patch SF:

Code: Select all

./stockfish bench 16384 16 60000 default time
===========================
Total time (ms) : 2220015
Nodes searched  : 35535049715
Nodes/second    : 16006670

./stockfish bench 16384 8 60000 default time
===========================
Total time (ms) : 2220001
Nodes searched  : 29611602154
Nodes/second    : 13338553

16006670/13338553 = 1.20
PS---Seems that 1 min per position is a bit too short to allow the 16 thread NPS to reach its potential.
Really? 1 minute is not enough to keep all threads busy? Wow.
Jörg Oster
fastgm
Posts: 818
Joined: Mon Aug 19, 2013 6:57 pm

Re: Stockfish still scales poorly?

Post by fastgm »

AMD Opteron 6376 @ 2.3 GHz

Code: Select all

stockfish_15012721_x64_modern = Stockfish 6
stockfish_15021923_x64_modern = Actual master (Clarify we don't late join with only 2 threads)

./stockfish_15021923_x64_modern bench 4096 8 60000 default time

===========================
Total time (ms) : 2220075
Nodes searched  : 15170515954
Nodes/second    : 6833334

./stockfish_15021923_x64_modern bench 4096 16 60000 default time
 
===========================
Total time (ms) : 2220090
Nodes searched  : 17551656395
Nodes/second    : 7905831

./stockfish_15012721_x64_modern bench 4096 8 60000 default time

===========================
Total time (ms) : 2220063
Nodes searched  : 15344047411
Nodes/second    : 6911536

./stockfish_15012721_x64_modern bench 4096 16 60000 default time

===========================
Total time (ms) : 2220085
Nodes searched  : 18711073535
Nodes/second    : 8428088

./stockfish_15021923_x64_modern bench 4096 16 25 default depth

===========================
Total time (ms) : 247253
Nodes searched  : 1168697899
Nodes/second    : 4726728

./stockfish_15012721_x64_modern bench 4096 16 25 default depth

===========================
Total time (ms) : 273054
Nodes searched  : 1311022615
Nodes/second    : 4801330

./stockfish_15021923_x64_modern bench 4096 16 30 default depth

===========================
Total time (ms) : 1114996
Nodes searched  : 8208679590
Nodes/second    : 7362070

./stockfish_15012721_x64_modern bench 4096 16 30 default depth

===========================
Total time (ms) : 1348255
Nodes searched  : 10800407574
Nodes/second    : 8010656
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched  : 233436761771
Nodes/second    : 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched  : 160664514528
Nodes/second    : 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing :wink:. Suggestions, criticisms, explanations welcome.
What hardware?

One issue with AMD is that most of the AMD BIOS chips give you two choices on memory setup. (a) NUMA (b) SMP.

NUMA is the traditional NUMA approach, if you have two chips as you do, with say 16gb of DRAM, chip 0 will have addresses 0-8gb and chip 1 will have addresses 8gb-16gb.

SMP interleaves pages between the two chips, so that chip 0 gets addresses 0-4K, chip 1 gets 4K-8K, chip 0 gets 8K-12K, etc. If a program understands NUMA, using the SMP setting will break it badly. If it doesn't understand NUMA, the SMP setting will help avoid memory hotspots but does introduce delays.

If it is intel, I don't believe they have done this, at least not on any machines I have run on and I have used a bunch of 'em over time.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Stockfish still scales poorly?

Post by bob »

lucasart wrote:As I explained already, NPS is the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's elo we care about. Nothing else.

In all likelyhood, what happens is that SF does not search faster (in NPS), but the nodes it calculates are less often wasted.

Besides, SMP is non deterministic, so you can't even conclude from a single bench run anything about NPS scaling.
You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.

NPS is just an upper bound, but it is an important number, because it gives you that upper bound that you can never exceed. But just because a program gets a 15x-16x NPS speedup does not mean they search 15x-16x faster. SMP overhead is still there.

Try to understand a topic before making absolute statements.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Stockfish still scales poorly?

Post by zullil »

bob wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:

Code: Select all

./stockfish bench 16384 16 300000 default time
===========================
Total time (ms) : 11100075
Nodes searched  : 233436761771
Nodes/second    : 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time (ms) : 11100001
Nodes searched  : 160664514528
Nodes/second    : 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing :wink:. Suggestions, criticisms, explanations welcome.
What hardware?

One issue with AMD is that most of the AMD BIOS chips give you two choices on memory setup. (a) NUMA (b) SMP.

NUMA is the traditional NUMA approach, if you have two chips as you do, with say 16gb of DRAM, chip 0 will have addresses 0-8gb and chip 1 will have addresses 8gb-16gb.

SMP interleaves pages between the two chips, so that chip 0 gets addresses 0-4K, chip 1 gets 4K-8K, chip 0 gets 8K-12K, etc. If a program understands NUMA, using the SMP setting will break it badly. If it doesn't understand NUMA, the SMP setting will help avoid memory hotspots but does introduce delays.

If it is intel, I don't believe they have done this, at least not on any machines I have run on and I have used a bunch of 'em over time.
Dell T5610 dual Xeon with NUMA set in BIOS

Code: Select all

louis@LZsT5610:~$ numactl --hardware
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 15991 MB
node 0 free: 5994 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 16125 MB
node 1 free: 8474 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
louis@LZsT5610:~$ numastat stockfish
                           node0           node1
numa_hit               368078713       350129592
numa_miss                     74          287032
numa_foreign              287032              74
interleave_hit             17534           18437
local_node             365337202       350110658
other_node               2741585          305966