Stockfish still scales poorly?

jhellis3 · Post by **jhellis3** » Fri Feb 20, 2015 6:36 pm

You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.

Well, if one wants to be pedantic.... This statement is wrong. If you search 16x many nodes in a given unit of time T with terrible efficiency (1/2 are redundant), then the effective speedup is only 8x. On the other hand, searching 12x as many nodes in T with say 90% efficiency is an effective speedup of 10.8x. I would take the 10.8x over the 8x, but that's just me.

At the end of the day though, Lucas is right in that it doesn't really matter what one does as long as the end result is higher Elo.

Laskos · Post by **Laskos** » Fri Feb 20, 2015 6:41 pm

jhellis3 wrote:
You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.
Well, if one wants to be pedantic.... This statement is wrong. If you search 16x many nodes in a given unit of time T with terrible efficiency (1/2 are redundant), then the effective speedup is only 8x. On the other hand, searching 12x as many nodes in T with say 90% efficiency is an effective speedup of 10.8x. I would take the 10.8x over the 8x, but that's just me.

At the end of the day though, Lucas is right in that it doesn't really matter what one does as long as the end result is higher Elo.

Good correction, I was thinking to post myself the same. The effective speed-up of SF on 16 threads must be in the 6x-8x range, so anything NPS which goes above 9x could be better, be it 9x or 15x.

zullil · Post by **zullil** » Fri Feb 20, 2015 6:44 pm

zullil wrote:
bob wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.
What hardware?

One issue with AMD is that most of the AMD BIOS chips give you two choices on memory setup. (a) NUMA (b) SMP.

NUMA is the traditional NUMA approach, if you have two chips as you do, with say 16gb of DRAM, chip 0 will have addresses 0-8gb and chip 1 will have addresses 8gb-16gb.

SMP interleaves pages between the two chips, so that chip 0 gets addresses 0-4K, chip 1 gets 4K-8K, chip 0 gets 8K-12K, etc. If a program understands NUMA, using the SMP setting will break it badly. If it doesn't understand NUMA, the SMP setting will help avoid memory hotspots but does introduce delays.

If it is intel, I don't believe they have done this, at least not on any machines I have run on and I have used a bunch of 'em over time.
Dell T5610 dual Xeon with NUMA set in BIOS
Code: Select all
louis@LZsT5610&#58;~$ numactl --hardware
available&#58; 2 nodes &#40;0-1&#41;
node 0 cpus&#58; 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size&#58; 15991 MB
node 0 free&#58; 5994 MB
node 1 cpus&#58; 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size&#58; 16125 MB
node 1 free&#58; 8474 MB
node distances&#58;
node   0   1 
  0&#58;  10  20 
  1&#58;  20  10 
louis@LZsT5610&#58;~$ numastat stockfish
                           node0           node1
numa_hit               368078713       350129592
numa_miss                     74          287032
numa_foreign              287032              74
interleave_hit             17534           18437
local_node             365337202       350110658
other_node               2741585          305966

Ignore the output from numastat, at least until I have a chance to process the man page.

AlvaroBegue · Post by **AlvaroBegue** » Fri Feb 20, 2015 8:09 pm

jhellis3 wrote:
You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.
Well, if one wants to be pedantic.... This statement is wrong. If you search 16x many nodes in a given unit of time T with terrible efficiency (1/2 are redundant), then the effective speedup is only 8x. On the other hand, searching 12x as many nodes in T with say 90% efficiency is an effective speedup of 10.8x. I would take the 10.8x over the 8x, but that's just me.

How does that show that the statement is wrong?

bob · Post by **bob** » Fri Feb 20, 2015 10:46 pm

jhellis3 wrote:
You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.
Well, if one wants to be pedantic.... This statement is wrong. If you search 16x many nodes in a given unit of time T with terrible efficiency (1/2 are redundant), then the effective speedup is only 8x. On the other hand, searching 12x as many nodes in T with say 90% efficiency is an effective speedup of 10.8x. I would take the 10.8x over the 8x, but that's just me.

At the end of the day though, Lucas is right in that it doesn't really matter what one does as long as the end result is higher Elo.

Please read what I wrote.

1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.

2. Once you run a real SMP test, time to depth, and compute the speedup, you can ask yourself, "OK, how well am I doing here?"

Here's a sample:

Your NPS speedup is only 10x on 16 cpus. Your parallel search speedup is 6x. 6x out of 16 sounds bad. But it is not 6x out of 16, it is 6x out of a max of 10x. Which is not as bad as it originally sounded. What to do here? You can try to improve the parallel search, which will NEVER get to 10x due to search overhead, so you struggle to get part of that 4x you are losing, or you improve the underlying code to try to get that 6x NPS boost that is missing.. If you get half of that back, you will get half of that parallel search lost performance as well.

NPS doesn't measure speedup, but it absolutely measures the upper bound on speedup that is possible. Assuming a perfect parallel search with zero overhead, you can not exceed the NPS speedup, ever.

So both numbers are important. NPS gives information about cache traffic, memory traffic, potential lock interference, and such. Parallel speedup gives information about things like thread waiting times, extra nodes searched due to poor splitting, etc. They are related, but not the same thing. BOTH are critical numbers.

bob · Post by **bob** » Fri Feb 20, 2015 10:48 pm

zullil wrote:
bob wrote:
zullil wrote:Using the latest development version of SF on a 2x8 core linux workstation with 32 GB RAM. Turbo boost disabled.

Standard SF bench of 37 positions, each searched for 5 minutes:
Code: Select all
./stockfish bench 16384 16 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100075
Nodes searched  &#58; 233436761771
Nodes/second    &#58; 21030196
 

./stockfish bench 16384 8 300000 default time
===========================
Total time &#40;ms&#41; &#58; 11100001
Nodes searched  &#58; 160664514528
Nodes/second    &#58; 14474279

21030196/14474279 = 1.45...
Seems like this still suggests poor scaling from 8 to 16 cores. Realize this is a small amount of testing . Suggestions, criticisms, explanations welcome.
What hardware?

One issue with AMD is that most of the AMD BIOS chips give you two choices on memory setup. (a) NUMA (b) SMP.

NUMA is the traditional NUMA approach, if you have two chips as you do, with say 16gb of DRAM, chip 0 will have addresses 0-8gb and chip 1 will have addresses 8gb-16gb.

SMP interleaves pages between the two chips, so that chip 0 gets addresses 0-4K, chip 1 gets 4K-8K, chip 0 gets 8K-12K, etc. If a program understands NUMA, using the SMP setting will break it badly. If it doesn't understand NUMA, the SMP setting will help avoid memory hotspots but does introduce delays.

If it is intel, I don't believe they have done this, at least not on any machines I have run on and I have used a bunch of 'em over time.
Dell T5610 dual Xeon with NUMA set in BIOS
Code: Select all
louis@LZsT5610&#58;~$ numactl --hardware
available&#58; 2 nodes &#40;0-1&#41;
node 0 cpus&#58; 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size&#58; 15991 MB
node 0 free&#58; 5994 MB
node 1 cpus&#58; 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size&#58; 16125 MB
node 1 free&#58; 8474 MB
node distances&#58;
node   0   1 
  0&#58;  10  20 
  1&#58;  20  10 
louis@LZsT5610&#58;~$ numastat stockfish
                           node0           node1
numa_hit               368078713       350129592
numa_miss                     74          287032
numa_foreign              287032              74
interleave_hit             17534           18437
local_node             365337202       350110658
other_node               2741585          305966

Turbo-boost disabled? If not that wrecks the numbers right off the bat.

zullil · Post by **zullil** » Fri Feb 20, 2015 11:00 pm

bob wrote: Turbo-boost disabled? If not that wrecks the numbers right off the bat.

Yes.

syzygy · Post by **syzygy** » Fri Feb 20, 2015 11:03 pm

Joerg Oster wrote:Really? 1 minute is not enough to keep all threads busy? Wow.

1 minute is not enough to reach a depth at which the threads can be kept busy at least most of the time.

If you search very deep, each thread can get a huge subtree for itself and all threads should be busy most of the time, at least if there are a sufficient number of split points relatively close to the root.

syzygy · Post by **syzygy** » Fri Feb 20, 2015 11:11 pm

Laskos wrote:
jhellis3 wrote:
You are wrong, as several have already told you. NPS scaling tells what percentage of the hardware you are able to use. your SMP speedup is bound by the NPS speedup. If you search 1M nodes per second on one CPU, and only 8M on 16 cores, you are wasting 1/2 of the hardware and you will NEVER get an SMP speedup > 8x. And in reality it will be less.
Well, if one wants to be pedantic.... This statement is wrong. If you search 16x many nodes in a given unit of time T with terrible efficiency (1/2 are redundant), then the effective speedup is only 8x. On the other hand, searching 12x as many nodes in T with say 90% efficiency is an effective speedup of 10.8x. I would take the 10.8x over the 8x, but that's just me.

At the end of the day though, Lucas is right in that it doesn't really matter what one does as long as the end result is higher Elo.
Good correction, I was thinking to post myself the same. The effective speed-up of SF on 16 threads must be in the 6x-8x range, so anything NPS which goes above 9x could be better, be it 9x or 15x.

Bob specifically said that NPS speedup bounds SMP speedup. This is the point. He also said NPS is not everything. What is quoted here is simply 100% correct if one reads carefully.

SF has poor NPS speedup when going from 8 to 16 cores. If that can be addressed (in way that is not stupid, so not e.g. by letting all cores search their own tree without any form of communication), the potential gain is very significant.

jhellis3 · Post by **jhellis3** » Sat Feb 21, 2015 1:07 am

No it isn't. The reason it isn't can be evidenced by what he was replying to. Lucas's statement was not incorrect because Elo > all. So using any reasoning whatsoever (no matter how independently sound) to claim that one should worship at a different altar than Elo is wrong.

One would hope this is rather obvious...

1. NPS gives an absolute upper bound on speedup. If you search 10M naps with one cpu, and only 10M with 16 cpus, you will NEVER get any speedup, because you are doing no "extra work" during that search time. NPS is an important number because it provides this upper bound on performance.

This statement is false. NPS does not provide an upper bound on performance. Performance can only sanely be measured 1 way (via results), typically in the form of Elo. Using NPS as metric to judge the effectiveness of code (which alters the search) is nonsense.

Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?

Re: Stockfish still scales poorly?