Baffling multithreading scaling behavior

Dann Corbit · Post by **Dann Corbit** » Wed Sep 07, 2016 9:29 am

I know you are aware that you have a lot of loss once all the cores are busy and another task is launched.

I did find with the C++11 threading stuff that the base priority was too low. A small priority boost gave a big measurable speedup for me.

I am atypical though, because I have lots of services running on my machine (most of which will be idle, but I will always have 8% or so of my compute power consumed even at rest. So (for instance) I only run 10 threads out of 12 cores possible so that the 8% overhead won't clash and so that I can break in when I need the machine for other purposes. Otherwise, when the screen saver turns on, it can take 15 minutes to get control back.

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 1:44 pm

Dann Corbit wrote:I know you are aware that you have a lot of loss once all the cores are busy and another task is launched.

I did find with the C++11 threading stuff that the base priority was too low. A small priority boost gave a big measurable speedup for me.

I am atypical though, because I have lots of services running on my machine (most of which will be idle, but I will always have 8% or so of my compute power consumed even at rest. So (for instance) I only run 10 threads out of 12 cores possible so that the 8% overhead won't clash and so that I can break in when I need the machine for other purposes. Otherwise, when the screen saver turns on, it can take 15 minutes to get control back.

That only matters if all the cores are fully loaded, though. My server is mostly idle, and I run all my training at nice -n 19, and it still runs just as fast.

ymatioun · Post by **ymatioun** » Wed Sep 07, 2016 3:00 pm

why don't you profile your slow run - using "very sleepy" profiler, for example. When i had a similar problem, profiler showed that a lot of time was spent writing in memory shared by all threads (to update total node count and other search stats). Once i removed that code, NPS scaling became almost perfect.

cdani · Post by **cdani** » Wed Sep 07, 2016 6:08 pm

ymatioun wrote:...to update total node count...

I get the total node count from the main thread iterating over the counters of the other threads, and only when I have to show a new pv, so mostly never, so this does not affects performance.

jdart · Post by **jdart** » Wed Sep 07, 2016 6:34 pm

I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 7:32 pm

jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon

Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf

bob · Post by **bob** » Wed Sep 07, 2016 7:47 pm

I would suggest a couple of things.

(1) make CERTAIN hyper threading is turned off. There has been some inconsistency with Intel and how they report this stuff and how they number the processors. You do not want to get two threads on one physical core, so disabling hyper threading is a first test. If it makes no difference, you can turn it back on.

(2) memory is the next issue. If you share anything, such as hash, you have a potential cache thrasher. For example, if you have broken your hashing so that you repeatedly access the same entry, that will kill performance.

(3) I use "detached mode" (pthread_set_attribute() or something similar) since I NEVER use a pthread_join(). This eliminates some hidden overhead where threads can check on each other without your knowing.

(4) if you have access to the intel compiler, vTune is the best there is to find out what is causing such a bottleneck. icc is free for personal use on a linux box. I don't know about vTune.

(5) I presume you are certain that you are not excessively using locks. Even if there is no contention, a lock (even a simple spin lock as used in Crafty) is STILL a high-overhead activity that causes all sorts of cache overhead.

The usual culprits are (a) shared memory updating that was not intended. IE this:

int node_counts[#threads]

is an absolute no-no. Nothing is really shared, but since the counters for 8 threads fit into one cache line, you will absolutely smoke cache with inter-cache forwarding and invalidating.

(b) hyper threading; (c) excessive locking; (d) excessive system calls that have "hidden" locks or shared data that you don't directly see.

I doubt NUMA will make any major impact here. It is important, but it won't cause anything like this at all. You might have trouble going 8x faster on 16 cores if NUMA is an issue, but not to the level you are seeing it.

bob · Post by **bob** » Wed Sep 07, 2016 7:49 pm

cdani wrote:
ymatioun wrote:...to update total node count...
I get the total node count from the main thread iterating over the counters of the other threads, and only when I have to show a new pv, so mostly never, so this does not affects performance.

That was not quite what he was talking about. More like this:

int nodes[#threads]; is a no-no. Even if each thread updates a different element of that array, it is smoking cache.

bob · Post by **bob** » Wed Sep 07, 2016 7:52 pm

matthewlai wrote:
jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf

This HAS been an issue within the last 3-4 years also. Intel has not always been consistent on how the logical processors are numbered. On some Intel chips, logical processors 0 and 1 share a physical core. On others, 0 and 8 share a physical core (assuming 8 physical cores). I ran afoul of this even though I use processor affinity in Crafty to lock each thread to a specific core. I have an "affinity" command that lets me specify how the physical cores are numbered to avoid this. And even worse, when Intel went to non-power-of-2 core counts, that broke some of the linux internal processor affinity stuff as well because it is more convenient to do an AND rather than a DIV instruction.

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 7:59 pm

bob wrote:
matthewlai wrote:
jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf
This HAS been an issue within the last 3-4 years also. Intel has not always been consistent on how the logical processors are numbered. On some Intel chips, logical processors 0 and 1 share a physical core. On others, 0 and 8 share a physical core (assuming 8 physical cores). I ran afoul of this even though I use processor affinity in Crafty to lock each thread to a specific core. I have an "affinity" command that lets me specify how the physical cores are numbered to avoid this. And even worse, when Intel went to non-power-of-2 core counts, that broke some of the linux internal processor affinity stuff as well because it is more convenient to do an AND rather than a DIV instruction.

Wouldn't that be _because_ you use affinity, instead of _even though_ you use affinity? If you don't pin your threads, and the scheduler is aware of the new numbering (since presumably it looks at the core ID of each core instead of relying on a specific numbering), it should do the right thing?

Baffling multithreading scaling behavior

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Idea: Try a parallel perft()

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior