Baffling multithreading scaling behavior

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Idea: Try a parallel perft()

Post by Dann Corbit »

I know you are aware that you have a lot of loss once all the cores are busy and another task is launched.

I did find with the C++11 threading stuff that the base priority was too low. A small priority boost gave a big measurable speedup for me.

I am atypical though, because I have lots of services running on my machine (most of which will be idle, but I will always have 8% or so of my compute power consumed even at rest. So (for instance) I only run 10 threads out of 12 cores possible so that the 8% overhead won't clash and so that I can break in when I need the machine for other purposes. Otherwise, when the screen saver turns on, it can take 15 minutes to get control back.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Idea: Try a parallel perft()

Post by matthewlai »

Dann Corbit wrote:I know you are aware that you have a lot of loss once all the cores are busy and another task is launched.

I did find with the C++11 threading stuff that the base priority was too low. A small priority boost gave a big measurable speedup for me.

I am atypical though, because I have lots of services running on my machine (most of which will be idle, but I will always have 8% or so of my compute power consumed even at rest. So (for instance) I only run 10 threads out of 12 cores possible so that the 8% overhead won't clash and so that I can break in when I need the machine for other purposes. Otherwise, when the screen saver turns on, it can take 15 minutes to get control back.
That only matters if all the cores are fully loaded, though. My server is mostly idle, and I run all my training at nice -n 19, and it still runs just as fast.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
ymatioun
Posts: 64
Joined: Fri Oct 18, 2013 11:40 pm
Location: New York

Re: Idea: Try a parallel perft()

Post by ymatioun »

why don't you profile your slow run - using "very sleepy" profiler, for example. When i had a similar problem, profiler showed that a lot of time was spent writing in memory shared by all threads (to update total node count and other search stats). Once i removed that code, NPS scaling became almost perfect.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Idea: Try a parallel perft()

Post by cdani »

ymatioun wrote:...to update total node count...
I get the total node count from the main thread iterating over the counters of the other threads, and only when I have to show a new pv, so mostly never, so this does not affects performance.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Baffling multithreading scaling behavior

Post by jdart »

I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Baffling multithreading scaling behavior

Post by matthewlai »

jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Baffling multithreading scaling behavior

Post by bob »

I would suggest a couple of things.

(1) make CERTAIN hyper threading is turned off. There has been some inconsistency with Intel and how they report this stuff and how they number the processors. You do not want to get two threads on one physical core, so disabling hyper threading is a first test. If it makes no difference, you can turn it back on.

(2) memory is the next issue. If you share anything, such as hash, you have a potential cache thrasher. For example, if you have broken your hashing so that you repeatedly access the same entry, that will kill performance.

(3) I use "detached mode" (pthread_set_attribute() or something similar) since I NEVER use a pthread_join(). This eliminates some hidden overhead where threads can check on each other without your knowing.

(4) if you have access to the intel compiler, vTune is the best there is to find out what is causing such a bottleneck. icc is free for personal use on a linux box. I don't know about vTune.

(5) I presume you are certain that you are not excessively using locks. Even if there is no contention, a lock (even a simple spin lock as used in Crafty) is STILL a high-overhead activity that causes all sorts of cache overhead.

The usual culprits are (a) shared memory updating that was not intended. IE this:

int node_counts[#threads]

is an absolute no-no. Nothing is really shared, but since the counters for 8 threads fit into one cache line, you will absolutely smoke cache with inter-cache forwarding and invalidating.

(b) hyper threading; (c) excessive locking; (d) excessive system calls that have "hidden" locks or shared data that you don't directly see.

I doubt NUMA will make any major impact here. It is important, but it won't cause anything like this at all. You might have trouble going 8x faster on 16 cores if NUMA is an issue, but not to the level you are seeing it.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Idea: Try a parallel perft()

Post by bob »

cdani wrote:
ymatioun wrote:...to update total node count...
I get the total node count from the main thread iterating over the counters of the other threads, and only when I have to show a new pv, so mostly never, so this does not affects performance.
That was not quite what he was talking about. More like this:

int nodes[#threads]; is a no-no. Even if each thread updates a different element of that array, it is smoking cache.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Baffling multithreading scaling behavior

Post by bob »

matthewlai wrote:
jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf
This HAS been an issue within the last 3-4 years also. Intel has not always been consistent on how the logical processors are numbered. On some Intel chips, logical processors 0 and 1 share a physical core. On others, 0 and 8 share a physical core (assuming 8 physical cores). I ran afoul of this even though I use processor affinity in Crafty to lock each thread to a specific core. I have an "affinity" command that lets me specify how the physical cores are numbered to avoid this. And even worse, when Intel went to non-power-of-2 core counts, that broke some of the linux internal processor affinity stuff as well because it is more convenient to do an AND rather than a DIV instruction.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Baffling multithreading scaling behavior

Post by matthewlai »

bob wrote:
matthewlai wrote:
jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf
This HAS been an issue within the last 3-4 years also. Intel has not always been consistent on how the logical processors are numbered. On some Intel chips, logical processors 0 and 1 share a physical core. On others, 0 and 8 share a physical core (assuming 8 physical cores). I ran afoul of this even though I use processor affinity in Crafty to lock each thread to a specific core. I have an "affinity" command that lets me specify how the physical cores are numbered to avoid this. And even worse, when Intel went to non-power-of-2 core counts, that broke some of the linux internal processor affinity stuff as well because it is more convenient to do an AND rather than a DIV instruction.
Wouldn't that be _because_ you use affinity, instead of _even though_ you use affinity? If you don't pin your threads, and the scheduler is aware of the new numbering (since presumably it looks at the core ID of each core instead of relying on a specific numbering), it should do the right thing?
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.