Baffling multithreading scaling behavior

bob · Post by **bob** » Wed Sep 07, 2016 7:47 pm

I would suggest a couple of things.

(1) make CERTAIN hyper threading is turned off. There has been some inconsistency with Intel and how they report this stuff and how they number the processors. You do not want to get two threads on one physical core, so disabling hyper threading is a first test. If it makes no difference, you can turn it back on.

(2) memory is the next issue. If you share anything, such as hash, you have a potential cache thrasher. For example, if you have broken your hashing so that you repeatedly access the same entry, that will kill performance.

(3) I use "detached mode" (pthread_set_attribute() or something similar) since I NEVER use a pthread_join(). This eliminates some hidden overhead where threads can check on each other without your knowing.

(4) if you have access to the intel compiler, vTune is the best there is to find out what is causing such a bottleneck. icc is free for personal use on a linux box. I don't know about vTune.

(5) I presume you are certain that you are not excessively using locks. Even if there is no contention, a lock (even a simple spin lock as used in Crafty) is STILL a high-overhead activity that causes all sorts of cache overhead.

The usual culprits are (a) shared memory updating that was not intended. IE this:

int node_counts[#threads]

is an absolute no-no. Nothing is really shared, but since the counters for 8 threads fit into one cache line, you will absolutely smoke cache with inter-cache forwarding and invalidating.

(b) hyper threading; (c) excessive locking; (d) excessive system calls that have "hidden" locks or shared data that you don't directly see.

I doubt NUMA will make any major impact here. It is important, but it won't cause anything like this at all. You might have trouble going 8x faster on 16 cores if NUMA is an issue, but not to the level you are seeing it.

bob · Post by **bob** » Wed Sep 07, 2016 7:49 pm

cdani wrote:
ymatioun wrote:...to update total node count...
I get the total node count from the main thread iterating over the counters of the other threads, and only when I have to show a new pv, so mostly never, so this does not affects performance.

That was not quite what he was talking about. More like this:

int nodes[#threads]; is a no-no. Even if each thread updates a different element of that array, it is smoking cache.

bob · Post by **bob** » Wed Sep 07, 2016 7:52 pm

matthewlai wrote:
jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf

This HAS been an issue within the last 3-4 years also. Intel has not always been consistent on how the logical processors are numbered. On some Intel chips, logical processors 0 and 1 share a physical core. On others, 0 and 8 share a physical core (assuming 8 physical cores). I ran afoul of this even though I use processor affinity in Crafty to lock each thread to a specific core. I have an "affinity" command that lets me specify how the physical cores are numbered to avoid this. And even worse, when Intel went to non-power-of-2 core counts, that broke some of the linux internal processor affinity stuff as well because it is more convenient to do an AND rather than a DIV instruction.

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 7:59 pm

bob wrote:
matthewlai wrote:
jdart wrote:I have a box with the same CPU.

One thing to try is to disable hyperthreading in the BIOS. Then at least you can be sure threads are allocated to physical cores.

This is a NUMA machine so there is a penalty for memory access that is not local to a CPU. Without explicitly requesting it I don't think you have any guarantee that your four cores are allocated on the same CPU. So that is a factor although not typically a huge one.

--Jon
Threads should already be allocated to evenly to physical cores. That was a problem when hyper-threading first came out more than 10 years ago, but all modern schedulers are very aware of it now, and don't treat all the virtual CPUs as the same.

NUMA auto-balancing (moving threads closer to memory they use, or memory closer to threads, at run time) is also in Linux, but was only introduced 2 years ago - https://www.linux-kvm.org/images/7/75/0 ... ancing.pdf
This HAS been an issue within the last 3-4 years also. Intel has not always been consistent on how the logical processors are numbered. On some Intel chips, logical processors 0 and 1 share a physical core. On others, 0 and 8 share a physical core (assuming 8 physical cores). I ran afoul of this even though I use processor affinity in Crafty to lock each thread to a specific core. I have an "affinity" command that lets me specify how the physical cores are numbered to avoid this. And even worse, when Intel went to non-power-of-2 core counts, that broke some of the linux internal processor affinity stuff as well because it is more convenient to do an AND rather than a DIV instruction.

Wouldn't that be _because_ you use affinity, instead of _even though_ you use affinity? If you don't pin your threads, and the scheduler is aware of the new numbering (since presumably it looks at the core ID of each core instead of relying on a specific numbering), it should do the right thing?

cdani · Post by **cdani** » Wed Sep 07, 2016 8:23 pm

bob wrote:
cdani wrote:
ymatioun wrote:...to update total node count...
I get the total node count from the main thread iterating over the counters of the other threads, and only when I have to show a new pv, so mostly never, so this does not affects performance.
That was not quite what he was talking about. More like this:

int nodes[#threads]; is a no-no. Even if each thread updates a different element of that array, it is smoking cache.

Yes. I didn't told it, but the different counters of every thread are a lot of KB far one from another. In fact all the thread memory in Andscacs is separated from other threads memory. This is a must as you know to be sure cache stuff does not interfere in performance.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 8:50 pm

I finally figured out what the problem was.

Turns out I had a condition backwards in my code that checks to see if the computer has run out of time. This was only supposed to happen on the main thread but a typo (?) caused it to happen on the spawned threads too.

Checking the time must be some kind of horrible locking operation on desktop Ubuntu 16 since it was slowing the spawned threads down to 10%-40% of the speed they should have been! Glad this happened though because it did prompt me to find this bug.

Now that my threads are all running the same speed, I have to figure out why running 4 threads is only 84% as fast as running 4 unique instances of the program even though there's no locks or (I think) shared memory use.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 8:54 pm

Thanks for the reply; turning off hyperthreading was one of the first things I did with this machine.

I found the bug that was causing the child threads to run slower than the main thread:
http://www.talkchess.com/forum/viewtopi ... 23&t=61349

I'm still not seeing the speedup that I want so I will have to work on that.

Right now I have my engine in a class and am just instantiating multiple classes and running them in parallel. I think this should be as fast as running separate processes.

But when I run 4 separate instances of the engine (different processes), they run 98.5% as fast as single-core maximum, but the threads are only running 84% as fast, and that decreases fairly quickly by adding more threads.

So it seems I must have some kind of memory contention somewhere even though all my data is separated in different instances of this engine class.

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 8:57 pm

TomKerrigan wrote:I finally figured out what the problem was.

Turns out I had a condition backwards in my code that checks to see if the computer has run out of time. This was only supposed to happen on the main thread but a typo (?) caused it to happen on the spawned threads too.

Checking the time must be some kind of horrible locking operation on desktop Ubuntu 16 since it was slowing the spawned threads down to 10%-40% of the speed they should have been! Glad this happened though because it did prompt me to find this bug.

Now that my threads are all running the same speed, I have to figure out why running 4 threads is only 84% as fast as running 4 unique instances of the program even though there's no locks or (I think) shared memory use.

Wow that's a nice bug to catch! It may help to only check time say every 1000 nodes, too, instead of at every node. That may even speed up the single threaded case.

There is a lock in malloc, if you are using dynamic memory. It can be quite significant if you are doing a lot of allocations. A thread-local memory allocator like TCMalloc would solve that.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 9:00 pm

Yeah, I only check every 1024 nodes (easy to test) ... still, I've been meaning to increase that number. Computers have gotten so fast since I wrote that code that I'm now checking the time thousands of times per second when that seems a little unnecessary.

I don't do any memory allocation while searching.

I think I must have some memory contention that I'm not aware of. The threads do access some global tables for move generation etc. but nothing ever WRITES to those tables after the boot code. Maybe I'm misunderstanding cache coherency and this is still causing a problem somehow?

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 9:12 pm

TomKerrigan wrote:Thanks for the reply; turning off hyperthreading was one of the first things I did with this machine.

I found the bug that was causing the child threads to run slower than the main thread:
http://www.talkchess.com/forum/viewtopi ... 23&t=61349

I'm still not seeing the speedup that I want so I will have to work on that.

Right now I have my engine in a class and am just instantiating multiple classes and running them in parallel. I think this should be as fast as running separate processes.

But when I run 4 separate instances of the engine (different processes), they run 98.5% as fast as single-core maximum, but the threads are only running 84% as fast, and that decreases fairly quickly by adding more threads.

So it seems I must have some kind of memory contention somewhere even though all my data is separated in different instances of this engine class.

It's possible that the threads are put into one NUMA group (since they share a PID), and Linux will try to schedule threads in the same group on the same NUMA node (CPU).

That's good for memory access (until you run out of memory bandwidth at least), but bad for turbo boost.

This hypothesis is easy to test though - just do 16 threads vs 16 processes. In that case all cores will be busy in both cases.

Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Idea: Try a parallel perft()

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Idea: Try a parallel perft()

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior