Baffling multithreading scaling behavior

cdani · Post by **cdani** » Wed Sep 07, 2016 8:23 pm

bob wrote:
cdani wrote:
ymatioun wrote:...to update total node count...
I get the total node count from the main thread iterating over the counters of the other threads, and only when I have to show a new pv, so mostly never, so this does not affects performance.
That was not quite what he was talking about. More like this:

int nodes[#threads]; is a no-no. Even if each thread updates a different element of that array, it is smoking cache.

Yes. I didn't told it, but the different counters of every thread are a lot of KB far one from another. In fact all the thread memory in Andscacs is separated from other threads memory. This is a must as you know to be sure cache stuff does not interfere in performance.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 8:50 pm

I finally figured out what the problem was.

Turns out I had a condition backwards in my code that checks to see if the computer has run out of time. This was only supposed to happen on the main thread but a typo (?) caused it to happen on the spawned threads too.

Checking the time must be some kind of horrible locking operation on desktop Ubuntu 16 since it was slowing the spawned threads down to 10%-40% of the speed they should have been! Glad this happened though because it did prompt me to find this bug.

Now that my threads are all running the same speed, I have to figure out why running 4 threads is only 84% as fast as running 4 unique instances of the program even though there's no locks or (I think) shared memory use.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 8:54 pm

Thanks for the reply; turning off hyperthreading was one of the first things I did with this machine.

I found the bug that was causing the child threads to run slower than the main thread:
http://www.talkchess.com/forum/viewtopi ... 23&t=61349

I'm still not seeing the speedup that I want so I will have to work on that.

Right now I have my engine in a class and am just instantiating multiple classes and running them in parallel. I think this should be as fast as running separate processes.

But when I run 4 separate instances of the engine (different processes), they run 98.5% as fast as single-core maximum, but the threads are only running 84% as fast, and that decreases fairly quickly by adding more threads.

So it seems I must have some kind of memory contention somewhere even though all my data is separated in different instances of this engine class.

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 8:57 pm

TomKerrigan wrote:I finally figured out what the problem was.

Turns out I had a condition backwards in my code that checks to see if the computer has run out of time. This was only supposed to happen on the main thread but a typo (?) caused it to happen on the spawned threads too.

Checking the time must be some kind of horrible locking operation on desktop Ubuntu 16 since it was slowing the spawned threads down to 10%-40% of the speed they should have been! Glad this happened though because it did prompt me to find this bug.

Now that my threads are all running the same speed, I have to figure out why running 4 threads is only 84% as fast as running 4 unique instances of the program even though there's no locks or (I think) shared memory use.

Wow that's a nice bug to catch! It may help to only check time say every 1000 nodes, too, instead of at every node. That may even speed up the single threaded case.

There is a lock in malloc, if you are using dynamic memory. It can be quite significant if you are doing a lot of allocations. A thread-local memory allocator like TCMalloc would solve that.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 9:00 pm

Yeah, I only check every 1024 nodes (easy to test) ... still, I've been meaning to increase that number. Computers have gotten so fast since I wrote that code that I'm now checking the time thousands of times per second when that seems a little unnecessary.

I don't do any memory allocation while searching.

I think I must have some memory contention that I'm not aware of. The threads do access some global tables for move generation etc. but nothing ever WRITES to those tables after the boot code. Maybe I'm misunderstanding cache coherency and this is still causing a problem somehow?

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 9:12 pm

TomKerrigan wrote:Thanks for the reply; turning off hyperthreading was one of the first things I did with this machine.

I found the bug that was causing the child threads to run slower than the main thread:
http://www.talkchess.com/forum/viewtopi ... 23&t=61349

I'm still not seeing the speedup that I want so I will have to work on that.

Right now I have my engine in a class and am just instantiating multiple classes and running them in parallel. I think this should be as fast as running separate processes.

But when I run 4 separate instances of the engine (different processes), they run 98.5% as fast as single-core maximum, but the threads are only running 84% as fast, and that decreases fairly quickly by adding more threads.

So it seems I must have some kind of memory contention somewhere even though all my data is separated in different instances of this engine class.

It's possible that the threads are put into one NUMA group (since they share a PID), and Linux will try to schedule threads in the same group on the same NUMA node (CPU).

That's good for memory access (until you run out of memory bandwidth at least), but bad for turbo boost.

This hypothesis is easy to test though - just do 16 threads vs 16 processes. In that case all cores will be busy in both cases.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 9:15 pm

Good thought; could also be tested by turning off turbo boost, although I don't know how to do this in Linux yet. I do have a good program to toggle turbo boost on my Mac.

I see the same slowdown on my 4-core Mac when running 4 threads vs. 4 processes.

My next idea is to move ALL data into my engine class, even if the data is read-only tables.

Do you guys thing vTune would be useful in sorting out this slowdown?

cdani · Post by **cdani** » Wed Sep 07, 2016 9:29 pm

TomKerrigan wrote:Yeah, I only check every 1024 nodes (easy to test) ... still, I've been meaning to increase that number.

I calculate on the fly how many nodes I should wait to check the clock, so for example to check it say every 0.05 seconds, or less if the TC is really low. Of course this number varies a lot on different cpus.

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 10:04 pm

cdani wrote:
TomKerrigan wrote:Yeah, I only check every 1024 nodes (easy to test) ... still, I've been meaning to increase that number.
I calculate on the fly how many nodes I should wait to check the clock, so for example to check it say every 0.05 seconds, or less if the TC is really low. Of course this number varies a lot on different cpus.

I don't check the time anymore. At the beginning of each search, I start a thread that sleeps for the allocated amount of time, and sets a flag when it wakes up. The search thread only checks the flag.

Well, ok, it's slightly more complicated than that, because searches can be interrupted. It actually waits on a condition variable with a timeout equal to the duration, so it can wake up early if we want to terminate the search early (by notify()ing on the condition variable).

I have 3 threads for a single-threaded search - the main I/O processing thread, a search thread, and a timer thread.

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 10:05 pm

TomKerrigan wrote:Good thought; could also be tested by turning off turbo boost, although I don't know how to do this in Linux yet. I do have a good program to toggle turbo boost on my Mac.

I see the same slowdown on my 4-core Mac when running 4 threads vs. 4 processes.

My next idea is to move ALL data into my engine class, even if the data is read-only tables.

Do you guys thing vTune would be useful in sorting out this slowdown?

vTune will probably help, but maybe try free options like gprof first? If you look at the profile of a single-threaded search vs multiple searches in a process, that may give you some hints (if some function takes a higher % of time in the multiple threads case).

Baffling multithreading scaling behavior

Re: Idea: Try a parallel perft()

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior

Re: Baffling multithreading scaling behavior