Baffling multithreading scaling behavior

TomKerrigan · Post by **TomKerrigan** » Tue Sep 06, 2016 11:07 pm

Hi guys, I was curious about your thoughts on an issue.

I just bought a nice (used) dual processor E5-2670 workstation, with 16 cores of Sandy Bridge goodness, and installed Ubuntu.

Unfortunately my parallel search performance is worse than awful on this machine. The best it does is a 1.5x speedup with 4 threads... any more threads and it slows down.

The same exact algorithm goes 3x faster on my 4-core Mac Mini.

As a sanity check, I set up my engine to do _exactly_ the same searches on multiple threads. Each thread even has its own hash table. There are no locks and use of shared memory is absolutely minimal.

With this setup, I would expect NPS to scale perfectly. Memory bandwidth is not an issue since I can run two instances of the engine and they both run as fast as if I was just running one instance.

Instead I am seeing some extremely odd behavior, where the main thread searches at maximum speed (same as if I was running without multithreading) whereas the spawned threads are somehow running at a fraction of that.

A second thread runs at ~30% of a core's maximum speed. If I spawn 15 threads, they each run at ~8% of a core's maximum speed.

At this point I'm at a loss to understand how this could be happening or even how to go about debugging it. If there's some kind of memory contention or cache coherency problem or whatever, then I would expect the main thread to be affected by it just as much as the other threads, but the main thread always runs at "100%" NPS.

I was wondering if any of you guys had any ideas??? Thanks in advance!

sje · Post by **sje** » Wed Sep 07, 2016 12:01 am

Idea: Try a parallel perft()

It's fairly simple to write a parallel perft() which splits only at the root and where the threads have minimal interaction and where the results are known beforehand and the order of calculation is irrelevant. This should produce a near linear speed-up; if not, at least it will be easy to diagnose any problems. You might want to use the perft() code which was added to TSCP a while ago.

After that, various thread interaction code nuggets (transposition table sharing, deep splits, rendezvous points, etc.) can be added to see which new feature is causing a slowdown.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 12:05 am

Thanks for the suggestion.

I went even simpler and just wrote a function that generates a stream of random numbers. When run on parallel threads simultaneously, this function scales perfectly, as expected. So it doesn't seem to be any kind of thread priority issue.

Then I changed my code so it ran the same search on multiple threads serially. In other words, one thread did a search, the another thread did the same exact search, etc., and finally the main thread did the search.

Somehow the main thread was STILL massively faster at doing the same search than the spawned threads, even when operating in serial. (?!?!)

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 12:08 am

TomKerrigan wrote:Thanks for the suggestion.

I went even simpler and just wrote a function that generates a stream of random numbers. When run on parallel threads simultaneously, this function scales perfectly, as expected. So it doesn't seem to be any kind of thread priority issue.

Then I changed my code so it ran the same search on multiple threads serially. In other words, one thread did a search, the another thread did the same exact search, etc., and finally the main thread did the search.

Somehow the main thread was STILL massively faster at doing the same search than the spawned threads, even when operating in serial. (?!?!)

What does "top" say when your program is searching? Are you doing anything with affinity?

Also, I just got a used 2xE5-2670 system, too! It's amazing how cheap they go for these days.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 12:11 am

top says my engine is using 100% CPU when running these searches serially, as one would hope.

Somehow the searches on the spawned threads are still going less than half as fast as the same exact search on the main thread?!

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 12:16 am

TomKerrigan wrote:top says my engine is using 100% CPU when running these searches serially, as one would hope.

Somehow the searches on the spawned threads are still going less than half as fast as the same exact search on the main thread?!

You mean 1600% right? "top" reports 100% per core/hyperthread.

Are you accessing EGTB? I found that to be my bottleneck when I was doing massively parallel searches (on independent positions for training) when I had my EGTB on a hard drive. But in my case I saw a drop in reported CPU usage, since threads were waiting for IO.

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 12:20 am

No, 100% because I'm doing the same search on different threads serially.

So this eliminates any possibility of resource contention.

There is NO reason why a search should take longer when run on a spawned thread vs. the main thread if they are the only things running on the computer. And yet...

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 12:23 am

TomKerrigan wrote:No, 100% because I'm doing the same search on different threads serially.

So this eliminates any possibility of resource contention.

There is NO reason why a search should take longer when run on a spawned thread vs. the main thread if they are the only things running on the computer. And yet...

OH I see what you mean. Now that's strange.

The only difference is the stack I think. Linux handles stack of the main thread differently from stacks of spawned threads, if I remember correctly. But that shouldn't cause any problem unless you are overflowing the stacks.

Maybe try profiling?

TomKerrigan · Post by **TomKerrigan** » Wed Sep 07, 2016 12:27 am

Indeed. Very frustrating.

Also congrats on your new purchase. I lucked out and got a barebones Dell T5610 for $399 on eBay. It wasn't even that barebones, it came with 32GB of RAM and a (slowish) graphics card. With processors and a small SSD the total came to $600... crazy amount of compute power per dollar!

matthewlai · Post by **matthewlai** » Wed Sep 07, 2016 1:21 am

TomKerrigan wrote:Indeed. Very frustrating.

Also congrats on your new purchase. I lucked out and got a barebones Dell T5610 for $399 on eBay. It wasn't even that barebones, it came with 32GB of RAM and a (slowish) graphics card. With processors and a small SSD the total came to $600... crazy amount of compute power per dollar!

A workstation like that for $600 is absolutely amazing!

I got a 1U server with a SuperMicro motherboard, which makes installing a graphics card a bit of a challenge... that will probably require using a Dremel to resolve!

Couldn't say no to a 2xE5-2670 server (32GB ECC RAM) for $500! Shipping to the UK costed about half that much, though...

I'm waiting for the day IvyBridge 12-cores become affordable, and upgrade it to 24-cores!

Baffling multithreading scaling behavior

Baffling multithreading scaling behavior

Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()

Re: Idea: Try a parallel perft()