TalkChess.com

Posted: **Fri Jul 03, 2015 3:17 pm**

I've noticed that my program (that does something else than play chess), has relatively lousy speedup while being "embarassingly parallel"
so I profiled a bit and noticed that the OS was preempting relatively often, shuffling threads around cores a lot (Win7x64). I haven't tried other OSes yet.
So I tried to experiment a bit and forced thread affinity mask for workers (assuming nothing else was running at the moment).
This gave me a marginal but nice speedup of ~7% on 4 cores (quad with 8 logical cores so I set affinity mask to 1 << 2*thread_id)
Any thoughts? Anyone did something similar I guess this is nothing new.
In theory, setting affinity mask is dangerous but if you know that nothing else is running, it seems like a small win (in my case, speedup went from 3.26x to 3.5x)

Posted: **Fri Jul 03, 2015 3:54 pm**

mar wrote:I've noticed that my program (that does something else than play chess), has relatively lousy speedup while being "embarassingly parallel"
so I profiled a bit and noticed that the OS was preempting relatively often, shuffling threads around cores a lot (Win7x64). I haven't tried other OSes yet.
So I tried to experiment a bit and forced thread affinity mask for workers (assuming nothing else was running at the moment).
This gave me a marginal but nice speedup of ~7% on 4 cores (quad with 8 logical cores so I set affinity mask to 1 << 2*thread_id)
Any thoughts? Anyone did something similar I guess this is nothing new.
In theory, setting affinity mask is dangerous but if you know that nothing else is running, it seems like a small win (in my case, speedup went from 3.26x to 3.5x)

I found that (on Linux at least) if you have less or equal compute-intensive threads than physical cores, affinity doesn't usually help all that much.

What I've found to be the bottleneck a lot of times in embarassingly parallel problems is heap allocation overhead.

By default, heap allocations are always synchronized, and if your threads do a lot of heap allocations and deallocations (for example, by using standard containers in C++), it can become a bottleneck.

The solution is for each thread to allocate a large chunk of memory, and do their own allocations/deallocations from that pool. No synchronization is then needed in that case.

You can implement that yourself, or use an existing library. I use tcmalloc, which is a drop-in replacement for malloc/free (which new and delete also call under the hood). There's no code change required. Just have to link in the library. It overrides all the memory allocation functions.

I've recently started working with 20-cores machines, and found that at 20 threads, even moderate use of the heap will become bottleneck quickly. tcmalloc solves the problem completely, and allows me to get 19+x scaling from 20 threads on embarassingly parallel problems.

Obviously you can also rewrite your code to not do dynamic allocation (much), but that usually requires extensive changes to the code.

Posted: **Fri Jul 03, 2015 4:47 pm**

matthewlai wrote:What I've found to be the bottleneck a lot of times in embarassingly parallel problems is heap allocation overhead.

In my case allocation overhead is exactly zero.

Posted: **Fri Jul 03, 2015 5:23 pm**

mar wrote:I've noticed that my program (that does something else than play chess), has relatively lousy speedup while being "embarassingly parallel"
so I profiled a bit and noticed that the OS was preempting relatively often, shuffling threads around cores a lot (Win7x64). I haven't tried other OSes yet.
So I tried to experiment a bit and forced thread affinity mask for workers (assuming nothing else was running at the moment).
This gave me a marginal but nice speedup of ~7% on 4 cores (quad with 8 logical cores so I set affinity mask to 1 << 2*thread_id)
Any thoughts? Anyone did something similar I guess this is nothing new.
In theory, setting affinity mask is dangerous but if you know that nothing else is running, it seems like a small win (in my case, speedup went from 3.26x to 3.5x)

I've used cpu affinity for quite a while, although only on linux. The linux kernel is very good about doing this correctly (automatically), but it is not perfect. I pin a specific thread to a specific core, although you can make an argument for pinning a thread to a specific physical CPU chip instead (still have big shared L3 cache, but each cpu has its own L1/L2 caches.)

I did this because of NUMA memory rather than cache, however. When I start Crafty, I malloc() all the memory for everything, then fire up each thread which first pins itself to the correct physical core, and then initializes its own data so the data will "fault in" to the correct NUMA node (local memory). Then the thread will always be on the right physical cpu chip with the correct local memory bank, and not suffer from the remote memory access penalty NUMA produces.

However, I did go to the specific core to extract as much as possible by always using the same L1, L2 and L3, rather than just the same L3 if threads bounce around in the same chip but switch cores.

Posted: **Fri Jul 03, 2015 5:23 pm**

mar wrote:
matthewlai wrote:What I've found to be the bottleneck a lot of times in embarassingly parallel problems is heap allocation overhead.
In my case allocation overhead is exactly zero.

Well then that's clearly not your problem

.

Another common is threads writing into adjacent memory locations causing cache-line invalidation on other cores.

I'm guessing that's not your problem as well

.

Posted: **Fri Jul 03, 2015 8:59 pm**

mar wrote:I've noticed that my program (that does something else than play chess), has relatively lousy speedup while being "embarassingly parallel"
so I profiled a bit and noticed that the OS was preempting relatively often, shuffling threads around cores a lot (Win7x64). I haven't tried other OSes yet.
So I tried to experiment a bit and forced thread affinity mask for workers (assuming nothing else was running at the moment).
This gave me a marginal but nice speedup of ~7% on 4 cores (quad with 8 logical cores so I set affinity mask to 1 << 2*thread_id)
Any thoughts? Anyone did something similar I guess this is nothing new.
In theory, setting affinity mask is dangerous but if you know that nothing else is running, it seems like a small win (in my case, speedup went from 3.26x to 3.5x)

I've tried several times in the past to set the affinity mask on Win7x64 and it didn't give me any speedup whatsoever. Actually I have the impression that it hurts a little.
It is not very easy to detect small differences because SMP has a tendency to be somewhat random in nature.
Since you are seeing a speedup of 7% with the affinity set it must have something to do with the difference in architecture of the programs as such.

Posted: **Fri Jul 03, 2015 9:37 pm**

matthewlai wrote:Well then that's clearly not your problem .

Another common is threads writing into adjacent memory locations causing cache-line invalidation on other cores.

I'm guessing that's not your problem as well .

I was thinking along the same lines. The threads read the same memory (this should be no problem) and the only writes occur to a buffer which is different for each thread.
Only the final result might write on adjacent locations in theory, but the work is divided so that the probability that this happens is very close to zero (I tried various other batching schemes but this made no difference).
I also tried to write the final result to per-thread buffers as well but for no gain at all.

It's just that I was surprised that shuffling work around cores can cause a measurable slowdown (in fact, I was surprised that the OS does something like that at all and relatively often...)
Maybe Linux doesn't suffer from this. (btw. 19+ speedup on 20 cores is excellent!

But honestly I was hoping to get average speedup of 3.x closer to 4 but instead I get 3.x closer to 3.

Posted: **Fri Jul 03, 2015 9:42 pm**

Joost Buijs wrote:I've tried several times in the past to set the affinity mask on Win7x64 and it didn't give me any speedup whatsoever. Actually I have the impression that it hurts a little.
It is not very easy to detect small differences because SMP has a tendency to be somewhat random in nature.
Since you are seeing a speedup of 7% with the affinity set it must have something to do with the difference in architecture of the programs as such.

That's interesting, I'd guess that it should be no worse unless other threads are running.
As I thought it's a common idea to try but what I measured seems too much of a difference for my taste.
Yes, as I stated it's not a chess program (in fact it's a simple kd-tree accelerated trimesh raytracer so it should be trivial to parallelize)

Posted: **Fri Jul 03, 2015 9:46 pm**

bob wrote:However, I did go to the specific core to extract as much as possible by always using the same L1, L2 and L3, rather than just the same L3 if threads bounce around in the same chip but switch cores.

I knew that L1 is per core but I was never sure with L2/L3. I thought L2 was shared by a pair of cores but as you said this is probably not the case.
I was surprised that the OS decided to shuffle n active threads running on n cores that much.

Posted: **Fri Jul 03, 2015 9:51 pm**

mar wrote:
matthewlai wrote:Well then that's clearly not your problem .

Another common is threads writing into adjacent memory locations causing cache-line invalidation on other cores.

I'm guessing that's not your problem as well .
I was thinking along the same lines. The threads read the same memory (this should be no problem) and the only writes occur to a buffer which is different for each thread.
Only the final result might write on adjacent locations in theory, but the work is divided so that the probability that this happens is very close to zero (I tried various other batching schemes but this made no difference).
I also tried to write the final result to per-thread buffers as well but for no gain at all.

It's just that I was surprised that shuffling work around cores can cause a measurable slowdown (in fact, I was surprised that the OS does something like that at all and relatively often...)
Maybe Linux doesn't suffer from this. (btw. 19+ speedup on 20 cores is excellent!
But honestly I was hoping to get average speedup of 3.x closer to 4 but instead I get 3.x closer to 3.

Yeah reading should be no problem.

I find it strange that moving around would cause significant slowdown in a non-NUMA system. For example, even if every thread gets moved 10 times a second, that's 100ms of running time per move. A modern machine has DRAM bandwidth of about 30GB/s, so it would still only take less than 1ms to get everything back in cache.

One would also think that the guy/girl who wrote the scheduler would have thought of this, and made it not happen that often.

Maybe your program is now bottlenecked by memory bandwidth? Depending on your CPU, It's also possible that it's bottlenecked by shared cache bandwidth (on recent Intel CPUs L3 is shared, and on recent AMD CPUs L3 is shared by all cores, and each pair of cores share a L2).

TalkChess.com

thread affinity

thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity