thread affinity

bob · Post by **bob** » Fri Jul 03, 2015 5:23 pm

mar wrote:I've noticed that my program (that does something else than play chess), has relatively lousy speedup while being "embarassingly parallel"
so I profiled a bit and noticed that the OS was preempting relatively often, shuffling threads around cores a lot (Win7x64). I haven't tried other OSes yet.
So I tried to experiment a bit and forced thread affinity mask for workers (assuming nothing else was running at the moment).
This gave me a marginal but nice speedup of ~7% on 4 cores (quad with 8 logical cores so I set affinity mask to 1 << 2*thread_id)
Any thoughts? Anyone did something similar I guess this is nothing new.
In theory, setting affinity mask is dangerous but if you know that nothing else is running, it seems like a small win (in my case, speedup went from 3.26x to 3.5x)

I've used cpu affinity for quite a while, although only on linux. The linux kernel is very good about doing this correctly (automatically), but it is not perfect. I pin a specific thread to a specific core, although you can make an argument for pinning a thread to a specific physical CPU chip instead (still have big shared L3 cache, but each cpu has its own L1/L2 caches.)

I did this because of NUMA memory rather than cache, however. When I start Crafty, I malloc() all the memory for everything, then fire up each thread which first pins itself to the correct physical core, and then initializes its own data so the data will "fault in" to the correct NUMA node (local memory). Then the thread will always be on the right physical cpu chip with the correct local memory bank, and not suffer from the remote memory access penalty NUMA produces.

However, I did go to the specific core to extract as much as possible by always using the same L1, L2 and L3, rather than just the same L3 if threads bounce around in the same chip but switch cores.

matthewlai · Post by **matthewlai** » Fri Jul 03, 2015 5:23 pm

mar wrote:
matthewlai wrote:What I've found to be the bottleneck a lot of times in embarassingly parallel problems is heap allocation overhead.
In my case allocation overhead is exactly zero.

Well then that's clearly not your problem

.

Another common is threads writing into adjacent memory locations causing cache-line invalidation on other cores.

I'm guessing that's not your problem as well

.

Joost Buijs · Post by **Joost Buijs** » Fri Jul 03, 2015 8:59 pm

mar wrote:I've noticed that my program (that does something else than play chess), has relatively lousy speedup while being "embarassingly parallel"
so I profiled a bit and noticed that the OS was preempting relatively often, shuffling threads around cores a lot (Win7x64). I haven't tried other OSes yet.
So I tried to experiment a bit and forced thread affinity mask for workers (assuming nothing else was running at the moment).
This gave me a marginal but nice speedup of ~7% on 4 cores (quad with 8 logical cores so I set affinity mask to 1 << 2*thread_id)
Any thoughts? Anyone did something similar I guess this is nothing new.
In theory, setting affinity mask is dangerous but if you know that nothing else is running, it seems like a small win (in my case, speedup went from 3.26x to 3.5x)

I've tried several times in the past to set the affinity mask on Win7x64 and it didn't give me any speedup whatsoever. Actually I have the impression that it hurts a little.
It is not very easy to detect small differences because SMP has a tendency to be somewhat random in nature.
Since you are seeing a speedup of 7% with the affinity set it must have something to do with the difference in architecture of the programs as such.

mar · Post by **mar** » Fri Jul 03, 2015 9:37 pm

matthewlai wrote:Well then that's clearly not your problem .

Another common is threads writing into adjacent memory locations causing cache-line invalidation on other cores.

I'm guessing that's not your problem as well .

I was thinking along the same lines. The threads read the same memory (this should be no problem) and the only writes occur to a buffer which is different for each thread.
Only the final result might write on adjacent locations in theory, but the work is divided so that the probability that this happens is very close to zero (I tried various other batching schemes but this made no difference).
I also tried to write the final result to per-thread buffers as well but for no gain at all.

It's just that I was surprised that shuffling work around cores can cause a measurable slowdown (in fact, I was surprised that the OS does something like that at all and relatively often...)
Maybe Linux doesn't suffer from this. (btw. 19+ speedup on 20 cores is excellent!

But honestly I was hoping to get average speedup of 3.x closer to 4 but instead I get 3.x closer to 3.

mar · Post by **mar** » Fri Jul 03, 2015 9:42 pm

Joost Buijs wrote:I've tried several times in the past to set the affinity mask on Win7x64 and it didn't give me any speedup whatsoever. Actually I have the impression that it hurts a little.
It is not very easy to detect small differences because SMP has a tendency to be somewhat random in nature.
Since you are seeing a speedup of 7% with the affinity set it must have something to do with the difference in architecture of the programs as such.

That's interesting, I'd guess that it should be no worse unless other threads are running.
As I thought it's a common idea to try but what I measured seems too much of a difference for my taste.
Yes, as I stated it's not a chess program (in fact it's a simple kd-tree accelerated trimesh raytracer so it should be trivial to parallelize)

mar · Post by **mar** » Fri Jul 03, 2015 9:46 pm

bob wrote:However, I did go to the specific core to extract as much as possible by always using the same L1, L2 and L3, rather than just the same L3 if threads bounce around in the same chip but switch cores.

I knew that L1 is per core but I was never sure with L2/L3. I thought L2 was shared by a pair of cores but as you said this is probably not the case.
I was surprised that the OS decided to shuffle n active threads running on n cores that much.

matthewlai · Post by **matthewlai** » Fri Jul 03, 2015 9:51 pm

mar wrote:
matthewlai wrote:Well then that's clearly not your problem .

Another common is threads writing into adjacent memory locations causing cache-line invalidation on other cores.

I'm guessing that's not your problem as well .
I was thinking along the same lines. The threads read the same memory (this should be no problem) and the only writes occur to a buffer which is different for each thread.
Only the final result might write on adjacent locations in theory, but the work is divided so that the probability that this happens is very close to zero (I tried various other batching schemes but this made no difference).
I also tried to write the final result to per-thread buffers as well but for no gain at all.

It's just that I was surprised that shuffling work around cores can cause a measurable slowdown (in fact, I was surprised that the OS does something like that at all and relatively often...)
Maybe Linux doesn't suffer from this. (btw. 19+ speedup on 20 cores is excellent!
But honestly I was hoping to get average speedup of 3.x closer to 4 but instead I get 3.x closer to 3.

Yeah reading should be no problem.

I find it strange that moving around would cause significant slowdown in a non-NUMA system. For example, even if every thread gets moved 10 times a second, that's 100ms of running time per move. A modern machine has DRAM bandwidth of about 30GB/s, so it would still only take less than 1ms to get everything back in cache.

One would also think that the guy/girl who wrote the scheduler would have thought of this, and made it not happen that often.

Maybe your program is now bottlenecked by memory bandwidth? Depending on your CPU, It's also possible that it's bottlenecked by shared cache bandwidth (on recent Intel CPUs L3 is shared, and on recent AMD CPUs L3 is shared by all cores, and each pair of cores share a L2).

Joost Buijs · Post by **Joost Buijs** » Fri Jul 03, 2015 9:57 pm

mar wrote:
Joost Buijs wrote:I've tried several times in the past to set the affinity mask on Win7x64 and it didn't give me any speedup whatsoever. Actually I have the impression that it hurts a little.
It is not very easy to detect small differences because SMP has a tendency to be somewhat random in nature.
Since you are seeing a speedup of 7% with the affinity set it must have something to do with the difference in architecture of the programs as such.
That's interesting, I'd guess that it should be no worse unless other threads are running.
As I thought it's a common idea to try but what I measured seems too much of a difference for my taste.
Yes, as I stated it's not a chess program (in fact it's a simple kd-tree accelerated trimesh raytracer so it should be trivial to parallelize)

I found it to be worse, but with a very small difference, it could have been noise.
On my system there are usually many threads running, they're not taking much CPU time, but it can explain small differences.

jdart · Post by **jdart** » Sat Jul 04, 2015 3:13 am

The issue is that when a thread is migrated to another core, its memory is not migrated. So now that thread is doing non-local memory access. This is true both under Windows and Linux. See https://software.intel.com/en-us/articl ... s-for-numa.

That is why pinning threads to cores may be a good idea.

--Jon

matthewlai · Post by **matthewlai** » Sat Jul 04, 2015 3:16 am

jdart wrote:The issue is that when a thread is migrated to another core, its memory is not migrated. So now that thread is doing non-local memory access. This is true both under Windows and Linux. See https://software.intel.com/en-us/articl ... s-for-numa.

That is why pinning threads to cores may be a good idea.

--Jon

That only applies to NUMA though. I don't think he is working on NUMA hardware.

On recent Intel CPUs only L1 and L2 will be invalidated, and that's less than 1MB (and can be loaded from L3 after the migration).

thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity

Re: thread affinity