I've used cpu affinity for quite a while, although only on linux. The linux kernel is very good about doing this correctly (automatically), but it is not perfect. I pin a specific thread to a specific core, although you can make an argument for pinning a thread to a specific physical CPU chip instead (still have big shared L3 cache, but each cpu has its own L1/L2 caches.)mar wrote:I've noticed that my program (that does something else than play chess), has relatively lousy speedup while being "embarassingly parallel"
so I profiled a bit and noticed that the OS was preempting relatively often, shuffling threads around cores a lot (Win7x64). I haven't tried other OSes yet.
So I tried to experiment a bit and forced thread affinity mask for workers (assuming nothing else was running at the moment).
This gave me a marginal but nice speedup of ~7% on 4 cores (quad with 8 logical cores so I set affinity mask to 1 << 2*thread_id)
Any thoughts? Anyone did something similar I guess this is nothing new.
In theory, setting affinity mask is dangerous but if you know that nothing else is running, it seems like a small win (in my case, speedup went from 3.26x to 3.5x)
I did this because of NUMA memory rather than cache, however. When I start Crafty, I malloc() all the memory for everything, then fire up each thread which first pins itself to the correct physical core, and then initializes its own data so the data will "fault in" to the correct NUMA node (local memory). Then the thread will always be on the right physical cpu chip with the correct local memory bank, and not suffer from the remote memory access penalty NUMA produces.
However, I did go to the specific core to extract as much as possible by always using the same L1, L2 and L3, rather than just the same L3 if threads bounce around in the same chip but switch cores.