optimizing performance on dual Xeon systems (NUMA)

jdart · Post by **jdart** » Thu Feb 28, 2013 5:18 pm

I've had a report from Martin Thoresen that Arasan scales well up to 8 cores but actually starts dropping in NPS once it is using more than that (he a dual Xeon setup with 16 cores). My understanding is that modern Xeons (5500 series and up) have a Numa architecture, so each processor has its own memory and non-local memory access is more expensive than local. the only thing I can think of that might cause this slowdown is that the thread usage is causing a lot of cross-processor traffic. The way I do YBWC, the master thread has a small data structure that holds the current best move and score and some other information, and all cooperating threads update this memory structure as they are searching. So that is an example of shared memory that may be local to one processor by accessed by others. Of course the hash table is also a shared resource.

I am speculating here because I don't currently have a dual processor system to test on.

Does this seem plausible and could it cause the observed slowdown? And what would the fix be? I see Crafty is doing some stuff with thread affintity in case of a NUMA system. Would it make sense to restrict splits to threads that are assigned to the same processor as the master thread?

--Jon

Daniel Shawul · Post by **Daniel Shawul** » Thu Feb 28, 2013 6:48 pm

Perhaps this thread is useful. When I tested scorpio on an 8 way amd quad opteron , i was not able to get scaling beyond 8 cores. Same story for crafty so it was not only for your engine. You can optimize a bit with a small effort by using the 'first touch approach'. All split blocks that a thread may use, local hash tables (eval & pawn) and other stuff can be allocated on first touch basis. That helped a bit but controlling thread migration by setting affinity did not help me, and infact may have slowed it down. I think it is better to let OS handle it.
YMMV

bob · Post by **bob** » Thu Feb 28, 2013 7:27 pm

jdart wrote:I've had a report from Martin Thoresen that Arasan scales well up to 8 cores but actually starts dropping in NPS once it is using more than that (he a dual Xeon setup with 16 cores). My understanding is that modern Xeons (5500 series and up) have a Numa architecture, so each processor has its own memory and non-local memory access is more expensive than local. the only thing I can think of that might cause this slowdown is that the thread usage is causing a lot of cross-processor traffic. The way I do YBWC, the master thread has a small data structure that holds the current best move and score and some other information, and all cooperating threads update this memory structure as they are searching. So that is an example of shared memory that may be local to one processor by accessed by others. Of course the hash table is also a shared resource.

I am speculating here because I don't currently have a dual processor system to test on.

Does this seem plausible and could it cause the observed slowdown? And what would the fix be? I see Crafty is doing some stuff with thread affintity in case of a NUMA system. Would it make sense to restrict splits to threads that are assigned to the same processor as the master thread?

--Jon

Two things to look out for:

(1) if you have two variables in a 64-byte block of memory, where one is frequently changed by one thread, the other is changed by another thread, that will thrash cache badly. For example, per-thread node counters are a potential killer. Simple solution is to make sure that any such values are separated by 64 bytes to force them into separate cache blocks.

(2) the first memory access for a specific page faults that page into the physical ram on the core that does the access. For split blocks in Crafty, I initialize them when I start the program, but I make each thread (core) initialize its sub-set of those blocks so that each core has the most referenced blocks in local memory. This can be an issue for hash as well, as you'd like to have hash evenly distributed across all NUMA nodes to avoid a hot-spot where one node has the entire thing...

jdart · Post by **jdart** » Thu Feb 28, 2013 7:34 pm

The first issue is not specific to Numa systems and in fact I have optimized for this.

The 2nd one may be significant. However, I am a little surprised that it is apparently so severe that adding cores doesn't increase search speed, but decreases it.

--Jon

bob · Post by **bob** » Fri Mar 01, 2013 5:02 am

jdart wrote:The first issue is not specific to Numa systems and in fact I have optimized for this.

The 2nd one may be significant. However, I am a little surprised that it is apparently so severe that adding cores doesn't increase search speed, but decreases it.

--Jon

As # of cores per chip rises, it becomes a REAL bottleneck issue, since there is just one memory path per chip, not one per core. Much cache coherency traffic and you hit a bottleneck that won't go away.

Note that the first case is more severe on NUMA boxes, because the cache-to-cache traffic is not nearly as fast as a single chip with multiple caches talking to each other. Since caches share lines and forward lines back and forth, the traffic can ramp right on up there. I'd rather have my local cache provide 99.99% of the data I need, with only occasional inter-cache traffic...

AMD has some internal tools that will measure this stuff quite accurately, but they don't distribute those. They've run em on my code once or twice when I raised a question..

jdart · Post by **jdart** » Fri Mar 01, 2013 4:42 pm

Intel CPUs also have internal performance counters and VTune can read these. Although cross-processor traffic is not as far as I know an out of box configured measurement scenario.

I dug around a bit and found this very detailed presentation online about Xeon 5500/5600:

http://openlab-mu-internal.web.cern.ch/ ... _intro.pdf

see pp. 75-80 for discussion about memory use on multi-processor systems.

--Jon

Martin Thoresen · Post by **Martin Thoresen** » Sun Mar 10, 2013 10:02 pm

Hi Jon,

Just for general information, the processors used are Intel Xeon E5-2689.

Some cache information:
Level 1 instruction cache size: 8 x 32 KB
Level 1 data cache size: 8 x 32 KB data
Level 2 cache size: 8 x 256 KB
Level 3 cache size: 20 MB

For each processor there are 4 sticks of 4 GB RAM, so 16 GB RAM per processor.

I don't know how it works, if I set a hash of say 8 GB, will that hash only use memory connected to just one processor or it will use from any, and can this be decided somewhere?

If you would like me to run a few test builds for you I can do that of course.

Best,
Martin

optimizing performance on dual Xeon systems (NUMA)

optimizing performance on dual Xeon systems (NUMA)

Re: optimizing performance on dual Xeon systems (NUMA)

Re: optimizing performance on dual Xeon systems (NUMA)

Re: optimizing performance on dual Xeon systems (NUMA)

Re: optimizing performance on dual Xeon systems (NUMA)

Re: optimizing performance on dual Xeon systems (NUMA)

Re: optimizing performance on dual Xeon systems (NUMA)