optimizing performance on dual Xeon systems (NUMA)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

optimizing performance on dual Xeon systems (NUMA)

Post by jdart »

I've had a report from Martin Thoresen that Arasan scales well up to 8 cores but actually starts dropping in NPS once it is using more than that (he a dual Xeon setup with 16 cores). My understanding is that modern Xeons (5500 series and up) have a Numa architecture, so each processor has its own memory and non-local memory access is more expensive than local. the only thing I can think of that might cause this slowdown is that the thread usage is causing a lot of cross-processor traffic. The way I do YBWC, the master thread has a small data structure that holds the current best move and score and some other information, and all cooperating threads update this memory structure as they are searching. So that is an example of shared memory that may be local to one processor by accessed by others. Of course the hash table is also a shared resource.

I am speculating here because I don't currently have a dual processor system to test on.

Does this seem plausible and could it cause the observed slowdown? And what would the fix be? I see Crafty is doing some stuff with thread affintity in case of a NUMA system. Would it make sense to restrict splits to threads that are assigned to the same processor as the master thread?

--Jon
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: optimizing performance on dual Xeon systems (NUMA)

Post by Daniel Shawul »

Perhaps this thread is useful. When I tested scorpio on an 8 way amd quad opteron , i was not able to get scaling beyond 8 cores. Same story for crafty so it was not only for your engine. You can optimize a bit with a small effort by using the 'first touch approach'. All split blocks that a thread may use, local hash tables (eval & pawn) and other stuff can be allocated on first touch basis. That helped a bit but controlling thread migration by setting affinity did not help me, and infact may have slowed it down. I think it is better to let OS handle it.
YMMV
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: optimizing performance on dual Xeon systems (NUMA)

Post by bob »

jdart wrote:I've had a report from Martin Thoresen that Arasan scales well up to 8 cores but actually starts dropping in NPS once it is using more than that (he a dual Xeon setup with 16 cores). My understanding is that modern Xeons (5500 series and up) have a Numa architecture, so each processor has its own memory and non-local memory access is more expensive than local. the only thing I can think of that might cause this slowdown is that the thread usage is causing a lot of cross-processor traffic. The way I do YBWC, the master thread has a small data structure that holds the current best move and score and some other information, and all cooperating threads update this memory structure as they are searching. So that is an example of shared memory that may be local to one processor by accessed by others. Of course the hash table is also a shared resource.

I am speculating here because I don't currently have a dual processor system to test on.

Does this seem plausible and could it cause the observed slowdown? And what would the fix be? I see Crafty is doing some stuff with thread affintity in case of a NUMA system. Would it make sense to restrict splits to threads that are assigned to the same processor as the master thread?

--Jon
Two things to look out for:

(1) if you have two variables in a 64-byte block of memory, where one is frequently changed by one thread, the other is changed by another thread, that will thrash cache badly. For example, per-thread node counters are a potential killer. Simple solution is to make sure that any such values are separated by 64 bytes to force them into separate cache blocks.

(2) the first memory access for a specific page faults that page into the physical ram on the core that does the access. For split blocks in Crafty, I initialize them when I start the program, but I make each thread (core) initialize its sub-set of those blocks so that each core has the most referenced blocks in local memory. This can be an issue for hash as well, as you'd like to have hash evenly distributed across all NUMA nodes to avoid a hot-spot where one node has the entire thing...
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: optimizing performance on dual Xeon systems (NUMA)

Post by jdart »

The first issue is not specific to Numa systems and in fact I have optimized for this.

The 2nd one may be significant. However, I am a little surprised that it is apparently so severe that adding cores doesn't increase search speed, but decreases it.

--Jon
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: optimizing performance on dual Xeon systems (NUMA)

Post by bob »

jdart wrote:The first issue is not specific to Numa systems and in fact I have optimized for this.

The 2nd one may be significant. However, I am a little surprised that it is apparently so severe that adding cores doesn't increase search speed, but decreases it.

--Jon
As # of cores per chip rises, it becomes a REAL bottleneck issue, since there is just one memory path per chip, not one per core. Much cache coherency traffic and you hit a bottleneck that won't go away.

Note that the first case is more severe on NUMA boxes, because the cache-to-cache traffic is not nearly as fast as a single chip with multiple caches talking to each other. Since caches share lines and forward lines back and forth, the traffic can ramp right on up there. I'd rather have my local cache provide 99.99% of the data I need, with only occasional inter-cache traffic...

AMD has some internal tools that will measure this stuff quite accurately, but they don't distribute those. They've run em on my code once or twice when I raised a question..
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: optimizing performance on dual Xeon systems (NUMA)

Post by jdart »

Intel CPUs also have internal performance counters and VTune can read these. Although cross-processor traffic is not as far as I know an out of box configured measurement scenario.

I dug around a bit and found this very detailed presentation online about Xeon 5500/5600:

http://openlab-mu-internal.web.cern.ch/ ... _intro.pdf

see pp. 75-80 for discussion about memory use on multi-processor systems.

--Jon
Martin Thoresen
Posts: 1833
Joined: Thu Jun 22, 2006 12:07 am

Re: optimizing performance on dual Xeon systems (NUMA)

Post by Martin Thoresen »

Hi Jon,

Just for general information, the processors used are Intel Xeon E5-2689.

Some cache information:
Level 1 instruction cache size: 8 x 32 KB
Level 1 data cache size: 8 x 32 KB data
Level 2 cache size: 8 x 256 KB
Level 3 cache size: 20 MB

For each processor there are 4 sticks of 4 GB RAM, so 16 GB RAM per processor.

I don't know how it works, if I set a hash of say 8 GB, will that hash only use memory connected to just one processor or it will use from any, and can this be decided somewhere?

If you would like me to run a few test builds for you I can do that of course.

Best,
Martin