NUMA-awareness

zullil · Post by **zullil** » Wed Feb 25, 2015 3:38 pm

In general terms, how would an engine be coded to take advantage of the following NUMA set-up (assuming the engine will run 16 search threads)? What methods might be used to minimize threads running on one node needing to access memory on the other node? Thanks.

Code: Select all

$ numactl --hardware
available&#58; 2 nodes &#40;0-1&#41;
node 0 cpus&#58; 0 1 2 3 4 5 6 7
node 0 size&#58; 15992 MB
node 0 free&#58; 14202 MB
node 1 cpus&#58; 8 9 10 11 12 13 14 15
node 1 size&#58; 16126 MB
node 1 free&#58; 14711 MB
node distances&#58;
node   0   1 
  0&#58;  10  20 
  1&#58;  20  10

jdart · Post by **jdart** » Wed Feb 25, 2015 4:11 pm

Most modern multi-CPU systems are NUMA.

General rule is to keep memory access local to the NUMA node whenever possible. There are tools that can measure this for you (Intel vTune, TAO, etc).

That means among other things generally avoiding access to globals that are shared across all threads. In the case of the main hash table you can't avoid that. But for anything else, don't do it, not even one byte. If you need a small amount of global data like option settings and it does not change, consider caching it per-thread. If you need to update a global counter or something, do it less often if possible.

Apply the "first touch rule:" make sure memory that is used exclusively by a thread is initialized by that thread. (This means moving memory access to the thread procedure).

Specifically for NUMA systems: you can use NUMA APIs to pin threads so that they will not migrate to another node (Linux and Window schedulers will both do this and when this happens the thread's memory is not migrated). Most engines that do that also let you set an index so you can run one engine on nodes 1..2 and another on nodes 3..4 if you want.

--Jon

NUMA-awareness

NUMA-awareness

Re: NUMA-awareness