What do you do with NUMA?

matthewlai · Post by **matthewlai** » Mon Sep 19, 2016 2:57 pm

For engines that are "NUMA-aware", what are you doing differently on NUMA systems, and how much speedup do you get compared to Linux NUMA auto-balancing?

I imagine doing interleaved allocation for the main transposition table would be a good idea to increase total bandwidth. Is there anything else?

syzygy · Post by **syzygy** » Mon Sep 19, 2016 9:17 pm

matthewlai wrote:For engines that are "NUMA-aware", what are you doing differently on NUMA systems, and how much speedup do you get compared to Linux NUMA auto-balancing?

I imagine doing interleaved allocation for the main transposition table would be a good idea to increase total bandwidth. Is there anything else?

If the engine is using "lazy smp", each search thread is basically fully independent from the other threads apart from the shared TT table (which must be shared for the lazy approach to make sense at all) and possibly some other tables (e.g. pawn hash table, history move table, etc.).

On a NUMA system, I imagine it should help to:
- bind threads to nodes to avoid that threads are migrated between nodes;
- allocate thread-specific memory on the node where the thread runs;
- indeed interleave the allocation of the TT across all nodes;
- with the exception of the TT, avoid as much as possible that tables that are being written to are shared between threads running on different nodes, by making them either thread-specific or node-specific (i.e. shared by the threads running on a node and allocated on that node).

If the engine uses an smp implementation with synchronisation at split points etc, I imagine that one should try to assign threads to split points on the same node and use higher minimal split depth for split points shared by threads on more than one node.

I don't really know what Linux can and will do (and whether it has to be specially configured), but if a chess engine basically has the machine for itself there seems certainly no better way than to let the engine take care of NUMA itself. The OS has to make guesses on where to best place threads and allocate memory and won't always make the right choice. And migrating threads to other nodes, and perhaps also their memory (if that is being done at all), comes with a cost.

mcostalba · Post by **mcostalba** » Mon Sep 19, 2016 9:29 pm

The only documented real need of NUMA is to fully use more than 64 logical processor under Windows:

https://msdn.microsoft.com/en-us/librar ... s.85).aspx

All the other uses are at best handwaiving at the moment, AFAIK there are no serious tests that clearly show a sensible advantage of NUMA compared to per-thread tables in all the other cases.

mcostalba · Post by **mcostalba** » Mon Sep 19, 2016 10:05 pm

Here is an interesting article for Linux:

http://www.glennklockwood.com/hpc-howto ... inity.html

People here want to out-smart the thread scheduler but, as is written in the article:

Code: Select all

If your application is running at full bore 100% of the time, Linux will probably keep it on its own dedicated CPU core.

In case of chess engines this is true, even more with lazy SMP where threads never block on a mutex, so never sleep. OS can be fooled to take wrong decisions by threads that go idle most of the time, but this is not our typical case.

syzygy · Post by **syzygy** » Mon Sep 19, 2016 10:05 pm

Using more than 64 logical processors on Windows has nothing to do with NUMA.

syzygy · Post by **syzygy** » Tue Sep 20, 2016 1:04 am

In this thread Peter Österlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:

Code: Select all

Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39

So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)

Later in the thread Bob mentions an idea that occurred to me recently:

bob wrote:There is a lot to think about. For example, when you use magics, you can end up with ALL of the magic lookup tables on a single node? Good, bad or indifferent? I haven't addressed this yet, but one idea is that such data can easily be duplicated. IE generate the original, and then let each thread (on each NUMA node) copy it to a private copy. I have not done this because (...)

For the moment I don't think there is terribly much to gain by replicating the magic lookup tables on each node, but still it might be interesting to try, if only to prepare for machines with larger numbers of nodes.

How to implement this efficiently? Allocating separate tables per node will cost you a register to hold the pointer to the data.

Ideally, such data should be replicated to the same virtual address on each node. But is that even possible with threads? I'm not sure and I kind of doubt it. But I might be wrong.

Perhaps the best thing to do is to use one process per node and then one thread per core. Each process will have its own address space and each thread within the process will find the magic lookup tables in a fixed (rip-relative) place. However, the immediate disadvantage is that the TT will have to be in shared memory, which means (as far as I'm aware) that it cannot benefit from the transparent hugepage feature of Linux.

Another piece of shared memory is the binary image: is it possible to replicate that across all nodes? When using processes it could at least be done by having one physical copy of the program on disk per node. (With a single copy, I suspect the binary image will be loaded into RAM only once and then mmap()ed into each process address space.)

As Bob wrote, there is a lot to think about.

bob · Post by **bob** » Tue Sep 20, 2016 1:23 am

There are a list of things...

(1) interleaved allocation for large tables (hash, pawn hash, etc.) If you run on linux, you want to have each thread touch part of these large tables before any other threads touch them to fault the pages in to the right NUMA node.

(2) continue that thinking for any data that should be pretty much node-local. Make sure that the threads touch a part of the table. And it can actually help to use thread affinity so that the thread won't run on one node, fault its data in, then get flipped to a new node. Linux has some cute stuff that helps here as it will actually migrate data to the right place. Unless it moves again...

(3) You could probably carry this right into things like magic tables and such to try to balance everything across all nodes.

(4) locks are evil, minimize their use as much as possible.

(5) make absolutely certain that two different threads don't continually access the same page of memory (write access) frequently as this will smoke cache, even worse when NUMA is in the picture (which is basically any multiple-cpu-chip machine being made now.)

bob · Post by **bob** » Tue Sep 20, 2016 1:26 am

syzygy wrote:
matthewlai wrote:For engines that are "NUMA-aware", what are you doing differently on NUMA systems, and how much speedup do you get compared to Linux NUMA auto-balancing?

I imagine doing interleaved allocation for the main transposition table would be a good idea to increase total bandwidth. Is there anything else?
If the engine is using "lazy smp", each search thread is basically fully independent from the other threads apart from the shared TT table (which must be shared for the lazy approach to make sense at all) and possibly some other tables (e.g. pawn hash table, history move table, etc.).

On a NUMA system, I imagine it should help to:
- bind threads to nodes to avoid that threads are migrated between nodes;
- allocate thread-specific memory on the node where the thread runs;
- indeed interleave the allocation of the TT across all nodes;
- with the exception of the TT, avoid as much as possible that tables that are being written to are shared between threads running on different nodes, by making them either thread-specific or node-specific (i.e. shared by the threads running on a node and allocated on that node).

If the engine uses an smp implementation with synchronisation at split points etc, I imagine that one should try to assign threads to split points on the same node and use higher minimal split depth for split points shared by threads on more than one node.

I don't really know what Linux can and will do (and whether it has to be specially configured), but if a chess engine basically has the machine for itself there seems certainly no better way than to let the engine take care of NUMA itself. The OS has to make guesses on where to best place threads and allocate memory and won't always make the right choice. And migrating threads to other nodes, and perhaps also their memory (if that is being done at all), comes with a cost.

Linux is quite clever, in that if a thread uses data often, and no other threads use it as frequently, linux will migrate it to the right node no matter what. But it takes time. There was a paper written on this, I don't remember who did it (perhaps Rik V. W.??) In any case, affinity is a good idea as you don't want threads to bounce and then drag their data along with them... Linux won't do this until it has all cores busy, which causes other things to go wrong.

bob · Post by **bob** » Tue Sep 20, 2016 1:27 am

mcostalba wrote:The only documented real need of NUMA is to fully use more than 64 logical processor under Windows:

https://msdn.microsoft.com/en-us/librar ... s.85).aspx

All the other uses are at best handwaiving at the moment, AFAIK there are no serious tests that clearly show a sensible advantage of NUMA compared to per-thread tables in all the other cases.

It is only "hand waving" when you don't understand the issues.

BTW NUMA has NOTHING to do with "number of cores". It has EVERYTHING to do with "number of nodes". And also nothing to do with "logical processors" which can only mean hyper-threading.

Adam Hair · Post by **Adam Hair** » Tue Sep 20, 2016 3:48 am

syzygy wrote:Using more than 64 logical processors on Windows has nothing to do with NUMA.

Marco's link states that Windows processor groups consist of sets of up to 64 logical processors, and a NUMA node is assigned to each processor group.

What do you do with NUMA?

What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?