NUMA 101

bob · Post by **bob** » Thu Jan 07, 2016 5:48 pm

I have been asked several questions about NUMA over the past few months, and have seen a few misconceptions about it as well, so I thought it would be worthwhile to start a discussion.

First, assume a simple architecture, two CPU sockets, one core per chip. The minute you see a machine with more than one CPU socket, think "NUMA" because that is the standard and has been for years.

The key here is that each CPU will generally have 1/2 of total RAM attached to it, and it can access that RAM at typical memory speeds. The gotcha is that the other CPU has the other 1/2 of RAM, and it is remote to the first CPU and will generally double the access time (or worse, sometimes MUCH worse).

The basic idea is that you would like to have things being used by CPU 0 in the local memory for that CPU, ditto for CPU 1. But that is not so straightforward. So let's take a few cases to see what happens and how we can take advantage of what happens..

Let's use that simple two socket machine, and we are going to run two threads in a parallel search. Current operating systems use a virtual memory model, and when a thread begins execution, its physical memory usage is zero. Whenever a process on a CPU first accesses a page of memory, that page has to be brought into physical memory (a page fault) and it is naturally brought in to the local memory that is attached to the CPU that is handling the page fault, which is the same CPU that is running the process that produced the page fault. This is sometimes called "faulting in" where a page is brought in to memory on demand, part of a concept known as demand paging.

When you first start a chess engine, it uses only one CPU until the search starts. So let's specifically discuss thread 0, and let's assume that it happens to be running on CPU 0 when it begins execution. Each page of memory that thread accesses will be brought into local memory for CPU 0, as mentioned. The pages containing instructions, and the pages containing any data used, and most importantly, any memory allocated via something like malloc() and then accessed.

Since it is good programming practice to initialize memory obtained via malloc() you run into the first potential problem. Suppose you use, as I do, a thing called a "split block" which is local memory containing all the tree state data for a single thread to do a parallel search. I create 64 of these per thread, or in this case, a total of 128. If I carelessly initialize them now, at start-up, ALL of them are faulted into CPU 0. And when CPU 1 / thread 1 begin a search, all of the frequently-used data will be on the wrong NUMA node and things will run at 1/2 the normal memory speed, not good. The solution is pretty simple, DO NOT initialize this local data just on thread 0. It doesn't matter which thread/cpu malloc()s the memory, it is not allocated physical pages until the data is initially touched. Therefore, in Crafty, thread 0 allocates ALL of the split blocks, but it only initializes the split blocks it will use. When thread 1 is initially created (whether via fork(), clone(), thread_create(), or whatever on windows) the first thing it will do is initialize its own memory. Which will fault those pages in to the correct node.

That is true for any thread-specific data. It adds a bit to the complexity, but it is easy enough. There is an alternative available for Linux, namely using processor affinity. The trick here is to "pin" thread 0 to CPU 0, then initialize thread 0's memory, then "pin" thread 0 to CPU 1, then initialize thread 1's memory, etc. For linux, pthread_setaffinity_np() is the name of the procedure, I do not know anything about windows so you are on your own there. Also note that this does not work on OS X which doesn't support it for reasons known only to Apple.

Another useful trick is to use processor affinity to pin thread 0 to CPU 0 as soon as it begins execution. If you are not aware of this, when you set the affinity, the thread migrates to that specific CPU automatically and instantly, which makes this work. If you pin thread 0 to CPU 0 at the start, then its data will fault in to the correct place. Then you can repin to CPU 1, and whatever you initialize there faults in there where we want it. The processor stack is handled differently AND automatically, as with linux, pthread_create() uses the clone() system call, and that malloc()'s the stack and then starts the thread. It is fairly probable that the first page of the stack will end up on the wrong CPU, which means there is little you can do here. I can give some things to try, but there is no guarantee. But once you pin, the remainder of the stack will show up on the right CPU's memory.

Before I go on, there is a potential issue with testing here. If you have a 20 core machine, and you want to play two games at once, say Crafty vs A and Crafty vs B, with all of them using 10 cores (no pondering) you can end up with the two copies of Crafty pinned to the same CPU, and when they are running at the same time, performance goes haywire. I have two ways to solve this. First, since not all machines support NUMA, I have a -DNUMA Makefile option that turns this on. Leave -DNUMA off, and it doesn't do any NUMA-related tricks at all. Second, I have an internal command "affinity=N". Affinity=-1 disables affinity completely, even when it is compiled in. affinity=10 says pin thread 0 to cpu 10, thread 1 to cpu 11, etc. In the above example, the first version would run with the default affinity=0, and I would run the second with affinity=10, both using 10 threads. If you forget about this, you get bogus results. Been there, done that, got the T-shirt.

OK, next step. What about large shared things like hash tables, eval cache, whatever. For linux, the idea should be pretty obvious. If you use two threads, then when you malloc() the memory, pin to thread 0 (you should already be pinned here of course) and initialize 1/2 of memory, then pin to thread 1 and initialize the other half. Now the large memory object is equally distributed across all nodes so there will be no "hot spot" that degrades performance.

For windows, there is an easier way for this specific problem, that is, allocating a large memory object that you want uniformly distributed. The system call is WinMallocInterleaved() and it does it all for you, only proviso being that you have to tell it how much memory, AND how many threads you will run. If you change the number of threads, you should use WinMallocFree() to release the memory, then re-malloc it with the correct number of threads.

You might not like allocating 1/2 of hash to one CPU and 1/2 on the other, but doing it any other way is dangerous. IE you could malloc 16 gigs of ram, then on thread zero touch address 0, then address 8192, cleverly faulting every other page into CPU 0, then repeat on CPU 1 starting with 4096 and incrementing by 8192. And then you use large or huge pages. On linux, 2mb pages are now automatic (transparent huge pages). You are initially given small pages (4K) but the hugepaged notices that there are lots of contiguous 4K pages and it collects them into one large 2mb page. And a single page will ALWAYS be on one NUMA node/cpu. Which breaks your attempt to interleave on 4K boundaries. And you can also use 1gb huge pages, and that would make it even worse, since a 1gb hash table would sit on one node, period, regardless of how many threads/cpus you use. I found it simpler to just remember that the "trick" is in play, and if I am using N cores, I always set hash sizes to N times 2mb on linux, and I probably should add code to handle 1gb pages, but since they are not automatic, I have not made using them a part of the standard crafty that is released.

About the only other warning I can think of is not really NUMA related, but is concerned with memory. Do not ever put two variables in adjacent memory addresses if they are being modified by different threads. This will cause heave cache-bouncing and significantly impair performance, and it will get worse as processors/threads are added, and in a non-linear way. Hash tables are not a particular problem since they are so large, and two different threads accessing the same 64 bytes of hash memory are not very likely to happen at the same time. But node counters, etc, have to be separated by 64 bytes to place them in separate cache lines.

I think that covers everything, if you have questions, feel free...

bob · Post by **bob** » Thu Jan 07, 2016 7:16 pm

BTW, for the record, the operating system will NOT "correct" this stuff. If you get data on the wrong node, the O/S will never realize it, because it will have no clue about what is being accessed internally by the CPU. So you can't expect it to say, "Oops, this was faulted in over THERE, when we need it over HERE." It won't migrate. Only way this can happen is if you over-subscribe memory and cause paging to occur. Then a page can be replaced/removed from memory, and it will fault in where it is next accessed. Generally a chess program will not be producing page faults that result in replacements, except when due to operator error or system overloads.

This is one of those issues you have to handle preemptively, up front, or they will haunt you throughout the execution of your code.

Another note, if you do not use processor affinity, you only exacerbate this problem. IE with two physical sockets/processors, one core per processor, you start up and do the clever "touch me" trick to get memory onto the correct nodes. Then an interrupt, a system daemon, whatever has to run, and now you have one process running, one waiting for cpu access. When the running process hits end of quantum, the other process will be scheduled on that CPU and now it is on the wrong node, remote from its local data. Then the other process gets scheduled, on the available CPU, and now it too is on the wrong node with all of its local data on the remote node. Not so good.

If you pin, one of your threads will get pre-empted, but then will return to the same node. Which avoids that. Since at least Linux uses the APIC to balance interrupts across processors, you should see even usage and no flip/flopping between nodes which is harmful on a NUMA platform. Using affinity will completely eliminate this.

Another note: the numa trick really only needs to be done across nodes. I do it across threads because I find it very difficult to decode the hardware configuration. For example, are odd cores on one node, even on the other? Or first N on one node, second N on another. For those that insist on using SMT (hyper-threading) it is even worse. Doing it CPU by CPU works perfectly, since this will balance just fine assuming you use all available cores. If you have a 20 core machine and use 15 cores, then 2/3 of the data will be faulted in on one node, and 1/3 on the other node, assuming two NUMA nodes with 10 cores each (my ES 2660 box as the example)..

Simple is easier, and is no worse than trying to decipher the node configuration and fault across nodes.

petero2 · Post by **petero2** » Thu Jan 07, 2016 7:47 pm

bob wrote:BTW, for the record, the operating system will NOT "correct" this stuff. If you get data on the wrong node, the O/S will never realize it, because it will have no clue about what is being accessed internally by the CPU.

Some linux distributions include support for automatic NUMA balancing, see for example:

https://access.redhat.com/documentation ... ncing.html

http://events.linuxfoundation.org/sites ... cing_0.pdf

bob · Post by **bob** » Thu Jan 07, 2016 8:02 pm

petero2 wrote:
bob wrote:BTW, for the record, the operating system will NOT "correct" this stuff. If you get data on the wrong node, the O/S will never realize it, because it will have no clue about what is being accessed internally by the CPU.
Some linux distributions include support for automatic NUMA balancing, see for example:

https://access.redhat.com/documentation ... ncing.html

http://events.linuxfoundation.org/sites ... cing_0.pdf

Doesn't work well at all for a chess engine. By definition most everything is shared and is accessed by all threads at some point. Trying to automagically migrate does nothing more than generate thrashing.

This from my 20 core box:

iprogress% cat /proc/sys/kernel/numa_balancing
1

Already turned on. It is better than nothing. It is not nearly as good as doing it manually and intentionally. To me, it appears that that stuff is intended to work with virtualization, which we are not using on our high-performance stuff. We have a vmware cluster, to be sure, but it is used for our data center stuff such as firewall, DNS, DHCP, LDAP, etc...

If you read some of the discussions about this, they seem to reach the same conclusion that I reached a good while back. Namely that doing this yourself is significantly more efficient than using the automatic scheme. Of course, if you do it wrong, then automatic is better. Rik's been working on this for a good while, and has done some excellent stuff. Fortunately, he took the approach that if one sets affinity, all of this is disabled for that process group.

wgarvin · Post by **wgarvin** » Thu Jan 07, 2016 8:55 pm

If someone wants to try and handle NUMA properly on Windows, the MSDN page about NUMA Support is probably a good starting point.

[Sorry, MS uses dumb characters in their URLs and I couldn't figure out how to get bbcode to handle this link correctly!]
https://msdn.microsoft.com/en-us/librar ... s.85).aspx

It has links to functions to find out how many NUMA nodes there are and how much RAM is connected to each one, and also get affinity masks (I.e. which processors are connected to each node). To retrieve the most detailed info, GetLogicalProcessorInformationEx can be used to enumerate physical CPU packages, processor cores and NUMA nodes.

Incidentally, GetLogicalProcessorInformationEx is also a good way to distinguish between hyperthreaded and non-hyperthreaded systems.

petero2 · Post by **petero2** » Thu Jan 07, 2016 9:14 pm

bob wrote:
petero2 wrote:
bob wrote:BTW, for the record, the operating system will NOT "correct" this stuff. If you get data on the wrong node, the O/S will never realize it, because it will have no clue about what is being accessed internally by the CPU.
Some linux distributions include support for automatic NUMA balancing, see for example:

https://access.redhat.com/documentation ... ncing.html

http://events.linuxfoundation.org/sites ... cing_0.pdf
Doesn't work well at all for a chess engine. By definition most everything is shared and is accessed by all threads at some point. Trying to automagically migrate does nothing more than generate thrashing.
...
It is better than nothing. It is not nearly as good as doing it manually and intentionally.

Interesting, I had not tested this because my engine was NUMA aware before I knew that this feature existed. I ran some quick tests now, where I let my engine (texel) search to depth 20 in the starting position using 16 threads on an 8x2 core computer. Each test was repeated 10 times and four different configurations were tested, auto balancing on/off and texel NUMA awareness on/off. The result was:

Code: Select all

Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39

I used 1GB for the main hash table, which is shared by all threads. Texel also has a number of smaller hash tables that are thread-local. These are pawn hash, king safety hash and material hash. The history table is also thread local. I would have thought that automatic NUMA balancing would make access to the thread local hash tables faster, but even if it does, it seems this gain is eaten by the thrashing of the main hash table.

Based on this data, manual NUMA awareness is worth 14% extra speed for texel on this hardware, but automatic balancing is not worth anything. It might even hurt, but there is too little data to be sure.

It is also interesting to note that the NPS variation is much larger when engine NUMA awareness is disabled. I guess this happens because sometimes you are lucky and most threads stays on "good" NUMA nodes, but when you are unlucky performance suffers a lot. With engine NUMA awareness this luck factor goes away.

bob · Post by **bob** » Thu Jan 07, 2016 10:55 pm

petero2 wrote:
bob wrote:
petero2 wrote:
bob wrote:BTW, for the record, the operating system will NOT "correct" this stuff. If you get data on the wrong node, the O/S will never realize it, because it will have no clue about what is being accessed internally by the CPU.
Some linux distributions include support for automatic NUMA balancing, see for example:

https://access.redhat.com/documentation ... ncing.html

http://events.linuxfoundation.org/sites ... cing_0.pdf
Doesn't work well at all for a chess engine. By definition most everything is shared and is accessed by all threads at some point. Trying to automagically migrate does nothing more than generate thrashing.
...
It is better than nothing. It is not nearly as good as doing it manually and intentionally.
Interesting, I had not tested this because my engine was NUMA aware before I knew that this feature existed. I ran some quick tests now, where I let my engine (texel) search to depth 20 in the starting position using 16 threads on an 8x2 core computer. Each test was repeated 10 times and four different configurations were tested, auto balancing on/off and texel NUMA awareness on/off. The result was:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
I used 1GB for the main hash table, which is shared by all threads. Texel also has a number of smaller hash tables that are thread-local. These are pawn hash, king safety hash and material hash. The history table is also thread local. I would have thought that automatic NUMA balancing would make access to the thread local hash tables faster, but even if it does, it seems this gain is eaten by the thrashing of the main hash table.

Based on this data, manual NUMA awareness is worth 14% extra speed for texel on this hardware, but automatic balancing is not worth anything. It might even hurt, but there is too little data to be sure.

It is also interesting to note that the NPS variation is much larger when engine NUMA awareness is disabled. I guess this happens because sometimes you are lucky and most threads stays on "good" NUMA nodes, but when you are unlucky performance suffers a lot. With engine NUMA awareness this luck factor goes away.

My results were in line with yours. I think a parallel search is a horrible candidate for the approach Rik used. IE marking 256mb of memory as invalid, then observing the faults to see who uses it and draw conclusions from that.

That's why I posted the original description in this thread. NUMA awareness is not critical, but it is also not something one would want to ignore if speed is important... And you are most likely correct with your conclusion. The thing does random sampling, which means it can get good and bad results, which leads to fluctuating results.

Are you using the 1GB pages or just the automatic 2mb pages???

1GB does make a difference but it is a pain in the ass to set up and it locks pare of memory into 1gb pages only. The machine I am using has 128gb, so I played with 1gb pages for a bit, and it definitely speeds things up due to less TLB thrashing. But the mmap() stuff is a bit of a kludge. Would be nice to have another daemon that would do 1gb pages, but only when told to and only for a specific process.

petero2 · Post by **petero2** » Thu Jan 07, 2016 11:14 pm

bob wrote:
petero2 wrote:
bob wrote:
petero2 wrote:
bob wrote:BTW, for the record, the operating system will NOT "correct" this stuff. If you get data on the wrong node, the O/S will never realize it, because it will have no clue about what is being accessed internally by the CPU.
Some linux distributions include support for automatic NUMA balancing, see for example:

https://access.redhat.com/documentation ... ncing.html

http://events.linuxfoundation.org/sites ... cing_0.pdf
Doesn't work well at all for a chess engine. By definition most everything is shared and is accessed by all threads at some point. Trying to automagically migrate does nothing more than generate thrashing.
...
It is better than nothing. It is not nearly as good as doing it manually and intentionally.
Interesting, I had not tested this because my engine was NUMA aware before I knew that this feature existed. I ran some quick tests now, where I let my engine (texel) search to depth 20 in the starting position using 16 threads on an 8x2 core computer. Each test was repeated 10 times and four different configurations were tested, auto balancing on/off and texel NUMA awareness on/off. The result was:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
I used 1GB for the main hash table, which is shared by all threads. Texel also has a number of smaller hash tables that are thread-local. These are pawn hash, king safety hash and material hash. The history table is also thread local. I would have thought that automatic NUMA balancing would make access to the thread local hash tables faster, but even if it does, it seems this gain is eaten by the thrashing of the main hash table.

Based on this data, manual NUMA awareness is worth 14% extra speed for texel on this hardware, but automatic balancing is not worth anything. It might even hurt, but there is too little data to be sure.

It is also interesting to note that the NPS variation is much larger when engine NUMA awareness is disabled. I guess this happens because sometimes you are lucky and most threads stays on "good" NUMA nodes, but when you are unlucky performance suffers a lot. With engine NUMA awareness this luck factor goes away.
My results were in line with yours. I think a parallel search is a horrible candidate for the approach Rik used. IE marking 256mb of memory as invalid, then observing the faults to see who uses it and draw conclusions from that.

That's why I posted the original description in this thread. NUMA awareness is not critical, but it is also not something one would want to ignore if speed is important... And you are most likely correct with your conclusion. The thing does random sampling, which means it can get good and bad results, which leads to fluctuating results.

Are you using the 1GB pages or just the automatic 2mb pages???

1GB does make a difference but it is a pain in the ass to set up and it locks pare of memory into 1gb pages only. The machine I am using has 128gb, so I played with 1gb pages for a bit, and it definitely speeds things up due to less TLB thrashing. But the mmap() stuff is a bit of a kludge. Would be nice to have another daemon that would do 1gb pages, but only when told to and only for a specific process.

I only use the automatic 2MB pages. I also think the 1GB pages are too kludgy to use.

jdart · Post by **jdart** » Fri Jan 08, 2016 1:15 am

Very useful, thanks.

When I have looked at this before, a couple complications have come up:

1. It seems that Windows and POSIX-type OSs have completely different APIs for this stuff, with different capabilities.

2. Determining the exact CPU architecture you have is apparently quite tricky. So when you pin to a CPU, what exactly does that mean? I always disable hyperthreading on my machines, so I only want to consider physical cores, not virtual ones. But many people run with hyperthreading on. Also, recent AMD architectures have 2 NUMA nodes within a single physical CPU.

Any advice on handling this stuff across varying OSs and architectures?

--Jon

bob · Post by **bob** » Fri Jan 08, 2016 2:44 am

petero2 wrote:
bob wrote:
petero2 wrote:
bob wrote:
petero2 wrote:
bob wrote:BTW, for the record, the operating system will NOT "correct" this stuff. If you get data on the wrong node, the O/S will never realize it, because it will have no clue about what is being accessed internally by the CPU.
Some linux distributions include support for automatic NUMA balancing, see for example:

https://access.redhat.com/documentation ... ncing.html

http://events.linuxfoundation.org/sites ... cing_0.pdf
Doesn't work well at all for a chess engine. By definition most everything is shared and is accessed by all threads at some point. Trying to automagically migrate does nothing more than generate thrashing.
...
It is better than nothing. It is not nearly as good as doing it manually and intentionally.
Interesting, I had not tested this because my engine was NUMA aware before I knew that this feature existed. I ran some quick tests now, where I let my engine (texel) search to depth 20 in the starting position using 16 threads on an 8x2 core computer. Each test was repeated 10 times and four different configurations were tested, auto balancing on/off and texel NUMA awareness on/off. The result was:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
I used 1GB for the main hash table, which is shared by all threads. Texel also has a number of smaller hash tables that are thread-local. These are pawn hash, king safety hash and material hash. The history table is also thread local. I would have thought that automatic NUMA balancing would make access to the thread local hash tables faster, but even if it does, it seems this gain is eaten by the thrashing of the main hash table.

Based on this data, manual NUMA awareness is worth 14% extra speed for texel on this hardware, but automatic balancing is not worth anything. It might even hurt, but there is too little data to be sure.

It is also interesting to note that the NPS variation is much larger when engine NUMA awareness is disabled. I guess this happens because sometimes you are lucky and most threads stays on "good" NUMA nodes, but when you are unlucky performance suffers a lot. With engine NUMA awareness this luck factor goes away.
My results were in line with yours. I think a parallel search is a horrible candidate for the approach Rik used. IE marking 256mb of memory as invalid, then observing the faults to see who uses it and draw conclusions from that.

That's why I posted the original description in this thread. NUMA awareness is not critical, but it is also not something one would want to ignore if speed is important... And you are most likely correct with your conclusion. The thing does random sampling, which means it can get good and bad results, which leads to fluctuating results.

Are you using the 1GB pages or just the automatic 2mb pages???

1GB does make a difference but it is a pain in the ass to set up and it locks pare of memory into 1gb pages only. The machine I am using has 128gb, so I played with 1gb pages for a bit, and it definitely speeds things up due to less TLB thrashing. But the mmap() stuff is a bit of a kludge. Would be nice to have another daemon that would do 1gb pages, but only when told to and only for a specific process.
I only use the automatic 2MB pages. I also think the 1GB pages are too kludgy to use.

I don't like committing a bunch of memory to 1gb pages. And while I have used mmap() many times, that is still a bit of a kludge simply to access huge pages. But I did try it and when you reach beyond 4gb for hash memory, you can hear the TLB crying.

Even with 2mb pages, the TLB has huge problems. 4gb requires 2,048 pages, and no TLB handles more than a couple of dozen or so large page translations. With 1gb pages, you can easily go to 16gb and have no TLB misses for the large pages, which is a significant win.

NUMA 101

NUMA 101

Re: NUMA 101

Re: NUMA 101

Re: NUMA 101

Re: NUMA 101

Re: NUMA 101

Re: NUMA 101

Re: NUMA 101

Re: NUMA 101

Re: NUMA 101