NUMA 101
Posted: Thu Jan 07, 2016 5:48 pm
I have been asked several questions about NUMA over the past few months, and have seen a few misconceptions about it as well, so I thought it would be worthwhile to start a discussion.
First, assume a simple architecture, two CPU sockets, one core per chip. The minute you see a machine with more than one CPU socket, think "NUMA" because that is the standard and has been for years.
The key here is that each CPU will generally have 1/2 of total RAM attached to it, and it can access that RAM at typical memory speeds. The gotcha is that the other CPU has the other 1/2 of RAM, and it is remote to the first CPU and will generally double the access time (or worse, sometimes MUCH worse).
The basic idea is that you would like to have things being used by CPU 0 in the local memory for that CPU, ditto for CPU 1. But that is not so straightforward. So let's take a few cases to see what happens and how we can take advantage of what happens..
Let's use that simple two socket machine, and we are going to run two threads in a parallel search. Current operating systems use a virtual memory model, and when a thread begins execution, its physical memory usage is zero. Whenever a process on a CPU first accesses a page of memory, that page has to be brought into physical memory (a page fault) and it is naturally brought in to the local memory that is attached to the CPU that is handling the page fault, which is the same CPU that is running the process that produced the page fault. This is sometimes called "faulting in" where a page is brought in to memory on demand, part of a concept known as demand paging.
When you first start a chess engine, it uses only one CPU until the search starts. So let's specifically discuss thread 0, and let's assume that it happens to be running on CPU 0 when it begins execution. Each page of memory that thread accesses will be brought into local memory for CPU 0, as mentioned. The pages containing instructions, and the pages containing any data used, and most importantly, any memory allocated via something like malloc() and then accessed.
Since it is good programming practice to initialize memory obtained via malloc() you run into the first potential problem. Suppose you use, as I do, a thing called a "split block" which is local memory containing all the tree state data for a single thread to do a parallel search. I create 64 of these per thread, or in this case, a total of 128. If I carelessly initialize them now, at start-up, ALL of them are faulted into CPU 0. And when CPU 1 / thread 1 begin a search, all of the frequently-used data will be on the wrong NUMA node and things will run at 1/2 the normal memory speed, not good. The solution is pretty simple, DO NOT initialize this local data just on thread 0. It doesn't matter which thread/cpu malloc()s the memory, it is not allocated physical pages until the data is initially touched. Therefore, in Crafty, thread 0 allocates ALL of the split blocks, but it only initializes the split blocks it will use. When thread 1 is initially created (whether via fork(), clone(), thread_create(), or whatever on windows) the first thing it will do is initialize its own memory. Which will fault those pages in to the correct node.
That is true for any thread-specific data. It adds a bit to the complexity, but it is easy enough. There is an alternative available for Linux, namely using processor affinity. The trick here is to "pin" thread 0 to CPU 0, then initialize thread 0's memory, then "pin" thread 0 to CPU 1, then initialize thread 1's memory, etc. For linux, pthread_setaffinity_np() is the name of the procedure, I do not know anything about windows so you are on your own there. Also note that this does not work on OS X which doesn't support it for reasons known only to Apple.
Another useful trick is to use processor affinity to pin thread 0 to CPU 0 as soon as it begins execution. If you are not aware of this, when you set the affinity, the thread migrates to that specific CPU automatically and instantly, which makes this work. If you pin thread 0 to CPU 0 at the start, then its data will fault in to the correct place. Then you can repin to CPU 1, and whatever you initialize there faults in there where we want it. The processor stack is handled differently AND automatically, as with linux, pthread_create() uses the clone() system call, and that malloc()'s the stack and then starts the thread. It is fairly probable that the first page of the stack will end up on the wrong CPU, which means there is little you can do here. I can give some things to try, but there is no guarantee. But once you pin, the remainder of the stack will show up on the right CPU's memory.
Before I go on, there is a potential issue with testing here. If you have a 20 core machine, and you want to play two games at once, say Crafty vs A and Crafty vs B, with all of them using 10 cores (no pondering) you can end up with the two copies of Crafty pinned to the same CPU, and when they are running at the same time, performance goes haywire. I have two ways to solve this. First, since not all machines support NUMA, I have a -DNUMA Makefile option that turns this on. Leave -DNUMA off, and it doesn't do any NUMA-related tricks at all. Second, I have an internal command "affinity=N". Affinity=-1 disables affinity completely, even when it is compiled in. affinity=10 says pin thread 0 to cpu 10, thread 1 to cpu 11, etc. In the above example, the first version would run with the default affinity=0, and I would run the second with affinity=10, both using 10 threads. If you forget about this, you get bogus results. Been there, done that, got the T-shirt.
OK, next step. What about large shared things like hash tables, eval cache, whatever. For linux, the idea should be pretty obvious. If you use two threads, then when you malloc() the memory, pin to thread 0 (you should already be pinned here of course) and initialize 1/2 of memory, then pin to thread 1 and initialize the other half. Now the large memory object is equally distributed across all nodes so there will be no "hot spot" that degrades performance.
For windows, there is an easier way for this specific problem, that is, allocating a large memory object that you want uniformly distributed. The system call is WinMallocInterleaved() and it does it all for you, only proviso being that you have to tell it how much memory, AND how many threads you will run. If you change the number of threads, you should use WinMallocFree() to release the memory, then re-malloc it with the correct number of threads.
You might not like allocating 1/2 of hash to one CPU and 1/2 on the other, but doing it any other way is dangerous. IE you could malloc 16 gigs of ram, then on thread zero touch address 0, then address 8192, cleverly faulting every other page into CPU 0, then repeat on CPU 1 starting with 4096 and incrementing by 8192. And then you use large or huge pages. On linux, 2mb pages are now automatic (transparent huge pages). You are initially given small pages (4K) but the hugepaged notices that there are lots of contiguous 4K pages and it collects them into one large 2mb page. And a single page will ALWAYS be on one NUMA node/cpu. Which breaks your attempt to interleave on 4K boundaries. And you can also use 1gb huge pages, and that would make it even worse, since a 1gb hash table would sit on one node, period, regardless of how many threads/cpus you use. I found it simpler to just remember that the "trick" is in play, and if I am using N cores, I always set hash sizes to N times 2mb on linux, and I probably should add code to handle 1gb pages, but since they are not automatic, I have not made using them a part of the standard crafty that is released.
About the only other warning I can think of is not really NUMA related, but is concerned with memory. Do not ever put two variables in adjacent memory addresses if they are being modified by different threads. This will cause heave cache-bouncing and significantly impair performance, and it will get worse as processors/threads are added, and in a non-linear way. Hash tables are not a particular problem since they are so large, and two different threads accessing the same 64 bytes of hash memory are not very likely to happen at the same time. But node counters, etc, have to be separated by 64 bytes to place them in separate cache lines.
I think that covers everything, if you have questions, feel free...
First, assume a simple architecture, two CPU sockets, one core per chip. The minute you see a machine with more than one CPU socket, think "NUMA" because that is the standard and has been for years.
The key here is that each CPU will generally have 1/2 of total RAM attached to it, and it can access that RAM at typical memory speeds. The gotcha is that the other CPU has the other 1/2 of RAM, and it is remote to the first CPU and will generally double the access time (or worse, sometimes MUCH worse).
The basic idea is that you would like to have things being used by CPU 0 in the local memory for that CPU, ditto for CPU 1. But that is not so straightforward. So let's take a few cases to see what happens and how we can take advantage of what happens..
Let's use that simple two socket machine, and we are going to run two threads in a parallel search. Current operating systems use a virtual memory model, and when a thread begins execution, its physical memory usage is zero. Whenever a process on a CPU first accesses a page of memory, that page has to be brought into physical memory (a page fault) and it is naturally brought in to the local memory that is attached to the CPU that is handling the page fault, which is the same CPU that is running the process that produced the page fault. This is sometimes called "faulting in" where a page is brought in to memory on demand, part of a concept known as demand paging.
When you first start a chess engine, it uses only one CPU until the search starts. So let's specifically discuss thread 0, and let's assume that it happens to be running on CPU 0 when it begins execution. Each page of memory that thread accesses will be brought into local memory for CPU 0, as mentioned. The pages containing instructions, and the pages containing any data used, and most importantly, any memory allocated via something like malloc() and then accessed.
Since it is good programming practice to initialize memory obtained via malloc() you run into the first potential problem. Suppose you use, as I do, a thing called a "split block" which is local memory containing all the tree state data for a single thread to do a parallel search. I create 64 of these per thread, or in this case, a total of 128. If I carelessly initialize them now, at start-up, ALL of them are faulted into CPU 0. And when CPU 1 / thread 1 begin a search, all of the frequently-used data will be on the wrong NUMA node and things will run at 1/2 the normal memory speed, not good. The solution is pretty simple, DO NOT initialize this local data just on thread 0. It doesn't matter which thread/cpu malloc()s the memory, it is not allocated physical pages until the data is initially touched. Therefore, in Crafty, thread 0 allocates ALL of the split blocks, but it only initializes the split blocks it will use. When thread 1 is initially created (whether via fork(), clone(), thread_create(), or whatever on windows) the first thing it will do is initialize its own memory. Which will fault those pages in to the correct node.
That is true for any thread-specific data. It adds a bit to the complexity, but it is easy enough. There is an alternative available for Linux, namely using processor affinity. The trick here is to "pin" thread 0 to CPU 0, then initialize thread 0's memory, then "pin" thread 0 to CPU 1, then initialize thread 1's memory, etc. For linux, pthread_setaffinity_np() is the name of the procedure, I do not know anything about windows so you are on your own there. Also note that this does not work on OS X which doesn't support it for reasons known only to Apple.
Another useful trick is to use processor affinity to pin thread 0 to CPU 0 as soon as it begins execution. If you are not aware of this, when you set the affinity, the thread migrates to that specific CPU automatically and instantly, which makes this work. If you pin thread 0 to CPU 0 at the start, then its data will fault in to the correct place. Then you can repin to CPU 1, and whatever you initialize there faults in there where we want it. The processor stack is handled differently AND automatically, as with linux, pthread_create() uses the clone() system call, and that malloc()'s the stack and then starts the thread. It is fairly probable that the first page of the stack will end up on the wrong CPU, which means there is little you can do here. I can give some things to try, but there is no guarantee. But once you pin, the remainder of the stack will show up on the right CPU's memory.
Before I go on, there is a potential issue with testing here. If you have a 20 core machine, and you want to play two games at once, say Crafty vs A and Crafty vs B, with all of them using 10 cores (no pondering) you can end up with the two copies of Crafty pinned to the same CPU, and when they are running at the same time, performance goes haywire. I have two ways to solve this. First, since not all machines support NUMA, I have a -DNUMA Makefile option that turns this on. Leave -DNUMA off, and it doesn't do any NUMA-related tricks at all. Second, I have an internal command "affinity=N". Affinity=-1 disables affinity completely, even when it is compiled in. affinity=10 says pin thread 0 to cpu 10, thread 1 to cpu 11, etc. In the above example, the first version would run with the default affinity=0, and I would run the second with affinity=10, both using 10 threads. If you forget about this, you get bogus results. Been there, done that, got the T-shirt.
OK, next step. What about large shared things like hash tables, eval cache, whatever. For linux, the idea should be pretty obvious. If you use two threads, then when you malloc() the memory, pin to thread 0 (you should already be pinned here of course) and initialize 1/2 of memory, then pin to thread 1 and initialize the other half. Now the large memory object is equally distributed across all nodes so there will be no "hot spot" that degrades performance.
For windows, there is an easier way for this specific problem, that is, allocating a large memory object that you want uniformly distributed. The system call is WinMallocInterleaved() and it does it all for you, only proviso being that you have to tell it how much memory, AND how many threads you will run. If you change the number of threads, you should use WinMallocFree() to release the memory, then re-malloc it with the correct number of threads.
You might not like allocating 1/2 of hash to one CPU and 1/2 on the other, but doing it any other way is dangerous. IE you could malloc 16 gigs of ram, then on thread zero touch address 0, then address 8192, cleverly faulting every other page into CPU 0, then repeat on CPU 1 starting with 4096 and incrementing by 8192. And then you use large or huge pages. On linux, 2mb pages are now automatic (transparent huge pages). You are initially given small pages (4K) but the hugepaged notices that there are lots of contiguous 4K pages and it collects them into one large 2mb page. And a single page will ALWAYS be on one NUMA node/cpu. Which breaks your attempt to interleave on 4K boundaries. And you can also use 1gb huge pages, and that would make it even worse, since a 1gb hash table would sit on one node, period, regardless of how many threads/cpus you use. I found it simpler to just remember that the "trick" is in play, and if I am using N cores, I always set hash sizes to N times 2mb on linux, and I probably should add code to handle 1gb pages, but since they are not automatic, I have not made using them a part of the standard crafty that is released.
About the only other warning I can think of is not really NUMA related, but is concerned with memory. Do not ever put two variables in adjacent memory addresses if they are being modified by different threads. This will cause heave cache-bouncing and significantly impair performance, and it will get worse as processors/threads are added, and in a non-linear way. Hash tables are not a particular problem since they are so large, and two different threads accessing the same 64 bytes of hash memory are not very likely to happen at the same time. But node counters, etc, have to be separated by 64 bytes to place them in separate cache lines.
I think that covers everything, if you have questions, feel free...