NUMA

lucasart · Post by **lucasart** » Tue Dec 31, 2019 12:42 am

I don't have access to a NUMA machine for testing. But I wonder if there is a cheap way to make my code perform better on NUMA.

Here's a NUMA strategy:

Allocate thread local data (history table, pawn hash, zobrist keys for 3-rep etc.) in each thread, rather than a global array of such data allocated in the main thread (before spawning worker threads).

But does that make any difference w/o setting CPU affinity before spawning worker threads ?

lucasart · Post by **lucasart** » Tue Dec 31, 2019 12:59 am

lucasart wrote: ↑Tue Dec 31, 2019 12:42 am I don't have access to a NUMA machine for testing. But I wonder if there is a cheap way to make my code perform better on NUMA.

Here's a NUMA strategy:

Allocate thread local data (history table, pawn hash, zobrist keys for 3-rep etc.) in each thread, rather than a global array of such data allocated in the main thread (before spawning worker threads).

But does that make any difference w/o setting CPU affinity before spawning worker threads ?

This document from AMD (though quite old) recommends "keeping data local by virtue of first touch"
https://doc.xdevs.com/doc/AMD/_Performa ... -06%5D.pdf
So it's about first touch rather than allocation, it seems. In other words, if I allocate in the main thread like so:

Code: Select all

typedef struct {
    // Thread local data for each worker (history tables, pawn hash, zobrist stack for 2/3-move rule, etc.)
} Worker;

// main thread allocates
Workers = malloc(WorkersCount * sizeof(Worker));  // don't use calloc or even realloc here: no first touch allowed!

// worker #i when spawned, does its the first touch, signalling to the OS that Workers[i] should be on the same NUMA node as the CPU running worker thread #i
memset(Workers[i], 0, sizeof(Worker));

Am I getting this right ? Is it that simple ?

Daniel Shawul · Post by **Daniel Shawul** » Tue Dec 31, 2019 2:18 am

yes, NUMA SMP is that easy. If you have everything local to a thread in a struct, you can take advantage of first-touch policy quite easily.

For shared tables like the global transposition table, you are better off allocating it on thread-0's local memory. However, taking the NUMA concept one step further to distributed computing, it is better to distribute the TT equally among the workers to reduce network traffic congestion. I do that for my MPI scorpio code.

Edit: I actually do distributed TT on NUMA machines too, although the default is to use thread-0's -- must not have brought me any benefit.
This is how i locate the TT entry for a position for a global, distributed, and local transpotion tables.

Code: Select all

/* 
 Distributed hashtable on NUMA machines
 */
#if NUMA_TT_TYPE == 0
#define TT_KEY \
    static const PPROCESSOR proc = processors[0];\
    UBMP32 key = UBMP32(hash_key & PROCESSOR::hash_tab_mask);
#elif NUMA_TT_TYPE == 1
#define TT_KEY \
    int proc_id = ((hash_key >> 48) * PROCESSOR::n_processors) >> 16;\
    PPROCESSOR proc = processors[proc_id];\
    UBMP32 key = UBMP32(hash_key & PROCESSOR::hash_tab_mask);
#else
#define TT_KEY \
    PPROCESSOR proc = processors[processor_id];\
    UBMP32 key = UBMP32(hash_key & PROCESSOR::hash_tab_mask);
#endif

bob · Post by **bob** » Tue Dec 31, 2019 7:52 pm

Couple of points here.

First, the concept is "fault in". When you malloc/shmget/whatever to allocate a block of memory, that does nothing regarding which node the data ends up on. Exception is the windows "WinMallocInterleaved()" function which handles this for you. I assume by touching each block by swapping from one core to the next. So the idea for unix-based systems is to malloc() the TT, but do not touch it. Then you need to spawn the new threads and immediately pin them to a specific core. Then CPU_SET (for systems that support this like Linux) can be used to do the pinning. Once you do that, each thread should touch every PAGE of its chunk of the TT. Easiest way is to zero the thing since this only gets done once. The only question is how do you decide what goes on each core's local memory. The obvious answer is something that is a multiple of the page size, which can be 4K or something MUCH larger if huge pages are being used. So you could allocate page N or core 0, page N+1 onto core 1, and continue wrapping around until all pages have been touched by the correct core. Or, you could let each core touch N consecutive pages so that you have large chunks interleaved over all the cores, rather than having many more small chunks. If you believe you potentially have hot spots (I have not tested so do not know if this happens) then using 4K chunks makes the most sense.

Second, so far as I know, ANY machine with a reasonable number of processors is NUMA, even on a single chip. I have not really looked at this since I retired, but I doubt it has changed. So most likely, you have NUMA issues no matter what system you are running. A fully shared memory is a pain to design/build and putting in on a single chip is a tough (and expensive) proposition, limiting # of cores due to the data paths required on the chip. For most applications, you can likely ignore this stuff. If you try to fault in a subset of pages on each core, but fail to pin the thread to that core, it doesn't work to improve efficiency regarding the TT. There is definitely a gain for thread local data (in the case of Crafty, "split blocks". Those blocks that contain purely thread-local data really should be on the node with that core.

Finally there is one confounding issue that is harder to handle. MOST NUMA systems are M x N. That is M cores per local memory block. You really want to get all the threads that share a large block of memory frequently (split blocks and such) on the "connected" cores. Otherwise you have have M*N threads on M*N different cores, scattered over over the entire system in a really awkward configuration. IE cores I and I+1 assume they are using the same large block of memory, but they are on different nodes. This is pretty processor specific and is messy to handle. Another issue is global memory, containing globally used data, such as magic move indices, etc. Do you keep one copy or do you create one for each node?

It is not a really easy issue to deal with, but it is also not important. Only issue is that it might not make a lot of difference. Most NUMA machines I have used had a BIOS configuration option to use a "SMP" memory model, which simply does interleaved memory automatically, with block 0 going on node 0, block 1 going on node 1, etc. That is better than doing nothing since you can potentially have the key data on one node making it a hot spot. This is common when you have to initialize large arrays, and do this in the original thread before creating new threads. ALL of that early initialized data ends up on one node, which is not optimal.

I am sure there is more I forgot when writing the above.

BTW Daniel: Allocating the TT on node 0 is a bad idea. You can completely fill node 0's memory with the TT and it is going to spill over to other nodes anyway. But now, ANY malloc() calls will also have to spill over to other nodes, meaning thread-local stuff will be "over there" rather than "over here". Better to do the fault-in logic above to evenly distribute the TT over all nodes, leaving local memory on each node for any thread-local data you might want. IE some use small eval TTs and try to make those thread-local. Having thread-local memory helps.

Lucasart: You can set thread affinity before you spawn the threads if your operating system supports this. Or you can set thread affinity as the VERY FIRST thing you do before you execute anything else. The O/S will instantly migrate that thread to the selected core. That being said, before is better, as it is hard to say what is executed before getting to your first line of code in whatever procedure you specify in your thread creation code. You would prefer to avoid this unknown code from faulting in stuff on the current node, then setting affinity to a different node.

dragontamer5788 · Post by **dragontamer5788** » Tue Dec 31, 2019 8:50 pm

Nice post bob.

bob wrote: ↑Tue Dec 31, 2019 7:52 pm Second, so far as I know, ANY machine with a reasonable number of processors is NUMA, even on a single chip. I have not really looked at this since I retired, but I doubt it has changed.

I believe AMD "Zen2" processors are 4x NUMA on EPYC, 2x NUMA on Threadripper, and 1x NUMA on Ryzen. Which means you get 16x cores on one NUMA node with the Ryzen 3950x.

The EPYC I/O chip is split into quadrants on EPYC, and halves on Threadripper. You need to set some BIOS settings to get the OS to see the NUMA blocks, but Netflix gained a modest amount of performance (less than 10% memory bandwidth) by optimizing on this EPYC NUMA issue.

By default, EPYC and Threadripper ship in UMA mode, where all 4-quadrants (or 2-halves) will stripe data in a RAID0-like distribution. But you can unlock a wee bit more performance by messing with BIOS settings and coding NUMA-aware applications.

Finally there is one confounding issue that is harder to handle. MOST NUMA systems are M x N. That is M cores per local memory block. You really want to get all the threads that share a large block of memory frequently (split blocks and such) on the "connected" cores. Otherwise you have have M*N threads on M*N different cores, scattered over over the entire system in a really awkward configuration.

Hmmm, this is more than just "NUMA" in practice. For anyone getting AMD chips this generation, you'll run into the 4-core CCX issue before NUMA. As stated earlier, Ryzen are all 1x NUMA regardless of BIOS settings. But the Ryzen 3950x is 4x CCX (4x4 core configuration) where each CCX has a local 16MB L3 cache.

These 4-core/8-thread CCX blocks can communicate extremely efficiently due to L3 sharing. Communicating outside of your CCX incurs a penalty, especially if another CCX has claimed ownership over a particular memory location. (The remote CCX must send a MESI-like message to invalidate its L3 cache before the local CCX can start writing to that cache line).

So the concept I'd like to emphasize is that L3 cache blocks are independent of NUMA. Both NUMA and L3 caching issues can lead to performance bottlenecks, but these are extremely specific chip micro-architectural issues here.

bob · Post by **bob** » Tue Dec 31, 2019 10:00 pm

There were some things I omitted to keep it simpler. But another thing I do is to try to split work between threads that are on the same NUMA node. IE if thread X becomes idle, it will try to join and help other busy threads. First place it looks is in its own group of cores on the same NUMA node. If none are available, I give up and try all nodes, as doing anything less efficiently is better than doing nothing which is zero efficiency.

The primary driving force for this was that shared L3 cache (and L2 when there was no L3 on older architectures...

Fun topic.

Best thing I found to help me was Intel's vtune software. It can monitor amazing things and point you to issues. You can then monitor those in more detail to get a real close-up view of where performance is degraded.

dragontamer5788 · Post by **dragontamer5788** » Wed Jan 01, 2020 7:25 pm

bob wrote: ↑Tue Dec 31, 2019 10:00 pmBest thing I found to help me was Intel's vtune software. It can monitor amazing things and point you to issues. You can then monitor those in more detail to get a real close-up view of where performance is degraded.

VTune is allegedly pretty good, but I've never used it myself. Benchmarks for time is king. But benchmarks for L1 cache misses does help the programmer understand their program much better.

I will say that looking at hardware performance monitors, especially L1, L2, L3 cache hits/misses, branch-predictions, and other micro-benchmark stats can help you understand "why" any algorithm performs the way they do.

bob wrote: ↑Tue Dec 31, 2019 10:00 pm There were some things I omitted to keep it simpler. But another thing I do is to try to split work between threads that are on the same NUMA node. IE if thread X becomes idle, it will try to join and help other busy threads. First place it looks is in its own group of cores on the same NUMA node. If none are available, I give up and try all nodes, as doing anything less efficiently is better than doing nothing which is zero efficiency.

The primary driving force for this was that shared L3 cache (and L2 when there was no L3 on older architectures...

Fun topic.

Although I'm trying to get a GPU algorithm working, I've considered how to apply my ideas to a CPU. Assume a cooperative multithreading, or perhaps a "cofunction" style scheduler on the CPU. How many threads can actually work in parallel with each other? On say a AMD 3950x (4x CCX with 4-cores / 8-threads / 16MB L3 cache per CCX. 16-cores total / 32-hardware threads total). L1 is 32kB data cache, L2 is 512kB (inclusive of L1). See here for details.

The Alpha-Beta stack requires roughly 128-bytes per ply. (36-byte position, 4-byte depth remaining, 2-byte alpha, 2-byte beta, 84 byte move-list or ~42 moves). Where the "position" is a quad-bitboard (4x 8bytes + misc state). This means an AB-stack of size 2kB can hold roughly a 20-ply search.

A 4c/8t 16MB CCX can therefore hold ~8192 workunits. (Represented as C++ Packaged tasks, or some other kind of async / threadpooling / task-based multithreading system). Well, we wouldn't want to use the entirety of L3 cache on just storing tasks: L3 cache is useful for Magic Bitboards, lookup tables, and other sources of information. Lets say half of it is used for storage of work-tasks.

In effect: the CPU has such ample resources available to it, that a 4-core / 8-thread CCX could present itself as a ~4096 way parallel search cluster.

-------------

So the question is: is there an easy algorithm to select the "most work efficient" AB-searches to work on? Given the pile of 4096 AB-searches, surely some of those are more efficient than others. A 4c/8t CCX only has to select the top8 most efficient searches. A GPU has many more hardware threads per memory-cluster and has to select the ~top256 instead. But otherwise, the general concept remains the same.

In particular, is there a simple heuristic that can estimate the probability of Beta-cutoff. And is running this heuristic cheap enough / fast enough to be worth the hassle?

bob · Post by **bob** » Wed Jan 15, 2020 7:37 pm

There are some things you can do. First, if you want to join at node X, how many moves have ALREADY been searched there? Only one? You can't tell if this is a CUT or an ALL node. Look for one where more moves have been searched, suggesting that all will have to be searched.

You can also (in a DTS framework anyway) look at the entire tree state for a busy thread. You can qualify each ply as all or CUT by the above strategy. If a node has just one move searched, you have to guess, but if the previous ply was labeled "cut" (because the ply previous to that was labeled ALL) then you can reasonably assume that the current ply is ALL, as will be every other ply into the future. Obviously this will fail at some nodes, say when the previous ply SHOULD have been a cut node but the first couple of moves searched did not cause a cutoff due to poor move ordering).

I did this in the Cray Blitz DTS search and it worked surprisingly well 99% of the time. The only problem that is a pain is that when you get into a wild tactical position where you need max depth to work through it, move ordering is worse because the "usual" orderings fail in some nodes due to the tactics. And if you choose to split at an ALL node, and the ply below that node turns into an ALL node as well due to move ordering issues farther back up the tree, then you choose to split at what looks like an ALL node, but it turns into a CUT node, wasting computational effort.

So there are things to try, but you have to accept a certain level of inaccuracy. Crafty seems to have fail highs on 92% of the first move searched, if it is going to fail high at that ply (CUT node). That leaves 8% where it failed high on a later move. This will, naturally, be the second move (higher probability than a later move since there are several levels of move ordering). So you are playing with the potential of doing a parallel split where 8% of the time the fail high happens on the second or later move. Which means the ALL node that follows this one becomes a CUT node until the right move is searched.

This applies whether you try to get cute and predict ALL/CUT nodes as I did in CB, or whether you do nothing. But it IS more efficient to try, no matter how many times you are wrong. You will at least get something out of trying, and to me, something is always better than nothing. I do a form of this in Crafty, which seems to work well combined with everything else...

NUMA

NUMA

Re: NUMA

Re: NUMA

Re: NUMA

Re: NUMA

Re: NUMA

Re: NUMA

Re: NUMA