Re: What do you do with NUMA?
Posted: Tue Sep 20, 2016 4:37 am
Another thing is thread affinity: pin threads to cores, and ideally if using fewer threads than cores, allocate to cores that are on the same NUMA node.
--Jon
--Jon
That was in point (2).jdart wrote:Another thing is thread affinity: pin threads to cores, and ideally if using fewer threads than cores, allocate to cores that are on the same NUMA node.
--Jon
And that makes absolutely no sense to me, as an operating system person. Leave it to Micro$oft to completely mangle terminology and concepts to fit their operating system shortcomings.Adam Hair wrote:Marco's link states that Windows processor groups consist of sets of up to 64 logical processors, and a NUMA node is assigned to each processor group.syzygy wrote:Using more than 64 logical processors on Windows has nothing to do with NUMA.
I am not sure how those tests were run. If you read Riks paper, it definitely takes time to migrate things once things are running. At least a minute and that is probably not enough.syzygy wrote:In this thread Peter Ă–sterlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)Code: Select all
Auto Awareness Mn/s mean Mn/s std no no 16.44 1.55 yes no 15.22 1.67 no yes 18.16 0.37 yes yes 17.88 0.39
Later in the thread Bob mentions an idea that occurred to me recently:For the moment I don't think there is terribly much to gain by replicating the magic lookup tables on each node, but still it might be interesting to try, if only to prepare for machines with larger numbers of nodes.bob wrote:There is a lot to think about. For example, when you use magics, you can end up with ALL of the magic lookup tables on a single node? Good, bad or indifferent? I haven't addressed this yet, but one idea is that such data can easily be duplicated. IE generate the original, and then let each thread (on each NUMA node) copy it to a private copy. I have not done this because (...)
How to implement this efficiently? Allocating separate tables per node will cost you a register to hold the pointer to the data.
Ideally, such data should be replicated to the same virtual address on each node. But is that even possible with threads? I'm not sure and I kind of doubt it. But I might be wrong.
Perhaps the best thing to do is to use one process per node and then one thread per core. Each process will have its own address space and each thread within the process will find the magic lookup tables in a fixed (rip-relative) place. However, the immediate disadvantage is that the TT will have to be in shared memory, which means (as far as I'm aware) that it cannot benefit from the transparent hugepage feature of Linux.
Another piece of shared memory is the binary image: is it possible to replicate that across all nodes? When using processes it could at least be done by having one physical copy of the program on disk per node. (With a single copy, I suspect the binary image will be loaded into RAM only once and then mmap()ed into each process address space.)
As Bob wrote, there is a lot to think about.
In the (unlikely) case that you start to read posts before quoting them, we all would be witnessing a huge jump in quality of the answer.bob wrote:
It is only "hand waving" when you don't understand the issues.
BTW NUMA has NOTHING to do with "number of cores". It has EVERYTHING to do with "number of nodes". And also nothing to do with "logical processors" which can only mean hyper-threading.
It does not say that. It does say that the cores of one NUMA node will, if possible (what if you have more than 64 logical cores on one node), correspond to one processor group. But a processor group can contain multiple NUMA nodes.Adam Hair wrote:Marco's link states that Windows processor groups consist of sets of up to 64 logical processors, and a NUMA node is assigned to each processor group.syzygy wrote:Using more than 64 logical processors on Windows has nothing to do with NUMA.
I say that NUMA (in case of less than 64 logical processors on Windows) is still to be proved sensibly stronger than ignoring NUMA.syzygy wrote: What Marco is really saing is that NUMA is a non-issue.
Can you please post link to these multiple test reports?syzygy wrote:
That ignores the numbers reported by multiple chess engine programmers. .
Moreover a recent numa sf vs non-numa sf test failed to show any sensible difference (it was +2 ELO after 1000 games, well below noise level) when tested on 32 cores, 2 machines were involved in the test:mcostalba wrote:In case of Peter we are talking of about 10% speed-up in case of his engine Texel. I don't think this can be blindly generalized to all engines, for instance one of the latest patches n SF (that you know well) removed a scalability issue that alone, is able to gain that 10% in case of many cores (32 cores were used for testing).
Here are some (minimal) data comparing non-NUMA-aware Cfish to NUMA-aware Cfish. Testing was done on my two-node linux system. The file benchpos8 is simply 8 copies of the standard Stockfish bench positions, so 8 x 37 = 296 positions in total. I assume that these runs were long enough to give meaningful nps numbers. I have not made any attempt to test the two versions of Cfish in actual games.mcostalba wrote: That's what I know (I mean numbers and tests, not words), if you have something else I would be happy to read them.
Code: Select all
./cfish bench 16384 20 30 benchpos8 depth
not NUMA-aware
===========================
Total time (ms) : 6496911
Nodes searched : 209051997295
Nodes/second : 32177137
NUMA-aware
===========================
Total time (ms) : 5456006
Nodes searched : 184273309022
Nodes/second : 33774396
nps increase from NUMA is about 5%.