What do you do with NUMA?

jdart · Post by **jdart** » Tue Sep 20, 2016 4:37 am

Another thing is thread affinity: pin threads to cores, and ideally if using fewer threads than cores, allocate to cores that are on the same NUMA node.

--Jon

bob · Post by **bob** » Tue Sep 20, 2016 6:36 am

jdart wrote:Another thing is thread affinity: pin threads to cores, and ideally if using fewer threads than cores, allocate to cores that are on the same NUMA node.

--Jon

That was in point (2).

bob · Post by **bob** » Tue Sep 20, 2016 6:41 am

Adam Hair wrote:
syzygy wrote:Using more than 64 logical processors on Windows has nothing to do with NUMA.
Marco's link states that Windows processor groups consist of sets of up to 64 logical processors, and a NUMA node is assigned to each processor group.

And that makes absolutely no sense to me, as an operating system person. Leave it to Micro$oft to completely mangle terminology and concepts to fit their operating system shortcomings.

A NUMA node is dictated by hardware, not the operating system. Doesn't matter WHAT windows thinks about this stuff, NUMA is a hardware issue dealing with local memory (per node) and trying to minimize access to remote memory on other nodes. No idea what the 64 logical processors per node means. 99% of todays machines have exactly one NUMA node, which means NUMA is irrelevant. MOST of the rest have just two NUMA nodes. Interesting things happen as node count goes up. The first NUMA box I used was an 8-core AMD opteron, but it also had 8 chips (8 NUMA nodes) so NUMA-awareness was quite important, as we were pretty slow when we started on that box.

bob · Post by **bob** » Tue Sep 20, 2016 6:47 am

syzygy wrote:In this thread Peter Österlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)

Later in the thread Bob mentions an idea that occurred to me recently:
bob wrote:There is a lot to think about. For example, when you use magics, you can end up with ALL of the magic lookup tables on a single node? Good, bad or indifferent? I haven't addressed this yet, but one idea is that such data can easily be duplicated. IE generate the original, and then let each thread (on each NUMA node) copy it to a private copy. I have not done this because (...)
For the moment I don't think there is terribly much to gain by replicating the magic lookup tables on each node, but still it might be interesting to try, if only to prepare for machines with larger numbers of nodes.

How to implement this efficiently? Allocating separate tables per node will cost you a register to hold the pointer to the data.

Ideally, such data should be replicated to the same virtual address on each node. But is that even possible with threads? I'm not sure and I kind of doubt it. But I might be wrong.

Perhaps the best thing to do is to use one process per node and then one thread per core. Each process will have its own address space and each thread within the process will find the magic lookup tables in a fixed (rip-relative) place. However, the immediate disadvantage is that the TT will have to be in shared memory, which means (as far as I'm aware) that it cannot benefit from the transparent hugepage feature of Linux.

Another piece of shared memory is the binary image: is it possible to replicate that across all nodes? When using processes it could at least be done by having one physical copy of the program on disk per node. (With a single copy, I suspect the binary image will be loaded into RAM only once and then mmap()ed into each process address space.)

As Bob wrote, there is a lot to think about.

I am not sure how those tests were run. If you read Riks paper, it definitely takes time to migrate things once things are running. At least a minute and that is probably not enough.

The virtual address question you asked, was answered by you correctly. Since threads share everything, they can't have duplicate (but private) chunks of the virtual address space. But an operating system could pull this off by copying shared read-only virtual pages to local memory so that cache misses won't be so expensive.

I need to re-read Rik's paper again, I just skimmed it the last time to see what he was doing (mainly fiddling with the valid bit and permissions to see what the application was doing, and then copying as needed). Perhaps he is doing this although as I think about it, it would be a royal PITA to try to keep up with the memory maps that are supposed to be the same, yet each thread would now have different physical page numbers for some of the data. Seems doable, but it might be a kludge that Linus would red-flag as too complex.

If my head won't explode, I might open that discussion with the kernel memory management guys...

mcostalba · Post by **mcostalba** » Tue Sep 20, 2016 8:34 am

bob wrote:
It is only "hand waving" when you don't understand the issues.

BTW NUMA has NOTHING to do with "number of cores". It has EVERYTHING to do with "number of nodes". And also nothing to do with "logical processors" which can only mean hyper-threading.

In the (unlikely) case that you start to read posts before quoting them, we all would be witnessing a huge jump in quality of the answer.

To make you a short, read-challenged aimed, summary: a multi-thread process under Windows cannot use more than 64 logical processes, no matter what. To overcome this limitation you have to play with what they call NUMA functions.

For people willing to go into more details, here is an interesting technical document from Microsoft:
https://msdn.microsoft.com/en-us/librar ... s.85).aspx

syzygy · Post by **syzygy** » Tue Sep 20, 2016 9:48 am

Adam Hair wrote:
syzygy wrote:Using more than 64 logical processors on Windows has nothing to do with NUMA.
Marco's link states that Windows processor groups consist of sets of up to 64 logical processors, and a NUMA node is assigned to each processor group.

It does not say that. It does say that the cores of one NUMA node will, if possible (what if you have more than 64 logical cores on one node), correspond to one processor group. But a processor group can contain multiple NUMA nodes.

The processor groups are what you cannot avoid if you want to run threads on more than 64 logical cores / 32 physical cores (with HT on). If you do that, you can still ignore any and all NUMA issues.

What Marco is really saing is that NUMA is a non-issue. That ignores the numbers reported by multiple chess engine programmers. And I have to agree wth what Bob said... if one understands the issues it is very clear that there are various things one can do that are bound to improve performance.

mcostalba · Post by **mcostalba** » Tue Sep 20, 2016 11:53 am

syzygy wrote: What Marco is really saing is that NUMA is a non-issue.

I say that NUMA (in case of less than 64 logical processors on Windows) is still to be proved sensibly stronger than ignoring NUMA.

syzygy wrote:
That ignores the numbers reported by multiple chess engine programmers. .

Can you please post link to these multiple test reports?

I know the one of Peter Österlund that you posted above, and IIRC another case but with 96 cores on Windows (so above 64 logical processors limit) so not valid.

In case of Peter we are talking of about 10% speed-up in case of his engine Texel. I don't think this can be blindly generalized to all engines, for instance one of the latest patches n SF (that you know well) removed a scalability issue that alone, is able to gain that 10% in case of many cores (32 cores were used for testing).

That's what I know (I mean numbers and tests, not words), if you have something else I would be happy to read them.

mcostalba · Post by **mcostalba** » Tue Sep 20, 2016 12:08 pm

mcostalba wrote:In case of Peter we are talking of about 10% speed-up in case of his engine Texel. I don't think this can be blindly generalized to all engines, for instance one of the latest patches n SF (that you know well) removed a scalability issue that alone, is able to gain that 10% in case of many cores (32 cores were used for testing).

Moreover a recent numa sf vs non-numa sf test failed to show any sensible difference (it was +2 ELO after 1000 games, well below noise level) when tested on 32 cores, 2 machines were involved in the test:

http://tests.stockfishchess.org/tests/v ... 030fbe521f

If the difference would have been important (more than 10-20 ELO), test would have showed it.

zullil · Post by **zullil** » Tue Sep 20, 2016 12:58 pm

mcostalba wrote: That's what I know (I mean numbers and tests, not words), if you have something else I would be happy to read them.

Here are some (minimal) data comparing non-NUMA-aware Cfish to NUMA-aware Cfish. Testing was done on my two-node linux system. The file benchpos8 is simply 8 copies of the standard Stockfish bench positions, so 8 x 37 = 296 positions in total. I assume that these runs were long enough to give meaningful nps numbers. I have not made any attempt to test the two versions of Cfish in actual games.

Code: Select all

./cfish bench 16384 20 30 benchpos8 depth

not NUMA-aware
===========================
Total time &#40;ms&#41; &#58; 6496911
Nodes searched  &#58; 209051997295
Nodes/second    &#58; 32177137

NUMA-aware
===========================
Total time &#40;ms&#41; &#58; 5456006
Nodes searched  &#58; 184273309022
Nodes/second    &#58; 33774396

nps increase from NUMA is about 5%.

mcostalba · Post by **mcostalba** » Tue Sep 20, 2016 2:14 pm

Thanks Louis, indeed I don't know if the non-numa CFish version already has the per-thread countermove table (it is the scalability patch I was referring earlier).

So to compare oranges vs oranges I have rebased numa branch to current master and here are the links to the corresponding sources:

Numa-aware
https://github.com/mcostalba/Stockfish/ ... e016d5.zip

Master (non-numa aware)
https://github.com/mcostalba/Stockfish/ ... 0891ff.zip

In case you are willing to test, please use them.

For interested people, the numa patch is this one:
https://github.com/mcostalba/Stockfish/commit/numa

What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?