What do you do with NUMA?

petero2 · Post by **petero2** » Tue Sep 20, 2016 9:29 pm

petero2 wrote:
syzygy wrote:In this thread Peter Österlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)
I took the average improvement for auto on and auto off, that is:

(18.16/16.44 + 17.88/15.22) / 2 = 1.1397

My assumption was that there is no real difference between auto on and auto off for a chess program, so taking the average should give a better estimate. I don't know if that assumption is correct though. More measurements would be required to find out.

I started to set up a longer test, but something strange is happening on my current linux system. The NUMA version of texel is something like 70% faster than the non-NUMA version. The only thing I have changed since the last measurement is the Fedora version. I am now running Fedora 24, before I was running Fedora 19.

The NUMA version runs at the expected speed, but the non-NUMA version runs a lot slower than expected (based on the speeds I got before I upgraded to Fedora 24). Running "numatop" shows that RMA/LMA (number of remote memory accesses divided by the number of local memory accesses) varies a lot when the non-NUMA version is running.

It seems like something in the kernel scheduler is broken with respect to NUMA. However another weird thing is that when I run Cfish and compare NUMA vs non-NUMA speeds the difference is only around 10%.

Automatic NUMA balancing is disabled. Transparent huge pages is enabled. The kernel version is: Linux version 4.7.2-201.fc24.x86_64 (mockbuild@bkernel01.phx2.fedoraproject.org) (gcc version 6.1.1 20160621 (Red Hat 6.1.1-3) (GCC) ) #1 SMP Fri Aug 26 15:58:40 UTC 2016

matthewlai · Post by **matthewlai** » Tue Sep 20, 2016 10:09 pm

petero2 wrote: Automatic NUMA balancing is disabled.

With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.

petero2 · Post by **petero2** » Tue Sep 20, 2016 10:21 pm

matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.

Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.

Dann Corbit · Post by **Dann Corbit** » Tue Sep 20, 2016 10:36 pm

petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.

Is there a speed difference between the Fedora 19 non-numa version and the Fedora 24 non-numa version?

IOW, I am asking did the non-numa version slow down or did the numa version speed up?

Or was it both?

syzygy · Post by **syzygy** » Tue Sep 20, 2016 10:39 pm

bob wrote:The virtual address question you asked, was answered by you correctly. Since threads share everything, they can't have duplicate (but private) chunks of the virtual address space. But an operating system could pull this off by copying shared read-only virtual pages to local memory so that cache misses won't be so expensive.

I think so too. It is certainly not standard thread functionality and the Linux clone() system call does not seem to support some sort of partial sharing of the address space. But it might be possible at the OS level. And perhaps this sort of thing is what people are working on already.

I need to re-read Rik's paper again, I just skimmed it the last time to see what he was doing (mainly fiddling with the valid bit and permissions to see what the application was doing, and then copying as needed). Perhaps he is doing this although as I think about it, it would be a royal PITA to try to keep up with the memory maps that are supposed to be the same, yet each thread would now have different physical page numbers for some of the data. Seems doable, but it might be a kludge that Linus would red-flag as too complex.

I guess you're talking about Rik van Riel. I could only find this presentation, but it already has a lot of information.

NUMA load balancing is of course interesting, but in the case of a NUMA system dedicated to running a chess engine the situation is rather simply and the programmer is in a much better position than the OS to decide on memory allocation and thread placement.

However, it would be nice if the OS could somehow detect that certain memory pages are shared by threads on many nodes which only do read (or execution) accesses and would transparently replicate those pages on all nodes. Or the OS could offer some interface to the application to give it corresponding hints.

This seems to come close:
http://htor.inf.ethz.ch/ross2012/slides ... lankes.pdf

petero2 · Post by **petero2** » Tue Sep 20, 2016 10:39 pm

Dann Corbit wrote:
petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.
Is there a speed difference between the Fedora 19 non-numa version and the Fedora 24 non-numa version?

IOW, I am asking did the non-numa version slow down or did the numa version speed up?

Or was it both?

The non-NUMA version slowed down a lot when going to fedora 24. The NUMA version runs at about the same speed.

Dann Corbit · Post by **Dann Corbit** » Tue Sep 20, 2016 10:43 pm

Does the new OS start any new CPU consuming services/daemons?

I wonder if the default thread priority has changed for non-NUMA apps.

bob · Post by **bob** » Tue Sep 20, 2016 11:20 pm

syzygy wrote:
bob wrote:The virtual address question you asked, was answered by you correctly. Since threads share everything, they can't have duplicate (but private) chunks of the virtual address space. But an operating system could pull this off by copying shared read-only virtual pages to local memory so that cache misses won't be so expensive.
I think so too. It is certainly not standard thread functionality and the Linux clone() system call does not seem to support some sort of partial sharing of the address space. But it might be possible at the OS level. And perhaps this sort of thing is what people are working on already.

I need to re-read Rik's paper again, I just skimmed it the last time to see what he was doing (mainly fiddling with the valid bit and permissions to see what the application was doing, and then copying as needed). Perhaps he is doing this although as I think about it, it would be a royal PITA to try to keep up with the memory maps that are supposed to be the same, yet each thread would now have different physical page numbers for some of the data. Seems doable, but it might be a kludge that Linus would red-flag as too complex.
I guess you're talking about Rik van Riel. I could only find this presentation, but it already has a lot of information.

NUMA load balancing is of course interesting, but in the case of a NUMA system dedicated to running a chess engine the situation is rather simply and the programmer is in a much better position than the OS to decide on memory allocation and thread placement.

However, it would be nice if the OS could somehow detect that certain memory pages are shared by threads on many nodes which only do read (or execution) accesses and would transparently replicate those pages on all nodes. Or the OS could offer some interface to the application to give it corresponding hints.

This seems to come close:
http://htor.inf.ethz.ch/ross2012/slides ... lankes.pdf

That actually looks pretty interesting and is the solution I was thinking of, namely a DIFFERENT page table map for each node, so that they don't have to be identical and frequently accessed data (due to cache misses) can be replicated. This would be useful for instruction pages, plus all the write-once-read-many pages of data such as the magic stuff, zobrist random numbers, etc...

As for the Rik question, yes, that was who I was referencing. The powerpoint stuff looks like an outline of the paper I ran across somewhere. Perhaps something I was asked to review for a journal or something, I don't remember. I think it was a year or two back however, maybe a bit longer. Time flies as you get older.

zullil · Post by **zullil** » Thu Sep 22, 2016 11:32 am

mcostalba wrote:Thanks Louis, indeed I don't know if the non-numa CFish version already has the per-thread countermove table (it is the scalability patch I was referring earlier).

So to compare oranges vs oranges I have rebased numa branch to current master and here are the links to the corresponding sources:

Numa-aware
https://github.com/mcostalba/Stockfish/ ... e016d5.zip

Master (non-numa aware)
https://github.com/mcostalba/Stockfish/ ... 0891ff.zip

In case you are willing to test, please use them.

For interested people, the numa patch is this one:
https://github.com/mcostalba/Stockfish/commit/numa

I tested the two versions of Stockfish offered above in the same manner that I tested Cfish. Here are the results:

Code: Select all

./stockfish bench 16384 20 30 benchpos8 depth

Stockfish-0354e1... NUMA-aware
===========================
Total time &#40;ms&#41; &#58; 5987409
Nodes searched  &#58; 179943385246
Nodes/second    &#58; 30053631

Stockfish-4b0043... not NUMA-aware
===========================
Total time &#40;ms&#41; &#58; 4975656
Nodes searched  &#58; 151105733822
Nodes/second    &#58; 30369007

So the non-NUMA-aware master is about 1% faster in nps.

matthewlai · Post by **matthewlai** » Thu Sep 22, 2016 3:53 pm

petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.

I can try running it on my machine if you want? Ubuntu 16.04 Server, 2x E5-2670. It sounds like an interesting thing to figure out.

What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?