What do you do with NUMA?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: What do you do with NUMA?

Post by petero2 »

petero2 wrote:
syzygy wrote:In this thread Peter Österlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:

Code: Select all

Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)
I took the average improvement for auto on and auto off, that is:

(18.16/16.44 + 17.88/15.22) / 2 = 1.1397

My assumption was that there is no real difference between auto on and auto off for a chess program, so taking the average should give a better estimate. I don't know if that assumption is correct though. More measurements would be required to find out.
I started to set up a longer test, but something strange is happening on my current linux system. The NUMA version of texel is something like 70% faster than the non-NUMA version. The only thing I have changed since the last measurement is the Fedora version. I am now running Fedora 24, before I was running Fedora 19.

The NUMA version runs at the expected speed, but the non-NUMA version runs a lot slower than expected (based on the speeds I got before I upgraded to Fedora 24). Running "numatop" shows that RMA/LMA (number of remote memory accesses divided by the number of local memory accesses) varies a lot when the non-NUMA version is running.

It seems like something in the kernel scheduler is broken with respect to NUMA. However another weird thing is that when I run Cfish and compare NUMA vs non-NUMA speeds the difference is only around 10%.

Automatic NUMA balancing is disabled. Transparent huge pages is enabled. The kernel version is: Linux version 4.7.2-201.fc24.x86_64 (mockbuild@bkernel01.phx2.fedoraproject.org) (gcc version 6.1.1 20160621 (Red Hat 6.1.1-3) (GCC) ) #1 SMP Fri Aug 26 15:58:40 UTC 2016
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: What do you do with NUMA?

Post by matthewlai »

petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: What do you do with NUMA?

Post by petero2 »

matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: What do you do with NUMA?

Post by Dann Corbit »

petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.
Is there a speed difference between the Fedora 19 non-numa version and the Fedora 24 non-numa version?

IOW, I am asking did the non-numa version slow down or did the numa version speed up?

Or was it both?
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: What do you do with NUMA?

Post by syzygy »

bob wrote:The virtual address question you asked, was answered by you correctly. Since threads share everything, they can't have duplicate (but private) chunks of the virtual address space. But an operating system could pull this off by copying shared read-only virtual pages to local memory so that cache misses won't be so expensive.
I think so too. It is certainly not standard thread functionality and the Linux clone() system call does not seem to support some sort of partial sharing of the address space. But it might be possible at the OS level. And perhaps this sort of thing is what people are working on already.
I need to re-read Rik's paper again, I just skimmed it the last time to see what he was doing (mainly fiddling with the valid bit and permissions to see what the application was doing, and then copying as needed). Perhaps he is doing this although as I think about it, it would be a royal PITA to try to keep up with the memory maps that are supposed to be the same, yet each thread would now have different physical page numbers for some of the data. Seems doable, but it might be a kludge that Linus would red-flag as too complex.
I guess you're talking about Rik van Riel. I could only find this presentation, but it already has a lot of information.

NUMA load balancing is of course interesting, but in the case of a NUMA system dedicated to running a chess engine the situation is rather simply and the programmer is in a much better position than the OS to decide on memory allocation and thread placement.

However, it would be nice if the OS could somehow detect that certain memory pages are shared by threads on many nodes which only do read (or execution) accesses and would transparently replicate those pages on all nodes. Or the OS could offer some interface to the application to give it corresponding hints.

This seems to come close:
http://htor.inf.ethz.ch/ross2012/slides ... lankes.pdf
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: What do you do with NUMA?

Post by petero2 »

Dann Corbit wrote:
petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.
Is there a speed difference between the Fedora 19 non-numa version and the Fedora 24 non-numa version?

IOW, I am asking did the non-numa version slow down or did the numa version speed up?

Or was it both?
The non-NUMA version slowed down a lot when going to fedora 24. The NUMA version runs at about the same speed.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: What do you do with NUMA?

Post by Dann Corbit »

Does the new OS start any new CPU consuming services/daemons?

I wonder if the default thread priority has changed for non-NUMA apps.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What do you do with NUMA?

Post by bob »

syzygy wrote:
bob wrote:The virtual address question you asked, was answered by you correctly. Since threads share everything, they can't have duplicate (but private) chunks of the virtual address space. But an operating system could pull this off by copying shared read-only virtual pages to local memory so that cache misses won't be so expensive.
I think so too. It is certainly not standard thread functionality and the Linux clone() system call does not seem to support some sort of partial sharing of the address space. But it might be possible at the OS level. And perhaps this sort of thing is what people are working on already.
I need to re-read Rik's paper again, I just skimmed it the last time to see what he was doing (mainly fiddling with the valid bit and permissions to see what the application was doing, and then copying as needed). Perhaps he is doing this although as I think about it, it would be a royal PITA to try to keep up with the memory maps that are supposed to be the same, yet each thread would now have different physical page numbers for some of the data. Seems doable, but it might be a kludge that Linus would red-flag as too complex.
I guess you're talking about Rik van Riel. I could only find this presentation, but it already has a lot of information.

NUMA load balancing is of course interesting, but in the case of a NUMA system dedicated to running a chess engine the situation is rather simply and the programmer is in a much better position than the OS to decide on memory allocation and thread placement.

However, it would be nice if the OS could somehow detect that certain memory pages are shared by threads on many nodes which only do read (or execution) accesses and would transparently replicate those pages on all nodes. Or the OS could offer some interface to the application to give it corresponding hints.

This seems to come close:
http://htor.inf.ethz.ch/ross2012/slides ... lankes.pdf
That actually looks pretty interesting and is the solution I was thinking of, namely a DIFFERENT page table map for each node, so that they don't have to be identical and frequently accessed data (due to cache misses) can be replicated. This would be useful for instruction pages, plus all the write-once-read-many pages of data such as the magic stuff, zobrist random numbers, etc...

As for the Rik question, yes, that was who I was referencing. The powerpoint stuff looks like an outline of the paper I ran across somewhere. Perhaps something I was asked to review for a journal or something, I don't remember. I think it was a year or two back however, maybe a bit longer. Time flies as you get older. :)
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: What do you do with NUMA?

Post by zullil »

mcostalba wrote:Thanks Louis, indeed I don't know if the non-numa CFish version already has the per-thread countermove table (it is the scalability patch I was referring earlier).

So to compare oranges vs oranges I have rebased numa branch to current master and here are the links to the corresponding sources:

Numa-aware
https://github.com/mcostalba/Stockfish/ ... e016d5.zip


Master (non-numa aware)
https://github.com/mcostalba/Stockfish/ ... 0891ff.zip


In case you are willing to test, please use them.

For interested people, the numa patch is this one:
https://github.com/mcostalba/Stockfish/commit/numa
I tested the two versions of Stockfish offered above in the same manner that I tested Cfish. Here are the results:

Code: Select all

./stockfish bench 16384 20 30 benchpos8 depth

Stockfish-0354e1... NUMA-aware
===========================
Total time (ms) : 5987409
Nodes searched  : 179943385246
Nodes/second    : 30053631

Stockfish-4b0043... not NUMA-aware
===========================
Total time (ms) : 4975656
Nodes searched  : 151105733822
Nodes/second    : 30369007

So the non-NUMA-aware master is about 1% faster in nps.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: What do you do with NUMA?

Post by matthewlai »

petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.
I can try running it on my machine if you want? Ubuntu 16.04 Server, 2x E5-2670. It sounds like an interesting thing to figure out.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.