What do you do with NUMA?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: What do you do with NUMA?

Post by zullil »

bob wrote:
mar wrote:
mcostalba wrote:
mcostalba wrote: Is not so simple:
I explain better, our aim is to set the thread affinity to different physical cores, so I need to know if CPUX is on a different core than CPUY
I see, I thought you only need to detect HT.

Aren't the logical cores interleaved? Like each two logical cores are 1 physical (at least affinity on Windows seems to work this way, also mentioned in one othe links you sent)
This assumption might not hold in the future though...
I have seen the following:

10 physical cores. With HT enabled, you get 20 logical cores, numbered 0-19. Which share a core?

(1) 0 and 1, 2 and 3, etc.

(2) 0 and 10, 1 and 11, etc.

BOTH happen. And of course, then there is IBM with 20 logical cores per physical core.

And once upon a time, when there was a non-power-of-two number of physical cores, all bets were off in both linux and windows.
For example, with two xeons each having eight physical cores, and hyperthreading enabled in BIOS (so 16 logical cores per node):

Code: Select all

~$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 16 17 18 19 20 21 22 23
node 0 size: 15992 MB
node 0 free: 8455 MB
node 1 cpus: 8 9 10 11 12 13 14 15 24 25 26 27 28 29 30 31
node 1 size: 16125 MB
node 1 free: 8229 MB
node distances:
node   0   1 
  0:  10  20 
  1:  20  10 
Here the pairings are 0-16, 1-17, ..., 7-23 and 8-24, 9-25, ..., 15-31.

So, for example, "cpu" 0 and "cpu" 16 share one physical core on node 0 (ie, on one of the two xeons).
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: What do you do with NUMA?

Post by syzygy »

bob wrote:
mar wrote:
mcostalba wrote:
mcostalba wrote: Is not so simple:
I explain better, our aim is to set the thread affinity to different physical cores, so I need to know if CPUX is on a different core than CPUY
I see, I thought you only need to detect HT.

Aren't the logical cores interleaved? Like each two logical cores are 1 physical (at least affinity on Windows seems to work this way, also mentioned in one othe links you sent)
This assumption might not hold in the future though...
I have seen the following:

10 physical cores. With HT enabled, you get 20 logical cores, numbered 0-19. Which share a core?

(1) 0 and 1, 2 and 3, etc.

(2) 0 and 10, 1 and 11, etc.

BOTH happen. And of course, then there is IBM with 20 logical cores per physical core.

And once upon a time, when there was a non-power-of-two number of physical cores, all bets were off in both linux and windows.
Fortunately all these details are dealt with by the OS. Unfortunately the interface is different for Windows and Linux, so to cover both you have to do the work twice.

For Windows:
https://github.com/mcostalba/Stockfish/ ... p#L26-L132
This first checks whether the processorgroups API is available (necessary to support >64 logical threads). It then obtains the cpu topology and determines the number of nodes and, for each node, the number of physical cores and the number of logical cores. You can also find out exactly which logical cores are on which nodes and exactly which logical cores correspond to the same physical core.

For Linux:
https://github.com/mcostalba/Stockfish/ ... #L151-L216
This uses libnuma to get similar information. Unfortunately libnuma does not know about hyperthreading, so you need to parse /sys/devices/system/cpu/cpu%d/topology/thread_siblings_list yourself.

My own implementation:
https://github.com/syzygy1/Cfish/blob/m ... .c#L51-L92
for Linux and
https://github.com/syzygy1/Cfish/blob/m ... #L202-L325
for Windows.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: What do you do with NUMA?

Post by petero2 »

matthewlai wrote:
petero2 wrote:What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.
It would be interesting to bisect the kernel to see exactly what change resulted in this change in behaviour. That would take a while, though.
I ran some tests on different fedora versions using my 24 core computer. In all cases I used the same statically linked texel binary to make it work on all fedora versions. The binary was compiled without NUMA support. I ran texel like this:

Code: Select all

./texel
uci
setoption name hash value 1024
setoption name threads value 24
go infinite
I stopped the search after 30 seconds and noted the reported NPS value. I got the following results:

Code: Select all

Version             : MNPS
Fedora 19 live      : 27.7 28.3
Fedora 20 installer : 23.0 20.8 21.9
Fedora 20 live      : 21.8 21.1
Fedora 21 live      : 20.8 21.0
Fedora 22 live      : 20.5 20.3
Fedora 23 live      : 20.0 
Fedora 24 live      : 21.2 21.3 22.1
Fedora 24 + updates : 17.5 17.9 18.5 17.7
Fedora 25-a2 live   : 19.5 19.6
So there was a big regression in fedora 20 and it seems there was another regression in a fedora 24 update.
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: What do you do with NUMA?

Post by matthewlai »

petero2 wrote:
matthewlai wrote:
petero2 wrote:What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.
It would be interesting to bisect the kernel to see exactly what change resulted in this change in behaviour. That would take a while, though.
I ran some tests on different fedora versions using my 24 core computer. In all cases I used the same statically linked texel binary to make it work on all fedora versions. The binary was compiled without NUMA support. I ran texel like this:

Code: Select all

./texel
uci
setoption name hash value 1024
setoption name threads value 24
go infinite
I stopped the search after 30 seconds and noted the reported NPS value. I got the following results:

Code: Select all

Version             : MNPS
Fedora 19 live      : 27.7 28.3
Fedora 20 installer : 23.0 20.8 21.9
Fedora 20 live      : 21.8 21.1
Fedora 21 live      : 20.8 21.0
Fedora 22 live      : 20.5 20.3
Fedora 23 live      : 20.0 
Fedora 24 live      : 21.2 21.3 22.1
Fedora 24 + updates : 17.5 17.9 18.5 17.7
Fedora 25-a2 live   : 19.5 19.6
So there was a big regression in fedora 20 and it seems there was another regression in a fedora 24 update.
Fedora 19 uses kernel 3.9, and Fedora 20 uses kernel 3.11.

Release summary of 3.10: https://kernelnewbies.org/Linux_3.10
And 3.11: https://kernelnewbies.org/Linux_3.11

The timerless change sounds most probable.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What do you do with NUMA?

Post by bob »

petero2 wrote:
matthewlai wrote:
petero2 wrote:What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.
It would be interesting to bisect the kernel to see exactly what change resulted in this change in behaviour. That would take a while, though.
I ran some tests on different fedora versions using my 24 core computer. In all cases I used the same statically linked texel binary to make it work on all fedora versions. The binary was compiled without NUMA support. I ran texel like this:

Code: Select all

./texel
uci
setoption name hash value 1024
setoption name threads value 24
go infinite
I stopped the search after 30 seconds and noted the reported NPS value. I got the following results:

Code: Select all

Version             : MNPS
Fedora 19 live      : 27.7 28.3
Fedora 20 installer : 23.0 20.8 21.9
Fedora 20 live      : 21.8 21.1
Fedora 21 live      : 20.8 21.0
Fedora 22 live      : 20.5 20.3
Fedora 23 live      : 20.0 
Fedora 24 live      : 21.2 21.3 22.1
Fedora 24 + updates : 17.5 17.9 18.5 17.7
Fedora 25-a2 live   : 19.5 19.6
So there was a big regression in fedora 20 and it seems there was another regression in a fedora 24 update.
That's a bad test. If you read Rik's description of how the built-in NUMA support works, it takes MUCH more time than that to properly migrate date. He tries to do it in an unobtrusive manner, rather than swamping the system with overhead.

A better scheme would be a 3-4-5 minute search to "warm" things up, then without terminating the executable, doing the kind of test you ran...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: What do you do with NUMA?

Post by bob »

petero2 wrote:
matthewlai wrote:
petero2 wrote:What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.
It would be interesting to bisect the kernel to see exactly what change resulted in this change in behaviour. That would take a while, though.
I ran some tests on different fedora versions using my 24 core computer. In all cases I used the same statically linked texel binary to make it work on all fedora versions. The binary was compiled without NUMA support. I ran texel like this:

Code: Select all

./texel
uci
setoption name hash value 1024
setoption name threads value 24
go infinite
I stopped the search after 30 seconds and noted the reported NPS value. I got the following results:

Code: Select all

Version             : MNPS
Fedora 19 live      : 27.7 28.3
Fedora 20 installer : 23.0 20.8 21.9
Fedora 20 live      : 21.8 21.1
Fedora 21 live      : 20.8 21.0
Fedora 22 live      : 20.5 20.3
Fedora 23 live      : 20.0 
Fedora 24 live      : 21.2 21.3 22.1
Fedora 24 + updates : 17.5 17.9 18.5 17.7
Fedora 25-a2 live   : 19.5 19.6
So there was a big regression in fedora 20 and it seems there was another regression in a fedora 24 update.
That's a bad test. If you read Rik's description of how the built-in NUMA support works, it takes MUCH more time than that to properly migrate date. He tries to do it in an unobtrusive manner, rather than swamping the system with overhead.

A better scheme would be a 3-4-5 minute search to "warm" things up, then without terminating the executable, doing the kind of test you ran...
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: What do you do with NUMA?

Post by petero2 »

matthewlai wrote:
petero2 wrote:
matthewlai wrote:
petero2 wrote:What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.
It would be interesting to bisect the kernel to see exactly what change resulted in this change in behaviour. That would take a while, though.
I ran some tests on different fedora versions using my 24 core computer. In all cases I used the same statically linked texel binary to make it work on all fedora versions. The binary was compiled without NUMA support. I ran texel like this:

Code: Select all

./texel
uci
setoption name hash value 1024
setoption name threads value 24
go infinite
I stopped the search after 30 seconds and noted the reported NPS value. I got the following results:

Code: Select all

Version             : MNPS
Fedora 19 live      : 27.7 28.3
Fedora 20 installer : 23.0 20.8 21.9
Fedora 20 live      : 21.8 21.1
Fedora 21 live      : 20.8 21.0
Fedora 22 live      : 20.5 20.3
Fedora 23 live      : 20.0 
Fedora 24 live      : 21.2 21.3 22.1
Fedora 24 + updates : 17.5 17.9 18.5 17.7
Fedora 25-a2 live   : 19.5 19.6
So there was a big regression in fedora 20 and it seems there was another regression in a fedora 24 update.
Fedora 19 uses kernel 3.9, and Fedora 20 uses kernel 3.11.

Release summary of 3.10: https://kernelnewbies.org/Linux_3.10
And 3.11: https://kernelnewbies.org/Linux_3.11

The timerless change sounds most probable.
I agree. I tried booting the fedora 20 kernel with "nohz=off" but that did not affect the performance of the non-NUMA aware texel. Possibly the timer changes have an effect on scheduling even when nohz=off is used.

Rebuilding the live image with a kernel configured with CONFIG_HZ_PERIODIC would be interesting, but unfortunately I don't know how to do that.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: What do you do with NUMA?

Post by petero2 »

petero2 wrote:Rebuilding the live image with a kernel configured with CONFIG_HZ_PERIODIC would be interesting, but unfortunately I don't know how to do that.
I built a vanilla kernel (4.8.0-rc7) instead with CONFIG_HZ_PERIODIC=y and booted the fedora 24 OS using this kernel. It did not affect the non-NUMA aware texel though. NUMA awareness still gives around 70% higher NPS for texel.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: What do you do with NUMA?

Post by petero2 »

bob wrote:
petero2 wrote:
matthewlai wrote:
petero2 wrote:What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.
It would be interesting to bisect the kernel to see exactly what change resulted in this change in behaviour. That would take a while, though.
I ran some tests on different fedora versions using my 24 core computer. In all cases I used the same statically linked texel binary to make it work on all fedora versions. The binary was compiled without NUMA support. I ran texel like this:

Code: Select all

./texel
uci
setoption name hash value 1024
setoption name threads value 24
go infinite
I stopped the search after 30 seconds and noted the reported NPS value. I got the following results:

Code: Select all

Version             : MNPS
Fedora 19 live      : 27.7 28.3
Fedora 20 installer : 23.0 20.8 21.9
Fedora 20 live      : 21.8 21.1
Fedora 21 live      : 20.8 21.0
Fedora 22 live      : 20.5 20.3
Fedora 23 live      : 20.0 
Fedora 24 live      : 21.2 21.3 22.1
Fedora 24 + updates : 17.5 17.9 18.5 17.7
Fedora 25-a2 live   : 19.5 19.6
So there was a big regression in fedora 20 and it seems there was another regression in a fedora 24 update.
That's a bad test.
No, the test was fine, you just did not understand its purpose. As I wrote here automatic NUMA balancing was disabled for this test. The purpose of the test was to find out in which linux version the performance regression started for texel when texel's NUMA awareness is disabled.

Anyway, I have now tested again after implementing "lazy SMP" in texel. In this algorithm the search threads never sleep while the engine is thinking. When testing this version on my 16 core computer the NUMA aware texel was only about 4% faster than the non-NUMA aware texel.

Possibly this result is too good for the non-NUMA version because texel allocates its thread local memory when a search thread starts its first search. Since the thread does not sleep during thinking and my test restarted the engine between each search, it is quite likely that each search thread stayed on the same node where its local memory was allocated for the whole duration of the test. In a real game when the engine is sleeping while the opponent is thinking, there might be a risk that threads get assigned to different nodes/cores each time the engine starts thinking.