What do you do with NUMA?

bob · Post by **bob** » Thu Sep 22, 2016 8:21 pm

I would not expect much of a difference with just two nodes. I ran into significant issues on 8 nodes and beyond... I did not have a 4 node machine to test on so I can't say whether or not 4 nodes presents any issues although with AMD it would be a bit worse than Intel (hyper-transport is double-ended vs Intel QPI which has three connections to make a 4 node machine 0 or 1 hop period).

petero2 · Post by **petero2** » Thu Sep 22, 2016 8:33 pm

matthewlai wrote:
petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.
I can try running it on my machine if you want? Ubuntu 16.04 Server, 2x E5-2670. It sounds like an interesting thing to figure out.

Yes it would be interesting to see what happens on your machine.

I have performed some more tests and found the following:

* All texel versions I have tried (1.05 1.06 1.06a34) run very slowly when NUMA awareness is disabled.

* I ran a test in single user mode as root user, but it still runs equally slow.

* To monitor which cores the threads are using I used this command:

Code: Select all

watch -n 0.1 ps -mo pid,tid,%cpu,psr -p \`pgrep texel\`

This shows that when NUMA awareness is disabled, threads jump a lot between cores and between NUMA nodes.

When NUMA awareness is enabled, threads jump between cores very seldom and never between NUMA nodes. I bind each thread to a NUMA node but not to a specific core within the node.

When I run a NUMA unaware stockfish on the same computer, its search threads do not jump between cores. Possibly because stockfish search threads don't sleep because of the lazy SMP algorithm. Texel search threads can sleep for short amounts of time when waiting for new work. Maybe that causes the kernel to schedule the thread on a different core when it wakes up again.

Evert · Post by **Evert** » Thu Sep 22, 2016 9:06 pm

petero2 wrote: When I run a NUMA unaware stockfish on the same computer, its search threads do not jump between cores. Possibly because stockfish search threads don't sleep because of the lazy SMP algorithm. Texel search threads can sleep for short amounts of time when waiting for new work. Maybe that causes the kernel to schedule the thread on a different core when it wakes up again.

Interesting.
When Stockfish was changed to lazy SMP, was that effect looked at? It would be interesting to know how much it contributes to the efficiency of the new algorithm.

petero2 · Post by **petero2** » Thu Sep 22, 2016 9:20 pm

Evert wrote:
petero2 wrote: When I run a NUMA unaware stockfish on the same computer, its search threads do not jump between cores. Possibly because stockfish search threads don't sleep because of the lazy SMP algorithm. Texel search threads can sleep for short amounts of time when waiting for new work. Maybe that causes the kernel to schedule the thread on a different core when it wakes up again.
Interesting.
When Stockfish was changed to lazy SMP, was that effect looked at? It would be interesting to know how much it contributes to the efficiency of the new algorithm.

I don't know yet how generic this effect is. I have only seen it on my fedora 24 system. The effect did not exist on the same machine when it previously was running fedora 19. It could be a fedora specific problem, although this article suggests that it might be a more generic linux task scheduler problem.

I don't have a NUMA windows system but I seriously doubt the NUMA-aware texel is 70% faster on such systems.

matthewlai · Post by **matthewlai** » Thu Sep 22, 2016 9:41 pm

petero2 wrote:
matthewlai wrote:
petero2 wrote:
matthewlai wrote:
petero2 wrote: Automatic NUMA balancing is disabled.
With balancing disabled, if you allocate everything in one thread and touch (fault-in) all the pages, all the memory will be on that node. Also, the when the scheduler moves a thread to another node, memory won't follow if balancing is disabled.
Yes, but this would be equally true when running fedora 19 as when running fedora 24.

Using fedora 19 the NUMA version of texel was 10-15% faster than the non-NUMA version. Using the same texel version and the same computer, but with fedora 24 instead of fedora 19, the NUMA version is now around 70% faster.

Also, enabling NUMA balancing does not fix the problem. the NUMA version is still around 70% faster than the non-NUMA version.

I now also have a 24-core computer, which also runs fedora 24 and also has the same problem (NUMA version around 70% faster than non-NUMA version). I never ran fedora 19 on the new computer.
I can try running it on my machine if you want? Ubuntu 16.04 Server, 2x E5-2670. It sounds like an interesting thing to figure out.
Yes it would be interesting to see what happens on your machine.

I have performed some more tests and found the following:

* All texel versions I have tried (1.05 1.06 1.06a34) run very slowly when NUMA awareness is disabled.

* I ran a test in single user mode as root user, but it still runs equally slow.

* To monitor which cores the threads are using I used this command:
Code: Select all
watch -n 0.1 ps -mo pid,tid,%cpu,psr -p \`pgrep texel\`
This shows that when NUMA awareness is disabled, threads jump a lot between cores and between NUMA nodes.

When NUMA awareness is enabled, threads jump between cores very seldom and never between NUMA nodes. I bind each thread to a NUMA node but not to a specific core within the node.

When I run a NUMA unaware stockfish on the same computer, its search threads do not jump between cores. Possibly because stockfish search threads don't sleep because of the lazy SMP algorithm. Texel search threads can sleep for short amounts of time when waiting for new work. Maybe that causes the kernel to schedule the thread on a different core when it wakes up again.

I first enabled -DNUMA and -lnuma, and built with the "texel" target (had to move "$(LDFLAGS)" to the end of the command).

With NUMA:

Code: Select all

matthew@bigfoot&#58;~/texel$ ./texel64
setoption name Threads value 16
go movetime 30000
...
info nodes 547469319 nps 18245936 time 30005
bestmove e2e4 ponder c7c5

No NUMA:

Code: Select all

matthew@bigfoot&#58;~/texel$ ./texel64 -nonuma
setoption name Threads value 16
go movetime 30000
...
info nodes 345743254 nps 11522470 time 30006
bestmove e2e4 ponder e7e5

So it shows up here as well.

I think your hypothesis is right. And auto-balancing doesn't kick in because threads don't stay runnable for long, and lazy SMP or other transposition table-based algorithms won't have this problem. It may help to use a spinlock instead in WorkQueue so that threads don't go to sleep.

This is very hard to solve if threads go to sleep because as soon as a thread goes to sleep, its core may be taken by another thread. Then when it wakes up there may not be a free core on the node it was on before, so the scheduler has to schedule in on the other node (presumably that's better than not scheduling it at all in most workloads).

matthewlai · Post by **matthewlai** » Thu Sep 22, 2016 9:49 pm

At 12 threads (so there's room to move around), this doesn't happen as much.

Code: Select all

matthew@bigfoot&#58;~/texel$ ./texel64
setoption name Threads value 12
go movetime 30000
...
info nodes 412815906 nps 13758237 time 30005
bestmove e2e4 ponder c7c5

Code: Select all

matthew@bigfoot&#58;~/texel$ ./texel64 -nonuma
setoption name Threads value 12
go movetime 30000
...
info nodes 317504301 nps 10583476 time 30000
bestmove e2e4 ponder c7c5

If you think about it... this is a very difficult problem for the scheduler. When a thread wakes up and all the cores in the original node are busy, what do you do?

You can:
1) Not schedule it (even though another node has a free core)
2) Schedule it in a hyperthread on the original node
3) Schedule it on the other node

For most applications 3) is probably the most sensible. In chess it isn't because we know that another thread will probably go to sleep very soon.

petero2 · Post by **petero2** » Thu Sep 22, 2016 10:33 pm

matthewlai wrote:At 12 threads (so there's room to move around), this doesn't happen as much.
Code: Select all
matthew@bigfoot&#58;~/texel$ ./texel64
setoption name Threads value 12
go movetime 30000
...
info nodes 412815906 nps 13758237 time 30005
bestmove e2e4 ponder c7c5
Code: Select all
matthew@bigfoot&#58;~/texel$ ./texel64 -nonuma
setoption name Threads value 12
go movetime 30000
...
info nodes 317504301 nps 10583476 time 30000
bestmove e2e4 ponder c7c5
If you think about it... this is a very difficult problem for the scheduler. When a thread wakes up and all the cores in the original node are busy, what do you do?

You can:
1) Not schedule it (even though another node has a free core)
2) Schedule it in a hyperthread on the original node
3) Schedule it on the other node

For most applications 3) is probably the most sensible. In chess it isn't because we know that another thread will probably go to sleep very soon.

What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.

matthewlai · Post by **matthewlai** » Thu Sep 22, 2016 11:23 pm

petero2 wrote:
matthewlai wrote:At 12 threads (so there's room to move around), this doesn't happen as much.
Code: Select all
matthew@bigfoot&#58;~/texel$ ./texel64
setoption name Threads value 12
go movetime 30000
...
info nodes 412815906 nps 13758237 time 30005
bestmove e2e4 ponder c7c5
Code: Select all
matthew@bigfoot&#58;~/texel$ ./texel64 -nonuma
setoption name Threads value 12
go movetime 30000
...
info nodes 317504301 nps 10583476 time 30000
bestmove e2e4 ponder c7c5
If you think about it... this is a very difficult problem for the scheduler. When a thread wakes up and all the cores in the original node are busy, what do you do?

You can:
1) Not schedule it (even though another node has a free core)
2) Schedule it in a hyperthread on the original node
3) Schedule it on the other node

For most applications 3) is probably the most sensible. In chess it isn't because we know that another thread will probably go to sleep very soon.
What you write makes perfect sense, but you still get a 30% difference between NUMA and non-NUMA even when 4 cores are left idle (except for OS background activity). In my earlier test in fedora 19 I only saw a 14% difference even when the number of search threads was equal to the number of cores.

So while I agree the scheduling problem is hard, there has still been a significant regression for the texel workload in fedora 24 compared to fedora 19. Possibly the texel workload is unusual and the kernel developers have concentrated their effort on different workloads.

It would be interesting to bisect the kernel to see exactly what change resulted in this change in behaviour. That would take a while, though.

syzygy · Post by **syzygy** » Fri Sep 23, 2016 12:39 am

petero2 wrote:It could be a fedora specific problem, although this article suggests that it might be a more generic linux task scheduler problem.

But that paper is complaining that Linux is not moving around processes between cores aggressively enough. (And as far as I understand the authors have basically identified a corner case that is handled badly and put some effort in formulating the paper's title.)

In your case it might be a kernel setting that has changed or it might be a change in the scheduler that is somehow playing badly with Texel.

mcostalba · Post by **mcostalba** » Fri Sep 23, 2016 9:19 am

zullil wrote: So the non-NUMA-aware master is about 1% faster in nps.

Thanks Louis, this was expected

The non-NUMA version uses a per-thread big table, while the NUMA version uses a per-node table, so the memory contention is higher. CFish result instead is different because very probably non-NUMA CFish still uses a global table shared by threads.

The point is that the so called SF NUMA patch does 2 things:

- Sets threads affinity for each thread

- Changes table to be per-node

To really evaluate NUMA contirbution alone we should just test the first, all other things being equal. Eventually other tweaks are second-order improvements, but first we have to characterize NUMA alone.

With NUMA I mean:

1 Setting threads affinity one per core.
2 Allocating per-thread tables once affinity is set

The second point is not present in the so called NUMA patch but makes absolutely sense if you pin the thread. Note that if you don't set affinity, i.e in normal case, the second point has been tested to not change anything (possibly or because threads jump too quickly or because threads are more or less stable and the OS has the opportunity to migrate memory to the corresponding core).

Following the other interesting posts of this thread, I see that a sensible difference between Texel and SF is that in the former case threads do sleep quite often, while instead with SF's lazy SMP threads never sleep and this allows the OS to optimally place the threads in a dynamic fashion.

I still prefer the OS to place the threads for me, as long as I feed it with a predictable long running threads set (to avoid tricking it with sleeping threads) instead of manually set affinity, mainly because OS has more knowledge of difference between physical cores and logical cores, for instance if I have a 2 cores machine with hyper-threading enabled: CPU 0, CPU 1, CPU 2, CPU 3, my manual setting would probably end up in setting 2 threads affinity on CPU 0 and CPU 1, while the OS, realizing that CPU 0 and 1 are just the same physical core, would perhaps place them on CPU 0 and CPU 3.

In general detecting hyper-thread from physical cores is far from trivial, especially on Intel hardware, so people that will cheaply start commenting "you need to detect ht ..." and other similar BS, please don't spam this thread.

What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?