What do you do with NUMA?

zullil · Post by **zullil** » Tue Sep 20, 2016 2:36 pm

mcostalba wrote:Thanks Louis, indeed I don't know if the non-numa CFish version already has the per-thread countermove table (it is the scalability patch I was referring earlier).

So to compare oranges vs oranges I have rebased numa branch to current master and here are the links to the corresponding sources:

Numa-aware
https://github.com/mcostalba/Stockfish/ ... e016d5.zip

Master (non-numa aware)
https://github.com/mcostalba/Stockfish/ ... 0891ff.zip

In case you are willing to test, please use them.

For interested people, the numa patch is this one:
https://github.com/mcostalba/Stockfish/commit/numa

Hi Marco,

The NUMA-aware Cfish uses one CMH table for each NUMA node. So threads do not have individual tables.

I'll try to run an identical nps benchmarking test on the two versions of Stockfish you've provided. As soon as I can

Adam Hair · Post by **Adam Hair** » Tue Sep 20, 2016 5:56 pm

bob wrote:
Adam Hair wrote:
syzygy wrote:Using more than 64 logical processors on Windows has nothing to do with NUMA.
Marco's link states that Windows processor groups consist of sets of up to 64 logical processors, and a NUMA node is assigned to each processor group.
And that makes absolutely no sense to me, as an operating system person. Leave it to Micro$oft to completely mangle terminology and concepts to fit their operating system shortcomings.

A NUMA node is dictated by hardware, not the operating system. Doesn't matter WHAT windows thinks about this stuff, NUMA is a hardware issue dealing with local memory (per node) and trying to minimize access to remote memory on other nodes. No idea what the 64 logical processors per node means. 99% of todays machines have exactly one NUMA node, which means NUMA is irrelevant. MOST of the rest have just two NUMA nodes. Interesting things happen as node count goes up. The first NUMA box I used was an 8-core AMD opteron, but it also had 8 chips (8 NUMA nodes) so NUMA-awareness was quite important, as we were pretty slow when we started on that box.

That explains why the Microsoft description seemed different from everything else I read about NUMA (all of which matches what you wrote).

Adam Hair · Post by **Adam Hair** » Tue Sep 20, 2016 6:20 pm

syzygy wrote:
Adam Hair wrote:
syzygy wrote:Using more than 64 logical processors on Windows has nothing to do with NUMA.
Marco's link states that Windows processor groups consist of sets of up to 64 logical processors, and a NUMA node is assigned to each processor group.
It does not say that. It does say that the cores of one NUMA node will, if possible (what if you have more than 64 logical cores on one node), correspond to one processor group. But a processor group can contain multiple NUMA nodes.

The processor groups are what you cannot avoid if you want to run threads on more than 64 logical cores / 32 physical cores (with HT on). If you do that, you can still ignore any and all NUMA issues.

Okay, This and what Bob wrote clears up some confusion I had about nodes.

petero2 · Post by **petero2** » Tue Sep 20, 2016 6:41 pm

syzygy wrote:In this thread Peter Ã�sterlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)

I took the average improvement for auto on and auto off, that is:

(18.16/16.44 + 17.88/15.22) / 2 = 1.1397

My assumption was that there is no real difference between auto on and auto off for a chess program, so taking the average should give a better estimate. I don't know if that assumption is correct though. More measurements would be required to find out.

bob · Post by **bob** » Tue Sep 20, 2016 6:41 pm

mcostalba wrote:
mcostalba wrote:In case of Peter we are talking of about 10% speed-up in case of his engine Texel. I don't think this can be blindly generalized to all engines, for instance one of the latest patches n SF (that you know well) removed a scalability issue that alone, is able to gain that 10% in case of many cores (32 cores were used for testing).
Moreover a recent numa sf vs non-numa sf test failed to show any sensible difference (it was +2 ELO after 1000 games, well below noise level) when tested on 32 cores, 2 machines were involved in the test:

http://tests.stockfishchess.org/tests/v ... 030fbe521f

If the difference would have been important (more than 10-20 ELO), test would have showed it.

And what is the error bar for 1000 games? And does YOUR performance provide some sort of bound for everyone else's performance? And finally, number of cores is irrelevant, this is all about number of nodes. We saw significant improvement after doing some NUMA_related changes when Crafty first ran on (a) the 8-node AMD opteron box (one core per node) and the 64 node (1 core per node) Intel itanium box Eugene had access to at Microsoft.

Cores is the wrong topic.

bob · Post by **bob** » Tue Sep 20, 2016 6:44 pm

mcostalba wrote:
bob wrote:
It is only "hand waving" when you don't understand the issues.

BTW NUMA has NOTHING to do with "number of cores". It has EVERYTHING to do with "number of nodes". And also nothing to do with "logical processors" which can only mean hyper-threading.
In the (unlikely) case that you start to read posts before quoting them, we all would be witnessing a huge jump in quality of the answer.

To make you a short, read-challenged aimed, summary: a multi-thread process under Windows cannot use more than 64 logical processes, no matter what. To overcome this limitation you have to play with what they call NUMA functions.

For people willing to go into more details, here is an interesting technical document from Microsoft:
https://msdn.microsoft.com/en-us/librar ... s.85).aspx

And once YOU learn to read, I clearly said that has NOTHING to do with NUMA issues. NUMA is PURELY a hardware issue. NOT some broken software issue as is present in windows.

bob · Post by **bob** » Tue Sep 20, 2016 6:48 pm

petero2 wrote:
syzygy wrote:In this thread Peter Ã�sterlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)
I took the average improvement for auto on and auto off, that is:

(18.16/16.44 + 17.88/15.22) / 2 = 1.1397

My assumption was that there is no real difference between auto on and auto off for a chess program, so taking the average should give a better estimate. I don't know if that assumption is correct though. More measurements would be required to find out.

This might be kernel-version specific. We ran a number of tests at UAB, using chess and non-chess applications. The newer NUMA-aware kernels did provide better performance overall. But I STILL found it better to use processor affinity. On our 20 core box, threads will STILL bounce a bit due to an occasional system task running. Bouncing has two effects. (1) obvious cache issues with private (per core) L1 and L2 caches, and shared (on a single NUMA node) L3 cache. So even if you bounce between cores on one node, L1 and L2 suffer. If you bounce between nodes, L1, L2 and L3 suffer, plus the overhead of remote memory access until the data migrates.

However, this does take many minutes to get accurate results as the migration is not instant, because just detecting it is pretty complex and introduces memory management overhead.

bob · Post by **bob** » Tue Sep 20, 2016 8:47 pm

bob wrote:
petero2 wrote:
syzygy wrote:In this thread Peter Ã�sterlund reports some interesting results for Texel on a 2-node PC with 2 x 8 = 16 threads:
Code: Select all
Auto  Awareness    Mn/s mean    Mn/s std
no    no           16.44        1.55
yes   no           15.22        1.67
no    yes          18.16        0.37
yes   yes          17.88        0.39
So Linux' automatic NUMA balancing feature hurts rather than helps, and Texel's own NUMA awareness increases speed by over 10%. (He wrote 14%, but if I take 18.16 vs 16.44 it is 10.46%.)
I took the average improvement for auto on and auto off, that is:

(18.16/16.44 + 17.88/15.22) / 2 = 1.1397

My assumption was that there is no real difference between auto on and auto off for a chess program, so taking the average should give a better estimate. I don't know if that assumption is correct though. More measurements would be required to find out.
This might be kernel-version specific. We ran a number of tests at UAB, using chess and non-chess applications. The newer NUMA-aware kernels did provide better performance overall. But I STILL found it better to use processor affinity. On our 20 core box, threads will STILL bounce a bit due to an occasional system task running. Bouncing has two effects. (1) obvious cache issues with private (per core) L1 and L2 caches, and shared (on a single NUMA node) L3 cache. So even if you bounce between cores on one node, L1 and L2 suffer. If you bounce between nodes, L1, L2 and L3 suffer, plus the overhead of remote memory access until the data migrates.

However, this does take many minutes to get accurate results as the migration is not instant, because just detecting it is pretty complex and introduces memory management overhead.

I should add that IIRC, you can tune this in the kernel via the /proc file system to determine how much overhead this will cost vs how rapidly it will migrate the data.

The approach was, I believe, to selectively clear the page valid bits on a block of pages, then let the program run. If it doesn't fault on any of those pages, they are obviously not needed very frequently. If it does fault on them multiple times, an attempt is made to migrate them to the local node if they are not already there.

Those are not real page faults, obviously, since no paging I/O is done and pages don't have to be moved from free/modified lists back to the resident set (a linking operation). But it is still overhead.

syzygy · Post by **syzygy** » Tue Sep 20, 2016 9:05 pm

mcostalba wrote:
syzygy wrote: What Marco is really saing is that NUMA is a non-issue.
I say that NUMA (in case of less than 64 logical processors on Windows) is still to be proved sensibly stronger than ignoring NUMA.

syzygy wrote:That ignores the numbers reported by multiple chess engine programmers. .
Can you please post link to these multiple test reports?

Sure.
Some Real Performance Data

Code: Select all

Configuration		Best Split Depth	Average Node Speed	Speed Gain
Standard						   14				13600 kN/s 
With Large Pages				 14			   14900 kN/s			+10%
With NUMA and Large Pages	 12		      16200 kN/s		   +20%

In case of Peter we are talking of about 10% speed-up in case of his engine Texel. I don't think this can be blindly generalized to all engines,

Seems to me the one who was blindly generalising is you.

for instance one of the latest patches n SF (that you know well) removed a scalability issue that alone, is able to gain that 10% in case of many cores (32 cores were used for testing).

Which just serves to prove that hardware architecture very well is an issue that needs to be taken into account if performance is considered important.

That's what I know (I mean numbers and tests, not words), if you have something else I would be happy to read them.

Yes, you do not believe in rational arguments. That's OK, because this is a thread for everybody who is interested in the topic.

On a NUMA system, accessing memory on the local node is faster than accessing memory on another node. This is not hand waving; it is the definition of NUMA.

So it follows directly from the technical definition of a NUMA system that on such a system the memory accessed by a search thread is ideally present on the node on which that search thread runs. Obviously this is not possible in case of the transposition table, which needs to be shared by all search threads, but it is possible for many other tables and data structures. Doing so on a machine that is dedicated to running a chess engine (i.e. does not have all kinds of other cpu- and/or memory-intensive jobs running at the same time) cannot harm and should be expected to improve performance at least to some extent. How much performance will be improved will of course depend on lots of factors.

syzygy · Post by **syzygy** » Tue Sep 20, 2016 9:05 pm

mcostalba wrote:
syzygy wrote: What Marco is really saing is that NUMA is a non-issue.
I say that NUMA (in case of less than 64 logical processors on Windows) is still to be proved sensibly stronger than ignoring NUMA.

syzygy wrote:That ignores the numbers reported by multiple chess engine programmers. .
Can you please post link to these multiple test reports?

Sure.
Some Real Performance Data

Code: Select all

Configuration		Best Split Depth	Average Node Speed	Speed Gain
Standard						   14				13600 kN/s 
With Large Pages				 14			   14900 kN/s			+10%
With NUMA and Large Pages	 12		      16200 kN/s		   +20%

In case of Peter we are talking of about 10% speed-up in case of his engine Texel. I don't think this can be blindly generalized to all engines,

Seems to me the one who was blindly generalising is you.

for instance one of the latest patches n SF (that you know well) removed a scalability issue that alone, is able to gain that 10% in case of many cores (32 cores were used for testing).

Which just serves to prove that hardware architecture very well is an issue that needs to be taken into account if performance is considered important.

That's what I know (I mean numbers and tests, not words), if you have something else I would be happy to read them.

Yes, you do not believe in rational arguments. That's OK, because this is a thread for everybody who is interested in the topic.

On a NUMA system, accessing memory on the local node is faster than accessing memory on another node. This is not hand waving; it is the definition of NUMA.

So it follows directly from the technical definition of a NUMA system that on such a system the memory accessed by a search thread is ideally present on the node on which that search thread runs. Obviously this is not possible in case of the transposition table, which needs to be shared by all search threads, but it is possible for many other tables and data structures. Doing so on a machine that is dedicated to running a chess engine (i.e. does not have all kinds of other cpu- and/or memory-intensive jobs running at the same time) cannot harm and should be expected to improve performance at least to some extent. How much performance will be improved will of course depend on lots of factors.

What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?

Re: What do you do with NUMA?