Ethereal 10.88 NUMA

kranium · Post by **kranium** » Fri Aug 24, 2018 4:12 pm

NUMA thread scaling across multiple processors

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5

+34%

Code: Select all

Ethereal 10.88 (PEXT)
setoption name Threads value 46
info string set Threads to 46
go movetime 10000
info depth 25 seldepth 35 score cp 35 time 8781 nodes 846830267 nps 96427000 tbhits 0 hashfull 1000 pv d2d4 g8f6 c2c4 e7e6 g1f3 f8b4 c1d2 b4d2 b1d2 e8g8 e2e3 d7d6 f1d3 b7b6 d1c2 c8b7 a1d1 h7h6 e1g1 c7c5 a2a3 b8d7 h2h3 a8c8 d2e4 b7e4 d3e4
bestmove d2d4 ponder g8f6

Ethereal 10.88 (PEXT)
setoption name Threads value 92
info string set Threads to 92
go movetime 10000
info depth 24 seldepth 35 score cp 38 time 8422 nodes 1054606013 nps 125205000 tbhits 0 hashfull 1000 pv d2d4 e7e6 g1f3 c7c5 b1c3 c5d4 f3d4 g8f6 e2e4 f8b4 e4e5 f6e4 d4b5 a7a6 d1d4 b8c6 d4e4 a6b5 c1d2 d8b6 a2a3 b4c3 d2c3 e8g8 c3b4 c6b4 e4b4
bestmove d2d4 ponder e7e6

+30%

Code: Select all

Stockfish 130818 64 BMI2
setoption name Threads value 46
go movetime 10000
info depth 27 seldepth 34 multipv 1 score cp 45 nodes 709112433 nps 70904152 hashfull 999 tbhits 0 time 10001 pv d2d4 e7e6
bestmove d2d4 ponder e7e6

Stockfish 130818 64 BMI2
setoption name Threads value 92
go movetime 10000
info depth 26 seldepth 39 multipv 1 score cp 46 nodes 913860776 nps 91358669 hashfull 999 tbhits 0 time 10003 pv d2d4 g8f6 g1f3 e7e6 c2c4 d7d5 e2e3 f8e7 f1d3 d5c4 d3c4 e8g8 e1g1 c7c5 d4c5 e7c5 c1d2 b7b6 d1c2 c8b7 f1d1 b8d7 d2c3 d8c7 c4d3 c5d6 b1d2
bestmove d2d4 ponder g8f6

+29%

CPUs: 2 x Intel Xeon Platinum 8168 @ 2.70 GHz 33 MB L3
CORES: 48 physical (96 logical)
RAM: 256GB DDR4-2666 ECC Registered RDIMM
SSD: 2x Crucial MX300 (1TB) in RAID1
OS: Windows Server 2016

Nice job Andrew-
your NUMA implementation for 10.88 looks quite competitive
and overall NPS is significantly improved

zullil · Post by **zullil** » Fri Aug 24, 2018 4:41 pm

kranium wrote: ↑Fri Aug 24, 2018 4:12 pm NUMA thread scaling across multiple processors

Code: Select all

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5

+34% (+17 Elo)

Code: Select all

Ethereal 10.88 (PEXT)
setoption name Threads value 46
info string set Threads to 46
go movetime 10000
info depth 25 seldepth 35 score cp 35 time 8781 nodes 846830267 nps 96427000 tbhits 0 hashfull 1000 pv d2d4 g8f6 c2c4 e7e6 g1f3 f8b4 c1d2 b4d2 b1d2 e8g8 e2e3 d7d6 f1d3 b7b6 d1c2 c8b7 a1d1 h7h6 e1g1 c7c5 a2a3 b8d7 h2h3 a8c8 d2e4 b7e4 d3e4
bestmove d2d4 ponder g8f6

Ethereal 10.88 (PEXT)
setoption name Threads value 92
info string set Threads to 92
go movetime 10000
info depth 24 seldepth 35 score cp 38 time 8422 nodes 1054606013 nps 125205000 tbhits 0 hashfull 1000 pv d2d4 e7e6 g1f3 c7c5 b1c3 c5d4 f3d4 g8f6 e2e4 f8b4 e4e5 f6e4 d4b5 a7a6 d1d4 b8c6 d4e4 a6b5 c1d2 d8b6 a2a3 b4c3 d2c3 e8g8 c3b4 c6b4 e4b4
bestmove d2d4 ponder e7e6

+30% (+15 Elo)

Code: Select all

Stockfish 130818 64 BMI2
setoption name Threads value 46
go movetime 10000
info depth 27 seldepth 34 multipv 1 score cp 45 nodes 709112433 nps 70904152 hashfull 999 tbhits 0 time 10001 pv d2d4 e7e6
bestmove d2d4 ponder e7e6

Stockfish 130818 64 BMI2
setoption name Threads value 92
go movetime 10000
info depth 26 seldepth 39 multipv 1 score cp 46 nodes 913860776 nps 91358669 hashfull 999 tbhits 0 time 10003 pv d2d4 g8f6 g1f3 e7e6 c2c4 d7d5 e2e3 f8e7 f1d3 d5c4 d3c4 e8g8 e1g1 c7c5 d4c5 e7c5 c1d2 b7b6 d1c2 c8b7 f1d1 b8d7 d2c3 d8c7 c4d3 c5d6 b1d2
bestmove d2d4 ponder g8f6

+29% (+15 Elo)

CPUs: 2 x Intel Xeon Platinum 8168 @ 2.70 GHz 33 MB L3
CORES: 48 physical (96 logical)
RAM: 256GB DDR4-2666 ECC Registered RDIMM
SSD: 2x Crucial MX300 (1TB) in RAID1
OS: Windows Server 2016

Nice job Andrew-
your NUMA implementation for 10.88 looks very good

Hyperthreading is enabled, so it seems that all we're seeing here is that running 92 threads on 48 physical cores brings a very small gain in nps over running 46 threads on 48 physical cores. It appears all that NUMA is doing here is correcting for Windows, which would otherwise run the 92 threads on 24 physical cores, i.e., on a single "processor group". I wonder if Houdini is making some further use of NUMA awareness, which might explain its slightly greater gain than the other engines.

Would be interesting to see the gain going from 46 to 92 threads on a machine with more than 92 physical cores, especially one running Linux. Need someone with extremely big hardware.

elcabesa · Post by **elcabesa** » Fri Aug 24, 2018 6:12 pm

I don't understand. My engine has not any NUMA code at all. To be run in a system with 2 processors with 6 core each one it doesn't seems to need NUMA code. Does it need some NUMA awareness code? How much do you think it will gain? Isn't this a OS problem to equally allocate threads to the 2 cpu? Or is a memory issue?

zullil · Post by **zullil** » Fri Aug 24, 2018 6:24 pm

elcabesa wrote: ↑Fri Aug 24, 2018 6:12 pm I don't understand. My engine has not any NUMA code at all. To be run in a system with 2 processors with 6 core each one it doesn't seems to need NUMA code. Does it need some NUMA awareness code? How much do you think it will gain? Isn't this a OS problem to equally allocate threads to the 2 cpu? Or is a memory issue?

The only major issue of concern here seems to be related to Windows running on large hardware---machines with more than 64 logical cores. On such systems, Windows by default will force all threads onto a single processor group. For such hardware, the engine needs "NUMA-awareness", essentially to compensate for Windows.

More here: https://docs.microsoft.com/en-us/window ... sor-groups

kranium · Post by **kranium** » Fri Aug 24, 2018 6:34 pm

zullil wrote: ↑Fri Aug 24, 2018 6:24 pm
elcabesa wrote: ↑Fri Aug 24, 2018 6:12 pm I don't understand. My engine has not any NUMA code at all. To be run in a system with 2 processors with 6 core each one it doesn't seems to need NUMA code. Does it need some NUMA awareness code? How much do you think it will gain? Isn't this a OS problem to equally allocate threads to the 2 cpu? Or is a memory issue?
The only major issue of concern here seems to be related to Windows running on large hardware---machines with more than 64 logical cores. On such systems, Windows by default will force all threads onto a single processor group. For such hardware, the engine needs "NUMA-awareness", essentially to compensate for Windows.

More here: https://docs.microsoft.com/en-us/window ... sor-groups

Yes, normally with anything over 64 threads (logical processors in WindowsSpeak), the application should only benefit if NUMA (support for multiple physical processors) is added. At this time I believe only Stockfish, Houdini, and now very recently Ethereal have added code for this.

https://docs.microsoft.com/en-us/window ... ma-support

AndrewGrant · Post by **AndrewGrant** » Fri Aug 24, 2018 10:33 pm

kranium wrote: ↑Fri Aug 24, 2018 4:12 pm Nice job Andrew-
your NUMA implementation for 10.88 looks quite competitive
and overall NPS is significantly improved

Thank you for testing this -- 64+ core Windows machines are certainly a rarity.

Credit of course to Peter O (Texel), since this is his invention, as used in Stockfish.

Cheers, Andrew Grant

Milos · Post by **Milos** » Fri Aug 24, 2018 11:55 pm

AndrewGrant wrote: ↑Fri Aug 24, 2018 10:33 pm Thank you for testing this -- 64+ core Windows machines are certainly a rarity.

Credit of course to Peter O (Texel), since this is his invention, as used in Stockfish.

Cheers, Andrew Grant

It is originally Peter O code that has been improved in Brainfish and then just included back into SF.
And Peter O found most of "inspiration" for his code on MSDN example code

.

Dokterchen · Post by **Dokterchen** » Sat Aug 25, 2018 11:18 am

kranium wrote: ↑Fri Aug 24, 2018 4:12 pm NUMA thread scaling across multiple processors

Code: Select all

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5

+34%

Code: Select all

Ethereal 10.88 (PEXT)
setoption name Threads value 46
info string set Threads to 46
go movetime 10000
info depth 25 seldepth 35 score cp 35 time 8781 nodes 846830267 nps 96427000 tbhits 0 hashfull 1000 pv d2d4 g8f6 c2c4 e7e6 g1f3 f8b4 c1d2 b4d2 b1d2 e8g8 e2e3 d7d6 f1d3 b7b6 d1c2 c8b7 a1d1 h7h6 e1g1 c7c5 a2a3 b8d7 h2h3 a8c8 d2e4 b7e4 d3e4
bestmove d2d4 ponder g8f6

Ethereal 10.88 (PEXT)
setoption name Threads value 92
info string set Threads to 92
go movetime 10000
info depth 24 seldepth 35 score cp 38 time 8422 nodes 1054606013 nps 125205000 tbhits 0 hashfull 1000 pv d2d4 e7e6 g1f3 c7c5 b1c3 c5d4 f3d4 g8f6 e2e4 f8b4 e4e5 f6e4 d4b5 a7a6 d1d4 b8c6 d4e4 a6b5 c1d2 d8b6 a2a3 b4c3 d2c3 e8g8 c3b4 c6b4 e4b4
bestmove d2d4 ponder e7e6

+30%

Code: Select all

Stockfish 130818 64 BMI2
setoption name Threads value 46
go movetime 10000
info depth 27 seldepth 34 multipv 1 score cp 45 nodes 709112433 nps 70904152 hashfull 999 tbhits 0 time 10001 pv d2d4 e7e6
bestmove d2d4 ponder e7e6

Stockfish 130818 64 BMI2
setoption name Threads value 92
go movetime 10000
info depth 26 seldepth 39 multipv 1 score cp 46 nodes 913860776 nps 91358669 hashfull 999 tbhits 0 time 10003 pv d2d4 g8f6 g1f3 e7e6 c2c4 d7d5 e2e3 f8e7 f1d3 d5c4 d3c4 e8g8 e1g1 c7c5 d4c5 e7c5 c1d2 b7b6 d1c2 c8b7 f1d1 b8d7 d2c3 d8c7 c4d3 c5d6 b1d2
bestmove d2d4 ponder g8f6

+29%

NPS goes up but depth goes down!?

AndrewGrant · Post by **AndrewGrant** » Sat Aug 25, 2018 11:32 am

Dokterchen wrote: ↑Sat Aug 25, 2018 11:18 am NPS goes up but depth goes down!?

So, there are a couple things at play here ....

1) This is a sample of exactly one 10 second search each
2) The threads are running slower here, but faster as a collective.
3) In general, for LazySMP engines, more cores does not increase depth as much, it more often increases the quality of the depth. This is due to things like less aggressive pruning when you have a TT move, extending TT moves, overall better move ordering, ...

zullil · Post by **zullil** » Sat Aug 25, 2018 11:36 am

Dokterchen wrote: ↑Sat Aug 25, 2018 11:18 am

kranium wrote: ↑Fri Aug 24, 2018 4:12 pm NUMA thread scaling across multiple processors

Code: Select all

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6

Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5

+34%

Code: Select all

Ethereal 10.88 (PEXT)
setoption name Threads value 46
info string set Threads to 46
go movetime 10000
info depth 25 seldepth 35 score cp 35 time 8781 nodes 846830267 nps 96427000 tbhits 0 hashfull 1000 pv d2d4 g8f6 c2c4 e7e6 g1f3 f8b4 c1d2 b4d2 b1d2 e8g8 e2e3 d7d6 f1d3 b7b6 d1c2 c8b7 a1d1 h7h6 e1g1 c7c5 a2a3 b8d7 h2h3 a8c8 d2e4 b7e4 d3e4
bestmove d2d4 ponder g8f6

Ethereal 10.88 (PEXT)
setoption name Threads value 92
info string set Threads to 92
go movetime 10000
info depth 24 seldepth 35 score cp 38 time 8422 nodes 1054606013 nps 125205000 tbhits 0 hashfull 1000 pv d2d4 e7e6 g1f3 c7c5 b1c3 c5d4 f3d4 g8f6 e2e4 f8b4 e4e5 f6e4 d4b5 a7a6 d1d4 b8c6 d4e4 a6b5 c1d2 d8b6 a2a3 b4c3 d2c3 e8g8 c3b4 c6b4 e4b4
bestmove d2d4 ponder e7e6

+30%

Code: Select all

Stockfish 130818 64 BMI2
setoption name Threads value 46
go movetime 10000
info depth 27 seldepth 34 multipv 1 score cp 45 nodes 709112433 nps 70904152 hashfull 999 tbhits 0 time 10001 pv d2d4 e7e6
bestmove d2d4 ponder e7e6

Stockfish 130818 64 BMI2
setoption name Threads value 92
go movetime 10000
info depth 26 seldepth 39 multipv 1 score cp 46 nodes 913860776 nps 91358669 hashfull 999 tbhits 0 time 10003 pv d2d4 g8f6 g1f3 e7e6 c2c4 d7d5 e2e3 f8e7 f1d3 d5c4 d3c4 e8g8 e1g1 c7c5 d4c5 e7c5 c1d2 b7b6 d1c2 c8b7 f1d1 b8d7 d2c3 d8c7 c4d3 c5d6 b1d2
bestmove d2d4 ponder g8f6

+29%

NPS goes up but depth goes down!?

Well, these are single ten-second runs, and the PV for each 46-thread search is different from the PV for the corresponding 92-thread search. So comparing the depths is probably meaningless. Also, bear in mind that all these engines are using Lazy SMP, where it seems that depth is somehow less important than it is for other methods of parallel search.

[EDIT] See above for a more informed answer that appeared while I was typing mine!

Ethereal 10.88 NUMA

Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA

Re: Ethereal 10.88 NUMA