Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6
Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5
Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6
Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5
CPUs: 2 x Intel Xeon Platinum 8168 @ 2.70 GHz 33 MB L3
CORES: 48 physical (96 logical)
RAM: 256GB DDR4-2666 ECC Registered RDIMM
SSD: 2x Crucial MX300 (1TB) in RAID1
OS: Windows Server 2016
Nice job Andrew-
your NUMA implementation for 10.88 looks very good
Hyperthreading is enabled, so it seems that all we're seeing here is that running 92 threads on 48 physical cores brings a very small gain in nps over running 46 threads on 48 physical cores. It appears all that NUMA is doing here is correcting for Windows, which would otherwise run the 92 threads on 24 physical cores, i.e., on a single "processor group". I wonder if Houdini is making some further use of NUMA awareness, which might explain its slightly greater gain than the other engines.
Would be interesting to see the gain going from 46 to 92 threads on a machine with more than 92 physical cores, especially one running Linux. Need someone with extremely big hardware.
I don't understand. My engine has not any NUMA code at all. To be run in a system with 2 processors with 6 core each one it doesn't seems to need NUMA code. Does it need some NUMA awareness code? How much do you think it will gain? Isn't this a OS problem to equally allocate threads to the 2 cpu? Or is a memory issue?
elcabesa wrote: ↑Fri Aug 24, 2018 6:12 pm
I don't understand. My engine has not any NUMA code at all. To be run in a system with 2 processors with 6 core each one it doesn't seems to need NUMA code. Does it need some NUMA awareness code? How much do you think it will gain? Isn't this a OS problem to equally allocate threads to the 2 cpu? Or is a memory issue?
The only major issue of concern here seems to be related to Windows running on large hardware---machines with more than 64 logical cores. On such systems, Windows by default will force all threads onto a single processor group. For such hardware, the engine needs "NUMA-awareness", essentially to compensate for Windows.
elcabesa wrote: ↑Fri Aug 24, 2018 6:12 pm
I don't understand. My engine has not any NUMA code at all. To be run in a system with 2 processors with 6 core each one it doesn't seems to need NUMA code. Does it need some NUMA awareness code? How much do you think it will gain? Isn't this a OS problem to equally allocate threads to the 2 cpu? Or is a memory issue?
The only major issue of concern here seems to be related to Windows running on large hardware---machines with more than 64 logical cores. On such systems, Windows by default will force all threads onto a single processor group. For such hardware, the engine needs "NUMA-awareness", essentially to compensate for Windows.
Yes, normally with anything over 64 threads (logical processors in WindowsSpeak), the application should only benefit if NUMA (support for multiple physical processors) is added. At this time I believe only Stockfish, Houdini, and now very recently Ethereal have added code for this.
kranium wrote: ↑Fri Aug 24, 2018 4:12 pm
Nice job Andrew-
your NUMA implementation for 10.88 looks quite competitive
and overall NPS is significantly improved
Thank you for testing this -- 64+ core Windows machines are certainly a rarity.
Credit of course to Peter O (Texel), since this is his invention, as used in Stockfish.
Cheers, Andrew Grant
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra "Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
AndrewGrant wrote: ↑Fri Aug 24, 2018 10:33 pm
Thank you for testing this -- 64+ core Windows machines are certainly a rarity.
Credit of course to Peter O (Texel), since this is his invention, as used in Stockfish.
Cheers, Andrew Grant
It is originally Peter O code that has been improved in Brainfish and then just included back into SF.
And Peter O found most of "inspiration" for his code on MSDN example code .
Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6
Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5
Dokterchen wrote: ↑Sat Aug 25, 2018 11:18 am
NPS goes up but depth goes down!?
So, there are a couple things at play here ....
1) This is a sample of exactly one 10 second search each
2) The threads are running slower here, but faster as a collective.
3) In general, for LazySMP engines, more cores does not increase depth as much, it more often increases the quality of the depth. This is due to things like less aggressive pruning when you have a TT move, extending TT moves, overall better move ordering, ...
#WeAreAllDraude #JusticeForDraude #RememberDraude #LeptirBigUltra "Those who can't do, clone instead" - Eduard ( A real life friend, not this forum's Eduard )
Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
info string 48 cores with 96 logical processors detected
setoption name Threads value 46
info string 46 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 43 score cp 26 time 10003 nodes 866774704 nps 86651000 tbhits 0 hashfull 1000 pv e2e4
bestmove e2e4 ponder e7e6
Houdini 6.03 Pro x64-pext
info string NUMA configuration with 2 nodes, offset 0
info string NUMA node 0 in group 0 processor mask 0000ffffffffffff
info string NUMA node 1 in group 1 processor mask 0000ffffffffffff
setoption name Threads value 92
info string 92 threads used
go movetime 10000
info multipv 1 depth 22 seldepth 42 score cp 25 time 10008 nodes 1160911796 nps 115998000 tbhits 0 hashfull 1000 pv d2d4
bestmove d2d4 ponder d7d5
Well, these are single ten-second runs, and the PV for each 46-thread search is different from the PV for the corresponding 92-thread search. So comparing the depths is probably meaningless. Also, bear in mind that all these engines are using Lazy SMP, where it seems that depth is somehow less important than it is for other methods of parallel search.
[EDIT] See above for a more informed answer that appeared while I was typing mine!