Some NUMA data for Stockfish-dev and Cfish-dev

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Some NUMA data for Stockfish-dev and Cfish-dev

Post by zullil »

Summary: Stockfish's NUMA performance seems much worse than Cfish's, which likely results in suboptimal performance by Stockfish on high-end systems (like the one at TCEC, for example).

Typical data appear below. Stockfish's Local/Remote Memory Access Ratio was 1.0, while Cfish's was 1.8.

Code: Select all

$ sudo ./pcm-numa.x -- Stockfish/src/stockfish bench 32768 20 15

 Processor Counter Monitor: NUMA monitoring utility 

IBRS and IBPB supported  : yes
STIBP supported          : yes
Spec arch caps supported : no
Number of physical cores: 20
Number of logical cores: 40
Number of online logical cores: 40
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 10
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3100000000 Hz
IBRS enabled in the kernel   : no
STIBP enabled in the kernel  : no
Disabling NMI watchdog since it consumes one hw-PMU counter.
Package thermal spec power: 160 Watt; Package minimum power: 51 Watt; Package maximum power: 320 Watt; 
Socket 0: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected. 0 M2M (mesh to memory) blocks detected.
Socket 1: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected. 0 M2M (mesh to memory) blocks detected.
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
Socket 0
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)
Socket 1
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)

Detected Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz "Intel(r) microarchitecture codename Haswell-EP/EN/EX" stepping 2 microcode level 0x43
Update every 1.0 seconds

Executing "Stockfish/src/stockfish" command:
Stockfish 140619 64 BMI2 by T. Romstad, M. Costalba, J. Kiiski, G. Linscott

[bench output snipped]

===========================
Total time (ms) : 7092
Nodes searched  : 68440190
Nodes/second    : 9650336
DEBUG: caught signal to interrupt (Child exited).
Program Stockfish/src/stockfish launched with PID: 4f8a
Program exited with status 0
Time elapsed: 9803 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   0   0.58         11 G       20 G      7190 K              8045 K              
   1   0.61         11 G       18 G      5861 K              7463 K              
   2   0.61         10 G       17 G      6128 K              8213 K              
   3   0.61         10 G       16 G      5554 K              7458 K              
   4   0.61       9512 M       15 G      4589 K              6401 K              
   5   0.64       8872 M       13 G      4508 K              7049 K              
   6   0.69       7852 M       11 G      3631 K              6492 K              
   7   0.67       7468 M       11 G      3747 K              5454 K              
   8   0.75       6727 M     8926 M      3103 K              5360 K              
   9   0.16        699 M     4254 M      1773 K               223 K              
  10   0.31        371 M     1179 M       337 K               258 K              
  11   0.29        872 M     2995 M       784 K               611 K              
  12   0.38       1151 M     3026 M       892 K               776 K              
  13   0.32       1605 M     5032 M      1277 K              1056 K              
  14   0.62       8970 M       14 G      5169 K              5108 K              
  15   0.28       2852 M       10 G      2402 K              2199 K              
  16   0.29       3433 M       11 G      2915 K              2347 K              
  17   0.31       3981 M       12 G      3208 K              2719 K              
  18   0.38       3798 M       10 G      2560 K              2218 K              
  19   0.42       8157 M       19 G      4689 K              4433 K              
  20   0.94        941 M     1006 M      1800 K               210 K              
  21   0.18       5459 K       29 M        39 K                29 K              
  22   0.27         15 M       57 M        74 K                70 K              
  23   0.26         26 M       99 M       148 K                70 K              
  24   0.31       8031 K       25 M        20 K              3740                
  25   0.17        685 M     4135 M      2038 K                36 K              
  26   0.25        163 M      654 M      1897 K               560 K              
  27   0.33         23 M       71 M        99 K                49 K              
  28   0.20       1387 M     6780 M      3821 K               376 K              
  29   0.46        111 M      244 M       621 K               203 K              
  30   0.57         12 G       21 G      7062 K              6910 K              
  31   0.58         12 G       21 G      6876 K              6867 K              
  32   0.58         12 G       21 G      6820 K              6717 K              
  33   0.57         12 G       21 G      6912 K              6463 K              
  34   0.35       4503 M       12 G      3546 K              2914 K              
  35   0.91         17 G       19 G        20 M                16 M              
  36   0.77         10 G       14 G      5006 K              5596 K              
  37   0.74         10 G       14 G      5357 K              5402 K              
  38   0.51         11 G       21 G      7316 K              6494 K              
  39   0.39       8709 M       22 G      5458 K              4940 K              
-------------------------------------------------------------------------------------------------------------------
   *   0.55        236 G      433 G       155 M               154 M              

Cleaning up
 Zeroed uncore PMU registers
 Freeing up all RMIDs
 Re-enabling NMI watchdog.

Code: Select all

$ sudo ./pcm-numa.x -- Cfish/src/cfish bench 32768 20 15

 Processor Counter Monitor: NUMA monitoring utility 

IBRS and IBPB supported  : yes
STIBP supported          : yes
Spec arch caps supported : no
Number of physical cores: 20
Number of logical cores: 40
Number of online logical cores: 40
Threads (logical cores) per physical core: 2
Num sockets: 2
Physical cores per socket: 10
Core PMU (perfmon) version: 3
Number of core PMU generic (programmable) counters: 4
Width of generic (programmable) counters: 48 bits
Number of core PMU fixed counters: 3
Width of fixed counters: 48 bits
Nominal core frequency: 3100000000 Hz
IBRS enabled in the kernel   : no
STIBP enabled in the kernel  : no
Disabling NMI watchdog since it consumes one hw-PMU counter.
Package thermal spec power: 160 Watt; Package minimum power: 51 Watt; Package maximum power: 320 Watt; 
Socket 0: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected. 0 M2M (mesh to memory) blocks detected.
Socket 1: 2 memory controllers detected with total number of 4 channels. 2 QPI ports detected. 0 M2M (mesh to memory) blocks detected.
Trying to use Linux perf events...
Successfully programmed on-core PMU using Linux perf
Socket 0
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)
Socket 1
Max QPI link 0 speed: 19.2 GBytes/second (9.6 GT/second)
Max QPI link 1 speed: 19.2 GBytes/second (9.6 GT/second)

Detected Intel(R) Xeon(R) CPU E5-2687W v3 @ 3.10GHz "Intel(r) microarchitecture codename Haswell-EP/EN/EX" stepping 2 microcode level 0x43
Update every 1.0 seconds

Executing "Cfish/src/cfish" command:
Cfish 160619 64 BMI2 NUMA by Syzygy based on Stockfish
info string NUMA enabled.
info string Binding thread 0 to node 0.
info string Binding thread 1 to node 0.
info string Binding thread 2 to node 0.
info string Binding thread 3 to node 0.
info string Binding thread 4 to node 0.
info string Binding thread 5 to node 0.
info string Binding thread 6 to node 0.
info string Binding thread 7 to node 0.
info string Binding thread 8 to node 0.
info string Binding thread 9 to node 0.
info string Binding thread 10 to node 1.
info string Binding thread 11 to node 1.
info string Binding thread 12 to node 1.
info string Binding thread 13 to node 1.
info string Binding thread 14 to node 1.
info string Binding thread 15 to node 1.
info string Binding thread 16 to node 1.
info string Binding thread 17 to node 1.
info string Binding thread 18 to node 1.
info string Binding thread 19 to node 1.

[bench output snipped]

===========================
Total time (ms) : 1664
Nodes searched  : 55178288
Nodes/second    : 33160028
DEBUG: caught signal to interrupt (Child exited).
Program Cfish/src/cfish launched with PID: 530c
Program exited with status 0
Time elapsed: 7283 ms
Core | IPC  | Instructions | Cycles  |  Local DRAM accesses | Remote DRAM Accesses 
   0   0.77       6462 M     8391 M      3564 K              2585 K              
   1   0.11        913 M     8350 M      2224 K               174 K              
   2   0.46       7125 M       15 G      4889 K              2205 K              
   3   0.47       7255 M       15 G      5099 K              2382 K              
   4   0.45       7028 M       15 G      4877 K              2198 K              
   5   0.47       7203 M       15 G      4857 K              2223 K              
   6   0.46       7169 M       15 G      4823 K              2189 K              
   7   0.46       7212 M       15 G      4691 K              2036 K              
   8   0.48       7232 M       15 G      5040 K              2326 K              
   9   0.46       7226 M       15 G      5092 K              2247 K              
  10   0.14         49 M      351 M       196 K                52 K              
  11   0.13         40 M      314 M       122 K                66 K              
  12   0.46       7233 M       15 G      4214 K              2619 K              
  13   0.20         15 M       76 M       132 K                87 K              
  14   0.46       7273 M       15 G      4324 K              2803 K              
  15   0.46       7286 M       15 G      4321 K              2759 K              
  16   0.46       7292 M       15 G      4205 K              2637 K              
  17   0.46       7181 M       15 G      4110 K              2557 K              
  18   0.45       7207 M       15 G      4178 K              2616 K              
  19   0.45       7198 M       15 G      4197 K              2681 K              
  20   0.11        911 M     8044 M      2230 K               107 K              
  21   1.08         12 G       11 G        13 M                12 M              
  22   0.19        130 M      677 M      1814 K               576 K              
  23   0.13         72 M      544 M      1130 K               590 K              
  24   0.78        612 M      787 M      1583 K               290 K              
  25   0.36         79 M      223 M       269 K                92 K              
  26   0.39         75 M      195 M       198 K                55 K              
  27   0.38         34 M       88 M        95 K                24 K              
  28   0.22        270 M     1242 M      2350 K               996 K              
  29   0.18         72 M      412 M      1023 K               312 K              
  30   0.46       7255 M       15 G      4196 K              2615 K              
  31   0.46       7257 M       15 G      4181 K              2607 K              
  32   0.21         63 M      300 M       604 K               468 K              
  33   0.46       7262 M       15 G      4163 K              2603 K              
  34   0.20         16 M       80 M       141 K               104 K              
  35   0.21         14 M       70 M       122 K                80 K              
  36   0.18         18 M      102 M       214 K               107 K              
  37   0.15         27 M      182 M       348 K               253 K              
  38   0.19         15 M       79 M       156 K                62 K              
  39   0.35         56 M      162 M       370 K               189 K              
-------------------------------------------------------------------------------------------------------------------
   *   0.47        152 G      324 G       113 M                63 M              

Cleaning up
 Zeroed uncore PMU registers
 Freeing up all RMIDs
 Re-enabling NMI watchdog.
NUMA topology:

Code: Select all

$ numactl -H
available: 2 nodes (0-1)
node 0 cpus: 0 1 2 3 4 5 6 7 8 9 20 21 22 23 24 25 26 27 28 29
node 0 size: 32112 MB
node 0 free: 30551 MB
node 1 cpus: 10 11 12 13 14 15 16 17 18 19 30 31 32 33 34 35 36 37 38 39
node 1 size: 32230 MB
node 1 free: 30749 MB
node distances:
node   0   1 
  0:  10  21 
  1:  21  10 
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: Some NUMA data for Stockfish-dev and Cfish-dev

Post by syzygy »

Cfish's nps was more than 3x that of Stockfish? (Looking at "Time elapsed", maybe not.)

I suppose the difference in local/remote ratio is due to Cfish making sure that the memory for each thread is allocated on the node on which the thread runs.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Some NUMA data for Stockfish-dev and Cfish-dev

Post by zullil »

syzygy wrote: Sat Jun 22, 2019 4:52 pm Cfish's nps was more than 3x that of Stockfish? (Looking at "Time elapsed", maybe not.)

I suppose the difference in local/remote ratio is due to Cfish making sure that the memory for each thread is allocated on the node on which the thread runs.
Here is some recent speed data. Since the two fishes are again in sync, I'll retest using the current development versions.

[EDIT] Same result with the latest development versions. Cfish-dev is about 35% faster than Stockfish-dev using 20 threads, as measured by the command $./*fish bench 4096 20 18 >> \dev\null

About 34 Mnps for Cfish-dev and 25 Mnps for Stockfish-dev. Both compiled using gcc-9 (or g++-9) and make profile-build ARCH=x86-64-bmi2. I added -march=native to the Stockfish-dev optimization, since that flag is used by default by Cfish-dev.

Something seems very off with Stockfish's performance. I don't recall there being such a disparity in the past.
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Some NUMA data for Stockfish-dev and Cfish-dev

Post by zullil »

syzygy wrote: Sat Jun 22, 2019 4:52 pm Cfish's nps was more than 3x that of Stockfish? (Looking at "Time elapsed", maybe not.)

I suppose the difference in local/remote ratio is due to Cfish making sure that the memory for each thread is allocated on the node on which the thread runs.
I was so fixated on the local vs. remote DRAM access numbers that I didn't even notice the perplexing bench results for Stockfish in the data I posted. Now that I've finally noticed what you noticed, I'm even more puzzled. Need to review and retest. I should add that I just discovered the pcm-numa tool, and am certainly no expert.