SMP NPS measurements
Posted: Sun Aug 06, 2017 11:16 pm
To check if NPS scales linearly with the number of search threads as expected I ran a series of tests on an Amazon EC2 r4.16xlarge instance. The instance runs Amazon linux and has 480GB RAM. "lscpu" gives the following hardware details:
I let texel search for 20 seconds from the starting position, using different number of search threads and different hash table sizes. Each search was repeated five times.
When using 1 thread and 1MB hash, the speed was 1.47 MN/s.
For other combinations of search threads and hash size, the measured speed relative to the 1 thread/1MB value was:
If each entry is divided by the number of used search threads, we get the following relative speeds per thread:
The time in seconds required to initialize the hash table is given by the following table:
Analysis
NPS scales almost linearly with the number of search threads and is only very slightly affected by the hash table size. There are some anomalies however:
* For 64 threads the NPS scaling is only 58%, but this is expected because the machine only has 32 cores (but 64 hyperthreads).
* For 4GB hash the NPS is a few percent lower than for other hash sizes. On this machine, 8 GB RAM has been reserved for 1GB huge pages, which means that for hash size 1GB and 4GB, the memory allocation is served using 1GB huge pages. For other hash table sizes 2MB huge pages are used since the "transparent huge pages" feature is enabled in the kernel. It can be seen that the initialization time is significantly smaller for 1GB and 4GB hash sizes, which is expected when 1GB huge pages are used. It is however surprising that NPS is reduced when 4 1GB huge pages are used for the hash table.
* There is a small slowdown when 32 threads are used and when 256GB hash is used. For these test texel was run in NUMA aware mode, which means that it puts all search threads on the same NUMA node if possible. This is possible when using at most 16 threads, but not when using 32 threads. Also, texel tries to allocate the hash table on NUMA node 0, but for the 256GB hash case there is not enough memory on node 0, so some of the hash table is allocated on node 1. The observed NPS values seem reasonable given how texel handles NUMA hardware.
It is quite remarkable that NPS is almost equal when using a 1MB hash table that fits completely in L3 cache, and when using a 64GB hash table, which is around 1000 times larger than the L3 cache. Texel uses prefetch instructions to improve hash table access times. Possibly that causes most of the memory latency to be hidden.
Code: Select all
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 1644.842
BogoMIPS: 4661.99
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
When using 1 thread and 1MB hash, the speed was 1.47 MN/s.
For other combinations of search threads and hash size, the measured speed relative to the 1 thread/1MB value was:
Code: Select all
threads 1M 4M 16M 64M 256M 1G 4G 16G 64G 256G
1 1 1.011 1.015 1.013 1.034 1.024 0.980 1.011 1.005 0.986
2 2.010 1.999 2.031 2.015 2.031 2.028 1.970 2.036 2.028 1.980
4 4.002 3.995 4.008 4.017 4.060 4.033 3.924 4.058 4.038 3.931
8 7.944 7.966 7.968 8.007 8.012 8.059 7.831 8.068 8.037 7.836
16 15.718 15.822 15.846 15.747 15.882 15.871 15.418 16.007 15.923 15.392
32 30.781 30.855 30.790 30.843 31.267 31.187 30.386 31.690 31.830 30.919
64 36.540 36.070 36.254 36.346 36.779 37.108 36.076 37.836 38.380 38.217
Code: Select all
threads 1M 4M 16M 64M 256M 1G 4G 16G 64G 256G
1 1 1.011 1.015 1.013 1.034 1.024 0.980 1.011 1.005 0.986
2 1.005 1.000 1.016 1.008 1.016 1.014 0.985 1.018 1.014 0.990
4 1.000 0.999 1.002 1.004 1.015 1.008 0.981 1.015 1.010 0.983
8 0.993 0.996 0.996 1.001 1.001 1.007 0.979 1.008 1.005 0.980
16 0.982 0.989 0.990 0.984 0.993 0.992 0.964 1.000 0.995 0.962
32 0.962 0.964 0.962 0.964 0.977 0.975 0.950 0.990 0.995 0.966
64 0.571 0.564 0.566 0.568 0.575 0.580 0.564 0.591 0.600 0.597
Code: Select all
threads 1M 4M 16M 64M 256M 1G 4G 16G 64G 256G
1 0.023 0.022 0.021 0.041 0.099 0.022 0.022 4.817 19.302 84.250
2 0.022 0.022 0.022 0.041 0.100 0.023 0.023 4.910 19.527 84.641
4 0.024 0.022 0.023 0.041 0.100 0.023 0.023 4.922 19.663 84.971
8 0.023 0.024 0.023 0.043 0.100 0.024 0.023 4.949 19.602 85.991
16 0.026 0.026 0.025 0.044 0.104 0.026 0.026 4.944 19.649 85.229
32 0.030 0.031 0.031 0.049 0.108 0.030 0.030 4.944 19.773 85.196
64 0.041 0.042 0.042 0.060 0.118 0.041 0.040 4.933 19.410 85.199
NPS scales almost linearly with the number of search threads and is only very slightly affected by the hash table size. There are some anomalies however:
* For 64 threads the NPS scaling is only 58%, but this is expected because the machine only has 32 cores (but 64 hyperthreads).
* For 4GB hash the NPS is a few percent lower than for other hash sizes. On this machine, 8 GB RAM has been reserved for 1GB huge pages, which means that for hash size 1GB and 4GB, the memory allocation is served using 1GB huge pages. For other hash table sizes 2MB huge pages are used since the "transparent huge pages" feature is enabled in the kernel. It can be seen that the initialization time is significantly smaller for 1GB and 4GB hash sizes, which is expected when 1GB huge pages are used. It is however surprising that NPS is reduced when 4 1GB huge pages are used for the hash table.
* There is a small slowdown when 32 threads are used and when 256GB hash is used. For these test texel was run in NUMA aware mode, which means that it puts all search threads on the same NUMA node if possible. This is possible when using at most 16 threads, but not when using 32 threads. Also, texel tries to allocate the hash table on NUMA node 0, but for the 256GB hash case there is not enough memory on node 0, so some of the hash table is allocated on node 1. The observed NPS values seem reasonable given how texel handles NUMA hardware.
It is quite remarkable that NPS is almost equal when using a 1MB hash table that fits completely in L3 cache, and when using a 64GB hash table, which is around 1000 times larger than the L3 cache. Texel uses prefetch instructions to improve hash table access times. Possibly that causes most of the memory latency to be hidden.