petero2 wrote:Time doubling vs cores doubling vs cluster nodes doubling
To see how well the algorithm scales, a series of test matches were played where the used hardware resources were successively doubled. Three different ways of doubling hardware resources were tested:
* Doubling the thinking time.
* Doubling the number of cores on a single cluster node.
* Doubling the number of cluster nodes.
The matches were played either on identical "E5-2686 v4 @ 2.30GHz" EC2 nodes, on my 24 core computer (E5-2680 v3 @ 2.50GHz), or on my 16 core computer (E5-2670 0 @ 2.60GHz). Turbo boost was enabled during all tests. The transposition table size was 128MB. The EC2 instances, which were used for all cluster tests, are connected by 10 Gigabit Ethernet and the ping time between nodes is around 90 microseconds.
Matches are run using a slightly modified version of cutechess-cli. The modification uses thread affinities to lock each game to a subset of the available cores. This gives more reliable results, especially on NUMA hardware when running multiple games in parallel.
The results can be summarized by the following table, which shows the elo increase when successively doubling thinking time (row 1), number of cores (row 2), and number of cluster nodes (row 3).
Code: Select all
config X1 X2 X4 X8 X16 X32
tXc1n1 - 112 100 85 72 63
t1cXn1 - 72 76 73 66 44
t1c1nX - 50 46 51 29 --
A configuration tAcBnC means the time control is A times the base time control, the number of cores per cluster node is B, and the number of cluster nodes is C. The base time control was 24s/game+0.24s/move. Each entry in the table shows the elo increase compared to the configuration immediately to the left of the entry. For example, the value 100 in the X4 column in row 1 means that when playing 4*base_time vs 2*base_time on a single core and a single cluster node, the elo increase is 100.
The total elo increase between 1*base_time and 32*base_time is: 112+100+85+72+63 = 432.
The total elo increase between 1 core and 32 cores is: 72+76+73+66+44 = 331.
The total elo increase between 1 cluster node and 16 nodes is: 50+46+51+29 = 176.
The 32 node cluster test is missing because my current Amazon instance limit does not let me rent 32 computers at the same time. I might try to get this limit increased in the future.
The 32 core configuration is the only one that was running on more than one NUMA node. The 16 core configuration was run on an EC2 instance having 16 cores per NUMA node so it only used one NUMA node.
Each match was typically 1000 or 2000 games long, and the estimated error in the results was between 4 and 6 elo, one standard deviation. More details about the matches are given in the following table.
Code: Select all
dElo sDev win loss draw depth1 depth2
2t vs 1t +112.0 4.5 778 153 1075 17.65 16.34
4t vs 2t +100.0 4.3 690 130 1180 19.12 17.81
8t vs 4t +85.4 4.3 615 133 1252 20.58 19.29
16t vs 8t +72.1 4.1 552 143 1305 21.90 20.65
32t vs 16t +62.7 3.9 480 123 1397 23.21 22.05
2c vs 1c +72.2 4.5 604 194 1202 17.19 16.68
4c vs 2c +76.4 4.5 642 209 1149 17.74 17.07
8c vs 4c +73.2 4.3 577 162 1261 18.71 18.00
16c vs 8c +65.7 5.7 260 73 667 19.89 19.17
32c vs 16c +44.0 5.3 203 77 720 20.55 19.97
2n vs 1n +50.0 6.5 260 117 623 16.68 16.55
4n vs 2n +45.8 6.0 252 121 627 17.12 16.71
8n vs 4n +51.1 6.1 251 105 644 17.46 17.08
16n vs 8n +28.6 6.1 203 121 676 18.09 17.70
The estimated standard deviation was computed by grouping game results in pairs of two (pentanomial distribution) since each opening was repeated with colors reversed.
depth1 and depth2 are the arithmetic mean value of the search depth taken over all moves an engine configuration played in the match. Note that this value depends to a non-trivial amount on the number of moves the engine was searching in very simple endgame positions where texel's maximum search depth (100 ply) was reached. There may be better ways to compute a representative average search depth from the PGN data.
Other cluster tests
Some test results using more than one core per cluster node:
Code: Select all
dElo sDev win loss draw depth1 depth2
t1c4n4 vs t1c4n1 +92.8 5.8 316 55 629 18.78 17.97
t1c32n2 vs t1c32n1 +39.2 5.6 217 102 705 20.73 20.17
t1c4n2 vs t1c4n1 +37.7 5.7 221 113 666 18.75 18.34
t1c2n2 vs t1c2n1 +31.4 6.7 228 138 634 17.74 17.43
t1c2n4 vs t1c2n2 +42.2 6.1 237 116 647 18.09 17.66
t1c2n8 vs t1c2n4 +31.4 5.7 203 113 684 18.49 18.19
t1c2n16 vs t1c2n8 +32.1 5.7 206 114 680 18.97 18.58
The last four rows can be used to extend the summary table above with one more row:
Code: Select all
config X1 X2 X4 X8 X16 X32
tXc1n1 - 112 100 85 72 63
t1cXn1 - 72 76 73 66 44
t1c1nX - 50 46 51 29 --
t1c2nX - 31 42 31 32 --
Varying transposition table size
With the given hardware texel fills the transposition table at around 16MB/s per thread. This means that the fixed 128MB hash table used in the previous tests is too small to hold the entire search tree when the thinking time, number of cores, or number of cluster nodes is large. To see if this affected the results, some tests were repeated with a 512MB hash table.
Code: Select all
dElo sDev win loss draw depth1 depth2
t8c1n1 vs t4c1n1 +97.1 4.2 654 109 1237 20.57 19.24
t16c1n1 vs t8c1n1 +83.4 4.2 604 133 1263 21.89 20.60
t32c1n1 vs t16c1n1 +72.4 4.0 538 127 1335 23.18 21.89
Some tests were also repeated with a 1MB transposition table, which is too small to hold the whole search tree for all configurations.
Code: Select all
dElo sDev win loss draw depth1 depth2
t2c1n1 vs t1c1n1 +103.3 4.1 875 182 1343 16.90 15.82
t4c1n1 vs t2c1n1 +77.5 4.4 619 180 1201 18.06 17.02
t8c1n1 vs t4c1n1 +77.0 4.3 603 167 1230 19.11 18.08
t1c2n1 vs t1c1n1 +48.6 4.4 544 266 1190 16.35 16.11
t1c4n1 vs t1c2n1 +38.8 3.1 953 508 2539 16.89 16.48
t1c8n1 vs t1c4n1 +46.3 4.2 465 200 1335 17.52 17.07
Using the above data the summary table can be extended:
Code: Select all
config X1 X2 X4 X8 X16 X32
tXc1n1 - 112 100 85 72 63
tXc1n1' - 97 83 72 // 512MB hash
tXc1n1'' - 103 78 77 // 1MB hash
t1cXn1 - 72 76 73 66 44
t1cXn1'' - 49 39 46 // 1MB hash
t1c1nX - 50 46 51 29 --
t1c2nX - 31 42 31 32 --
It can be seen that using a too small transposition table hurts more when doubling the number of cores than when doubling the thinking time.
Possible NUMA issues
As mentioned above the 32 core configuration was the only one running on more than one NUMA node. The result 32 cores vs 16 cores (+44 elo) was also somewhat lower than expected. To further test NUMA behavior a 16 core vs 8 core match was played on the 16 core computer (which has two NUMA nodes) and a 24 core vs 12 core match was played on the 24 core computer (which also has two NUMA nodes.)
Code: Select all
dElo sDev win loss draw depth1 depth2
t1c16n1 vs t1c8n1 +44.4 5.6 219 92 689 19.62 18.95 // 128MB hash
t1c16n1 vs t1c8n1 +48.3 6.1 240 102 658 19.68 19.00 // 1024MB hash
t1c24n1 vs t1c12n1 +41.5 5.9 211 92 697 20.12 19.58 // 128MB hash
t1c24n1 vs t1c12n1 +47.5 6.0 229 93 678 20.39 19.80 // 1024MB hash
The elo increase is similar to the earlier 32 cores vs 16 cores match, and the 16 cores vs 8 cores result (+44.4 elo) is lower than the earlier non-NUMA 16 cores vs 8 cores result (+66 elo).
Test data
All games played for these measurements are available in
this archive file.
The tests used the texel development version 1.07a27, available for download
here.
Turbo boost effects
Turbo boost was enabled for all tests. This makes the computers run faster than the base clock frequency, but the speed increase is larger when fewer cores are in use. This can potentially skew the results in favor for configurations using few cores, i.e. it can look like the SMP algorithm scales worse than it would scale on hypothetical hardware that runs at the same clock speed regardless of how many cores are in use.
All used computers have very good cooling systems, so the actual turbo boost frequency can be calculated from processor turbo boost specification information as explained
here. The following table shows the turbo boost value as a function on the number of active cores:
Code: Select all
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 base incr
E5-2686 v4 : 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 7 7 2.3GHz 0.1GHz
E5-2680 v3 : 4 4 4 4 4 4 4 4 5 6 8 8 2.5GHz 0.1GHz
E5-2670 0 : 4 4 5 5 6 6 7 7 2.6GHz 0.1GHz
As an example of the worst case, consider a match played between 1 core and 12 cores on the E5-2680 computer. When the 1 core engine is thinking only one core is active so the clock frequency is 2.5+8*0.1 = 3.3GHz. When the 12 core engine is thinking all 12 cores are active so the clock frequency is 2.5+4*0.1 = 2.9GHz. In this case it would be as if the 12 core engine had a 12% time handicap (1-2.9/3.3), so if we wanted to measure the true algorithmic 12 core vs 1 core improvement, we would have to give the 12 core engine 12% more thinking time.
However, these tests were run in a way that does not trigger the worst case, since only 2*N vs 1*N core matches were played (for various values of N), and as many parallel games as possible were played on each test computer. For example, consider a 4 core vs 2 cores match played on the E5-2680 computer. If only one game were to be played in parallel, the speed difference would be 1-(2.5+5*0.1)/(2.5+8*0.1)=9%. However since only 4 cores are needed for each game, 3 games are played in parallel. This means that at least 2*3 cores are active at any time, so it can be seen from the table above that the turbo boost multiplier is always 4, and hence the speed difference is zero.
I think the largest difference that actually occurred in the test data is 8 cores vs 4 cores played on the E5-2670 computer. In this case the speed difference is 1-(2.6+4*0.1)/(2.6+6*0.1)=6%.