Lazy SMP and "lazy cluster" experiments
Moderators: hgm, Rebel, chrisw
-
- Posts: 689
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Lazy SMP and "lazy cluster" experiments
Lately I have been doing some experiments with an implementation of lazy SMP in texel. My motives were to see how lazy SMP behaves in general and to have an SMP algorithm that is easier to generalize to cluster computing than texel's old SMP algorithm.
In the following posts I will describe how the algorithm works, present some test results and suggest some ideas for future improvements.
In the following posts I will describe how the algorithm works, present some test results and suggest some ideas for future improvements.
-
- Posts: 689
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Lazy SMP and lazy cluster algorithm
Lazy SMP algorithm
The general design principles for texel's old SMP algorithm were "no waiting" and "be asynchronous whenever possible". Typical lazy SMP algorithms also follow these principles, since there is very little synchronization between threads, except for the transposition table. Other than that there is very little similarity between the old and the new algorithm.
Texel's lazy SMP algorithm uses a shared transposition table, which is thread safe and lockless using the "XOR trick". Other data, such as history tables, killer tables and evaluation caches, are private to each search thread.
The first thread is the master thread and all other threads are helper threads. Whenever the master thread calls the negaMax function from the root search function, it also tells the helper threads to start searching the same position. The helper threads improve the search in two ways. First, they store partial search results in the shared transposition table, which can be useful for other search threads. Second, if a helper thread finishes its search before the master thread is finished, the master thread will detect this and abort its search and use the result from the helper thread instead.
In pseudo code, the algorithm works basically like this:
Communication between the master thread and helper threads is asynchronous, meaning that when a thread sends a command to another thread, it does not wait for the other thread to acknowledge the command.
Standard C++11 threading primitives such as locks and condition variables are used for the communication.
Half of the helper threads search at the same depth as the master thread and the other half search at a one ply deeper depth.
When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.
The master thread is responsible for all UCI communication and when reporting search depth it uses its own search depth. This means that even if a "depth+1" helper thread finished first, the reported depth is "depth", even though in that case the "quality" of the search is "depth+1".
The master thread does not directly talk to all helper threads. Instead all search threads are arranged in a tree, where the master thread is the tree root node. Each tree node has at most 4 children. The tree is created to have minimal height and children are assigned from left to right. For example if there are 16 search threads they are arranged as follows:
A search thread is only allowed to talk to its neighbors in the tree graph. When a search thread receives a command from a neighbor it acts on the command and also forwards it to its parent or children as appropriate.
This thread tree structure is not strictly necessary for SMP search, since SMP hardware does not have an excessive number of hardware threads, so the master thread could talk directly to all helper threads. However for clusters the number of possible hardware threads could be very large (the largest super computers have more than one million cores), so in that case it would be very bad if the master thread tried to talk directly to all helper threads.
The algorithm is NUMA aware. This means that each search thread is locked to a specific NUMA node, and care is taken to allocate thread local memory on the NUMA node where the thread runs. The shared transposition table is allocated on NUMA node 0 if there is enough free memory on that node.
Lazy cluster algorithm
Support for clusters is implemented using the message passing interface (MPI). An MPI program consists of a set of processes, called "processors" in MPI terminology. Different processors do not share any memory and possibly execute on different computers (aka cluster nodes). All communication between processors is performed using message passing, which for example can be implemented by the MPI library using TCP/IP communication over Ethernet.
The texel cluster implementation is a so called "hybrid" MPI implementation, which means that each processor is a multithreaded program and there is only one processor on each computer in the cluster. Inside each processor the threads are arranged in the same tree structure as in the lazy SMP case. Cluster nodes are also connected in a tree structure. The main thread in each processor is connected to the main thread in other processors. This means that the set of all search threads in all cluster nodes forms a tree of trees. For example if each node has 3 cores and there are 7 nodes in the system, the tree structure looks like this:
Communication between cluster nodes works logically the same way as between threads within a node, but the implementation uses MPI messages instead of C++11 thread primitives.
It is not necessary for each cluster node to have the same number of cores, even though that was the case in the example above. The number of cluster nodes in the system is given when the program starts. The number of used search threads is however given by the "Threads" UCI parameter. Normally the number of search threads is specified to be the same as the total number of cores in the cluster. It is however possible to specify a different number of search threads. If there are fewer search threads than cores, cores are allocated in breadth first order. If there are more search threads than cores, the extra search threads are spread out evenly over the cluster nodes and cores, which makes it possible to utilize hyperthreading cores.
Using the cluster communication described so far improves the search by making it possible for search threads to finish the search before the master thread (on node 0) and reporting the result up in the tree to the master thread. However, since each processor lives in its own address space, the major lazy SMP search improvement caused by transposition table sharing does not work between cluster nodes. To overcome this limitation a method to distribute transposition table updates to other cluster nodes has been implemented.
Each processor has one transposition table shared between all threads in the process. Whenever a search thread stores a result in the transposition table, it also checks if the stored depth is large enough to make it worthwhile to distribute this information to other cluster nodes. If so, the TT entry is stored in output queues containing entries to be sent to cluster node neighbors. Each cluster neighbor has its own output queue. Spare bandwidth between cluster nodes is used to transmit the queued entries to neighboring nodes. When a node receives a TT entry, it inserts it in its local transposition table, and, if the depth is large enough, also stores it in the output queues for all neighbors except the neighbor the entry came from. Since the cluster nodes form a tree structure, this ensures that each TT entry update can be distributed to all other cluster nodes, but distribution can stop at any cluster node if the TT entry depth is not large enough.
What TT entry depth is considered "large enough" for distribution to other cluster nodes is determined dynamically using a simple procedure. Each output queue is periodically transmitted to the receiving node. If the queue is less than half full when it is about to be sent, the minimum depth for that queue is decreased by 1. If on the other hand the queue becomes full before it was possible to send it to the receiving node, the minimum depth for that queue is increased by 1.
Cluster algorithm properties
From the design it should be clear that speed measured in nodes per second (NPS) will increase proportionally to the number of nodes in the cluster, since all threads will always be searching something as soon as the first "start search" command has reached all search threads.
The reported NPS values in the UCI communication is based on information available in the master thread. Each search thread maintains a local "nodes searched" counter, and counter changes are periodically reported to parent threads, which makes this information propagate up the tree structure towards the master thread. However, at a given point in time there will be counter values that have not yet been propagated to the master thread. Therefore the reported NPS values will be somewhat smaller than the real NPS values, but this error will get smaller the longer the search is running, and after around 10 seconds the error should be quite small even for very large clusters.
The transposition table design makes TT access from search threads very fast compared to more traditional distributed hash table systems where a thread has to ask another cluster node for the result of a TT probe. This is because no cluster communication is required to access a TT entry. On the other hand the asynchronous TT entry broadcasting mechanism means that there will be a delay between the time when a TT entry is created and the time when it has been distributed to all cluster nodes.
Of course NPS in itself is not important for a chess engine. What really matters is how much the extra hardware contributes to making the program play stronger chess. Tests will have to be performed to measure the strength increase.
The general design principles for texel's old SMP algorithm were "no waiting" and "be asynchronous whenever possible". Typical lazy SMP algorithms also follow these principles, since there is very little synchronization between threads, except for the transposition table. Other than that there is very little similarity between the old and the new algorithm.
Texel's lazy SMP algorithm uses a shared transposition table, which is thread safe and lockless using the "XOR trick". Other data, such as history tables, killer tables and evaluation caches, are private to each search thread.
The first thread is the master thread and all other threads are helper threads. Whenever the master thread calls the negaMax function from the root search function, it also tells the helper threads to start searching the same position. The helper threads improve the search in two ways. First, they store partial search results in the shared transposition table, which can be useful for other search threads. Second, if a helper thread finishes its search before the master thread is finished, the master thread will detect this and abort its search and use the result from the helper thread instead.
In pseudo code, the algorithm works basically like this:
Code: Select all
Move iterativeDeepening(Position pos) {
for (depth = 1; depth <= maxDepth && !timeout; depth++) {
for (all legal moves m) {
[alpha, beta] = score_from_previous_iteration -/+ aspiration_window_delta;
pos.makeMove(m);
score = -negaScoutRoot(pos, -beta, -alpha, depth);
pos.unMakeMove(m);
if (score outside aspiration window)
widen window and redo search;
if (score > bestScore) {
bestMove = m;
bestScore = score;
}
}
}
tell all helper threads to stop searching
return bestMove;
}
int negaScoutRoot(Position pos, int alpha, int beta, int depth) {
tell all helper threads to start searching with params (pos, alpha, beta, depth)
try {
return negaScout(pos, alpha, beta, depth);
} catch (HelperThreadResult result) {
return result.getScore();
}
}
int negaScout(Position pos, int alpha, int beta, int depth) {
if (time to check for abort) {
if (timeout)
throw StopSearch();
if (helper thread has a search result)
throw HelperThreadResult(helper_thread_score);
}
// Normal recursive negaScout implementation ...
}
void helperThreadMainLoop() {
while (!quit) {
while (idle)
wait for search command
int extraDepth = thread_number % 2;
try {
for ( ; depth + extraDepth <= maxDepth; extraDepth++) {
negaScout2(pos, alpha, beta, depth + extraDepth);
}
} catch (StopSearch) {
}
}
}
int negaScout2(Position pos, int alpha, int beta, int depth) { // Like negaScout() except for abort condition
if (time to check for abort) {
if (stop command received or new search command received)
throw StopSearch();
}
// Normal recursive negaScout implementation ...
}
Standard C++11 threading primitives such as locks and condition variables are used for the communication.
Half of the helper threads search at the same depth as the master thread and the other half search at a one ply deeper depth.
When a helper thread finishes its search it will send the search result to the master thread. The helper thread will then restart its current search at a one ply deeper depth. It is expected that this search will be aborted rather quickly. As soon as the master thread receives the helper thread search result, it will continue its algorithm, which causes it to call negaScoutRoot() again with different parameters, and negaScoutRoot() tells all helper threads to start searching using the new parameters.
The master thread is responsible for all UCI communication and when reporting search depth it uses its own search depth. This means that even if a "depth+1" helper thread finished first, the reported depth is "depth", even though in that case the "quality" of the search is "depth+1".
The master thread does not directly talk to all helper threads. Instead all search threads are arranged in a tree, where the master thread is the tree root node. Each tree node has at most 4 children. The tree is created to have minimal height and children are assigned from left to right. For example if there are 16 search threads they are arranged as follows:
Code: Select all
0
|
+--------+-----------+---------+
| | | |
1 2 3 4
| | |
+-+-+-+ +-+--+--+ +--+--+
| | | | | | | | | | |
5 6 7 8 9 10 11 12 13 14 15
This thread tree structure is not strictly necessary for SMP search, since SMP hardware does not have an excessive number of hardware threads, so the master thread could talk directly to all helper threads. However for clusters the number of possible hardware threads could be very large (the largest super computers have more than one million cores), so in that case it would be very bad if the master thread tried to talk directly to all helper threads.
The algorithm is NUMA aware. This means that each search thread is locked to a specific NUMA node, and care is taken to allocate thread local memory on the NUMA node where the thread runs. The shared transposition table is allocated on NUMA node 0 if there is enough free memory on that node.
Lazy cluster algorithm
Support for clusters is implemented using the message passing interface (MPI). An MPI program consists of a set of processes, called "processors" in MPI terminology. Different processors do not share any memory and possibly execute on different computers (aka cluster nodes). All communication between processors is performed using message passing, which for example can be implemented by the MPI library using TCP/IP communication over Ethernet.
The texel cluster implementation is a so called "hybrid" MPI implementation, which means that each processor is a multithreaded program and there is only one processor on each computer in the cluster. Inside each processor the threads are arranged in the same tree structure as in the lazy SMP case. Cluster nodes are also connected in a tree structure. The main thread in each processor is connected to the main thread in other processors. This means that the set of all search threads in all cluster nodes forms a tree of trees. For example if each node has 3 cores and there are 7 nodes in the system, the tree structure looks like this:
Code: Select all
n0_t0 +- n0_t1
| |
| +- n0_t2
|
+----------------+----------------+---------------+
| | | |
n1_t0 +- n1_t1 n2_t0 +- n2_t1 n3_t0 +- n3_t1 n4_t0 +- n4_t1
| | | | |
| +- n1_t2 +- n2_t2 +- n3_t2 +- n4_t2
|
+----------------+
| |
n5_t0 +- n5_t1 n6_t0 +- n6_t1
| |
+- n5_t2 +- n6_t2
It is not necessary for each cluster node to have the same number of cores, even though that was the case in the example above. The number of cluster nodes in the system is given when the program starts. The number of used search threads is however given by the "Threads" UCI parameter. Normally the number of search threads is specified to be the same as the total number of cores in the cluster. It is however possible to specify a different number of search threads. If there are fewer search threads than cores, cores are allocated in breadth first order. If there are more search threads than cores, the extra search threads are spread out evenly over the cluster nodes and cores, which makes it possible to utilize hyperthreading cores.
Using the cluster communication described so far improves the search by making it possible for search threads to finish the search before the master thread (on node 0) and reporting the result up in the tree to the master thread. However, since each processor lives in its own address space, the major lazy SMP search improvement caused by transposition table sharing does not work between cluster nodes. To overcome this limitation a method to distribute transposition table updates to other cluster nodes has been implemented.
Each processor has one transposition table shared between all threads in the process. Whenever a search thread stores a result in the transposition table, it also checks if the stored depth is large enough to make it worthwhile to distribute this information to other cluster nodes. If so, the TT entry is stored in output queues containing entries to be sent to cluster node neighbors. Each cluster neighbor has its own output queue. Spare bandwidth between cluster nodes is used to transmit the queued entries to neighboring nodes. When a node receives a TT entry, it inserts it in its local transposition table, and, if the depth is large enough, also stores it in the output queues for all neighbors except the neighbor the entry came from. Since the cluster nodes form a tree structure, this ensures that each TT entry update can be distributed to all other cluster nodes, but distribution can stop at any cluster node if the TT entry depth is not large enough.
What TT entry depth is considered "large enough" for distribution to other cluster nodes is determined dynamically using a simple procedure. Each output queue is periodically transmitted to the receiving node. If the queue is less than half full when it is about to be sent, the minimum depth for that queue is decreased by 1. If on the other hand the queue becomes full before it was possible to send it to the receiving node, the minimum depth for that queue is increased by 1.
Cluster algorithm properties
From the design it should be clear that speed measured in nodes per second (NPS) will increase proportionally to the number of nodes in the cluster, since all threads will always be searching something as soon as the first "start search" command has reached all search threads.
The reported NPS values in the UCI communication is based on information available in the master thread. Each search thread maintains a local "nodes searched" counter, and counter changes are periodically reported to parent threads, which makes this information propagate up the tree structure towards the master thread. However, at a given point in time there will be counter values that have not yet been propagated to the master thread. Therefore the reported NPS values will be somewhat smaller than the real NPS values, but this error will get smaller the longer the search is running, and after around 10 seconds the error should be quite small even for very large clusters.
The transposition table design makes TT access from search threads very fast compared to more traditional distributed hash table systems where a thread has to ask another cluster node for the result of a TT probe. This is because no cluster communication is required to access a TT entry. On the other hand the asynchronous TT entry broadcasting mechanism means that there will be a delay between the time when a TT entry is created and the time when it has been distributed to all cluster nodes.
Of course NPS in itself is not important for a chess engine. What really matters is how much the extra hardware contributes to making the program play stronger chess. Tests will have to be performed to measure the strength increase.
-
- Posts: 689
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
SMP NPS measurements
To check if NPS scales linearly with the number of search threads as expected I ran a series of tests on an Amazon EC2 r4.16xlarge instance. The instance runs Amazon linux and has 480GB RAM. "lscpu" gives the following hardware details:
I let texel search for 20 seconds from the starting position, using different number of search threads and different hash table sizes. Each search was repeated five times.
When using 1 thread and 1MB hash, the speed was 1.47 MN/s.
For other combinations of search threads and hash size, the measured speed relative to the 1 thread/1MB value was:
If each entry is divided by the number of used search threads, we get the following relative speeds per thread:
The time in seconds required to initialize the hash table is given by the following table:
Analysis
NPS scales almost linearly with the number of search threads and is only very slightly affected by the hash table size. There are some anomalies however:
* For 64 threads the NPS scaling is only 58%, but this is expected because the machine only has 32 cores (but 64 hyperthreads).
* For 4GB hash the NPS is a few percent lower than for other hash sizes. On this machine, 8 GB RAM has been reserved for 1GB huge pages, which means that for hash size 1GB and 4GB, the memory allocation is served using 1GB huge pages. For other hash table sizes 2MB huge pages are used since the "transparent huge pages" feature is enabled in the kernel. It can be seen that the initialization time is significantly smaller for 1GB and 4GB hash sizes, which is expected when 1GB huge pages are used. It is however surprising that NPS is reduced when 4 1GB huge pages are used for the hash table.
* There is a small slowdown when 32 threads are used and when 256GB hash is used. For these test texel was run in NUMA aware mode, which means that it puts all search threads on the same NUMA node if possible. This is possible when using at most 16 threads, but not when using 32 threads. Also, texel tries to allocate the hash table on NUMA node 0, but for the 256GB hash case there is not enough memory on node 0, so some of the hash table is allocated on node 1. The observed NPS values seem reasonable given how texel handles NUMA hardware.
It is quite remarkable that NPS is almost equal when using a 1MB hash table that fits completely in L3 cache, and when using a 64GB hash table, which is around 1000 times larger than the L3 cache. Texel uses prefetch instructions to improve hash table access times. Possibly that causes most of the memory latency to be hidden.
Code: Select all
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 64
On-line CPU(s) list: 0-63
Thread(s) per core: 2
Core(s) per socket: 16
Socket(s): 2
NUMA node(s): 2
Vendor ID: GenuineIntel
CPU family: 6
Model: 79
Model name: Intel(R) Xeon(R) CPU E5-2686 v4 @ 2.30GHz
Stepping: 1
CPU MHz: 1644.842
BogoMIPS: 4661.99
Hypervisor vendor: Xen
Virtualization type: full
L1d cache: 32K
L1i cache: 32K
L2 cache: 256K
L3 cache: 46080K
NUMA node0 CPU(s): 0-15,32-47
NUMA node1 CPU(s): 16-31,48-63
When using 1 thread and 1MB hash, the speed was 1.47 MN/s.
For other combinations of search threads and hash size, the measured speed relative to the 1 thread/1MB value was:
Code: Select all
threads 1M 4M 16M 64M 256M 1G 4G 16G 64G 256G
1 1 1.011 1.015 1.013 1.034 1.024 0.980 1.011 1.005 0.986
2 2.010 1.999 2.031 2.015 2.031 2.028 1.970 2.036 2.028 1.980
4 4.002 3.995 4.008 4.017 4.060 4.033 3.924 4.058 4.038 3.931
8 7.944 7.966 7.968 8.007 8.012 8.059 7.831 8.068 8.037 7.836
16 15.718 15.822 15.846 15.747 15.882 15.871 15.418 16.007 15.923 15.392
32 30.781 30.855 30.790 30.843 31.267 31.187 30.386 31.690 31.830 30.919
64 36.540 36.070 36.254 36.346 36.779 37.108 36.076 37.836 38.380 38.217
Code: Select all
threads 1M 4M 16M 64M 256M 1G 4G 16G 64G 256G
1 1 1.011 1.015 1.013 1.034 1.024 0.980 1.011 1.005 0.986
2 1.005 1.000 1.016 1.008 1.016 1.014 0.985 1.018 1.014 0.990
4 1.000 0.999 1.002 1.004 1.015 1.008 0.981 1.015 1.010 0.983
8 0.993 0.996 0.996 1.001 1.001 1.007 0.979 1.008 1.005 0.980
16 0.982 0.989 0.990 0.984 0.993 0.992 0.964 1.000 0.995 0.962
32 0.962 0.964 0.962 0.964 0.977 0.975 0.950 0.990 0.995 0.966
64 0.571 0.564 0.566 0.568 0.575 0.580 0.564 0.591 0.600 0.597
Code: Select all
threads 1M 4M 16M 64M 256M 1G 4G 16G 64G 256G
1 0.023 0.022 0.021 0.041 0.099 0.022 0.022 4.817 19.302 84.250
2 0.022 0.022 0.022 0.041 0.100 0.023 0.023 4.910 19.527 84.641
4 0.024 0.022 0.023 0.041 0.100 0.023 0.023 4.922 19.663 84.971
8 0.023 0.024 0.023 0.043 0.100 0.024 0.023 4.949 19.602 85.991
16 0.026 0.026 0.025 0.044 0.104 0.026 0.026 4.944 19.649 85.229
32 0.030 0.031 0.031 0.049 0.108 0.030 0.030 4.944 19.773 85.196
64 0.041 0.042 0.042 0.060 0.118 0.041 0.040 4.933 19.410 85.199
NPS scales almost linearly with the number of search threads and is only very slightly affected by the hash table size. There are some anomalies however:
* For 64 threads the NPS scaling is only 58%, but this is expected because the machine only has 32 cores (but 64 hyperthreads).
* For 4GB hash the NPS is a few percent lower than for other hash sizes. On this machine, 8 GB RAM has been reserved for 1GB huge pages, which means that for hash size 1GB and 4GB, the memory allocation is served using 1GB huge pages. For other hash table sizes 2MB huge pages are used since the "transparent huge pages" feature is enabled in the kernel. It can be seen that the initialization time is significantly smaller for 1GB and 4GB hash sizes, which is expected when 1GB huge pages are used. It is however surprising that NPS is reduced when 4 1GB huge pages are used for the hash table.
* There is a small slowdown when 32 threads are used and when 256GB hash is used. For these test texel was run in NUMA aware mode, which means that it puts all search threads on the same NUMA node if possible. This is possible when using at most 16 threads, but not when using 32 threads. Also, texel tries to allocate the hash table on NUMA node 0, but for the 256GB hash case there is not enough memory on node 0, so some of the hash table is allocated on node 1. The observed NPS values seem reasonable given how texel handles NUMA hardware.
It is quite remarkable that NPS is almost equal when using a 1MB hash table that fits completely in L3 cache, and when using a 64GB hash table, which is around 1000 times larger than the L3 cache. Texel uses prefetch instructions to improve hash table access times. Possibly that causes most of the memory latency to be hidden.
-
- Posts: 689
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
ELO measurements
Time doubling vs cores doubling vs cluster nodes doubling
To see how well the algorithm scales, a series of test matches were played where the used hardware resources were successively doubled. Three different ways of doubling hardware resources were tested:
* Doubling the thinking time.
* Doubling the number of cores on a single cluster node.
* Doubling the number of cluster nodes.
The matches were played either on identical "E5-2686 v4 @ 2.30GHz" EC2 nodes, on my 24 core computer (E5-2680 v3 @ 2.50GHz), or on my 16 core computer (E5-2670 0 @ 2.60GHz). Turbo boost was enabled during all tests. The transposition table size was 128MB. The EC2 instances, which were used for all cluster tests, are connected by 10 Gigabit Ethernet and the ping time between nodes is around 90 microseconds.
Matches are run using a slightly modified version of cutechess-cli. The modification uses thread affinities to lock each game to a subset of the available cores. This gives more reliable results, especially on NUMA hardware when running multiple games in parallel.
The results can be summarized by the following table, which shows the elo increase when successively doubling thinking time (row 1), number of cores (row 2), and number of cluster nodes (row 3).
A configuration tAcBnC means the time control is A times the base time control, the number of cores per cluster node is B, and the number of cluster nodes is C. The base time control was 24s/game+0.24s/move. Each entry in the table shows the elo increase compared to the configuration immediately to the left of the entry. For example, the value 100 in the X4 column in row 1 means that when playing 4*base_time vs 2*base_time on a single core and a single cluster node, the elo increase is 100.
The total elo increase between 1*base_time and 32*base_time is: 112+100+85+72+63 = 432.
The total elo increase between 1 core and 32 cores is: 72+76+73+66+44 = 331.
The total elo increase between 1 cluster node and 16 nodes is: 50+46+51+29 = 176.
The 32 node cluster test is missing because my current Amazon instance limit does not let me rent 32 computers at the same time. I might try to get this limit increased in the future.
The 32 core configuration is the only one that was running on more than one NUMA node. The 16 core configuration was run on an EC2 instance having 16 cores per NUMA node so it only used one NUMA node.
Each match was typically 1000 or 2000 games long, and the estimated error in the results was between 4 and 6 elo, one standard deviation. More details about the matches are given in the following table.
The estimated standard deviation was computed by grouping game results in pairs of two (pentanomial distribution) since each opening was repeated with colors reversed.
depth1 and depth2 are the arithmetic mean value of the search depth taken over all moves an engine configuration played in the match. Note that this value depends to a non-trivial amount on the number of moves the engine was searching in very simple endgame positions where texel's maximum search depth (100 ply) was reached. There may be better ways to compute a representative average search depth from the PGN data.
Other cluster tests
Some test results using more than one core per cluster node:
The last four rows can be used to extend the summary table above with one more row:
Varying transposition table size
With the given hardware texel fills the transposition table at around 16MB/s per thread. This means that the fixed 128MB hash table used in the previous tests is too small to hold the entire search tree when the thinking time, number of cores, or number of cluster nodes is large. To see if this affected the results, some tests were repeated with a 512MB hash table.
Some tests were also repeated with a 1MB transposition table, which is too small to hold the whole search tree for all configurations.
Using the above data the summary table can be extended:
It can be seen that using a too small transposition table hurts more when doubling the number of cores than when doubling the thinking time.
Possible NUMA issues
As mentioned above the 32 core configuration was the only one running on more than one NUMA node. The result 32 cores vs 16 cores (+44 elo) was also somewhat lower than expected. To further test NUMA behavior a 16 core vs 8 core match was played on the 16 core computer (which has two NUMA nodes) and a 24 core vs 12 core match was played on the 24 core computer (which also has two NUMA nodes.)
The elo increase is similar to the earlier 32 cores vs 16 cores match, and the 16 cores vs 8 cores result (+44.4 elo) is lower than the earlier non-NUMA 16 cores vs 8 cores result (+66 elo).
Test data
All games played for these measurements are available in this archive file.
The tests used the texel development version 1.07a27, available for download here.
Turbo boost effects
Turbo boost was enabled for all tests. This makes the computers run faster than the base clock frequency, but the speed increase is larger when fewer cores are in use. This can potentially skew the results in favor for configurations using few cores, i.e. it can look like the SMP algorithm scales worse than it would scale on hypothetical hardware that runs at the same clock speed regardless of how many cores are in use.
All used computers have very good cooling systems, so the actual turbo boost frequency can be calculated from processor turbo boost specification information as explained here. The following table shows the turbo boost value as a function on the number of active cores:
As an example of the worst case, consider a match played between 1 core and 12 cores on the E5-2680 computer. When the 1 core engine is thinking only one core is active so the clock frequency is 2.5+8*0.1 = 3.3GHz. When the 12 core engine is thinking all 12 cores are active so the clock frequency is 2.5+4*0.1 = 2.9GHz. In this case it would be as if the 12 core engine had a 12% time handicap (1-2.9/3.3), so if we wanted to measure the true algorithmic 12 core vs 1 core improvement, we would have to give the 12 core engine 12% more thinking time.
However, these tests were run in a way that does not trigger the worst case, since only 2*N vs 1*N core matches were played (for various values of N), and as many parallel games as possible were played on each test computer. For example, consider a 4 core vs 2 cores match played on the E5-2680 computer. If only one game were to be played in parallel, the speed difference would be 1-(2.5+5*0.1)/(2.5+8*0.1)=9%. However since only 4 cores are needed for each game, 3 games are played in parallel. This means that at least 2*3 cores are active at any time, so it can be seen from the table above that the turbo boost multiplier is always 4, and hence the speed difference is zero.
I think the largest difference that actually occurred in the test data is 8 cores vs 4 cores played on the E5-2670 computer. In this case the speed difference is 1-(2.6+4*0.1)/(2.6+6*0.1)=6%.
To see how well the algorithm scales, a series of test matches were played where the used hardware resources were successively doubled. Three different ways of doubling hardware resources were tested:
* Doubling the thinking time.
* Doubling the number of cores on a single cluster node.
* Doubling the number of cluster nodes.
The matches were played either on identical "E5-2686 v4 @ 2.30GHz" EC2 nodes, on my 24 core computer (E5-2680 v3 @ 2.50GHz), or on my 16 core computer (E5-2670 0 @ 2.60GHz). Turbo boost was enabled during all tests. The transposition table size was 128MB. The EC2 instances, which were used for all cluster tests, are connected by 10 Gigabit Ethernet and the ping time between nodes is around 90 microseconds.
Matches are run using a slightly modified version of cutechess-cli. The modification uses thread affinities to lock each game to a subset of the available cores. This gives more reliable results, especially on NUMA hardware when running multiple games in parallel.
The results can be summarized by the following table, which shows the elo increase when successively doubling thinking time (row 1), number of cores (row 2), and number of cluster nodes (row 3).
Code: Select all
config X1 X2 X4 X8 X16 X32
tXc1n1 - 112 100 85 72 63
t1cXn1 - 72 76 73 66 44
t1c1nX - 50 46 51 29 --
The total elo increase between 1*base_time and 32*base_time is: 112+100+85+72+63 = 432.
The total elo increase between 1 core and 32 cores is: 72+76+73+66+44 = 331.
The total elo increase between 1 cluster node and 16 nodes is: 50+46+51+29 = 176.
The 32 node cluster test is missing because my current Amazon instance limit does not let me rent 32 computers at the same time. I might try to get this limit increased in the future.
The 32 core configuration is the only one that was running on more than one NUMA node. The 16 core configuration was run on an EC2 instance having 16 cores per NUMA node so it only used one NUMA node.
Each match was typically 1000 or 2000 games long, and the estimated error in the results was between 4 and 6 elo, one standard deviation. More details about the matches are given in the following table.
Code: Select all
dElo sDev win loss draw depth1 depth2
2t vs 1t +112.0 4.5 778 153 1075 17.65 16.34
4t vs 2t +100.0 4.3 690 130 1180 19.12 17.81
8t vs 4t +85.4 4.3 615 133 1252 20.58 19.29
16t vs 8t +72.1 4.1 552 143 1305 21.90 20.65
32t vs 16t +62.7 3.9 480 123 1397 23.21 22.05
2c vs 1c +72.2 4.5 604 194 1202 17.19 16.68
4c vs 2c +76.4 4.5 642 209 1149 17.74 17.07
8c vs 4c +73.2 4.3 577 162 1261 18.71 18.00
16c vs 8c +65.7 5.7 260 73 667 19.89 19.17
32c vs 16c +44.0 5.3 203 77 720 20.55 19.97
2n vs 1n +50.0 6.5 260 117 623 16.68 16.55
4n vs 2n +45.8 6.0 252 121 627 17.12 16.71
8n vs 4n +51.1 6.1 251 105 644 17.46 17.08
16n vs 8n +28.6 6.1 203 121 676 18.09 17.70
depth1 and depth2 are the arithmetic mean value of the search depth taken over all moves an engine configuration played in the match. Note that this value depends to a non-trivial amount on the number of moves the engine was searching in very simple endgame positions where texel's maximum search depth (100 ply) was reached. There may be better ways to compute a representative average search depth from the PGN data.
Other cluster tests
Some test results using more than one core per cluster node:
Code: Select all
dElo sDev win loss draw depth1 depth2
t1c4n4 vs t1c4n1 +92.8 5.8 316 55 629 18.78 17.97
t1c32n2 vs t1c32n1 +39.2 5.6 217 102 705 20.73 20.17
t1c4n2 vs t1c4n1 +37.7 5.7 221 113 666 18.75 18.34
t1c2n2 vs t1c2n1 +31.4 6.7 228 138 634 17.74 17.43
t1c2n4 vs t1c2n2 +42.2 6.1 237 116 647 18.09 17.66
t1c2n8 vs t1c2n4 +31.4 5.7 203 113 684 18.49 18.19
t1c2n16 vs t1c2n8 +32.1 5.7 206 114 680 18.97 18.58
Code: Select all
config X1 X2 X4 X8 X16 X32
tXc1n1 - 112 100 85 72 63
t1cXn1 - 72 76 73 66 44
t1c1nX - 50 46 51 29 --
t1c2nX - 31 42 31 32 --
With the given hardware texel fills the transposition table at around 16MB/s per thread. This means that the fixed 128MB hash table used in the previous tests is too small to hold the entire search tree when the thinking time, number of cores, or number of cluster nodes is large. To see if this affected the results, some tests were repeated with a 512MB hash table.
Code: Select all
dElo sDev win loss draw depth1 depth2
t8c1n1 vs t4c1n1 +97.1 4.2 654 109 1237 20.57 19.24
t16c1n1 vs t8c1n1 +83.4 4.2 604 133 1263 21.89 20.60
t32c1n1 vs t16c1n1 +72.4 4.0 538 127 1335 23.18 21.89
Code: Select all
dElo sDev win loss draw depth1 depth2
t2c1n1 vs t1c1n1 +103.3 4.1 875 182 1343 16.90 15.82
t4c1n1 vs t2c1n1 +77.5 4.4 619 180 1201 18.06 17.02
t8c1n1 vs t4c1n1 +77.0 4.3 603 167 1230 19.11 18.08
t1c2n1 vs t1c1n1 +48.6 4.4 544 266 1190 16.35 16.11
t1c4n1 vs t1c2n1 +38.8 3.1 953 508 2539 16.89 16.48
t1c8n1 vs t1c4n1 +46.3 4.2 465 200 1335 17.52 17.07
Code: Select all
config X1 X2 X4 X8 X16 X32
tXc1n1 - 112 100 85 72 63
tXc1n1' - 97 83 72 // 512MB hash
tXc1n1'' - 103 78 77 // 1MB hash
t1cXn1 - 72 76 73 66 44
t1cXn1'' - 49 39 46 // 1MB hash
t1c1nX - 50 46 51 29 --
t1c2nX - 31 42 31 32 --
Possible NUMA issues
As mentioned above the 32 core configuration was the only one running on more than one NUMA node. The result 32 cores vs 16 cores (+44 elo) was also somewhat lower than expected. To further test NUMA behavior a 16 core vs 8 core match was played on the 16 core computer (which has two NUMA nodes) and a 24 core vs 12 core match was played on the 24 core computer (which also has two NUMA nodes.)
Code: Select all
dElo sDev win loss draw depth1 depth2
t1c16n1 vs t1c8n1 +44.4 5.6 219 92 689 19.62 18.95 // 128MB hash
t1c16n1 vs t1c8n1 +48.3 6.1 240 102 658 19.68 19.00 // 1024MB hash
t1c24n1 vs t1c12n1 +41.5 5.9 211 92 697 20.12 19.58 // 128MB hash
t1c24n1 vs t1c12n1 +47.5 6.0 229 93 678 20.39 19.80 // 1024MB hash
Test data
All games played for these measurements are available in this archive file.
The tests used the texel development version 1.07a27, available for download here.
Turbo boost effects
Turbo boost was enabled for all tests. This makes the computers run faster than the base clock frequency, but the speed increase is larger when fewer cores are in use. This can potentially skew the results in favor for configurations using few cores, i.e. it can look like the SMP algorithm scales worse than it would scale on hypothetical hardware that runs at the same clock speed regardless of how many cores are in use.
All used computers have very good cooling systems, so the actual turbo boost frequency can be calculated from processor turbo boost specification information as explained here. The following table shows the turbo boost value as a function on the number of active cores:
Code: Select all
18 17 16 15 14 13 12 11 10 9 8 7 6 5 4 3 2 1 base incr
E5-2686 v4 : 4 4 4 4 4 4 4 4 4 4 4 4 4 4 4 5 7 7 2.3GHz 0.1GHz
E5-2680 v3 : 4 4 4 4 4 4 4 4 5 6 8 8 2.5GHz 0.1GHz
E5-2670 0 : 4 4 5 5 6 6 7 7 2.6GHz 0.1GHz
However, these tests were run in a way that does not trigger the worst case, since only 2*N vs 1*N core matches were played (for various values of N), and as many parallel games as possible were played on each test computer. For example, consider a 4 core vs 2 cores match played on the E5-2680 computer. If only one game were to be played in parallel, the speed difference would be 1-(2.5+5*0.1)/(2.5+8*0.1)=9%. However since only 4 cores are needed for each game, 3 games are played in parallel. This means that at least 2*3 cores are active at any time, so it can be seen from the table above that the turbo boost multiplier is always 4, and hence the speed difference is zero.
I think the largest difference that actually occurred in the test data is 8 cores vs 4 cores played on the E5-2670 computer. In this case the speed difference is 1-(2.6+4*0.1)/(2.6+6*0.1)=6%.
-
- Posts: 689
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Possible improvements
For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.
The elo measurements showed that when the transposition table is not large enough to contain the entire search tree, multi core searches suffer more than increased time searches. It might be possible to improve the hash replacement strategy to improve both multi core and single core performance for the too small hash table case.
The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.
The elo measurements showed that adding more cluster nodes gave less elo improvement than adding the same amount of additional cores in the SMP case. There are at least four cluster overheads compared to SMP:
* The main thread on each cluster node has to handle the MPI communication.
* The main thread on each cluster node has to handle the insertion of hash table entries into the node local hash table.
* The delay between generating a hash table entry and the entry being available for all search threads is larger.
* Hash table entries that have a too small depth are not shared between cluster nodes.
Measurements could be performed to determine the size of these effects. It might be possible to reduce the overheads, for example by offloading some of the work to a special communication thread.
The extra depth used by helper threads is currently computed as "threadNumber % 2". If there is a large number of helper threads it might be better to let some of them search at even larger depths.
Finally, tests have shown that when running on 16 cores on a single computer, the lazy SMP algorithm is around 12 elo weaker than texel's old SMP algorithm. It is not obvious how to extend the old algorithm to clusters, but it is possible that some sort of generalization of that algorithm could perform better on clusters than the current lazy algorithm does.
The elo measurements showed that when the transposition table is not large enough to contain the entire search tree, multi core searches suffer more than increased time searches. It might be possible to improve the hash replacement strategy to improve both multi core and single core performance for the too small hash table case.
The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.
The elo measurements showed that adding more cluster nodes gave less elo improvement than adding the same amount of additional cores in the SMP case. There are at least four cluster overheads compared to SMP:
* The main thread on each cluster node has to handle the MPI communication.
* The main thread on each cluster node has to handle the insertion of hash table entries into the node local hash table.
* The delay between generating a hash table entry and the entry being available for all search threads is larger.
* Hash table entries that have a too small depth are not shared between cluster nodes.
Measurements could be performed to determine the size of these effects. It might be possible to reduce the overheads, for example by offloading some of the work to a special communication thread.
The extra depth used by helper threads is currently computed as "threadNumber % 2". If there is a large number of helper threads it might be better to let some of them search at even larger depths.
Finally, tests have shown that when running on 16 cores on a single computer, the lazy SMP algorithm is around 12 elo weaker than texel's old SMP algorithm. It is not obvious how to extend the old algorithm to clusters, but it is possible that some sort of generalization of that algorithm could perform better on clusters than the current lazy algorithm does.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: Lazy SMP and "lazy cluster" experiments
Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel
-
- Posts: 689
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Lazy SMP and "lazy cluster" experiments
Hi Daniel,Daniel Shawul wrote:Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel
I actually read a lot of old interesting posts from you for inspiration while working on texel's cluster algorithm. I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes. I wanted to avoid that and therefore only considered algorithms where no node ever has to wait for an answer from any other node during search.
Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.
It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.
Regarding minimum sharing depth, I did a quick and dirty measurement of that for the root node when running on two single threaded cluster nodes connected by 1 gigabit Ethernet (ping time around 0.2ms). The average minimum sharing depth when searching in the start position was 1.04, distributed like this:
Code: Select all
min depth % of time
0 7.04
1 82.21
2 10.73
3 0.02
My long term goal is to figure out an algorithm for a distributed asynchronous approximate priority queue, which I could use as a building block when generalizing the old texel SMP algorithm to work in a cluster.
-
- Posts: 570
- Joined: Mon Jul 20, 2015 5:06 pm
Re: Lazy SMP and "lazy cluster" experiments
Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: Lazy SMP and "lazy cluster" experiments
Last time I tried ABDADA it was a big loser, but that was a long time ago and not on modern hardware.
I am kind of afraid to touch my SMP implementation because it works very reliably and it took a long time to get it there. But it doesn't scale as well as I would like. I have this on my to-do list to look at but it is not at the top of the list.
Thanks Peter for your very detailed writeup.
--Jon
I am kind of afraid to touch my SMP implementation because it works very reliably and it took a long time to get it there. But it doesn't scale as well as I would like. I have this on my to-do list to look at but it is not at the top of the list.
Thanks Peter for your very detailed writeup.
--Jon
-
- Posts: 689
- Joined: Mon Apr 19, 2010 7:07 pm
- Location: Sweden
- Full name: Peter Osterlund
Re: Lazy SMP and "lazy cluster" experiments
The idea was that the busy flag is only a hint that affects in what order moves are searched, so if there is a conflict between the busy flag and the result already stored in a TT entry, the busy flag is ignored.D Sceviour wrote:Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.
Something like this (untested):
Code: Select all
int abdada(Position pos, int alpha, int beta, int depth, bool exclusive) {
// Check for abort
// Repetition check
TTEntry& e = tt.probe(pos);
if (e.cutOff(alpha, beta, depth))
return e.score;
if (exclusive && e.busy)
return BUSY;
e.busy = true;
// null move, forward pruning, etc
MoveList moves = generateMoves(pos);
bool searched[256];
int nextDepth[256];
for (int pass = 0; pass < 2; pass++) {
for (int i = 0; i < moves.size(); i++) {
if (pass == 0)
nextDepth[i] = depth - 1 + extension - reduction;
else if (searched[i])
continue;
bool nextExclusive = pass == 0 && i > 0;
pos.makeMove(moves[i]);
int score = -abdada(pos, -beta, -alpha, nextDepth[i], nextExclusive);
pos.unMakeMove(moves[i]);
searched[i] = score != -BUSY;
// Update alpha and bestScore, break if done
}
}
tt.store(pos, bestScore); // Also sets entry.busy = false;
return bestScore;
}
I don't know if this algorithm would be better than lazy SMP, or if it can be combined with the "depth + 1" idea from lazy SMP. It would have to be tested by playing games.