TalkChess.com

Posted: **Sun Aug 06, 2017 11:16 pm**

To check if NPS scales linearly with the number of search threads as expected I ran a series of tests on an Amazon EC2 r4.16xlarge instance. The instance runs Amazon linux and has 480GB RAM. "lscpu" gives the following hardware details:

Code: Select all

Architecture&#58;          x86_64
CPU op-mode&#40;s&#41;&#58;        32-bit, 64-bit
Byte Order&#58;            Little Endian
CPU&#40;s&#41;&#58;                64
On-line CPU&#40;s&#41; list&#58;   0-63
Thread&#40;s&#41; per core&#58;    2
Core&#40;s&#41; per socket&#58;    16
Socket&#40;s&#41;&#58;             2
NUMA node&#40;s&#41;&#58;          2
Vendor ID&#58;             GenuineIntel
CPU family&#58;            6
Model&#58;                 79
Model name&#58;            Intel&#40;R&#41; Xeon&#40;R&#41; CPU E5-2686 v4 @ 2.30GHz
Stepping&#58;              1
CPU MHz&#58;               1644.842
BogoMIPS&#58;              4661.99
Hypervisor vendor&#58;     Xen
Virtualization type&#58;   full
L1d cache&#58;             32K
L1i cache&#58;             32K
L2 cache&#58;              256K
L3 cache&#58;              46080K
NUMA node0 CPU&#40;s&#41;&#58;     0-15,32-47
NUMA node1 CPU&#40;s&#41;&#58;     16-31,48-63

I let texel search for 20 seconds from the starting position, using different number of search threads and different hash table sizes. Each search was repeated five times.

When using 1 thread and 1MB hash, the speed was 1.47 MN/s.

For other combinations of search threads and hash size, the measured speed relative to the 1 thread/1MB value was:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        1       1.011   1.015   1.013   1.034   1.024   0.980   1.011   1.005   0.986
 2        2.010   1.999   2.031   2.015   2.031   2.028   1.970   2.036   2.028   1.980
 4        4.002   3.995   4.008   4.017   4.060   4.033   3.924   4.058   4.038   3.931
 8        7.944   7.966   7.968   8.007   8.012   8.059   7.831   8.068   8.037   7.836
16       15.718  15.822  15.846  15.747  15.882  15.871  15.418  16.007  15.923  15.392
32       30.781  30.855  30.790  30.843  31.267  31.187  30.386  31.690  31.830  30.919
64       36.540  36.070  36.254  36.346  36.779  37.108  36.076  37.836  38.380  38.217

If each entry is divided by the number of used search threads, we get the following relative speeds per thread:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        1       1.011   1.015   1.013   1.034   1.024   0.980   1.011   1.005   0.986
 2        1.005   1.000   1.016   1.008   1.016   1.014   0.985   1.018   1.014   0.990
 4        1.000   0.999   1.002   1.004   1.015   1.008   0.981   1.015   1.010   0.983
 8        0.993   0.996   0.996   1.001   1.001   1.007   0.979   1.008   1.005   0.980
16        0.982   0.989   0.990   0.984   0.993   0.992   0.964   1.000   0.995   0.962
32        0.962   0.964   0.962   0.964   0.977   0.975   0.950   0.990   0.995   0.966
64        0.571   0.564   0.566   0.568   0.575   0.580   0.564   0.591   0.600   0.597

The time in seconds required to initialize the hash table is given by the following table:

Code: Select all

threads      1M      4M     16M     64M    256M      1G      4G     16G     64G    256G
 1        0.023   0.022   0.021   0.041   0.099   0.022   0.022   4.817  19.302  84.250
 2        0.022   0.022   0.022   0.041   0.100   0.023   0.023   4.910  19.527  84.641
 4        0.024   0.022   0.023   0.041   0.100   0.023   0.023   4.922  19.663  84.971
 8        0.023   0.024   0.023   0.043   0.100   0.024   0.023   4.949  19.602  85.991
16        0.026   0.026   0.025   0.044   0.104   0.026   0.026   4.944  19.649  85.229
32        0.030   0.031   0.031   0.049   0.108   0.030   0.030   4.944  19.773  85.196
64        0.041   0.042   0.042   0.060   0.118   0.041   0.040   4.933  19.410  85.199

Analysis

NPS scales almost linearly with the number of search threads and is only very slightly affected by the hash table size. There are some anomalies however:

* For 64 threads the NPS scaling is only 58%, but this is expected because the machine only has 32 cores (but 64 hyperthreads).

* For 4GB hash the NPS is a few percent lower than for other hash sizes. On this machine, 8 GB RAM has been reserved for 1GB huge pages, which means that for hash size 1GB and 4GB, the memory allocation is served using 1GB huge pages. For other hash table sizes 2MB huge pages are used since the "transparent huge pages" feature is enabled in the kernel. It can be seen that the initialization time is significantly smaller for 1GB and 4GB hash sizes, which is expected when 1GB huge pages are used. It is however surprising that NPS is reduced when 4 1GB huge pages are used for the hash table.

* There is a small slowdown when 32 threads are used and when 256GB hash is used. For these test texel was run in NUMA aware mode, which means that it puts all search threads on the same NUMA node if possible. This is possible when using at most 16 threads, but not when using 32 threads. Also, texel tries to allocate the hash table on NUMA node 0, but for the 256GB hash case there is not enough memory on node 0, so some of the hash table is allocated on node 1. The observed NPS values seem reasonable given how texel handles NUMA hardware.

It is quite remarkable that NPS is almost equal when using a 1MB hash table that fits completely in L3 cache, and when using a 64GB hash table, which is around 1000 times larger than the L3 cache. Texel uses prefetch instructions to improve hash table access times. Possibly that causes most of the memory latency to be hidden.

Posted: **Sun Aug 06, 2017 11:17 pm**

Time doubling vs cores doubling vs cluster nodes doubling

To see how well the algorithm scales, a series of test matches were played where the used hardware resources were successively doubled. Three different ways of doubling hardware resources were tested:
* Doubling the thinking time.
* Doubling the number of cores on a single cluster node.
* Doubling the number of cluster nodes.

The matches were played either on identical "E5-2686 v4 @ 2.30GHz" EC2 nodes, on my 24 core computer (E5-2680 v3 @ 2.50GHz), or on my 16 core computer (E5-2670 0 @ 2.60GHz). Turbo boost was enabled during all tests. The transposition table size was 128MB. The EC2 instances, which were used for all cluster tests, are connected by 10 Gigabit Ethernet and the ping time between nodes is around 90 microseconds.

Matches are run using a slightly modified version of cutechess-cli. The modification uses thread affinities to lock each game to a subset of the available cores. This gives more reliable results, especially on NUMA hardware when running multiple games in parallel.

The results can be summarized by the following table, which shows the elo increase when successively doubling thinking time (row 1), number of cores (row 2), and number of cluster nodes (row 3).

Code: Select all

config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
t1cXn1    -   72   76   73   66   44
t1c1nX    -   50   46   51   29   --

A configuration tAcBnC means the time control is A times the base time control, the number of cores per cluster node is B, and the number of cluster nodes is C. The base time control was 24s/game+0.24s/move. Each entry in the table shows the elo increase compared to the configuration immediately to the left of the entry. For example, the value 100 in the X4 column in row 1 means that when playing 4*base_time vs 2*base_time on a single core and a single cluster node, the elo increase is 100.

The total elo increase between 1*base_time and 32*base_time is: 112+100+85+72+63 = 432.
The total elo increase between 1 core and 32 cores is: 72+76+73+66+44 = 331.
The total elo increase between 1 cluster node and 16 nodes is: 50+46+51+29 = 176.

The 32 node cluster test is missing because my current Amazon instance limit does not let me rent 32 computers at the same time. I might try to get this limit increased in the future.

The 32 core configuration is the only one that was running on more than one NUMA node. The 16 core configuration was run on an EC2 instance having 16 cores per NUMA node so it only used one NUMA node.

Each match was typically 1000 or 2000 games long, and the estimated error in the results was between 4 and 6 elo, one standard deviation. More details about the matches are given in the following table.

Code: Select all

              dElo  sDev  win loss draw  depth1 depth2
 2t vs  1t  +112.0   4.5  778  153 1075   17.65  16.34
 4t vs  2t  +100.0   4.3  690  130 1180   19.12  17.81
 8t vs  4t   +85.4   4.3  615  133 1252   20.58  19.29 
16t vs  8t   +72.1   4.1  552  143 1305   21.90  20.65
32t vs 16t   +62.7   3.9  480  123 1397   23.21  22.05
                   
 2c vs  1c   +72.2   4.5  604  194 1202   17.19  16.68
 4c vs  2c   +76.4   4.5  642  209 1149   17.74  17.07
 8c vs  4c   +73.2   4.3  577  162 1261   18.71  18.00
16c vs  8c   +65.7   5.7  260   73  667   19.89  19.17
32c vs 16c   +44.0   5.3  203   77  720   20.55  19.97
                   
 2n vs  1n   +50.0   6.5  260  117  623   16.68  16.55
 4n vs  2n   +45.8   6.0  252  121  627   17.12  16.71
 8n vs  4n   +51.1   6.1  251  105  644   17.46  17.08
16n vs  8n   +28.6   6.1  203  121  676   18.09  17.70

The estimated standard deviation was computed by grouping game results in pairs of two (pentanomial distribution) since each opening was repeated with colors reversed.

depth1 and depth2 are the arithmetic mean value of the search depth taken over all moves an engine configuration played in the match. Note that this value depends to a non-trivial amount on the number of moves the engine was searching in very simple endgame positions where texel's maximum search depth (100 ply) was reached. There may be better ways to compute a representative average search depth from the PGN data.

Other cluster tests

Some test results using more than one core per cluster node:

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t1c4n4  vs t1c4n1  +92.8   5.8  316   55  629   18.78  17.97
t1c32n2 vs t1c32n1 +39.2   5.6  217  102  705   20.73  20.17
t1c4n2  vs t1c4n1  +37.7   5.7  221  113  666   18.75  18.34

t1c2n2  vs t1c2n1  +31.4   6.7  228  138  634   17.74  17.43
t1c2n4  vs t1c2n2  +42.2   6.1  237  116  647   18.09  17.66
t1c2n8  vs t1c2n4  +31.4   5.7  203  113  684   18.49  18.19
t1c2n16 vs t1c2n8  +32.1   5.7  206  114  680   18.97  18.58

The last four rows can be used to extend the summary table above with one more row:

Code: Select all

config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
t1cXn1    -   72   76   73   66   44
t1c1nX    -   50   46   51   29   --
t1c2nX    -   31   42   31   32   --

Varying transposition table size

With the given hardware texel fills the transposition table at around 16MB/s per thread. This means that the fixed 128MB hash table used in the previous tests is too small to hold the entire search tree when the thinking time, number of cores, or number of cluster nodes is large. To see if this affected the results, some tests were repeated with a 512MB hash table.

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t8c1n1  vs t4c1n1  +97.1   4.2  654  109 1237   20.57  19.24
t16c1n1 vs t8c1n1  +83.4   4.2  604  133 1263   21.89  20.60
t32c1n1 vs t16c1n1 +72.4   4.0  538  127 1335   23.18  21.89

Some tests were also repeated with a 1MB transposition table, which is too small to hold the whole search tree for all configurations.

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t2c1n1 vs t1c1n1  +103.3   4.1  875  182 1343   16.90  15.82
t4c1n1 vs t2c1n1   +77.5   4.4  619  180 1201   18.06  17.02
t8c1n1 vs t4c1n1   +77.0   4.3  603  167 1230   19.11  18.08

t1c2n1 vs t1c1n1   +48.6   4.4  544  266 1190   16.35  16.11
t1c4n1 vs t1c2n1   +38.8   3.1  953  508 2539   16.89  16.48
t1c8n1 vs t1c4n1   +46.3   4.2  465  200 1335   17.52  17.07

Using the above data the summary table can be extended:

Code: Select all

config   X1   X2   X4   X8  X16  X32
tXc1n1    -  112  100   85   72   63
tXc1n1'   -             97   83   72     // 512MB hash
tXc1n1''  -  103   78   77               // 1MB hash
t1cXn1    -   72   76   73   66   44
t1cXn1''  -   49   39   46               // 1MB hash
t1c1nX    -   50   46   51   29   --
t1c2nX    -   31   42   31   32   --

It can be seen that using a too small transposition table hurts more when doubling the number of cores than when doubling the thinking time.

Possible NUMA issues

As mentioned above the 32 core configuration was the only one running on more than one NUMA node. The result 32 cores vs 16 cores (+44 elo) was also somewhat lower than expected. To further test NUMA behavior a 16 core vs 8 core match was played on the 16 core computer (which has two NUMA nodes) and a 24 core vs 12 core match was played on the 24 core computer (which also has two NUMA nodes.)

Code: Select all

                    dElo  sDev  win loss draw  depth1 depth2
t1c16n1 vs t1c8n1  +44.4   5.6  219   92  689   19.62  18.95    // 128MB hash
t1c16n1 vs t1c8n1  +48.3   6.1  240  102  658   19.68  19.00    // 1024MB hash
t1c24n1 vs t1c12n1 +41.5   5.9  211   92  697   20.12  19.58    // 128MB hash
t1c24n1 vs t1c12n1 +47.5   6.0  229   93  678   20.39  19.80    // 1024MB hash

The elo increase is similar to the earlier 32 cores vs 16 cores match, and the 16 cores vs 8 cores result (+44.4 elo) is lower than the earlier non-NUMA 16 cores vs 8 cores result (+66 elo).

Test data

All games played for these measurements are available in this archive file.

The tests used the texel development version 1.07a27, available for download here.

Turbo boost effects

Turbo boost was enabled for all tests. This makes the computers run faster than the base clock frequency, but the speed increase is larger when fewer cores are in use. This can potentially skew the results in favor for configurations using few cores, i.e. it can look like the SMP algorithm scales worse than it would scale on hypothetical hardware that runs at the same clock speed regardless of how many cores are in use.

All used computers have very good cooling systems, so the actual turbo boost frequency can be calculated from processor turbo boost specification information as explained here. The following table shows the turbo boost value as a function on the number of active cores:

Code: Select all

             18 17 16 15 14 13 12 11 10  9  8  7  6  5  4  3  2  1   base    incr
E5-2686 v4 &#58;  4  4  4  4  4  4  4  4  4  4  4  4  4  4  4  5  7  7   2.3GHz  0.1GHz
E5-2680 v3 &#58;                    4  4  4  4  4  4  4  4  5  6  8  8   2.5GHz  0.1GHz
E5-2670 0  &#58;                                4  4  5  5  6  6  7  7   2.6GHz  0.1GHz

As an example of the worst case, consider a match played between 1 core and 12 cores on the E5-2680 computer. When the 1 core engine is thinking only one core is active so the clock frequency is 2.5+8*0.1 = 3.3GHz. When the 12 core engine is thinking all 12 cores are active so the clock frequency is 2.5+4*0.1 = 2.9GHz. In this case it would be as if the 12 core engine had a 12% time handicap (1-2.9/3.3), so if we wanted to measure the true algorithmic 12 core vs 1 core improvement, we would have to give the 12 core engine 12% more thinking time.

However, these tests were run in a way that does not trigger the worst case, since only 2*N vs 1*N core matches were played (for various values of N), and as many parallel games as possible were played on each test computer. For example, consider a 4 core vs 2 cores match played on the E5-2680 computer. If only one game were to be played in parallel, the speed difference would be 1-(2.5+5*0.1)/(2.5+8*0.1)=9%. However since only 4 cores are needed for each game, 3 games are played in parallel. This means that at least 2*3 cores are active at any time, so it can be seen from the table above that the turbo boost multiplier is always 4, and hence the speed difference is zero.

I think the largest difference that actually occurred in the test data is 8 cores vs 4 cores played on the E5-2670 computer. In this case the speed difference is 1-(2.6+4*0.1)/(2.6+6*0.1)=6%.

Posted: **Sun Aug 06, 2017 11:17 pm**

For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.

The elo measurements showed that when the transposition table is not large enough to contain the entire search tree, multi core searches suffer more than increased time searches. It might be possible to improve the hash replacement strategy to improve both multi core and single core performance for the too small hash table case.

The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.

The elo measurements showed that adding more cluster nodes gave less elo improvement than adding the same amount of additional cores in the SMP case. There are at least four cluster overheads compared to SMP:
* The main thread on each cluster node has to handle the MPI communication.
* The main thread on each cluster node has to handle the insertion of hash table entries into the node local hash table.
* The delay between generating a hash table entry and the entry being available for all search threads is larger.
* Hash table entries that have a too small depth are not shared between cluster nodes.
Measurements could be performed to determine the size of these effects. It might be possible to reduce the overheads, for example by offloading some of the work to a special communication thread.

The extra depth used by helper threads is currently computed as "threadNumber % 2". If there is a large number of helper threads it might be better to let some of them search at even larger depths.

Finally, tests have shown that when running on 16 cores on a single computer, the lazy SMP algorithm is around 12 elo weaker than texel's old SMP algorithm. It is not obvious how to extend the old algorithm to clusters, but it is possible that some sort of generalization of that algorithm could perform better on clusters than the current lazy algorithm does.

Posted: **Mon Aug 07, 2017 7:35 pm**

Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel

Posted: **Mon Aug 07, 2017 9:06 pm**

Daniel Shawul wrote:Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel

Hi Daniel,

I actually read a lot of old interesting posts from you for inspiration while working on texel's cluster algorithm. I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes. I wanted to avoid that and therefore only considered algorithms where no node ever has to wait for an answer from any other node during search.

Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.

It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.

Regarding minimum sharing depth, I did a quick and dirty measurement of that for the root node when running on two single threaded cluster nodes connected by 1 gigabit Ethernet (ping time around 0.2ms). The average minimum sharing depth when searching in the start position was 1.04, distributed like this:

Code: Select all

min depth   % of time
0              7.04
1             82.21
2             10.73
3              0.02

I plan to do a more accurate measurement later in several different cluster configurations.

My long term goal is to figure out an algorithm for a distributed asynchronous approximate priority queue, which I could use as a building block when generalizing the old texel SMP algorithm to work in a cluster.

Posted: **Tue Aug 08, 2017 12:19 am**

petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.

Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?

Posted: **Tue Aug 08, 2017 12:46 am**

Last time I tried ABDADA it was a big loser, but that was a long time ago and not on modern hardware.

I am kind of afraid to touch my SMP implementation because it works very reliably and it took a long time to get it there. But it doesn't scale as well as I would like. I have this on my to-do list to look at but it is not at the top of the list.

Thanks Peter for your very detailed writeup.

--Jon

Posted: **Tue Aug 08, 2017 8:14 pm**

D Sceviour wrote:
petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.
Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?

The idea was that the busy flag is only a hint that affects in what order moves are searched, so if there is a conflict between the busy flag and the result already stored in a TT entry, the busy flag is ignored.

Something like this (untested):

Code: Select all

int abdada&#40;Position pos, int alpha, int beta, int depth, bool exclusive&#41; &#123;
    // Check for abort
    // Repetition check

    TTEntry& e = tt.probe&#40;pos&#41;;
    if &#40;e.cutOff&#40;alpha, beta, depth&#41;)
        return e.score;
    if &#40;exclusive && e.busy&#41;
        return BUSY;
    e.busy = true;

    // null move, forward pruning, etc

    MoveList moves = generateMoves&#40;pos&#41;;
    bool searched&#91;256&#93;;
    int nextDepth&#91;256&#93;;
    for &#40;int pass = 0; pass < 2; pass++) &#123;
        for &#40;int i = 0; i < moves.size&#40;); i++) &#123;
            if &#40;pass == 0&#41;
                nextDepth&#91;i&#93; = depth - 1 + extension - reduction;
            else if &#40;searched&#91;i&#93;)
                continue;
            bool nextExclusive = pass == 0 && i > 0;
            pos.makeMove&#40;moves&#91;i&#93;);
            int score = -abdada&#40;pos, -beta, -alpha, nextDepth&#91;i&#93;, nextExclusive&#41;;
            pos.unMakeMove&#40;moves&#91;i&#93;);
            searched&#91;i&#93; = score != -BUSY;

            // Update alpha and bestScore, break if done
        &#125;
    &#125;

    tt.store&#40;pos, bestScore&#41;; // Also sets entry.busy = false;
    return bestScore;
&#125;

The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.

I don't know if this algorithm would be better than lazy SMP, or if it can be combined with the "depth + 1" idea from lazy SMP. It would have to be tested by playing games.

Posted: **Wed Aug 09, 2017 1:07 am**

petero2 wrote:The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.

The phrase "even if e.busy is sometimes wrong" sounds like a recipe for disaster. The algorithm would still have to wait somewhere to make sure the "searched(i)" moves do not return a busy signal. The abdada search can only complete after all those busy nodes are finished, otherwise it may return an incorrect value.

Maybe try something strange like if alpha>hash_value then return hash_value [untested]. This ensures the previous ply will fail-high on the search. The final score for ply-1 would be on the safe side. A refutation of the "busy" node by another thread should verify the case condition when it is complete.

Posted: **Wed Aug 09, 2017 5:48 pm**

I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes.

Note sure about that. There is not a lot of waiting in YBW waiting for response from other nodes; messages are processed asynchronously via MPI_IProbe(), and can even be processede as they arrive if the hardware allows it. IIRC the BG/Q has for instance a dedicated 17th core for that purpose. If you allow in any thread to process MPI messages you have a lot more complexity.

Ofcourse the standard YBW has some waiting to do for the last processor to finish, and for the first move to be searched fully before parallelization. But those have barely any effect on performance and the effect is the same for both smp and cluster machines. The problem with the cluster is the split depth is often much larger than that for smp due to often slow interconnect. Other than that the cluster YBW implementation almost gave me the same speedup as the smp implementation both running on an smp implementation.

Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.

What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.

Have you done multi-node tests for your cluster implementation ? I am sure loosing just 12 elo on an smp machine is something you should be happy about, but performance dips heavily when you start adding in more nodes. On an SMP machine, using pure MPI mode should actually be equivalent to smp mode because message passing mode could be done more efficiently by just copying data on a shared memory.

I think with the simple approach that you have for the cluster the distributed TT scheme has more impact than anything else. My opinion is that if YBW performed better than Lazy on SMP, it should perform better also on the cluster too. Asynchronous schemes often have way too much overhead, but given that we have way too much power on the cluster who knows which algorithm works better for it.

Thanks for the tests,
Daniel

TalkChess.com

Lazy SMP and "lazy cluster" experiments

SMP NPS measurements

ELO measurements

Possible improvements

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments