Lazy SMP and "lazy cluster" experiments

petero2 · Post by **petero2** » Sun Aug 06, 2017 11:17 pm

For large clusters the cluster nodes are often connected in a regular structure, such as a 3D or 5D hyper torus. If the tree of cluster nodes created by the algorithm was created in a way that takes the cluster connectivity graph into account, the communication could be more efficient. For example a breadth first search starting from node 0 could be used to construct the tree. It is unknown how much this would help though, since large clusters also often have faster interconnects than Ethernet, so maybe it is already fast enough without this optimization.

The elo measurements showed that when the transposition table is not large enough to contain the entire search tree, multi core searches suffer more than increased time searches. It might be possible to improve the hash replacement strategy to improve both multi core and single core performance for the too small hash table case.

The tests where more than one NUMA node was used showed worse performance than expected. Texel tries to allocate the entire transposition table on NUMA node 0 and threads running on other NUMA nodes skip the hash table probe if the remaining search depth is 0. This logic is inherited from the old texel SMP algorithm, where measurements showed that it gave a noticeable reduction in the number of remote memory accesses. It is possible that this logic is bad for elo though, in which case it should be removed.

The elo measurements showed that adding more cluster nodes gave less elo improvement than adding the same amount of additional cores in the SMP case. There are at least four cluster overheads compared to SMP:
* The main thread on each cluster node has to handle the MPI communication.
* The main thread on each cluster node has to handle the insertion of hash table entries into the node local hash table.
* The delay between generating a hash table entry and the entry being available for all search threads is larger.
* Hash table entries that have a too small depth are not shared between cluster nodes.
Measurements could be performed to determine the size of these effects. It might be possible to reduce the overheads, for example by offloading some of the work to a special communication thread.

The extra depth used by helper threads is currently computed as "threadNumber % 2". If there is a large number of helper threads it might be better to let some of them search at even larger depths.

Finally, tests have shown that when running on 16 cores on a single computer, the lazy SMP algorithm is around 12 elo weaker than texel's old SMP algorithm. It is not obvious how to extend the old algorithm to clusters, but it is possible that some sort of generalization of that algorithm could perform better on clusters than the current lazy algorithm does.

Daniel Shawul · Post by **Daniel Shawul** » Mon Aug 07, 2017 7:35 pm

Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel

petero2 · Post by **petero2** » Mon Aug 07, 2017 9:06 pm

Daniel Shawul wrote:Hi Peter,
I did similar experiments in scorpio a few years ago with smp and cluster algorithms. Three algorithms were included in my tests YBW, ABDADA and SHT(shared hash table or lazy algorithm) for both smp and cluster algorithms with the default being YBW for both. Ofcourse on the cluster, you have the additional issue of a distributed/local transposition table that is fundamental to the performance of the last two algorithms. I think the depth of transposition table entries shared between nodes was too high for my tests because of the slow interconnect, but maybe ABDADA would be a better default option for the cluster on a faster interconnect. My SHT was pretty lazy in that, I don't do different depth searches with different threads/processes. To this day, I feel like playing with root search depths is a kludge and ABDADA is the right algorithm for an algorithm that relies only on shared transposition table. Note that ABDADA was actually first proposed as a distributed algorithm, so is YBW as a matter of fact.
Daniel

Hi Daniel,

I actually read a lot of old interesting posts from you for inspiration while working on texel's cluster algorithm. I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes. I wanted to avoid that and therefore only considered algorithms where no node ever has to wait for an answer from any other node during search.

Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.

It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.

Regarding minimum sharing depth, I did a quick and dirty measurement of that for the root node when running on two single threaded cluster nodes connected by 1 gigabit Ethernet (ping time around 0.2ms). The average minimum sharing depth when searching in the start position was 1.04, distributed like this:

Code: Select all

min depth   % of time
0              7.04
1             82.21
2             10.73
3              0.02

I plan to do a more accurate measurement later in several different cluster configurations.

My long term goal is to figure out an algorithm for a distributed asynchronous approximate priority queue, which I could use as a building block when generalizing the old texel SMP algorithm to work in a cluster.

D Sceviour · Post by **D Sceviour** » Tue Aug 08, 2017 12:19 am

petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.

Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?

jdart · Post by **jdart** » Tue Aug 08, 2017 12:46 am

Last time I tried ABDADA it was a big loser, but that was a long time ago and not on modern hardware.

I am kind of afraid to touch my SMP implementation because it works very reliably and it took a long time to get it there. But it doesn't scale as well as I would like. I have this on my to-do list to look at but it is not at the top of the list.

Thanks Peter for your very detailed writeup.

--Jon

petero2 · Post by **petero2** » Tue Aug 08, 2017 8:14 pm

D Sceviour wrote:
petero2 wrote:It is possible that ABDADA could work well even if hash table updates are not atomic. I think I read somewhere that ABDADA can be made fault tolerant by not using an "nproc" counter, but instead just use a boolean "visited" flag in the hash table. If it is fault tolerant I think the worst that could happen is that it falls back to lazy SMP behavior if several threads/nodes modify the "visited" flag concurrently. It could be interesting to test this.
Interesting. I was playing around with something like this today. Normally, hash depth > current node depth would indicate that the node has already been searched and therefore can return safely with the stored hash value. However, if a stored hash flag said it was "busy" being searched by another thread, then what should be done? If the current thread continues then it is just duplicating another thread. If the current thread wants to avoid this, then what hash value should be returned?

The idea was that the busy flag is only a hint that affects in what order moves are searched, so if there is a conflict between the busy flag and the result already stored in a TT entry, the busy flag is ignored.

Something like this (untested):

Code: Select all

int abdada&#40;Position pos, int alpha, int beta, int depth, bool exclusive&#41; &#123;
    // Check for abort
    // Repetition check

    TTEntry& e = tt.probe&#40;pos&#41;;
    if &#40;e.cutOff&#40;alpha, beta, depth&#41;)
        return e.score;
    if &#40;exclusive && e.busy&#41;
        return BUSY;
    e.busy = true;

    // null move, forward pruning, etc

    MoveList moves = generateMoves&#40;pos&#41;;
    bool searched&#91;256&#93;;
    int nextDepth&#91;256&#93;;
    for &#40;int pass = 0; pass < 2; pass++) &#123;
        for &#40;int i = 0; i < moves.size&#40;); i++) &#123;
            if &#40;pass == 0&#41;
                nextDepth&#91;i&#93; = depth - 1 + extension - reduction;
            else if &#40;searched&#91;i&#93;)
                continue;
            bool nextExclusive = pass == 0 && i > 0;
            pos.makeMove&#40;moves&#91;i&#93;);
            int score = -abdada&#40;pos, -beta, -alpha, nextDepth&#91;i&#93;, nextExclusive&#41;;
            pos.unMakeMove&#40;moves&#91;i&#93;);
            searched&#91;i&#93; = score != -BUSY;

            // Update alpha and bestScore, break if done
        &#125;
    &#125;

    tt.store&#40;pos, bestScore&#41;; // Also sets entry.busy = false;
    return bestScore;
&#125;

The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.

I don't know if this algorithm would be better than lazy SMP, or if it can be combined with the "depth + 1" idea from lazy SMP. It would have to be tested by playing games.

D Sceviour · Post by **D Sceviour** » Wed Aug 09, 2017 1:07 am

petero2 wrote:The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.

The phrase "even if e.busy is sometimes wrong" sounds like a recipe for disaster. The algorithm would still have to wait somewhere to make sure the "searched(i)" moves do not return a busy signal. The abdada search can only complete after all those busy nodes are finished, otherwise it may return an incorrect value.

Maybe try something strange like if alpha>hash_value then return hash_value [untested]. This ensures the previous ply will fail-high on the search. The final score for ply-1 would be on the safe side. A refutation of the "busy" node by another thread should verify the case condition when it is complete.

Daniel Shawul · Post by **Daniel Shawul** » Wed Aug 09, 2017 5:48 pm

I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes.

Note sure about that. There is not a lot of waiting in YBW waiting for response from other nodes; messages are processed asynchronously via MPI_IProbe(), and can even be processede as they arrive if the hardware allows it. IIRC the BG/Q has for instance a dedicated 17th core for that purpose. If you allow in any thread to process MPI messages you have a lot more complexity.

Ofcourse the standard YBW has some waiting to do for the last processor to finish, and for the first move to be searched fully before parallelization. But those have barely any effect on performance and the effect is the same for both smp and cluster machines. The problem with the cluster is the split depth is often much larger than that for smp due to often slow interconnect. Other than that the cluster YBW implementation almost gave me the same speedup as the smp implementation both running on an smp implementation.

Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.

What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.

Have you done multi-node tests for your cluster implementation ? I am sure loosing just 12 elo on an smp machine is something you should be happy about, but performance dips heavily when you start adding in more nodes. On an SMP machine, using pure MPI mode should actually be equivalent to smp mode because message passing mode could be done more efficiently by just copying data on a shared memory.

I think with the simple approach that you have for the cluster the distributed TT scheme has more impact than anything else. My opinion is that if YBW performed better than Lazy on SMP, it should perform better also on the cluster too. Asynchronous schemes often have way too much overhead, but given that we have way too much power on the cluster who knows which algorithm works better for it.

Thanks for the tests,
Daniel

petero2 · Post by **petero2** » Wed Aug 09, 2017 7:07 pm

D Sceviour wrote:
petero2 wrote:The idea is that maintaining sequential consistency across cluster nodes is too expensive, so instead have an algorithm that works even if e.busy is sometimes wrong. When it is wrong several threads could search the same subtree simultaneously, possibly causing search overhead. The hope is that this still contributes to elo increase, like in lazy SMP, but there is a risk that it does not work well when the "depth + 1" trick from lazy SMP is not used.
The phrase "even if e.busy is sometimes wrong" sounds like a recipe for disaster. The algorithm would still have to wait somewhere to make sure the "searched(i)" moves do not return a busy signal. The abdada search can only complete after all those busy nodes are finished, otherwise it may return an incorrect value.

Actually no thread has to wait for any other thread. If you look at my pseudo code it can be seen that all moves are searched exactly once (except in the case of a fail high, which stops the move loop immediately). The first pass searches move 0 and all moves where the busy flag indicates that nobody else is likely to be searching that move. The second pass searches all remaining moves (that's what the searched[] array is used for). If somebody else is already searching the same move in the second pass, that information is ignored and the thread will search the move itself anyway.

Therefore you will get "valid" scores returned by all threads, regardless of what other threads are doing. By "valid" I mean that the returned score is based on searching all possible moves, so it is for example not possible to "forget" to search the only move that recaptures a queen.

Whether or not this will actually cause stronger play than a single threaded search has to be tested. In the worst case the busy flag is wrong all the time and there is none of the "widening" effect that has been observed in lazy SMP.

I plan to test this algorithm and report how it performs.

Actually, reading the pseudo code again, I realize that it might not be clear that alpha and bestScore should only be updated if the move was actually searched, like this:

Code: Select all

int approximateAbdada&#40;Position pos, int alpha, int beta, int depth, bool exclusive&#41; &#123;
    // Check for abort
    // Repetition check

    TTEntry& e = tt.probe&#40;pos&#41;;
    if &#40;e.cutOff&#40;alpha, beta, depth&#41;)
        return e.score;
    if &#40;exclusive && e.busy&#41;
        return BUSY;
    e.busy = true;

    // null move, forward pruning, etc

    MoveList moves = generateMoves&#40;pos&#41;;
    bool searched&#91;256&#93;;
    int nextDepth&#91;256&#93;;
    for &#40;int pass = 0; pass < 2; pass++) &#123;
        for &#40;int i = 0; i < moves.size&#40;); i++) &#123;
            if &#40;pass == 0&#41;
                nextDepth&#91;i&#93; = depth - 1 + extension - reduction;
            else if &#40;searched&#91;i&#93;)
                continue;
            bool nextExclusive = pass == 0 && i > 0;
            pos.makeMove&#40;moves&#91;i&#93;);
            int score = -approximateAbdada&#40;pos, -beta, -alpha, nextDepth&#91;i&#93;, nextExclusive&#41;;
            pos.unMakeMove&#40;moves&#91;i&#93;);
            searched&#91;i&#93; = score != -BUSY;
            if (!searched&#91;i&#93;)               // <------------------------------------- Missing in previous post
                continue;

            // Update alpha and bestScore, break if done
        &#125;
    &#125;

    tt.store&#40;pos, bestScore&#41;; // Also sets entry.busy = false;
    return bestScore;
&#125;

petero2 · Post by **petero2** » Wed Aug 09, 2017 8:06 pm

Daniel Shawul wrote:
I remember reading that you had trouble getting good performance because of overhead caused by waiting for responses from other cluster nodes.
Note sure about that. There is not a lot of waiting in YBW waiting for response from other nodes; messages are processed asynchronously via MPI_IProbe(), and can even be processede as they arrive if the hardware allows it. IIRC the BG/Q has for instance a dedicated 17th core for that purpose. If you allow in any thread to process MPI messages you have a lot more complexity.

Ofcourse the standard YBW has some waiting to do for the last processor to finish, and for the first move to be searched fully before parallelization. But those have barely any effect on performance and the effect is the same for both smp and cluster machines. The problem with the cluster is the split depth is often much larger than that for smp due to often slow interconnect. Other than that the cluster YBW implementation almost gave me the same speedup as the smp implementation both running on an smp implementation.

I think the post where I got that idea was this one, but I realize now that that post is very old and probably not very relevant to what you are doing in current versions of Scorpio.

Daniel Shawul wrote:
Lazy SMP is easy to implement in a totally asynchronous way, so I went with that. I am not completely happy that I lose around 12 elo on 16 core SMP compared to the old SMP algorithm though. For the time being lazy cluster easily beats non-lazy SMP though, since cluster makes it possible to use many more cores, so that is what I use in HGM's monthly tournaments.
What I don't like about lazy is that ABDADA has a work sharing mechanism inside TT while lazy fully relies on the threads searching different parts of the tree by adhoc techinques like using different root depth, differnt root move ordering, or intentionally starting one thread slower than the other (just came up with this one now). Why not do all of this on top of ABDADA ? Should be able to perform better. My lazy smp just shares the TT without using such tricks and served as a baseline for performance of YBW and ABDADA.

I plan to do a hybrid ABDADA/lazy SMP implementation soon and see how it performs in terms of elo. Hopefully within a week.

Daniel Shawul wrote:Have you done multi-node tests for your cluster implementation ? I am sure loosing just 12 elo on an smp machine is something you should be happy about, but performance dips heavily when you start adding in more nodes. On an SMP machine, using pure MPI mode should actually be equivalent to smp mode because message passing mode could be done more efficiently by just copying data on a shared memory.

Yes, I actually reported several such results in the ELO measurements post.

I repeat the most interesting results here:

Two single threaded MPI processors running on different computers compared to a single threaded search on one computer: +50 elo

16 single threaded MPI processors running on different computers compared to a single threaded search on one computer: +50+46+51+29 = +176 elo

4 MPI processors running on different computers, each processor having 4 threads, compared to a non-MPI 4-thread search: +93 elo

2 MPI processors running on different computers, each processor having 32 threads running on a 2-socket NUMA hardware, compared to a non-MPI 32-thread search running on one computer having the same NUMA hardware: +39 elo

Daniel Shawul wrote:I think with the simple approach that you have for the cluster the distributed TT scheme has more impact than anything else. My opinion is that if YBW performed better than Lazy on SMP, it should perform better also on the cluster too. Asynchronous schemes often have way too much overhead, but given that we have way too much power on the cluster who knows which algorithm works better for it.

The problem is that my old algorithm was not YBW, it was a home made algorithm that admittedly has many YBW and DTS-like properties, but still is quite different. For one thing, it keeps all potential parallel tasks in a priority queue data structure, and that structure is only accessed under a lock.

I need to find a way to efficiently parallelize this data structure, and I don't want to implement locking primitives using message passing.

Also the data structure is actually more complicated that a priority queue, because it also has operations to recompute probability estimates and propagate those estimates between parent/child relationships between splitpoints.

Lazy SMP and "lazy cluster" experiments

Possible improvements

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments