Lazy SMP and "lazy cluster" experiments

Daniel Shawul · Post by **Daniel Shawul** » Sat Aug 19, 2017 4:32 pm

* One 256 byte "lrmVec[]" array to keep track of lmr reductions computed in the first ABDADA pass, to ensure that moves delayed to the second pass use the same lmr as they would have used if they had been searched in the first pass. If I just recompute lmr in the second pass, I may get a different result because the calculation depends on the number of legal moves already tested, not just on the move index in the sorted move list.

My legal_moves counter also counts skipped moves on the first psas so this does not cause me problems.

However, none of those arrays are really needed, because my Move class already has a 4 byte score field, so I can just encode that information as special score range within the move list. I am currently running tests with that modification (and some other tweaks). It looks promising but the tests are not finished yet.

Yes, that is what I do too, i.e. store in a SKIP_MOVE_SCORE in the score_st array on the first pass.

1. It might be the case that lazy SMP (and SHT) works better if you have a per-thread history table. The late moves in ALL nodes are mostly ordered by the history heuristic, and if the history table is thread local, there is a decent chance that it becomes different in different threads. This would have the effect that different threads often don't search the same moves at the same time, and would therefore be able to pick up results from other threads from the transposition table.

Unfortunately, I share the history and refutation tables between threads, so the move list is searched in the same order in different threads.

2. If you do heavy LMR/LMP type reductions, and the reduction a move receives depends on the move order, a side effect of a thread local history table is that different threads will use different reduction amounts for the same move. Since the moves searched early by each thread typically get a lower reduction, and those results are later picked up through the hash table by other threads, the effect is as if the average LMR/LMP reduction is lower than in a single threaded search. It is already known that LMR/LMP is a big loss for fixed depth games, but a big gain in fixed time games, so I expect this effect would cause some elo increase, although not as big an elo increase as when using an SMP algorithm that avoids duplicated work without decreasing LMR/LMP.

Ok, so the effect is re-searching a move with different reduction depths makes LMR less aggressive. This means one could try improve the re-search for LMR reduced moves on the sequential search too.

Daniel Shawul · Post by **Daniel Shawul** » Sat Aug 19, 2017 4:34 pm

Dann Corbit wrote:Lazy has been well known since 1972:
https://www.youtube.com/watch?v=uflsjU95lRM
Explanations are around the 5:40 mark.

ABDADA atleast tries to clutch a straw before drowning

petero2 · Post by **petero2** » Wed Aug 23, 2017 6:59 pm

petero2 wrote:Since "approximate ABDADA" is the best algorithm I have found so far, my next goal is to try to rewrite it in a way that does not cause a slowdown and corresponding elo loss for the 1 core case.

I have now optimized the implementation and ran some more tests. All games were played at time control 24+0.24. The transposition table size was 128MB unless otherwise noted. The single threaded performance is now roughly as good as lazy SMP:

Code: Select all

                    dElo  sDev   win  loss  draw  depth1 depth2
ABDADA vs lazy 1c   -1.7   2.1  1974  2024  6002   16.54  16.57

In all multithreaded tests I have run, ABDADA is better than lazy SMP, often around 10 elo better:

Code: Select all

                               dElo  sDev   win  loss  draw  depth1 depth2
A vs L, 2 cores                +8.3   2.9  1013   894  3093   17.57  17.28
A vs L, 4 cores                +9.8   3.2   779   666  2555   18.63  18.22
A vs L, 2 nodes, 24+16 cores  +14.3   3.9   304   222  1474   21.04  20.60
A vs L, 2 cores, 1MB hash      +9.2   2.9   975   842  3183   16.79  16.48
A vs L, 8 cores, 1MB hash     +12.3   4.4   363   292  1345   17.83  17.35

petero2 · Post by **petero2** » Wed Aug 23, 2017 7:06 pm

Since my experiments to improve on lazy SMP in texel by using ideas from the ABDADA algorithm were quite successful, I decided to write a summary of the algorithm I ended up with.

Differences compared to the original ABDADA algorithm

There are three categories of differences:

* The ABDADA exclusive condition is only approximately enforced, in order to reduce overhead, especially for clusters.
* Ideas from "lazy SMP" type algorithms that were successful in texel.
* Ideas from the texel "lazy cluster" algorithm.

Busy bit

* In transposition table entries, the "nproc" counter has been replaced by a single "busy" bit.
* There is no sequential consistency when updating the transposition table. The busy bit can sometimes be wrong.
* Handling of the busy bit and exclusive searches are only performed if remaining depth >= 7.
* The busy bit is independent of the rest of the information in the transposition table. In the original ABDADA algorithm a transposition table probe cannot return a score for a node currently being searched.
* If a transposition table entry does not already exist for a position, the entry is not created just to be able to store the busy bit.

From lazy SMP

* If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
* After a helper thread finishes its search, it will immediately restart its current search at a larger depth (1 ply deeper).
* The transposition table is shared between all threads running on an SMP or NUMA computer.

From lazy cluster

* Each node has its own local transposition table. All probe and insert operations act only on the local transposition table.
* A task that runs independently of the tree search algorithm sends interesting transposition table entries to other cluster nodes periodically.

Pseudo code

The part of the algorithm that deals with the recursive search function works as follows:

Code: Select all

int approximateAbdada&#40;Position pos, int alpha, int beta, int depth, bool exclusive&#41; &#123;
    // Check for abort
    // Repetition check

    TTEntry& e = tt.probe&#40;pos&#41;;
    if &#40;e.cutOff&#40;alpha, beta, depth&#41;)
        return e.score;
    if &#40;depth >= 7&#41; &#123;
        if &#40;exclusive && e.busy&#41;
            return BUSY;
        e.busy = true;
    &#125;

    // null move, forward pruning, etc

    MoveList moves = generateMoves&#40;pos&#41;;
    bool searched&#91;256&#93;;
    int nextDepth&#91;256&#93;;
    for &#40;int pass = 0; pass < 2; pass++) &#123;
        for &#40;int i = 0; i < moves.size&#40;); i++) &#123;
            if &#40;pass == 0&#41;
                nextDepth&#91;i&#93; = depth - 1 + extension - reduction;
            else if &#40;searched&#91;i&#93;)
                continue;
            bool nextExclusive = pass == 0 && i > 0;
            pos.makeMove&#40;moves&#91;i&#93;);
            int score = -approximateAbdada&#40;pos, -beta, -alpha, nextDepth&#91;i&#93;, nextExclusive&#41;;
            pos.unMakeMove&#40;moves&#91;i&#93;);
            searched&#91;i&#93; = score != -BUSY;
            if (!searched&#91;i&#93;)
                continue;

            // Update alpha and bestScore, break if done
        &#125;
    &#125;

    tt.store&#40;pos, bestScore&#41;; // Also sets entry.busy = false;
    return bestScore;
&#125;

For the full SMP/cluster algorithm, the approximateAbdada algorithm has to be combined with the original lazy SMP/cluster algorithm, basically by replacing the negaScout function with the approximateAbdada function.

Advantages

* The algorithm is fully asynchronous. No part of the algorithm requires waiting for results from other threads or other cluster nodes.
* Low latency cluster communication is not essential. Gigabit ethernet is enough.

Dann Corbit · Post by **Dann Corbit** » Wed Aug 23, 2017 8:34 pm

I guess that ABDADA gets deeper in the same time (on average).
Is that correct?

petero2 · Post by **petero2** » Wed Aug 23, 2017 8:39 pm

Dann Corbit wrote:I guess that ABDADA gets deeper in the same time (on average).
Is that correct?

Yes that is correct. The depth1 column is the average search depth in plies for the ABDADA version and the depth2 column is the corresponding value for the lazy SMP version. So ABDADA get about 0.3 - 0.5 ply deeper in the same time.

Michel · Post by **Michel** » Wed Aug 23, 2017 8:51 pm

Thanks for the explanations! It is very interesting!

Dann Corbit · Post by **Dann Corbit** » Wed Aug 23, 2017 8:53 pm

petero2 wrote:
Dann Corbit wrote:I guess that ABDADA gets deeper in the same time (on average).
Is that correct?
Yes that is correct. The depth1 column is the average search depth in plies for the ABDADA version and the depth2 column is the corresponding value for the lazy SMP version. So ABDADA get about 0.3 - 0.5 ply deeper in the same time.

Considering the new turns in scaling compute power (consider AMD Epyc, which has 128 threads and a cluster of connected computing cores [4 units per "chip"[*] using infinity fabric, and typically 2 "chips" per motherboard for 7601]) this sort of thing is going to be incredibly important in the advancement of computer chess.

See:
http://www.ipmanchess.yolasite.com/amd- ... -bench.php

[*] The "chip" has a rectangle size about the size of a bar of soap. There are 4 SMP units in the "chip" and the 7601 seems designed to have a couple of these on a motherboard.

D Sceviour · Post by **D Sceviour** » Thu Aug 24, 2017 12:56 am

petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.

Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?

Dann Corbit · Post by **Dann Corbit** » Thu Aug 24, 2017 2:19 am

D Sceviour wrote:
petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?

My guess:
A. If one is deeper, use the deepest one.
B. If depth is the same, use the one with the most nodes.

Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Approximate ABDADA

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Approximate ABDADA

Re: Lazy SMP and "lazy cluster" experiments

Re: Approximate ABDADA

Re: Approximate ABDADA