Lazy SMP and "lazy cluster" experiments

petero2 · Post by **petero2** » Wed Aug 23, 2017 7:06 pm

Since my experiments to improve on lazy SMP in texel by using ideas from the ABDADA algorithm were quite successful, I decided to write a summary of the algorithm I ended up with.

Differences compared to the original ABDADA algorithm

There are three categories of differences:

* The ABDADA exclusive condition is only approximately enforced, in order to reduce overhead, especially for clusters.
* Ideas from "lazy SMP" type algorithms that were successful in texel.
* Ideas from the texel "lazy cluster" algorithm.

Busy bit

* In transposition table entries, the "nproc" counter has been replaced by a single "busy" bit.
* There is no sequential consistency when updating the transposition table. The busy bit can sometimes be wrong.
* Handling of the busy bit and exclusive searches are only performed if remaining depth >= 7.
* The busy bit is independent of the rest of the information in the transposition table. In the original ABDADA algorithm a transposition table probe cannot return a score for a node currently being searched.
* If a transposition table entry does not already exist for a position, the entry is not created just to be able to store the busy bit.

From lazy SMP

* If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
* After a helper thread finishes its search, it will immediately restart its current search at a larger depth (1 ply deeper).
* The transposition table is shared between all threads running on an SMP or NUMA computer.

From lazy cluster

* Each node has its own local transposition table. All probe and insert operations act only on the local transposition table.
* A task that runs independently of the tree search algorithm sends interesting transposition table entries to other cluster nodes periodically.

Pseudo code

The part of the algorithm that deals with the recursive search function works as follows:

Code: Select all

int approximateAbdada&#40;Position pos, int alpha, int beta, int depth, bool exclusive&#41; &#123;
    // Check for abort
    // Repetition check

    TTEntry& e = tt.probe&#40;pos&#41;;
    if &#40;e.cutOff&#40;alpha, beta, depth&#41;)
        return e.score;
    if &#40;depth >= 7&#41; &#123;
        if &#40;exclusive && e.busy&#41;
            return BUSY;
        e.busy = true;
    &#125;

    // null move, forward pruning, etc

    MoveList moves = generateMoves&#40;pos&#41;;
    bool searched&#91;256&#93;;
    int nextDepth&#91;256&#93;;
    for &#40;int pass = 0; pass < 2; pass++) &#123;
        for &#40;int i = 0; i < moves.size&#40;); i++) &#123;
            if &#40;pass == 0&#41;
                nextDepth&#91;i&#93; = depth - 1 + extension - reduction;
            else if &#40;searched&#91;i&#93;)
                continue;
            bool nextExclusive = pass == 0 && i > 0;
            pos.makeMove&#40;moves&#91;i&#93;);
            int score = -approximateAbdada&#40;pos, -beta, -alpha, nextDepth&#91;i&#93;, nextExclusive&#41;;
            pos.unMakeMove&#40;moves&#91;i&#93;);
            searched&#91;i&#93; = score != -BUSY;
            if (!searched&#91;i&#93;)
                continue;

            // Update alpha and bestScore, break if done
        &#125;
    &#125;

    tt.store&#40;pos, bestScore&#41;; // Also sets entry.busy = false;
    return bestScore;
&#125;

For the full SMP/cluster algorithm, the approximateAbdada algorithm has to be combined with the original lazy SMP/cluster algorithm, basically by replacing the negaScout function with the approximateAbdada function.

Advantages

* The algorithm is fully asynchronous. No part of the algorithm requires waiting for results from other threads or other cluster nodes.
* Low latency cluster communication is not essential. Gigabit ethernet is enough.

Dann Corbit · Post by **Dann Corbit** » Wed Aug 23, 2017 8:34 pm

I guess that ABDADA gets deeper in the same time (on average).
Is that correct?

petero2 · Post by **petero2** » Wed Aug 23, 2017 8:39 pm

Dann Corbit wrote:I guess that ABDADA gets deeper in the same time (on average).
Is that correct?

Yes that is correct. The depth1 column is the average search depth in plies for the ABDADA version and the depth2 column is the corresponding value for the lazy SMP version. So ABDADA get about 0.3 - 0.5 ply deeper in the same time.

Michel · Post by **Michel** » Wed Aug 23, 2017 8:51 pm

Thanks for the explanations! It is very interesting!

Dann Corbit · Post by **Dann Corbit** » Wed Aug 23, 2017 8:53 pm

petero2 wrote:
Dann Corbit wrote:I guess that ABDADA gets deeper in the same time (on average).
Is that correct?
Yes that is correct. The depth1 column is the average search depth in plies for the ABDADA version and the depth2 column is the corresponding value for the lazy SMP version. So ABDADA get about 0.3 - 0.5 ply deeper in the same time.

Considering the new turns in scaling compute power (consider AMD Epyc, which has 128 threads and a cluster of connected computing cores [4 units per "chip"[*] using infinity fabric, and typically 2 "chips" per motherboard for 7601]) this sort of thing is going to be incredibly important in the advancement of computer chess.

See:
http://www.ipmanchess.yolasite.com/amd- ... -bench.php

[*] The "chip" has a rectangle size about the size of a bar of soap. There are 4 SMP units in the "chip" and the 7601 seems designed to have a couple of these on a motherboard.

D Sceviour · Post by **D Sceviour** » Thu Aug 24, 2017 12:56 am

petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.

Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?

Dann Corbit · Post by **Dann Corbit** » Thu Aug 24, 2017 2:19 am

D Sceviour wrote:
petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?

My guess:
A. If one is deeper, use the deepest one.
B. If depth is the same, use the one with the most nodes.

D Sceviour · Post by **D Sceviour** » Thu Aug 24, 2017 3:37 am

Dann Corbit wrote:
D Sceviour wrote:
petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?
My guess:
A. If one is deeper, use the deepest one.
B. If depth is the same, use the one with the most nodes.

Hello Dan,
A. Makes sense.
B. Not necessarily. Since both threads are sharing the same hash table, perhaps the one with the fewest nodes has the best hash values available - meaning it has the most recent up-to-date hash entries and cutoffs available. Peters intention might be the thread that returns first is the thread that governs the score and PV, regardless of any subsequent thread value. However, it would be nice to have a good list of the possible errors at the return point.

petero2 · Post by **petero2** » Thu Aug 24, 2017 6:40 am

D Sceviour wrote:
petero2 wrote:If a helper thread finishes its search before the master thread, the result from the helper thread is used instead of waiting for the master to finish its search.
Assuming both threads are allowed to finish, what do you do if one thread returns a different value than the other thread?

Several threads could finish their searches, but the master thread will always use the first value it sees, without checking if more than one result is available.

During the master thread search, it will periodically check for messages from other threads. If one such message contains a result from another thread, the master thread will immediately abort its own search and use the result from the other thread instead.

If the master thread finishes its own search, it will use its own value without checking if there are any received but unprocessed messages that also contains results from helper threads.

There is a communication optimization related to this behavior. As I described before, all search threads are arranged in a tree, and a thread is only allowed to talk to its neighbors in the tree. When a search thread receives a message from some other thread, it may decide to not forward it to other threads if it can determine that the other thread does not need the message. For example:

* If a thread receives a search result from one of its child threads, it will normally forward it to its parent thread. However, if it has already sent a search result to its parent thread, it will simply ignore any further search results it receives. (See WorkerThread::sendReportResult() in parallel.cpp in the source code.)

* If a thread receives a "start search position x" command from its parent thread, it will normally forward this message to its child threads. However if it already has an unprocessed "start search" command in its queue, it will first discard the old "start search" command since that has been obsoleted by the new "start search" command. (See ThreadCommunicator::doSendStartSearch() in parallel.cpp in the source code.) This is mostly useful in the first few iterative deepening iterations where the master thread probably produces many "start search" commands faster than they can be sent to helper threads.

Texel 1.07a29 contains the ABDADA implementation and is available for download here: texel 1.07a29.

Dann Corbit · Post by **Dann Corbit** » Thu Aug 24, 2017 6:37 pm

It seems to scale very well on my 64 core machine

Lazy SMP and "lazy cluster" experiments

Approximate ABDADA

Re: Lazy SMP and "lazy cluster" experiments

Re: Lazy SMP and "lazy cluster" experiments

Re: Approximate ABDADA

Re: Lazy SMP and "lazy cluster" experiments

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA

Re: Approximate ABDADA