Announcing lczero

Daniel Shawul · Post by **Daniel Shawul** » Wed Jan 17, 2018 6:29 pm

Michel wrote:
Daniel Shawul wrote:
Usually the k-armed-bandit problem is formulated with arms that have a benign distribution, like Gaussians with fixed standard deviation, only differing in expectation value. Unfortunately random sampling of a Chess tree is not like that at all. It is more like dealing with bandits where most of the contribution to the average comes from a few outcome with very small probablilty but huge payout. You really have no idea of the average until you have enough visits to representatively sample the unlikely events. Likewise, in Chess you will only get a meaningful value when you have enough visits to discover the tactics that define the main line, and start exploiting that.
That is why I do not use averaging as a backup operator for MCTS in chess -- I return the minimax value of the best bandit instead. This gives a more stable search and i think is better tactically as well. A typical example is a position where you are threatening a mate-in-1 but the opponent has one or two escape move that keeps the score at 0. With averaging as a backup operator, the engine would hugely prefer these position as it can get mate scores from all the other non-escape moves. Since Mate scores are very high , even one of them ruins the actual score. My engine with this backup operator exhibits a wildly varying score from move to move. However, with minimax as a backup operator, the engine will keep its score at 0, and the score remains stable like a standard alpha-beta search. Thus, the averaging backup operator is preferred as the better planning method for games dominated with strategy, and the minimax operator for tactical games like chess. Note that the tactical gain i get from using the minimax operator, though significant, is miniscule compared to that from expanding the tree rapidly every visit.
I have no idea how they removed the tactical blindness from A0 -- I am currently stuck at TSCP + 100 elo with the MCTS + qsearch approach.

You are obviously much more an expert than I. But I guess the idea is that an escapable mate threat should not be scored as zero (the minimax score) since it is still has a significant value as a threat (restricting the opponent), and moreover there could be nearby lines where the near mate is actually a mate.

Exactly, that is why averaging is the better planning method in most cases. At first, I didn't realize why moves like Bxh7 and anything that gets major piece closer to opponent king was getting such a high score. Once the opponent makes the somewhat obvious escape move after our optimisitic Bxh7 with a +10 pawns score, the score goes back down to 0 because the escape actually happened on the board.

BTW. I assume before applying the averaging operator you first convert scores to a winning probability? It seems obvious but I just wanted to make sure.

I had that to/from a logistic score at first, but it seemed to me that it was an unnecessary intermediate step so now I am using raw scores instead. The stability issue with the minmax operator was still there before i switched to backing up raw scores instead. Though it was more convenient to do so, maybe I shouldn't have removed for the same reasons as why elo scale is needed. The minimax backup returns the score of a single leaf node, so it seems the conversion may not to be needed atleast for this case.

Daniel

hgm · Post by **hgm** » Wed Jan 17, 2018 7:03 pm

Daniel Shawul wrote:That is why I do not use averaging as a backup operator for MCTS in chess -- I return the minimax value of the best bandit instead.

Indeed, that seems an improvement, as long as the deteriorated stats of the best move do not blow up the confidence interval so much that you would be better off averaging.

But it doesn't seem to address the main problem, which is that with a completely neutral score your tree would converge to minimax, rather than alpha-beta. IMO that is the main reason why it has so much difficulty with deep tactics.

hgm · Post by **hgm** » Thu Jan 18, 2018 5:37 pm

OK, I have given this some more thought. The basic problem is that in Chess, width is no substitute for depth. It frequently happens in the tree that most of your 30 moves look good at 4 ply, and every single one of them turns sour at 8 ply. I guess this is what they mean by 'search trap'. So it is much better to have an 8-ply refutation tree, only searching a single move for the player that wants to refute the variation, than to have a 4-ply minimax tree (which approximately has the same number of nodes). Having alteratives in the same position is not a very good safeguard, as they are in general not independent enough, as deep tactics that refutes one tends to refute them all. It might be many eggs, but they are still in the same basket!

In PV nodes and all-nodes the normal MCTS-UCT selection seems good: you are really looking for the best move in PV nodes, and in all-nodes you want to find the move with the best chances to beat alpha, which is most likely also the one with the best chances to be best move. In cut-nodes, however, you just want to find a move that is good enough, and then really make sure it is good enough. Rather than trying to find other moves that are nearly as good, or perhaps even better.

When you select a node to expand, you know the old root score, and as long as you follow the move with the best value, you stay on the PV. When you follow a sub-optimal move, however (to explore it, because the low visit count made the upper cofidence bound high), you get into a cut-node, and from then on all-nodes and cut-nodes alternate. In all-nodes there is little harm in averaging scores, and adding the visit counts to propagate these towards the root. But in cut-nodes we should propagate the score and count of the best move (as Daniel already does), as the other moves are of no concern. This way, the visit count returned to the PV node from where you branched will be the size of the refutation tree, and thus a good measure for the depth and the reliability of the result.

Initially you can use normal MCTS-UCT selection in a cut-node, keeping the upper confidence bound of all moves equal. But as soon as one move gets its lower confidence bound above the root score, you can be confident that it is a cut move. So you will then select that move, and ignore all others. If it isn't a real cut move, but was just lucky initially (or runs in to deep tactical trouble later) the inreased sampling will drive its score lower, and probably the lower confidence bound as well (as the confidence interval shrinks only very slowly), and if that bound then gets pressed below the root score again the move loses its special cut-move status, and we revert to normal MCTS-UCS selection.

This way the tree should converge to an alpha-beta tree, rather than a minimax tree, if every deviation from the PV gives up the same score, no matter what moves follow.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 18, 2018 7:01 pm

hgm wrote:
Daniel Shawul wrote:That is why I do not use averaging as a backup operator for MCTS in chess -- I return the minimax value of the best bandit instead.
Indeed, that seems an improvement, as long as the deteriorated stats of the best move do not blow up the confidence interval so much that you would be better off averaging.

Yes, the best move should have the most visits anyway so while it doesn't have the same stats as the parent, it should have the best stats among the sibilings. Infact the original MCTS paper from Remi hand a mixed minimax+averaging backup operator that is more suited to chess IMO. The backup operator gradually turns into minmax from averaging with more visits. MCTS-UCT that does averaging is apparently what many use in Go but it is probably because the game is dominated with strategy except in late endgames. That is why also the pure policy network is tremondously strong there.
I think it is better to dump UCT for chess and use the original MCTS https://hal.inria.fr/inria-00116992/document.

But it doesn't seem to address the main problem, which is that with a completely neutral score your tree would converge to minimax, rather than alpha-beta. IMO that is the main reason why it has so much difficulty with deep tactics.

Agreed, MCTS explores the minmax tree if I use uniform selection (instead of UCB) method. So it seems we have to strike a balance with highly selectivity MCTS and the need to avoid "search traps" that are so prevalent in chess.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 18, 2018 7:13 pm

Good idea, I will try to use the node classification scheme to choose between averaging and minimax. In the case where the currently best child (i.e. with highest score) is not selected for exploration, I will pass down a CUT_NODE flag, and switch between ALL & CUT nodes going down. This should be better than pure minimax or averaging schemes.

Daniel Shawul · Post by **Daniel Shawul** » Thu Jan 18, 2018 8:33 pm

Just tested the mixed approach:

Code: Select all

        /*select move and simulate subtree*/
        Node* next = Node&#58;&#58;Max_UCB_select&#40;n&#41;;
        int next_node_t;
        if&#40;node_t == ALL_NODE&#41;
            next_node_t = PV_NODE;
        else if&#40;node_t == CUT_NODE&#41;
            next_node_t = ALL_NODE;
        else &#123;
            Node* best = Node&#58;&#58;Max_score_select&#40;n&#41;;
            if&#40;best != next&#41;
                next_node_t = CUT_NODE;
            else
                next_node_t = PV_NODE;
        &#125;
        PUSH_MOVE&#40;next->move&#41;;
        play_simulation&#40;next,result,visits,next_node_t&#41;;
        POP_MOVE&#40;);

        /*Average/Minmax/Mixed style backups*/
        if&#40;backup_type == AVERAGE || 
          &#40;backup_type == NODETYPE && node_t != CUT_NODE&#41;
          ) &#123;
            Node* best = Node&#58;&#58;Max_score_select&#40;n&#41;;
            result = best->uct_wins / best->uct_visits;
        &#125; else &#123;
            result = -result;
        &#125;

It seems to work as well as minimax but i have to play many games to determine which is better. One thing i had problem with the averaging method was finishing of games by making a move with the shortest mate. The minimax style backup solved that and it seems the mixed backup method doesn't seem to suffer from this problem.. That is all I can say about the methods at this point.

hgm · Post by **hgm** » Thu Jan 18, 2018 9:49 pm

Well, it should converge to alpha-beta, not minimax. Minimax is no good.

Daniel Shawul · Post by **Daniel Shawul** » Sun Jan 21, 2018 9:37 pm

It turns out there is a rollouts based implementation of alpha-beta! Infact, any fixed-depth best first search method can be formulated as a rollouts based algorithm using something similar to the MT driver from the MTD(f). https://www.aaai.org/ocs/index.php/AAAI ... /9828/9354

I had difficulty understanding the paper before i realized that the alpha-beta rollout formulation is for a fixed-depth tree. MCTS dynamically grows the tree so it is hard to formulate bounds to be used for pruning, but for fixed-depth (iterative deepenign framework), it is possible to formulate a rollouts version that does alpha-beta pruning.

Other people have exteneded Huang's idea to get a hybrid MCTS-alpha-beta alorithm where you choose MCTS over alpha-beta rollouts based on a probability p.

The question is if A0 actually used these hybrids or plain MCTS.

Daniel

pilgrimdan · Post by **pilgrimdan** » Mon Jan 22, 2018 9:22 am

Daniel Shawul wrote:It turns out there is a rollouts based implementation of alpha-beta! Infact, any fixed-depth best first search method can be formulated as a rollouts based algorithm using something similar to the MT driver from the MTD(f). https://www.aaai.org/ocs/index.php/AAAI ... /9828/9354

I had difficulty understanding the paper before i realized that the alpha-beta rollout formulation is for a fixed-depth tree. MCTS dynamically grows the tree so it is hard to formulate bounds to be used for pruning, but for fixed-depth (iterative deepenign framework), it is possible to formulate a rollouts version that does alpha-beta pruning.

Other people have exteneded Huang's idea to get a hybrid MCTS-alpha-beta alorithm where you choose MCTS over alpha-beta rollouts based on a probability p.

The question is if A0 actually used these hybrids or plain MCTS.

Daniel

Daniel

I don't know if this helps...

but i found this from the following paper...

16.6.2 AlphaGo Zero

AlphaGo Zero implemented a form of policy iteration (Section 4.3), interleaving policy evaluation with policy improvement.
Figure 16.8 is an overview of AlphaGo Zero’s algorithm.
A signiﬁcant diﬀerence between AlphaGo Zero and AlphaGo is that AlphaGo Zero used MCTS to select moves throughout self-play reinforcement learning, whereas AlphaGo used MCTS for live play after—but not during—learning.
Other diﬀerences are that AlphaGo Zero used only one deep convolutional ANN and used a simpler version of MCTS.
AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy.
Each iteration of AlphaGo Zero’s MCTS ran a simulation that ended at a leaf node of the current search tree instead of at the terminal position of a complete game simulation.
But as in AlphaGo, each iteration of MCTS in AlphaGo Zero was guided by the output of a deep convolutional network, labeled fθ in Figure 16.7, where θ is the network’s weight vector.
The input to the network, whose architecture we describe below, consisted of raw representations of board positions, and its output had two parts:
a scaler value, v, an estimate of the probability that the current player will win from from the current board position,
and a vector, p, of move probabilities, one for each possible stone placement on the current board, plus the pass, or resign, move.
Instead of selecting self-play actions according to the probabilities p, however, AlphaGo Zero used these probabilities, together with the network’s value output, to direct each execution of MCTS, which returned new move probabilities, shown in Figure 16.7 as the policies πi.
These policies beneﬁtted from the many simulations that MCTS conducted each time it executed.
The result was that the policy actually followed by AlphaGo Zero was an improvement over the policy given by the network’s outputs p.
Silver et al. (2017) wrote that “MCTS may therefore be viewed as a powerful policy improvement operator.”

Reinforcement Learning: An Introduction
Second edition, in progress ****Complete Draft****
November 5, 2017

Announcing lczero

Re: Announcing lczero

Re: Announcing lczero

Re: Announcing lczero

Re: Announcing lczero

Re: Announcing lczero

Re: Announcing lczero

Re: Announcing lczero

Re: Announcing lczero

Re: Announcing lczero