MCTS with minmax backup operator?

hgm · Post by **hgm** » Wed May 02, 2018 10:28 pm

I still have difficulty grasping how this UCT algorithm can be any good for chess. The way the confidence bounds are calculated as a function of the number of visits doesn't seem to be realistic for chess at all.

I vaguely recall the original formulation of UCT reduces the confidence interval as 1/sqrt(visits). This is good when the search scores behave like the average of draws from a gaussian distribution (and by the central-limit theorem for any finite-variance distribution). But chess search trees don't behave like that at all; they have a pretty constant score, until you make the tree large enough to see some new tactic, and then the score can suddenly change by an arbitrarily large amount. Of course when you average scores in all nodes the score at the root cannot make sudden jumps. But that just means that once the game-changing tactic gets into view, you have to waste thousands of visits to correct all the average in a way that the newly found PV will dominate them.

Of course AlphaZero already 'cheated' with the UCT term by making it proportional to 1/visits, rather than 1/sqrt(visits). This seems a rather arbitrary choice. The efficiency of the search will obviously depend a great deal on how you calculate the UCT term, as this determines how often you visit a move that seems a certain amount worse compared to the move that is best If the UCT term decreases very rapidly with visits, you won't waste many visits to keep the UCT bound of a worse move below the score of the best. But if you overdo that, you will never discover it when a move that at low depth seems poor is actually good.

So it seems one of the first things we should know is how much moves with a certain 'score deficit' compared to the PV can be reduced. There will be an optimum for this; reducing them too much will make you overlook too many things, reducing too little will waste nodes on their fruitless search. This will in particular be important in all nodes (i.e. an even number of ply removed from a PV node). Because there we are desparately searching for a move that will get us back to the PV score, and it is important to distribute that effort over moves in such a way that we have maximum chance to find something. In an alpha-beta searcher we do this by LMR, determined by a move ordering based on static criteria (as we do not have true scores in alpha-beta, so we cannot base it on those). In the AlphaZero approach we also start with (very advanced) static criteria, namely the output of the policy network. But eventually the UCT term will dominate that. So it seems we should design a UCT term that mimics LMR that we know to work well in alpha-beta engines. (Using the NN policy for sorting.)

In cut-nodes we have another problem; MCTS-UCT tends to treat those the same as all-nodes, resulting in a mini-max search. This is not so bad if the cut-move is far better than all others, because then these others get reduced / their visit count suppressed. So that they cause relatively little overhead. In alpha-beta we also have some overhead involved in finding a proper cut-move, e.g. by IID searching many alternative moves that are not good enough to a much reduced depth.

But if there are many, approximately equally good cut-moves, MCTS-UCT starts to perform very poorly. It will keep searching all these cut-moves to equal depth, rather than just focusing on one that seems to do the job. In Chess you get a much more reliable score lower bound when seaching a single move to d=N+2, then by searching all moves to d=N and taking the best. The problem is that if there is a deep threat against you, most moves will do nothing to prevent it, and all these moves will be axed at the same moment, if the execution of that threat comes within the horizon.

I guess the whole concept of UCT is a bust for cut-nodes. Beause you are not looking for the move with the best chances to be the best, but you are looking for the move with the best chances to not be below the PV score. So you are actually interested in the LOWER confidence bound. If you have two moves with (persistently) equal scores, and the confidence interval shrinks with the number of visits, the upper confidence limit of the most-visited move will be lower. So you will then pick the least-visited move, to equalize the number of visits. But if you would pick the highest lower confidence limit, the move with the move with the most visits would continue to be picked, and the other one would never be visited again. Such instability is exactly what you want for a cut-move: concentrate on one that does the job. Of course when the new visits make the score drop, so that it no longer is a cut move (or it gets so much worse than the other that despite the narrower confidence interval the lower confidence limit gets below that of the other), you will start visiting the other move again.

You will probably need some minimum number of visits to a node before you can even suspect what the PV is, and thus whether the node is a cut-node. Until that point you would just select moves by UCT. That 'pilot search' (equivalent to IID in an alpha-beta searcher) would get you to a stage where you can be reasonably confident that what seems the best move is a cut-move, and then start to apply LCT. or a mix of LCT/UCT, where you slowly scale down the UCT fraction.

Daniel Shawul · Post by **Daniel Shawul** » Thu May 03, 2018 11:58 pm

hgm wrote:I still have difficulty grasping how this UCT algorithm can be any good for chess. The way the confidence bounds are calculated as a function of the number of visits doesn't seem to be realistic for chess at all.

I vaguely recall the original formulation of UCT reduces the confidence interval as 1/sqrt(visits). This is good when the search scores behave like the average of draws from a gaussian distribution (and by the central-limit theorem for any finite-variance distribution). But chess search trees don't behave like that at all; they have a pretty constant score, until you make the tree large enough to see some new tactic, and then the score can suddenly change by an arbitrarily large amount. Of course when you average scores in all nodes the score at the root cannot make sudden jumps. But that just means that once the game-changing tactic gets into view, you have to waste thousands of visits to correct all the average in a way that the newly found PV will dominate them.

Of course AlphaZero already 'cheated' with the UCT term by making it proportional to 1/visits, rather than 1/sqrt(visits). This seems a rather arbitrary choice. The efficiency of the search will obviously depend a great deal on how you calculate the UCT term, as this determines how often you visit a move that seems a certain amount worse compared to the move that is best If the UCT term decreases very rapidly with visits, you won't waste many visits to keep the UCT bound of a worse move below the score of the best. But if you overdo that, you will never discover it when a move that at low depth seems poor is actually good.

Yes, the fastest way to find a tactic is to forget all previous outcomes at a node when backing up and use only the latest. That is what the "max" version I mentioned does, and its score swings well from iteration to iteration unlike the averaging method.

About the PUCB formula I don't know where they got it exactly but the paper they referred to as containing the formula has actually something else that resembles the origianal UCB formula in that the confidence intervals are calculated proportional to sqrt(log(N)/N_i). As you pointed out, their formula can focus on the best node much faster than the standard UCB.

The paper is here, and the formula given in Figure 1 is quite different from what they used.

So it seems one of the first things we should know is how much moves with a certain 'score deficit' compared to the PV can be reduced. There will be an optimum for this; reducing them too much will make you overlook too many things, reducing too little will waste nodes on their fruitless search. This will in particular be important in all nodes (i.e. an even number of ply removed from a PV node). Because there we are desparately searching for a move that will get us back to the PV score, and it is important to distribute that effort over moves in such a way that we have maximum chance to find something. In an alpha-beta searcher we do this by LMR, determined by a move ordering based on static criteria (as we do not have true scores in alpha-beta, so we cannot base it on those). In the AlphaZero approach we also start with (very advanced) static criteria, namely the output of the policy network. But eventually the UCT term will dominate that. So it seems we should design a UCT term that mimics LMR that we know to work well in alpha-beta engines. (Using the NN policy for sorting.)

The way I do LMR in the alpha-beta rollouts version is as follows. The nodes in the tree are sorted at the beginning of search ( a fixed depth search) and are given ranks. These ranks remain static until the search is finished -- meaning no dynamic information is used to sort the moves while the search is going on. The LMR reduction depth for each move is based on the rank of the move and remains static until the one ID search is finished. This is because the search is best-first, so next time you visit the node you have to search it with the same depth, otherwise you get nonsense results.

To mimic this in a standard UCT implementation should be equivalent to tuning the exploration constant. At first I had difficulty beating the alpha-beta rollouts version with a UCT implemenation with minimax backup, but after lowering the exploration coefficient increased the selectivity and starts beating it.

In cut-nodes we have another problem; MCTS-UCT tends to treat those the same as all-nodes, resulting in a mini-max search. This is not so bad if the cut-move is far better than all others, because then these others get reduced / their visit count suppressed. So that they cause relatively little overhead. In alpha-beta we also have some overhead involved in finding a proper cut-move, e.g. by IID searching many alternative moves that are not good enough to a much reduced depth.

But if there are many, approximately equally good cut-moves, MCTS-UCT starts to perform very poorly. It will keep searching all these cut-moves to equal depth, rather than just focusing on one that seems to do the job. In Chess you get a much more reliable score lower bound when seaching a single move to d=N+2, then by searching all moves to d=N and taking the best. The problem is that if there is a deep threat against you, most moves will do nothing to prevent it, and all these moves will be axed at the same moment, if the execution of that threat comes within the horizon.

I guess the whole concept of UCT is a bust for cut-nodes. Beause you are not looking for the move with the best chances to be the best, but you are looking for the move with the best chances to not be below the PV score. So you are actually interested in the LOWER confidence bound. If you have two moves with (persistently) equal scores, and the confidence interval shrinks with the number of visits, the upper confidence limit of the most-visited move will be lower. So you will then pick the least-visited move, to equalize the number of visits. But if you would pick the highest lower confidence limit, the move with the move with the most visits would continue to be picked, and the other one would never be visited again. Such instability is exactly what you want for a cut-move: concentrate on one that does the job. Of course when the new visits make the score drop, so that it no longer is a cut move (or it gets so much worse than the other that despite the narrower confidence interval the lower confidence limit gets below that of the other), you will start visiting the other move again.

You will probably need some minimum number of visits to a node before you can even suspect what the PV is, and thus whether the node is a cut-node. Until that point you would just select moves by UCT. That 'pilot search' (equivalent to IID in an alpha-beta searcher) would get you to a stage where you can be reasonably confident that what seems the best move is a cut-move, and then start to apply LCT. or a mix of LCT/UCT, where you slowly scale down the UCT fraction.

Rémi Coulom · Post by **Rémi Coulom** » Sat May 05, 2018 11:20 am

Daniel Shawul wrote:The result for me clearly demonstrates that any kind of averaging deterioriates elo, except for the case of "avg" and "avg-old" (look below for definitions).

Code: Select all

 1.                        scorpio-mcts-min    425     35     35    687    84.9%   -69    14.8% 
 2.                        scorpio-mcts-mix    321     32     32    687    77.7%   -52    22.6% 
 3.                         scorpio-mcts-ab    284     31     31    687    73.7%   -46    16.6% 
 4.                    scorpio-mcts-min-old    -58     30     30    688    42.3%    10    20.6% 
 5.                    scorpio-mcts-mix-old   -102     31     31    687    38.6%    16    20.5% 
 6.                    scorpio-mcts-avg-old   -194     32     32    687    30.0%    32    23.3% 
 7.                        scorpio-mcts-avg   -672     60     60    687     2.9%   111     4.7%

Hi Daniel,

Thanks for sharing your result.

Your message made me try to use negamax instead of average in my AlphaZero shogi engine. Negamax turned out to be much weaker (about 150 Elo points at 40 playouts per move). So I wonder how your MCTS works. Do you use a policy to prune bad moves? Not using a policy would explain why averaging is so bad in your case.

In the early stages of Crazy Stone, I tried many backup operators. I also tried to mix averaging with negamaxing. I never managed to do better than averaging.

What I found is an important aspect of MCTS, is that evaluation errors are not symmetric: over-estimating a move is not a big problem, because the over-estimation will cause it to be searched more, and the error will be eventually corrected. under-estimating a move, on the other hand, may have much worse consequences, because an under-estimated move may not be searched any more, so the error will not be fixed.

Averaging values produces a pessimistic estimation of the current board. So the move that leads to it will have an optimistic evaluation, which is necessary for the good dynamics of MCTS.

Michel · Post by **Michel** » Sat May 05, 2018 2:05 pm

Remi Coulom wrote:Averaging values produces a pessimistic estimation of the current board. So the move that leads to it will have an optimistic evaluation, which is necessary for the good dynamics of MCTS.

That's a very nice theory!

Daniel Shawul · Post by **Daniel Shawul** » Sat May 05, 2018 3:10 pm

Rémi Coulom wrote:
Daniel Shawul wrote:The result for me clearly demonstrates that any kind of averaging deterioriates elo, except for the case of "avg" and "avg-old" (look below for definitions).
Code: Select all
 1.                        scorpio-mcts-min    425     35     35    687    84.9%   -69    14.8% 
 2.                        scorpio-mcts-mix    321     32     32    687    77.7%   -52    22.6% 
 3.                         scorpio-mcts-ab    284     31     31    687    73.7%   -46    16.6% 
 4.                    scorpio-mcts-min-old    -58     30     30    688    42.3%    10    20.6% 
 5.                    scorpio-mcts-mix-old   -102     31     31    687    38.6%    16    20.5% 
 6.                    scorpio-mcts-avg-old   -194     32     32    687    30.0%    32    23.3% 
 7.                        scorpio-mcts-avg   -672     60     60    687     2.9%   111     4.7% 
Hi Daniel,

Thanks for sharing your result.

Your message made me try to use negamax instead of average in my AlphaZero shogi engine. Negamax turned out to be much weaker (about 150 Elo points at 40 playouts per move). So I wonder how your MCTS works. Do you use a policy to prune bad moves? Not using a policy would explain why averaging is so bad in your case.

Hi Remi,
,
There is no policy or eval network; I did this experiment by replacing the eval network with standard hand-written evaluation of Scorpio. So that is one difference.

The other one, as HGM pointed out, is that AlphaZero uses a modified PUCT formula that reduces the confidence interval in inverse proportion to the number of visits, not sqr(visits) as it is in the standard UCB formula.

So what I do is the following. Each child gets a score computed from a quiescent_search() during creation, so later when I want to minimax/average all children will have some scores. So the pure minimax ("min") approach always picks the child with the best score and backs that up to the parent without mixing with the old score that the parent may have. So the "min" version results in the most dynamic root score. If I add a little bit of averaging (that is "min-old") by mixing it with the old score stored at the parent node, the performance drops. I wanted to try this mixing with old score stored at the parent because that is what standard UCT with averaging works. The equivalent of that in an iterative-deepenign alpha-beta search would be one that would average scores between depth d and d+1 searches; one would expect this will not work in chess.

One peculiar result is that while any form of averaging seems to hurt performance in chess, it doesn't hurt going from "avg" to "avg-old". That if you decide to compute the parent score from the averege of the children, it doesn't hurt you to average more with the old score.

In the early stages of Crazy Stone, I tried many backup operators. I also tried to mix averaging with negamaxing. I never managed to do better than averaging.

What I found is an important aspect of MCTS, is that evaluation errors are not symmetric: over-estimating a move is not a big problem, because the over-estimation will cause it to be searched more, and the error will be eventually corrected. under-estimating a move, on the other hand, may have much worse consequences, because an under-estimated move may not be searched any more, so the error will not be fixed.

That is a good point. I think that is the cause of most tactical errors in chess, because a bad-looking trap move hardly gets any visit. LCzero can suffer from one 2-ply tactics because of that.

Averaging values produces a pessimistic estimation of the current board. So the move that leads to it will have an optimistic evaluation, which is necessary for the good dynamics of MCTS.

How does being more pessismistic or optimistic in our evaluation of moves be better in one way or the other ? I observed that the mcts-min version has a root score that swings very wildly like an alpha-beta engine, while mcts-avg has a more or less stable score. We want the former unless we think that our evaluation has a lot of uncertainity. I think mcts-min fits chess better because there is a lot of certainity in our evaluation scores even when considering only material scores.

Btw there are situations where the mcts-avg is actually more optimistic. When our opponent has a single escape move, we might think we are doing better than we actually are. I saw this kind of situations where the mcts-avg says a +6.00 pawns up in some fuzzy king attacks that will evaporate on the next move.

I am very interested to learn of your results of AlphaZero approach on Shogi. Did you participate with it in the Shogi tournament already ?

regards,
Daniel

Rémi Coulom · Post by **Rémi Coulom** » Sat May 05, 2018 4:04 pm

I see. Using a policy would probably help a lot.

The shogi tournament was this week, and finished today. I was late in my preparation, and started to train my network on Monday. I was pleasantly surprised by how fast it made progress. The tournament started on Thursday. I could not qualify past the preliminary round, but the results were rather good for a newcomer.

http://blog.computer-shogi.org/wp-conte ... esult.html

Crazy Shogi took 12th place out of 40, ahead of famous programs such as YSS. It was running on 4xV100.

I can't play shogi at all, so it is difficult for me to appreciate how the program plays. I have the feeling that it has very good positional understanding, and is less materialistic than its opponents. It won a spectacular game against CGP, were its opponent estimated itself at +800 cp most of the game:

https://twitter.com/katsudonshogi/statu ... 4357672961

Daniel Shawul · Post by **Daniel Shawul** » Sun May 06, 2018 1:40 pm

Rémi Coulom wrote:I see. Using a policy would probably help a lot.

The shogi tournament was this week, and finished today. I was late in my preparation, and started to train my network on Monday. I was pleasantly surprised by how fast it made progress. The tournament started on Thursday. I could not qualify past the preliminary round, but the results were rather good for a newcomer.

http://blog.computer-shogi.org/wp-conte ... esult.html

Crazy Shogi took 12th place out of 40, ahead of famous programs such as YSS. It was running on 4xV100.

I can't play shogi at all, so it is difficult for me to appreciate how the program plays. I have the feeling that it has very good positional understanding, and is less materialistic than its opponents. It won a spectacular game against CGP, were its opponent estimated itself at +800 cp most of the game:

https://twitter.com/katsudonshogi/statu ... 4357672961

That is a very good result, congrats!

It looks like there were also other CNN engines guessing from their names DNNc, softmax etc.

The issue I have with the plain A0 approach is with tactics. LCzero still suffers from 2-ply tactic and with an opponent that aims for exploiting that it might not perform as well.

The policy network is a static rule (unlike an actual shallow depth search). If it is mostly good, then it won't give enough visits for trap moves that look bad initially but turn out to be good on exhaustive search. If it is bad, there is no hope. So the question is how likely are traps in actual games ? This paper says it is very common in chess and I suppose in Shogi too. Shogi is a large branching factor so it might be more suitable for DNNs than chess.

Rémi Coulom · Post by **Rémi Coulom** » Sun May 06, 2018 2:16 pm

Daniel Shawul wrote:The policy network is a static rule (unlike an actual shallow depth search). If it is mostly good, then it won't give enough visits for trap moves that look bad initially but turn out to be good on exhaustive search. If it is bad, there is no hope. So the question is how likely are traps in actual games ? This paper says it is very common in chess and I suppose in Shogi too. Shogi is a large branching factor so it might be more suitable for DNNs than chess.

You underestimate what a policy network can do. The policy network of Alpha Zero is pro strength by itself.

Training a network that combines policy and value produces a better value than training a network that learns value only, anyway. So you get the policy for free, and it should be considerably better than any man-made reduction heuristic.

Michel · Post by **Michel** » Sun May 06, 2018 3:57 pm

The policy network is a static rule (unlike an actual shallow depth search). If it is mostly good, then it won't give enough visits for trap moves that look bad initially but turn out to be good on exhaustive search. If it is bad, there is no hope.

You keep repeating this but it does not have to be true. Personally I would guess that the task of the policy network is to select _potentially_ good moves.

If the NN were to provide a score interval instead of a single value then in the spirit of UCT the policy ordering should be according to the upper bound on the score whereas for the actual evaluation one would use the middle of the interval.

Note this is just a theory. I have not verified if it corresponds in any way to reality.

Daniel Shawul · Post by **Daniel Shawul** » Mon May 07, 2018 7:08 am

Rémi Coulom wrote: ↑Sun May 06, 2018 2:16 pm
Daniel Shawul wrote:The policy network is a static rule (unlike an actual shallow depth search). If it is mostly good, then it won't give enough visits for trap moves that look bad initially but turn out to be good on exhaustive search. If it is bad, there is no hope. So the question is how likely are traps in actual games ? This paper says it is very common in chess and I suppose in Shogi too. Shogi is a large branching factor so it might be more suitable for DNNs than chess.
You underestimate what a policy network can do. The policy network of Alpha Zero is pro strength by itself.

Training a network that combines policy and value produces a better value than training a network that learns value only, anyway. So you get the policy for free, and it should be considerably better than any man-made reduction heuristic.

Note that it is pro-strength in Go not in chess. Go is a strategic game and the moves that the policy network gives are very good -- so that is a different story than chess. I strongly believe you can not get more than SEE level tactics in the policy NN for chess. You can actually play against LCzero policy network online and see it plays maybe not more than 1800 elo. Also some people here have matched it against Stockfish deph=1 and it looses againt it as well. That is far from a pro-level performance in chess.

I had to switch to alpha-beta rollouts (a form of MCTS) to get the tactics up. MCTS misses too much tactics. Here is the final table I posted with my experments at a time control of 20+0.1. I probably need to rease the alpha-beta depth >16 with longer time controls.

Code: Select all

 1.                       scorpio-mcts-ab16    375     25     25   1209    85.1%   -26    17.1%
 2.                                 scorpio    357     23     23   1310    83.9%   -23    19.3%
 3.                       scorpio-mcts-ab12    285     21     21   1478    77.9%   -25    16.8%
 4.                       scorpio-mcts-ab10    187     21     21   1309    68.1%   -11    20.6%
 5.                        scorpio-mcts-ab9    168     21     21   1308    66.3%   -10    19.6%
 6.                        scorpio-mcts-ab8    131     20     20   1308    62.6%    -7    18.5%
 7.                        scorpio-mcts-ab7     78     20     20   1308    57.2%    -3    18.3%
 8.                        scorpio-mcts-ab6     27     20     20   1308    52.3%     0    20.1%
 9.                        scorpio-mcts-ab5    -50     20     20   1308    44.3%     5    18.1%
10.                        scorpio-mcts-ab4   -126     21     21   1308    36.9%    10    16.3%
11.                        scorpio-mcts-ab3   -177     21     21   1308    32.3%    14    14.2%
12.                        scorpio-mcts-min   -236     22     22   1310    26.1%    18    18.5%
13.                        scorpio-mcts-ab2   -258     22     22   1308    24.7%    19    13.8%
14.                        scorpio-mcts-ab1   -365     25     25   1310    16.6%    27    11.0%
15.                         scorpio-mcts-ab   -389     25     25   1310    14.6%    29    11.7%

Daniel

MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?