MCTS with minmax backup operator?

fierz · Post by **fierz** » Tue May 01, 2018 9:39 pm

This is a question for Daniel Shawul (but also anyone else who might know...): You mentioned in some posts that a "minmax backup operator" worked better than the averaging operator for you. But you never mentioned exactly what you were doing - could you please elaborate a bit?
I wanted to try something like this, but I'm confused because in the leaf nodes you always have 1 win, loss or draw, and if you back up maximal/minimal scores, don't you always end up with 1/0/-1 and nothing in between? What am I missing?

cheers
Martin

Daniel Shawul · Post by **Daniel Shawul** » Tue May 01, 2018 10:32 pm

fierz wrote:This is a question for Daniel Shawul (but also anyone else who might know...): You mentioned in some posts that a "minmax backup operator" worked better than the averaging operator for you. But you never mentioned exactly what you were doing - could you please elaborate a bit?
I wanted to try something like this, but I'm confused because in the leaf nodes you always have 1 win, loss or draw, and if you back up maximal/minimal scores, don't you always end up with 1/0/-1 and nothing in between? What am I missing?

cheers
Martin

I keep actual scores in centipawns but that is irrelvant to the minmax/averaging stuff.

In the internal nodes, the score of a node is calculated from its children. Given childrens with scores of say [1,-1,-1,-1,0,0,], the minmax methods picks the first child and assigns a score of +1 to the parent node; the averaging method assigns a score of -1/3 to the parent.

Now you could also do the minmax and averaging by considering results from previous searches (expansions in case of MCTS). Infact, what AlphaZero does is to keep a single averaged score from different searches. So if the node in the above example had an old score of +1 from 1 previous visit, the average score will be (-1/3+1)/2. You could do the same with the minimax score as well; average it with the previous minimax score. So this one is similar to averaging scores from different iterative depths ...

The result for me clearly demonstrates that any kind of averaging deterioriates elo, except for the case of "avg" and "avg-old" (look below for definitions).

Code: Select all

 1.                        scorpio-mcts-min    425     35     35    687    84.9%   -69    14.8% 
 2.                        scorpio-mcts-mix    321     32     32    687    77.7%   -52    22.6% 
 3.                         scorpio-mcts-ab    284     31     31    687    73.7%   -46    16.6% 
 4.                    scorpio-mcts-min-old    -58     30     30    688    42.3%    10    20.6% 
 5.                    scorpio-mcts-mix-old   -102     31     31    687    38.6%    16    20.5% 
 6.                    scorpio-mcts-avg-old   -194     32     32    687    30.0%    32    23.3% 
 7.                        scorpio-mcts-avg   -672     60     60    687     2.9%   111     4.7%

min = pure minmax
mix = 3:1 ration minmax averaging
avg = Averaging
min-old = Averages current minmax score with previous minmax score
avg-old = Averages current average score with previous average score ( AlphaZero uses this)
mix-od = Averages current mix score with old mix score

Daniel

fierz · Post by **fierz** » Tue May 01, 2018 10:55 pm

Thanks for the quick reply!

On the averaging part, perhaps I was doing something stupid so far, I have been keeping track of wins/draws/losses for each playout; in your example

Given childrens with scores of say [1,-1,-1,-1,0,0,]

and a parent value of -1/3, if I did a new expansion and added a child with +1, I would have [1,-1,-1,-1,0,0,1] and a new average of -1/7.

Do you mean to say that it's better to compute (old average = -1/3) + 1 / 2 = +1/3? Did you ever try both versions?

And on the minmax part, you do the same then? I will try!

Daniel Shawul · Post by **Daniel Shawul** » Tue May 01, 2018 11:09 pm

fierz wrote:Thanks for the quick reply!

On the averaging part, perhaps I was doing something stupid so far, I have been keeping track of wins/draws/losses for each playout; in your example

Given childrens with scores of say [1,-1,-1,-1,0,0,]
and a parent value of -1/3, if I did a new expansion and added a child with +1, I would have [1,-1,-1,-1,0,0,1] and a new average of -1/7.

It is Ok to keep track of the total number of wins/losses with the number of visits which allows you to compute the average and standard deviation of the score eventually. Standard implementation of UCT actually does this.

Do you mean to say that it's better to compute (old average = -1/3) + 1 / 2 = +1/3? Did you ever try both versions?

And on the minmax part, you do the same then? I will try!

Yes, as I mentioned above. Those are "avg" and "avg-old" and this is the only one helped by averaging with previous score of the node.

Three of the six versions (those with "old" suffix) I mentioned before average with old scores kept for a node. This only helps the "avg" operator though; both the "min" and "mix" versions suffer from averaging with old scores. I hope the definitions I gave above are not confusing.

Note that when doing one playout from root to leaf, I have scores for siblings at each leve, that were not selected for expansion, from a qsearch() I do for each child when they are first created.

fierz · Post by **fierz** » Tue May 01, 2018 11:25 pm

Thanks again for the prompt reply.

One more dumb question: if you average a single value of the last rollout with say 100'000 previous results, isn't that a bit extreme (it kind of makes sense with 1 vs 6 as before, but with large numbers....)?

Is the A0 algorithm published somewhere? I looked at the A0 paper but there are no details in there

Daniel Shawul · Post by **Daniel Shawul** » Tue May 01, 2018 11:31 pm

fierz wrote:Thanks again for the prompt reply.

One more dumb question: if you average a single value of the last rollout with say 100'000 previous results, isn't that a bit extreme (it kind of makes sense with 1 vs 6 as before, but with large numbers....)?

The averaging is weighted by the number of visits, so in your example it hardly changes anything. The "min" version I have doesn't care if you had an old score with 100000 vistis, it just takes the value from the current ply. That is what an alpha-beta engine actually does, i.e the score can swing wildy going from depth d to d+1 since it doesnt average those two. This is what I found to perform the best for chess.

Is the A0 algorithm published somewhere? I looked at the A0 paper but there are no details in there

The A0 algorithm is a standard UCT implementation for which there are many sources. You should probably look into the AlphaGo / AlphaGo Zero paper for mode details not the alphazero arxiv paper.

fierz · Post by **fierz** » Wed May 02, 2018 6:38 am

Dear Daniel,

I'm sorry to ask again (feeling pretty dumb...) but I just don't get it: with only 6 nodes, and adding a seventh node, you explained that the seventh node is added to the previous average directly, i.e.

new average = 1/2 ( old average(of six) + new value)

instead of

new average = 1/7 ( 6*old average + new value)

Now you say that the averaging is weighted by the number of visits? I don't get it

- that seems to imply the second formula above, but before you said it was the first one? What am I missing? I'm doing the second one, which is standard UCT as far as I can find it in the web, so how is A0 and your program different?

Regarding max/min: my engine is doing rollouts in the end, I can only get the values win/lose/draw. If I min/max from there, all I can back up to the root is also win/lose/draw, and that's different from alphabeta with eval where I can back up intermediate values and will switch....

Daniel Shawul · Post by **Daniel Shawul** » Wed May 02, 2018 3:39 pm

fierz wrote:Dear Daniel,

I'm sorry to ask again (feeling pretty dumb...) but I just don't get it: with only 6 nodes, and adding a seventh node, you explained that the seventh node is added to the previous average directly, i.e.

new average = 1/2 ( old average(of six) + new value)

instead of

new average = 1/7 ( 6*old average + new value)

Now you say that the averaging is weighted by the number of visits? I don't get it - that seems to imply the second formula above, but before you said it was the first one? What am I missing? I'm doing the second one, which is standard UCT as far as I can find it in the web, so how is A0 and your program different?

Regarding max/min: my engine is doing rollouts in the end, I can only get the values win/lose/draw. If I min/max from there, all I can back up to the root is also win/lose/draw, and that's different from alphabeta with eval where I can back up intermediate values and will switch....

I didn't care to be precise because all I wanted is to convey how the averaging and minmax backup are done to get the parent node score.

The averaging is weighted by the number of visits to each child, not by the number of children.

In my previous example with average scores for the six children of [1,-1,-1,-1,0,0] and corresponding number of visits of say [4,2,2,1,1,2], you would get an average score of

Code: Select all

    parent_score = (4*1+2*-1+2*-1+1*-1+1*0+2*0) / (4+2+2+1+1+2)

Daniel Shawul · Post by **Daniel Shawul** » Wed May 02, 2018 4:54 pm

Regarding max/min: my engine is doing rollouts in the end, I can only get the values win/lose/draw. If I min/max from there, all I can back up to the root is also win/lose/draw, and that's different from alphabeta with eval where I can back up intermediate values and will switch....

I think what is confusing is that you are probably doing montecarlo playouts until you encounter terminal postions, i.e. where you have +1,0,-1 as a score. AlphaZero and my engine do not do montecarlo rollouts but use either a NN or a standard evaluation function to determine the winning probability from the leaf position. For example if the eval says +100 centi pawns up, that is a winning probability of 0.64. In your case, if you do multiple, not one, rollouts to evaluate a leaf position you can determine the winning probability by averaging the results from those rollouts. Say 10 montecarlo playouts are used to evaluate a leaf and you have 5 wins, 3 losses, 2 draws then your winning probability is a fraction which is (5*1+3*-1+2*0)/10=0.2.

It is only when you do a single rollout, which has a very large standard deviation not to be reliable, that you have to back up a win/lose/draw score. Infact it is recommended to use averaging when the standard deviation is high, and minmax when it is low. AlphaZero and my engine assume that the scores returned from NN/evaluation function give reliable estimates of the winning probability of the position so you could start using minmax right after one visist. Note that AlphaGo used to mix monte carlo rollouts and NN evaluation in a 50-50 manner but they found out the result from the NN is reliable enough to not use the costly montecarlo rollouts.

I hope this helps.

fierz · Post by **fierz** » Wed May 02, 2018 6:23 pm

Dear Daniel,

thanks, now it makes a lot more sense to me

I wanted to try MCTS/UCT because of its nice property of not needing an eval, just to learn something, so to remain pure I will try the multi-rollout.

best regards
Martin

MCTS with minmax backup operator?

MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?

Re: MCTS with minmax backup operator?