Selection in Monte Carlo Alpha-Beta

RedBedHed · Post by **RedBedHed** » Fri Sep 16, 2022 8:45 am

Hi again,

I've been working on a Monte-Carlo Tree Search for a while now. Initially, I used immediate evaluation in the simulation step. However, I recently switched to 2-ply Alpha-Beta searches. I am finding that the Alpha-Beta searches are much better at estimating win probabilities. The Alpha-Beta searches also help my MCTS pick up on subtle tactics that it missed before. I believe the playing strength has improved significantly with the new approach, even though the search depth is only between 8-20 ply at 20 seconds per move.

I am using the bandit algorithm UCB1, but I'm unsure if it is the right tool for this search. I also wonder what the best policy for selecting a final move is, given that the number of nodes expanded is about 20x fewer than it used to be. I am currently using "robust child."

Does anyone have experience with Monte Carlo Alpha-Beta? If so, what selection policies do you use?

dangi12012 · Post by **dangi12012** » Fri Sep 16, 2022 11:43 am

RedBedHed wrote: ↑Fri Sep 16, 2022 8:45 am Hi again,

I've been working on a Monte-Carlo Tree Search for a while now. Initially, I used immediate evaluation in the simulation step. However, I recently switched to 2-ply Alpha-Beta searches. I am finding that the Alpha-Beta searches are much better at estimating win probabilities. The Alpha-Beta searches also help my MCTS pick up on subtle tactics that it missed before. I believe the playing strength has improved significantly with the new approach, even though the search depth is only between 8-20 ply at 20 seconds per move.

I am using the bandit algorithm UCB1, but I'm unsure if it is the right tool for this search. I also wonder what the best policy for selecting a final move is, given that the number of nodes expanded is about 20x fewer than it used to be. I am currently using "robust child."

Does anyone have experience with Monte Carlo Alpha-Beta? If so, what selection policies do you use?

UCB1 is the mathematical perfect solution of "Exploration vs Exploitation" so a correct solution for - this move looks good but is it really the best one?
UCB1 is garuanteed to converge to the best solution. I could share my implementation of it but suffice it to say it can be done insanely fast:
Food for thought: fast inverse sqrt calculation, ln(x) = c * log2(x), most importantly of all log2(x) for a power of 2 == std::countr_zero(x).

Also you can keep a direct pointer to the very best leaf node in each node and update it when backpropagating. Selection of a new leaf node can be done instantly (i found).

Stronger than UCB1 exists so: convergence can be done faster. But the only way this is done to train a policy network which will select the next node among all nodes. (Just train it such that the engine strength increases is all)

MCTS requires a noisy playout until the end so you really just get -1, 0, 1 as the output for a given search. You can skip that random playout by an evaluation function that is very strong (like lc0). So in summary the strongest MCTS engine is this: Policy Network + Evaluation Network
One fact I was suprised by: Just the Lc0 evaluation network is so good that with 1 node visit per move it has an elo of above 2000.

hgm · Post by **hgm** » Fri Sep 16, 2022 1:34 pm

UCB1 is not much good, is it? A tree search in chess is nothing like repeated sampling of a distribution with finite variance.

dangi12012 · Post by **dangi12012** » Fri Sep 16, 2022 4:59 pm

hgm wrote: ↑Fri Sep 16, 2022 1:34 pm UCB1 is not much good, is it? A tree search in chess is nothing like repeated sampling of a distribution with finite variance.

Its good. UCB1 selects a move depending on if it leads to a good outcome. If the average outcome is good then it may be a VERY bad move. (1 forced losing line but 100 winning ones)

So you dont select the child with the best average winrate - but you select the child to move with the highest amount of picks according to ucb1.
Chess is a perfect game and exactly the same problem as the https://en.wikipedia.org/wiki/Multi-armed_bandit

This move looks good. But 10 moves later it turns out to be bad. Alphabeta solves this. MCTS can solve it also.
The only problem with MCTS is that you have to grow a tree in memory.

smatovic · Post by **smatovic** » Fri Sep 16, 2022 5:14 pm

AFAIK you can use MCTS-AB with UCT and UCB1 with win/draw/loss as score, but you can also add a score based heuristic function for position evaluation and then use PUCT:

https://www.chessprogramming.org/Christ ... Rosin#PUCT

then it would look more like a Best-First Minimax Search:

https://www.chessprogramming.org/Best-F ... max_Search

--
Srdja

lithander · Post by **lithander** » Fri Sep 16, 2022 5:29 pm

RedBedHed wrote: ↑Fri Sep 16, 2022 8:45 am I recently switched to 2-ply Alpha-Beta searches. I am finding that the Alpha-Beta searches are much better at estimating win probabilities. The Alpha-Beta searches also help my MCTS pick up on subtle tactics that it missed before.

Apparently Kommodo Dragon's MCTS mode is doing it like that, too: https://youtu.be/JUQoSVqiavw?t=1689

RedBedHed · Post by **RedBedHed** » Fri Sep 16, 2022 7:09 pm

dangi12012 wrote: ↑Fri Sep 16, 2022 11:43 am
RedBedHed wrote: ↑Fri Sep 16, 2022 8:45 am Hi again,

I've been working on a Monte-Carlo Tree Search for a while now. Initially, I used immediate evaluation in the simulation step. However, I recently switched to 2-ply Alpha-Beta searches. I am finding that the Alpha-Beta searches are much better at estimating win probabilities. The Alpha-Beta searches also help my MCTS pick up on subtle tactics that it missed before. I believe the playing strength has improved significantly with the new approach, even though the search depth is only between 8-20 ply at 20 seconds per move.

I am using the bandit algorithm UCB1, but I'm unsure if it is the right tool for this search. I also wonder what the best policy for selecting a final move is, given that the number of nodes expanded is about 20x fewer than it used to be. I am currently using "robust child."

Does anyone have experience with Monte Carlo Alpha-Beta? If so, what selection policies do you use?
UCB1 is the mathematical perfect solution of "Exploration vs Exploitation" so a correct solution for - this move looks good but is it really the best one?
UCB1 is garuanteed to converge to the best solution. I could share my implementation of it but suffice it to say it can be done insanely fast:
Food for thought: fast inverse sqrt calculation, ln(x) = c * log2(x), most importantly of all log2(x) for a power of 2 == std::countr_zero(x).

Also you can keep a direct pointer to the very best leaf node in each node and update it when backpropagating. Selection of a new leaf node can be done instantly (i found).

Stronger than UCB1 exists so: convergence can be done faster. But the only way this is done to train a policy network which will select the next node among all nodes. (Just train it such that the engine strength increases is all)

MCTS requires a noisy playout until the end so you really just get -1, 0, 1 as the output for a given search. You can skip that random playout by an evaluation function that is very strong (like lc0). So in summary the strongest MCTS engine is this: Policy Network + Evaluation Network
One fact I was suprised by: Just the Lc0 evaluation network is so good that with 1 node visit per move it has an elo of above 2000.

Woah. Yeah is pretty wild that Lc0 and AlphaZero are so strong evaluation-wise. I don't think I could ever hope to train a network like that on my laptop sadly :/

I haven't really thought about the speed of UCB1... I have it implemented with std::log and std::sqrt... I will look into this asap

RedBedHed · Post by **RedBedHed** » Fri Sep 16, 2022 7:50 pm

hgm wrote: ↑Fri Sep 16, 2022 1:34 pm UCB1 is not much good, is it? A tree search in chess is nothing like repeated sampling of a distribution with finite variance.

Yeah, I was thinking the same thing... But I can confirm that UCB1 does work. My understanding of it is this:

Each node is a multi-armed bandit, and each action is an arm. Rewards are winning probabilities based on normalized evaluations back-propagated up the tree. UCB1 starts by exploring a lot, and it exploits more as nodes accrue visits.
The algorithm is:

Code: Select all

child_value/child_visits + sqrt(2log(parent_visits)/child_visits)

The exploration term is:

Code: Select all

sqrt(2log(parent_visits)/child_visits)

So when the child's visit count is low, UCB1 will likely try it.

The exploitation term is just the winning probability:

Code: Select all

child_value/child_visits

So when child (and parent) visit counts are high, UCB1 will pick the nodes with stronger winning probabilities.

During a search, nodes that have been visited very frequently (closer to the root) are often exploited. Nodes closer to the leaves need to be explored more. But I think the idea is that, given enough time, the search will eventually converge around what minimax would consider the pv line, and with only a fraction of the nodes expanded.

RedBedHed · Post by **RedBedHed** » Fri Sep 16, 2022 7:56 pm

smatovic wrote: ↑Fri Sep 16, 2022 5:14 pm AFAIK you can use MCTS-AB with UCT and UCB1 with win/draw/loss as score, but you can also add a score based heuristic function for position evaluation and then use PUCT:

https://www.chessprogramming.org/Christ ... Rosin#PUCT

then it would look more like a Best-First Minimax Search:

https://www.chessprogramming.org/Best-F ... max_Search

--
Srdja

Ah okay, so PUCT really is worth trying then... I just wish I understood it lol. I need to take some probability courses. I've been studying computer science for too long. :'( Do you happen to know where I can find lecture materials related to PUCT?

smatovic · Post by **smatovic** » Fri Sep 16, 2022 8:10 pm

RedBedHed wrote: ↑Fri Sep 16, 2022 7:56 pm Ah okay, so PUCT really is worth trying then... I just wish I understood it lol. I need to take some probability courses. I've been studying computer science for too long. :'( Do you happen to know where I can find lecture materials related to PUCT?

https://www.chessprogramming.org/Christ ... ite_note-4
https://www.chessprogramming.org/Christ ... ite_note-5
https://www.chessprogramming.org/Christ ... ite_note-6

You can take a look at the original paper, or at Deepmind's AlphaZero paper, or at the implementation by Lc0 for example.

--
Srdja

Selection in Monte Carlo Alpha-Beta

Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta

Re: Selection in Monte Carlo Alpha-Beta