It turns out there is a rollouts based implementation of alphabeta! Infact, any fixeddepth best first search method can be formulated as a rollouts based algorithm using something similar to the MT driver from the MTD(f). https://www.aaai.org/ocs/index.php/AAAI ... /9828/9354
I had difficulty understanding the paper before i realized that the alphabeta rollout formulation is for a fixeddepth tree. MCTS dynamically grows the tree so it is hard to formulate bounds to be used for pruning, but for fixeddepth (iterative deepenign framework), it is possible to formulate a rollouts version that does alphabeta pruning.
Other people have exteneded Huang's idea to get a hybrid MCTSalphabeta alorithm where you choose MCTS over alphabeta rollouts based on a probability p.
The question is if A0 actually used these hybrids or plain MCTS.
Daniel
Announcing lczero
Moderators: hgm, Dann Corbit, Harvey Williamson
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

 Posts: 4134
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:

 Posts: 405
 Joined: Sat Jul 02, 2011 8:49 pm
Re: Announcing lczero
DanielDaniel Shawul wrote:It turns out there is a rollouts based implementation of alphabeta! Infact, any fixeddepth best first search method can be formulated as a rollouts based algorithm using something similar to the MT driver from the MTD(f). https://www.aaai.org/ocs/index.php/AAAI ... /9828/9354
I had difficulty understanding the paper before i realized that the alphabeta rollout formulation is for a fixeddepth tree. MCTS dynamically grows the tree so it is hard to formulate bounds to be used for pruning, but for fixeddepth (iterative deepenign framework), it is possible to formulate a rollouts version that does alphabeta pruning.
Other people have exteneded Huang's idea to get a hybrid MCTSalphabeta alorithm where you choose MCTS over alphabeta rollouts based on a probability p.
The question is if A0 actually used these hybrids or plain MCTS.
Daniel
I don't know if this helps...
but i found this from the following paper...
16.6.2 AlphaGo Zero
AlphaGo Zero implemented a form of policy iteration (Section 4.3), interleaving policy evaluation with policy improvement.
Figure 16.8 is an overview of AlphaGo Zero’s algorithm.
A signiﬁcant diﬀerence between AlphaGo Zero and AlphaGo is that AlphaGo Zero used MCTS to select moves throughout selfplay reinforcement learning, whereas AlphaGo used MCTS for live play after—but not during—learning.
Other diﬀerences are that AlphaGo Zero used only one deep convolutional ANN and used a simpler version of MCTS.
AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy.
Each iteration of AlphaGo Zero’s MCTS ran a simulation that ended at a leaf node of the current search tree instead of at the terminal position of a complete game simulation.
But as in AlphaGo, each iteration of MCTS in AlphaGo Zero was guided by the output of a deep convolutional network, labeled fθ in Figure 16.7, where θ is the network’s weight vector.
The input to the network, whose architecture we describe below, consisted of raw representations of board positions, and its output had two parts:
a scaler value, v, an estimate of the probability that the current player will win from from the current board position,
and a vector, p, of move probabilities, one for each possible stone placement on the current board, plus the pass, or resign, move.
Instead of selecting selfplay actions according to the probabilities p, however, AlphaGo Zero used these probabilities, together with the network’s value output, to direct each execution of MCTS, which returned new move probabilities, shown in Figure 16.7 as the policies πi.
These policies beneﬁtted from the many simulations that MCTS conducted each time it executed.
The result was that the policy actually followed by AlphaGo Zero was an improvement over the policy given by the network’s outputs p.
Silver et al. (2017) wrote that “MCTS may therefore be viewed as a powerful policy improvement operator.”
Reinforcement Learning: An Introduction
Second edition, in progress ****Complete Draft****
November 5, 2017