Announcing lczero

Discussion of anything and everything relating to chess playing software and machines.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Daniel Shawul
Posts: 3757
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Announcing lczero

Post by Daniel Shawul » Sun Jan 21, 2018 8:37 pm

It turns out there is a rollouts based implementation of alpha-beta! Infact, any fixed-depth best first search method can be formulated as a rollouts based algorithm using something similar to the MT driver from the MTD(f). https://www.aaai.org/ocs/index.php/AAAI ... /9828/9354

I had difficulty understanding the paper before i realized that the alpha-beta rollout formulation is for a fixed-depth tree. MCTS dynamically grows the tree so it is hard to formulate bounds to be used for pruning, but for fixed-depth (iterative deepenign framework), it is possible to formulate a rollouts version that does alpha-beta pruning.

Other people have exteneded Huang's idea to get a hybrid MCTS-alpha-beta alorithm where you choose MCTS over alpha-beta rollouts based on a probability p.

The question is if A0 actually used these hybrids or plain MCTS.

Daniel

pilgrimdan
Posts: 402
Joined: Sat Jul 02, 2011 8:49 pm

Re: Announcing lczero

Post by pilgrimdan » Mon Jan 22, 2018 8:22 am

Daniel Shawul wrote:It turns out there is a rollouts based implementation of alpha-beta! Infact, any fixed-depth best first search method can be formulated as a rollouts based algorithm using something similar to the MT driver from the MTD(f). https://www.aaai.org/ocs/index.php/AAAI ... /9828/9354

I had difficulty understanding the paper before i realized that the alpha-beta rollout formulation is for a fixed-depth tree. MCTS dynamically grows the tree so it is hard to formulate bounds to be used for pruning, but for fixed-depth (iterative deepenign framework), it is possible to formulate a rollouts version that does alpha-beta pruning.

Other people have exteneded Huang's idea to get a hybrid MCTS-alpha-beta alorithm where you choose MCTS over alpha-beta rollouts based on a probability p.

The question is if A0 actually used these hybrids or plain MCTS.

Daniel
Daniel

I don't know if this helps...

but i found this from the following paper...

16.6.2 AlphaGo Zero

AlphaGo Zero implemented a form of policy iteration (Section 4.3), interleaving policy evaluation with policy improvement.
Figure 16.8 is an overview of AlphaGo Zero’s algorithm.
A significant difference between AlphaGo Zero and AlphaGo is that AlphaGo Zero used MCTS to select moves throughout self-play reinforcement learning, whereas AlphaGo used MCTS for live play after—but not during—learning.
Other differences are that AlphaGo Zero used only one deep convolutional ANN and used a simpler version of MCTS.
AlphaGo Zero’s MCTS was simpler than the version used by AlphaGo in that it did not include rollouts of complete games, and therefore did not need a rollout policy.
Each iteration of AlphaGo Zero’s MCTS ran a simulation that ended at a leaf node of the current search tree instead of at the terminal position of a complete game simulation.
But as in AlphaGo, each iteration of MCTS in AlphaGo Zero was guided by the output of a deep convolutional network, labeled fθ in Figure 16.7, where θ is the network’s weight vector.
The input to the network, whose architecture we describe below, consisted of raw representations of board positions, and its output had two parts:
a scaler value, v, an estimate of the probability that the current player will win from from the current board position,
and a vector, p, of move probabilities, one for each possible stone placement on the current board, plus the pass, or resign, move.
Instead of selecting self-play actions according to the probabilities p, however, AlphaGo Zero used these probabilities, together with the network’s value output, to direct each execution of MCTS, which returned new move probabilities, shown in Figure 16.7 as the policies πi.
These policies benefitted from the many simulations that MCTS conducted each time it executed.
The result was that the policy actually followed by AlphaGo Zero was an improvement over the policy given by the network’s outputs p.
Silver et al. (2017) wrote that “MCTS may therefore be viewed as a powerful policy improvement operator.”

Reinforcement Learning: An Introduction
Second edition, in progress ****Complete Draft****
November 5, 2017

Post Reply