syzygy wrote:They are not playing out any games "to the very end". And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Where do you read that? When I was glancing the paper, I got the impression that what they do is normal MCTS, with the caveat that 'random' in the move selection for the playouts does not mean 'homogeneously distributed probabilities', but probabilities according to the policy of their NN. In absence of scoring at the game ends (e.g. declaring every game a draw) that would make the MCTS a self-fulfilling prophecy, generating a tree that has exactly the same statistics as the NN prediction. But the actual game results will steer the MCTS away from poor moves, and will make it focus on good moves. And the NN will then be tuned to already produce such focusing on good moves for the initial visiting probabilities, etc.
So you get progressively more realistic 'random' playouts, which will define more and more narrow MCTS trees in every position of the test games.
syzygy wrote:They are not playing out any games "to the very end". And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Where do you read that? When I was glancing the paper, I got the impression that what they do is normal MCTS, with the caveat that 'random' in the move selection for the playouts does not mean 'homogeneously distributed probabilities', but probabilities according to the policy of their NN. In absence of scoring at the game ends (e.g. declaring every game a draw) that would make the MCTS a self-fulfilling prophecy, generating a tree that has exactly the same statistics as the NN prediction. But the actual game results will steer the MCTS away from poor moves, and will make it focus on good moves. And the NN will then be tuned to already produce such focusing on good moves for the initial visiting probabilities, etc.
So you get progressively more realistic 'random' playouts, which will define more and more narrow MCTS trees in every position of the test games.
There are no playouts. Did you even read AGZ Nature paper???
Compared to the MCTS in AlphaGo Fan and AlphaGo Lee, the principal differences are that AlphaGo Zero does not use any rollouts; it uses a single neural network instead of separate policy and value networks; leaf nodes are always expanded, rather than using dynamic expansion; each search thread simply waits for the neural network evaluation, rather than performing evaluation and backup asynchronously; and there is no tree policy.
syzygy wrote:They are not playing out any games "to the very end". And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Where do you read that? When I was glancing the paper, I got the impression that what they do is normal MCTS, with the caveat that 'random' in the move selection for the playouts does not mean 'homogeneously distributed probabilities', but probabilities according to the policy of their NN. In absence of scoring at the game ends (e.g. declaring every game a draw) that would make the MCTS a self-fulfilling prophecy, generating a tree that has exactly the same statistics as the NN prediction. But the actual game results will steer the MCTS away from poor moves, and will make it focus on good moves. And the NN will then be tuned to already produce such focusing on good moves for the initial visiting probabilities, etc.
So you get progressively more realistic 'random' playouts, which will define more and more narrow MCTS trees in every position of the test games.
They dont do random playouts which means if you are at a leafNode you give the position to the NN to get a winProbability instead of doing a random playout to the very end of the game
At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for
a draw, and +1 for a win.
syzygy wrote:They are not playing out any games "to the very end". And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Where do you read that? When I was glancing the paper, I got the impression that what they do is normal MCTS, with the caveat that 'random' in the move selection for the playouts does not mean 'homogeneously distributed probabilities', but probabilities according to the policy of their NN. In absence of scoring at the game ends (e.g. declaring every game a draw) that would make the MCTS a self-fulfilling prophecy, generating a tree that has exactly the same statistics as the NN prediction. But the actual game results will steer the MCTS away from poor moves, and will make it focus on good moves. And the NN will then be tuned to already produce such focusing on good moves for the initial visiting probabilities, etc.
So you get progressively more realistic 'random' playouts, which will define more and more narrow MCTS trees in every position of the test games.
There are no playouts. Did you even read AGZ Nature paper???
Compared to the MCTS in AlphaGo Fan and AlphaGo Lee, the principal differences are that AlphaGo Zero does not use any rollouts; it uses a single neural network instead of separate policy and value networks; leaf nodes are always expanded, rather than using dynamic expansion; each search thread simply waits for the neural network evaluation, rather than performing evaluation and backup asynchronously; and there is no tree policy.
Here is a link to the Nature article (given at DeepMind site):
At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for
a draw, and +1 for a win.
Maybe just that they obviously don't use their NN to score positions where the game has ended?
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for
a draw, and +1 for a win.
Maybe just that they obviously don't use their NN to score positions where the game has ended?
This. Maybe HG forgot that this paper isn't only about chess but any board game. Therefore terminal positions are scored according to the rules of the game. This quote was about the self-played games which obviously need to be scored !!!
At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for
a draw, and +1 for a win.
Maybe just that they obviously don't use their NN to score positions where the game has ended?
This. Maybe HG forgot that this paper isn't only about chess but any board game. Therefore terminal positions are scored according to the rules of the game. This quote was about the self-played games which obviously need to be scored !!!
Strictly spoken, the Nature article is only about Go. But apart from that nitpicking you are right. The phrase quoted by HG is clearly related to terminal positions of the self-play games, not to leaf nodes of MCTS search. It is crucial to understand the difference: training was done by playing a huge number of self-play games ("iterations"), and at each position of those games MCTS was used to calculate move probabilites which in turn were used to improve the NN. According to the section "Methods" an MCTS leaf node is reached after L "time-steps" which sounds like some fixed search depth. AlphaGo Zero (and thus also AlphaZero for chess) does not use MonteCarlo "playouts" or "rollouts". So even though they call their method MCTS-based, it is not like a standard MCTS due to the complete lack of random playouts.
At the end of the game, the terminal position sT is scored according to the rules of the game to compute the game outcome z: −1 for a loss, 0 for
a draw, and +1 for a win.
Maybe just that they obviously don't use their NN to score positions where the game has ended?
This. Maybe HG forgot that this paper isn't only about chess but any board game. Therefore terminal positions are scored according to the rules of the game. This quote was about the self-played games which obviously need to be scored !!!
Strictly spoken, the Nature article is only about Go. But apart from that nitpicking you are right. The phrase quoted by HG is clearly related to terminal positions of the self-play games, not to leaf nodes of MCTS search. It is crucial to understand the difference: training was done by playing a huge number of self-play games ("iterations"), and at each position of those games MCTS was used to calculate move probabilites which in turn were used to improve the NN. According to the section "Methods" an MCTS leaf node is reached after L "time-steps" which sounds like some fixed search depth. AlphaGo Zero (and thus also AlphaZero for chess) does not use MonteCarlo "playouts" or "rollouts". So even though they call their method MCTS-based, it is not like a standard MCTS due to the complete lack of random playouts.
Yes.
I meant the alphaZero paper and not the AlphaGoZero-nature paper. I guess we will find out more about how A0 zero works once they publish the entire paper
Sven wrote:According to the section "Methods" an MCTS leaf node is reached after L "time-steps" which sounds like some fixed search depth.
As I understand it, they basically start with a tree consisting of just the root node. This node is at the same time the single leaf node and is expanded by running it through the NN. The move probabilities returned by the NN are assigned to the edges from the root node to the new leaf nodes. In the next step, one of these edges is chosen by applying the UCB1 strategy. We are then in a leaf node, which is expanded, etc. After every expansion, the move probabilities of the new leaf nodes are backed up to the root. So with 800 "simulations" (but you skip the simulations) you end up with a tree of 801 nodes. The most promising root move is then chosen and played.
One of the papers (probably the Nature paper) explains that the subtree for the move played is carried over to the next move, which obviously makes sense.
So there is no fixed search depth, but there is a fixed tree size (at least in self-play games for training).
AlphaGo Zero (and thus also AlphaZero for chess) does not use MonteCarlo "playouts" or "rollouts". So even though they call their method MCTS-based, it is not like a standard MCTS due to the complete lack of random playouts.