Article:"How Alpha Zero Sees/Wins"
Posted: Wed Jan 17, 2018 4:50 pm
New article discusses how Alpha Zero thinks..calculates variations
http://www.danamackenzie.com/blog/?p=5072
"So far I have looked at three games from the AlphaZero-Stockfish match: #5, #9, and #10 from the ten games provided in the arXiv preprint. All three are amazingly similar, and at the same time they are amazingly unlike almost any other game I’ve ever seen. In each case AlphaZero won by sacrificing a piece for compensation that didn’t fully emerge until at least 15 or 20 moves later."
"How does AlphaZero avoid the horizon effect? To evaluate a position, it simply plays hundreds of random games from that position. To you or me this may seem like a crazy idea, but actually it makes a certain amount of sense."
"Are you sure AlphaZero plays many random games out to the end? That is how Monte Carlo Tree Search Go bots used to work before AlphaGo, but I was under the impression AlphaZero doesn’t calculate all the way to the end of the game any more. Doesn’t it explore a range of plausible moves to a certain depth, and then evaluate the resulting position using its network?"
"Where the first AlphaGo indeed did play out the entire game – this is something that makes more sense in Go than in Chess – The next instance of AlphaGo, namely AlphaGoZero did not. Instead AlphaGoZero has the ability to judge positions (the value network) and does not need to play out the entire game. AlphaZero again is based on AlphaGoZero and also has a value network.
Beside the value network, the neural net has another output: The policy network that decide what moves are most likely good. The “random playouts” AlphaZero uses are not completely random, they are based on the scores given by the policy network. Over a single playout game it does indeed play randomly, if the policy net say “this move is 1% good”, it might just randomly play that 1% move. But over the 80.000 moves it played, 79.000 of them will be the move a grandmaster (well, AlphaZero) would prefer. So instead of writing “If we see that White usually wins the position if it’s played by weaker players”, it actually is “White usually wins the position if it’s played by grandmasters” "
"Thanks for the in-depth explanation! In particular I appreciate the explanation of the terms “value network” and “policy network,” which I didn’t fully understand from the AlphaZero and AlphaGo papers."
Is this correct? Thx AR
http://www.danamackenzie.com/blog/?p=5072
"So far I have looked at three games from the AlphaZero-Stockfish match: #5, #9, and #10 from the ten games provided in the arXiv preprint. All three are amazingly similar, and at the same time they are amazingly unlike almost any other game I’ve ever seen. In each case AlphaZero won by sacrificing a piece for compensation that didn’t fully emerge until at least 15 or 20 moves later."
"How does AlphaZero avoid the horizon effect? To evaluate a position, it simply plays hundreds of random games from that position. To you or me this may seem like a crazy idea, but actually it makes a certain amount of sense."
"Are you sure AlphaZero plays many random games out to the end? That is how Monte Carlo Tree Search Go bots used to work before AlphaGo, but I was under the impression AlphaZero doesn’t calculate all the way to the end of the game any more. Doesn’t it explore a range of plausible moves to a certain depth, and then evaluate the resulting position using its network?"
"Where the first AlphaGo indeed did play out the entire game – this is something that makes more sense in Go than in Chess – The next instance of AlphaGo, namely AlphaGoZero did not. Instead AlphaGoZero has the ability to judge positions (the value network) and does not need to play out the entire game. AlphaZero again is based on AlphaGoZero and also has a value network.
Beside the value network, the neural net has another output: The policy network that decide what moves are most likely good. The “random playouts” AlphaZero uses are not completely random, they are based on the scores given by the policy network. Over a single playout game it does indeed play randomly, if the policy net say “this move is 1% good”, it might just randomly play that 1% move. But over the 80.000 moves it played, 79.000 of them will be the move a grandmaster (well, AlphaZero) would prefer. So instead of writing “If we see that White usually wins the position if it’s played by weaker players”, it actually is “White usually wins the position if it’s played by grandmasters” "
"Thanks for the in-depth explanation! In particular I appreciate the explanation of the terms “value network” and “policy network,” which I didn’t fully understand from the AlphaZero and AlphaGo papers."
Is this correct? Thx AR