Google's AlphaGo team has been working on chess

CheckersGuy · Post by **CheckersGuy** » Sun Dec 10, 2017 2:01 pm

syzygy wrote:
AlvaroBegue wrote:
syzygy wrote:[...] And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").
Randomness has never been a requirement in the tree part of the search.
Yes, it seems the randomness refers just to the random playouts. But without any randomness left I don't see what the M and the C are still doing in AlphaZero's MCTS.

The earliest description I remember of this type of algorithm with MoGo's UCT search, which picked the move that maximizes a formula called UCB1 ("UCB" stands for "Upper Confidence Bound", just like in the recent paper). Actually I imagine most implementations don't have any random element in the tree part of the playout.
The "move probabilities" term in the paper initially suggested to me that the selection of the next node followed those probabilities. Apparently it does not, so why they are called probabilities is still rather unclear to me. (Probably it has something to do with the "confidence" in UCB.)

AlphaGo initially combined the results of playouts with the evaluation at the leaf of the tree (that was probably its biggest innovation), and then AlphaGo Zero and AlphaZero do away with the playout altogether, using only an evaluation function. Still, the whole mechanism for growing the tree and for backing up values is the same, so it feels like the MCTS label fits.
To me it feels more like the tree-expansion of proof-number search or of book-building algorithms. The innovation in MCTS, at least to me, lied solely in the random playouts (which should not work very well for chess).

You seem to forget that the "innovation" is mostly the neural network they use. There isn't anything special about mcts but apparently using a neural network + reinforcement learning instead of random playouts seems to work quite well.
Standard mcts doesn't really work for chess because random playouts aren't any good. The move probabilites (or whatever you wanna call it) from the NN provide a "soft-pruning". Completly stupid moves will have a very small probability and therefore won't be visited very often. Contrary to standard mcts that may visits those nodes way more often

syzygy · Post by **syzygy** » Sun Dec 10, 2017 2:34 pm

CheckersGuy wrote:
syzygy wrote:The innovation in MCTS, at least to me, lied solely in the random playouts (which should not work very well for chess).
You seem to forget that the "innovation" is mostly the neural network they use.

No, I simply meant the innovation in MCTS when MCTS was introduced (which seems to have been in the 1970s, but that does not matter here). The core element of MCTS is the random element. Without it, it makes no sense, to me, to still call it a Monte Carlo method. It is a best-first tree search algorithm, not a Monte Carlo tree search algorithm.

There isn't anything special about mcts but apparently using a neural network + reinforcement learning instead of random playouts seems to work quite well.

Taking the MC out of MCTS was probably necessary to make the AlphaZero approach work for chess.

I am still surprised that it works.
The neural network obviously gives quite good evaluations for the type of positions on which it was trained, but it seems it really needs enough tree search to do well against Stockfish. So the tree search may be part of the trick. Did Google simply discover that a book-building type algorithm for controlling the top of the tree outperforms alpha-beta when running on a highly parallel system? This is something that we can test. (The idea is not new, Marcel Kervinck mentioned it to me some years ago, but I don't know if anyone has tried it out on big enough hardware.)

Another idea is to try to get more out of self-play. Would it make sense to tune eval parameters such that, say, the results of a 10-ply search get closer to the results of a 15-ply search? Maybe this is only feasible with a ridiculously complex function (which then needs Google's hardware to be of any use).

Jesse Gersenson · Post by **Jesse Gersenson** » Sun Dec 10, 2017 4:08 pm

If I'm overlooking something simple, please point it out. My laymen calculation is:

part of the computer time Google used, translated into a single core of an Intel Xeon 2697 v4 (35 gflops)
5000 1st gen TPUs running 9 hours = 5000 * 96 teraflops * 9 hours =
4320000 teraflop-hours

4320000 teraflop-hours * 1000 =
4320000000 gigaflop-hours

4320000000 gigaflop-hours / 35 gigaflop =
123428571 hours
123428571 hours / 24 =
5142857 days / 365 = 14090 years

Joost Buijs · Post by **Joost Buijs** » Sun Dec 10, 2017 5:13 pm

I don't think it will take that long. Google is talking about 8 bit teraops which is not the same as teraflops. According to Google their 1st gen TPU runs approximately 35 times faster as the Xeon E5-2699 v3. (using all cores?)
5000 * 9 * 35 == 1575000 hours. 1575000/24/365 == 180 years.
Still a considerable amount of time though.

CheckersGuy · Post by **CheckersGuy** » Sun Dec 10, 2017 5:29 pm

Joost Buijs wrote:I don't think it will take that long. Google is talking about 8 bit teraops which is not the same as teraflops. According to Google their 1st gen TPU runs approximately 35 times faster as the Xeon E5-2699 v3. (using all cores?)
5000 * 9 * 35 == 1575000 hours. 1575000/24/365 == 180 years.
Still a considerable amount of time though.

180 years sounds like a lot but it actually isnt. Those are only the first generations of specialized ai chips which are pretty much still in their infancy. This coupled with die shrinks in the future we will see a 10.000 fold increase like we have seen with cpus and gpus. Then 180 years on the current state of the art doesn't sound like much anymore

Joost Buijs · Post by **Joost Buijs** » Sun Dec 10, 2017 6:24 pm

CheckersGuy wrote:
Joost Buijs wrote:I don't think it will take that long. Google is talking about 8 bit teraops which is not the same as teraflops. According to Google their 1st gen TPU runs approximately 35 times faster as the Xeon E5-2699 v3. (using all cores?)
5000 * 9 * 35 == 1575000 hours. 1575000/24/365 == 180 years.
Still a considerable amount of time though.
180 years sounds like a lot but it actually isnt. Those are only the first generations of specialized ai chips which are pretty much still in their infancy. This coupled with die shrinks in the future we will see a 10.000 fold increase like we have seen with cpus and gpus. Then 180 years on the current state of the art doesn't sound like much anymore

10.000 fold seems very optimistic, you can't go on with die shrinks forever.

Next year nVidia will release more affordable versions of their Volta GPU which runs at approximately the same speed as a gen 1 TPU, stacking 4 of these in a workstation seems feasible, this is probably the closest you can get in the nearby future.

Henk · Post by **Henk** » Sun Dec 10, 2017 9:30 pm

So it will probably be two years training time instead of four hours.

Milos · Post by **Milos** » Sun Dec 10, 2017 11:28 pm

syzygy wrote:No, I simply meant the innovation in MCTS when MCTS was introduced (which seems to have been in the 1970s, but that does not matter here). The core element of MCTS is the random element. Without it, it makes no sense, to me, to still call it a Monte Carlo method. It is a best-first tree search algorithm, not a Monte Carlo tree search algorithm.

There isn't anything special about mcts but apparently using a neural network + reinforcement learning instead of random playouts seems to work quite well.
Taking the MC out of MCTS was probably necessary to make the AlphaZero approach work for chess.

I am still surprised that it works.
The neural network obviously gives quite good evaluations for the type of positions on which it was trained, but it seems it really needs enough tree search to do well against Stockfish. So the tree search may be part of the trick. Did Google simply discover that a book-building type algorithm for controlling the top of the tree outperforms alpha-beta when running on a highly parallel system? This is something that we can test. (The idea is not new, Marcel Kervinck mentioned it to me some years ago, but I don't know if anyone has tried it out on big enough hardware.)

I am not sure that random element is completely out of picture.
There is Dirichlet noise working together with the temperature but seems only in self-play (there is no mentioning of it in the actual match against SF or evaluation). Basically when temperature is set to 1 (as in first 30 moves) then the probability that a root move is selected is proportional to visit count for each root move and not equal to 1 for best and 0 for all others what happens when t->0. And adding Dirichlet noise to root move probabilities make possible for other moves to be picked up.
However, if this was indeed disabled in match against SF, A0 would always play the same moves (start from the same move, and play the same moves as after same SF's replies in move 2, 3, etc.) which is clearly not the case judging from the png of 10 games.
So there was randomness in at least first few moves of each game against SF.

Milos · Post by **Milos** » Sun Dec 10, 2017 11:41 pm

Joost Buijs wrote:
CheckersGuy wrote:
Joost Buijs wrote:I don't think it will take that long. Google is talking about 8 bit teraops which is not the same as teraflops. According to Google their 1st gen TPU runs approximately 35 times faster as the Xeon E5-2699 v3. (using all cores?)
5000 * 9 * 35 == 1575000 hours. 1575000/24/365 == 180 years.
Still a considerable amount of time though.
180 years sounds like a lot but it actually isnt. Those are only the first generations of specialized ai chips which are pretty much still in their infancy. This coupled with die shrinks in the future we will see a 10.000 fold increase like we have seen with cpus and gpus. Then 180 years on the current state of the art doesn't sound like much anymore
10.000 fold seems very optimistic, you can't go on with die shrinks forever.

Next year nVidia will release more affordable versions of their Volta GPU which runs at approximately the same speed as a gen 1 TPU, stacking 4 of these in a workstation seems feasible, this is probably the closest you can get in the nearby future.

Not really with new GV100 you have realistic (non-boost) 12 TFLOPS of FP32 and in best case double that in FP16.
For training most of the hardware resources you need for self-play games. For that 5000 first gen TPUs were used. So mostly inference i.e. int8 multiplication.
First gen TPU has performance of 92T int8 OPs. So you'd need at least 4 GV100 to get the performance of 1 TPU.
However, GV100 has also tensor unit with boost performance of 110TFLOPS.
However, problem with tensor cores is that in inference convolution 3x3 kernels were used not 4x4, meaning each tensor unit would only work with partial bandwidth, i.e. around 40% of full performance.
So max combined performance of GV100 would be around half of TPU performance. Therfore 5000TPU ~ 10000GV100 i.e. 10000 fold time to play all 44 million games on a single GV100 than what Google used, i.e. around 10 years.

hgm · Post by **hgm** » Mon Dec 11, 2017 5:39 pm

syzygy wrote:They are not playing out any games "to the very end". And the randomness of the selection is also quite limited (if not completely absent - the paper states that the edge is selected that maximizes an "upper confidence bound").

Where do you read that? When I was glancing the paper, I got the impression that what they do is normal MCTS, with the caveat that 'random' in the move selection for the playouts does not mean 'homogeneously distributed probabilities', but probabilities according to the policy of their NN. In absence of scoring at the game ends (e.g. declaring every game a draw) that would make the MCTS a self-fulfilling prophecy, generating a tree that has exactly the same statistics as the NN prediction. But the actual game results will steer the MCTS away from poor moves, and will make it focus on good moves. And the NN will then be tuned to already produce such focusing on good moves for the initial visiting probabilities, etc.

So you get progressively more realistic 'random' playouts, which will define more and more narrow MCTS trees in every position of the test games.

Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess

Re: Google's AlphaGo team has been working on chess