Ozymandias wrote:The training phase... didn't it consist of 44 million games or something like that? If that's the case, I don't see how they could be played in just four hours.
Just read the paper:
https://arxiv.org/pdf/1712.01815.pdf
9 hours for 44 million self-play games corresponding to 700,000 training batches of 4096 positions (so 65 positions per game, which seems reasonable), so a bit less than 20 million games in 4 hours. Each position corresponded to an MCTS with 800 "simulations"/NN evaluations.
At the 4-hour point (300,000 training batches), AlphaZero became stronger than SF. See Figure 1. It seems the network reached its saturation point before the 4-hour point. This could be improved upon by using a bigger network (which would then need still more training, but with Google's resources that would be just a matter of weeks).
Figure 1 was created from the results of a tournament between various iterations of AlphaZero and Stockfish as a base player. This tournament was played at 1 second per move.
Since AlphaZero is a bit stronger than Stockfish at 1 second per move and, apparently, scales better than Stockfish, it is no surprise that AlphaZero beats Stockfish handily at 1 minute per move.
Is 700,000 x 4096 searches in 9 hours possible? Let's see: they used 5000 TPUs, so each TPU had to do 573440 searches in 9 hours, which is 17.7 positions per second, or 56.5ms per MCTS. According to the paper, each 800-node MCTS took 40ms.
So there was about 2.5 hours left!
But not really: they also needed time to process each batch to adjust the weights. I don't think the paper tells us how much time that took per batch, but I currently have no reason to doubt that those about 2.5 hours sufficed.
This is all based on the paper, not on speculation.