Note that this video is not what I would call a clear explanation. For one, it is heavily geared towards an audience of mathematicians, using concepts these are familiar with. But which are actually not needed at all to understand what is going on, and would just make it complete mumbo-jumbo for the non-mathematician.
Apart from that, the way they finally seem to adjust the weights seems highly stupid and inefficient...
The whole idea is actually quite simple, and only requires elementary-school arithmethic:
You have this huge network of connections, that does an unfathomable calculation that you rather would remain oblivious of. You present it inputs (like a Chess position), and it gives you outputs (like a winning probability for white). Now the outputs are not as you would like them (e.g. it predicts a win in a position of a game that it lost). So you want to improve it. How do you go about it?
Well, one by one you start tweeking the weight of every connection inside the NN a tiny bit, and rerun the network on the given input with this altered setting, to see how changing this single weight affects the outputs, and how much that changes the difference between what it gives you and what you would have wanted (the 'error'). After having done that for all weights in the entire network, you change these weights in the direction that reduces the error, in proportion with the effect that they had
. So weights that did have no effect are not changed at all, weights that had a lot of effect are changed a lot.
This is the principle of minimal change. Weights that had no effect for the position at hand could be important for getting the correct output on other positions, so you don't want to mess with those if it is not needed and without using that other position to guage the effect of the change. By changing the things that contributed most, you can make the largest reduction of the output error for a given amount of change to the weights. (The total weight change being measured as the sum of the squares of all the individual changes, to make sure that negative changes still are counted as increasing the total change.)
As a practical example: when you have 3 weights w1, w2 and w3, and the output is 0.7, while the correct output would be 1.0, and increasing w1 by 0.01 increases the output to 0.71, increasing w2 by 0.01 decreases the output to 0.695, and increasing w3 by 0.01 would increase the output to 0.701, you would increase w1 by 0.01 (because that went in the right direction), de
crease w2 by 0.005 (because increasing it had only half as much effect as for w1, and in the wrong direction), and increase w3 by 0.001 (because it hardly had any effect). Or, if you want the NN to learn a bit faster, change the weights by +0.1, -0.05 and +0.01.
You might wonder why we bother to change w3 at all, since it mattered so little. Wouldn't it be better to leave it alone, and only change w1? The point is that there can be very many weights that each contribute very little, but together contribute a lot. E.g. if there were 9 more like w3, changing all 10 of those by 0.01 would have as much effect as changing w1 by 0.1, while the total weight change according to the sum-of-squares measure would be even smaller. So to make sure we don't miss out on the cooperative effect of may small changes, we change every weight in proportion to its effect on the output, to nudge the output in the wanted direction.
If you don't want to run into problems with inputs having conflicting requirements, which alternately spoil the settings for each other, it is better not to perform this process one particular input at a time, but on the average ortotal error for a large batch of inputs (e.g. chess positions sampled from many self-play games). So that you immediately find the best compromise, instead of oscillating between settings good for one or the other.
That is really all the guy is saying, dressed up i an avelanche of totally unnecessary mathematical jargon.
pilgrimdan wrote:how did alphazero do this for chess in 4 hours...
By throwing large amounts of hardware towards it. But note that adjustment of the NN doesn't have to be done all that often. You train the NN on positions sampled from self-play games, ad most of the work is playing those games. Therefore they used 5,000 generation-1 TPUs for playing games, and only 64 generation-2 TPUs for training the NN based on positions occurring in these games. That makes it still 64 times faster than when they would have used a single gen-2 TPU, of course.