To TPU or not to TPU...

Michael Sherwin · Post by **Michael Sherwin** » Thu Dec 21, 2017 12:05 am

phhnguyen wrote:If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.

Yes! Thank you--someone that understands that most of the strength of Alpha Zero is due to the learning!

Add similar learning for alpha-beta to any top engine and Alpha Zero would be nicknamed Alpha Flop.

pilgrimdan · Post by **pilgrimdan** » Thu Dec 21, 2017 10:10 am

Vinvin wrote:Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM

thanks for the video...

wow ... that was an awful lot in 5 min...

need to go back to school and learn calculus...

this seems awfully time consuming...

how did alphazero do this for chess in 4 hours...

hgm · Post by **hgm** » Thu Dec 21, 2017 10:59 am

Note that this video is not what I would call a clear explanation. For one, it is heavily geared towards an audience of mathematicians, using concepts these are familiar with. But which are actually not needed at all to understand what is going on, and would just make it complete mumbo-jumbo for the non-mathematician.

Apart from that, the way they finally seem to adjust the weights seems highly stupid and inefficient...

The whole idea is actually quite simple, and only requires elementary-school arithmethic:

You have this huge network of connections, that does an unfathomable calculation that you rather would remain oblivious of. You present it inputs (like a Chess position), and it gives you outputs (like a winning probability for white). Now the outputs are not as you would like them (e.g. it predicts a win in a position of a game that it lost). So you want to improve it. How do you go about it?

Well, one by one you start tweeking the weight of every connection inside the NN a tiny bit, and rerun the network on the given input with this altered setting, to see how changing this single weight affects the outputs, and how much that changes the difference between what it gives you and what you would have wanted (the 'error'). After having done that for all weights in the entire network, you change these weights in the direction that reduces the error, in proportion with the effect that they had. So weights that did have no effect are not changed at all, weights that had a lot of effect are changed a lot.

This is the principle of minimal change. Weights that had no effect for the position at hand could be important for getting the correct output on other positions, so you don't want to mess with those if it is not needed and without using that other position to guage the effect of the change. By changing the things that contributed most, you can make the largest reduction of the output error for a given amount of change to the weights. (The total weight change being measured as the sum of the squares of all the individual changes, to make sure that negative changes still are counted as increasing the total change.)

As a practical example: when you have 3 weights w1, w2 and w3, and the output is 0.7, while the correct output would be 1.0, and increasing w1 by 0.01 increases the output to 0.71, increasing w2 by 0.01 decreases the output to 0.695, and increasing w3 by 0.01 would increase the output to 0.701, you would increase w1 by 0.01 (because that went in the right direction), decrease w2 by 0.005 (because increasing it had only half as much effect as for w1, and in the wrong direction), and increase w3 by 0.001 (because it hardly had any effect). Or, if you want the NN to learn a bit faster, change the weights by +0.1, -0.05 and +0.01.

You might wonder why we bother to change w3 at all, since it mattered so little. Wouldn't it be better to leave it alone, and only change w1? The point is that there can be very many weights that each contribute very little, but together contribute a lot. E.g. if there were 9 more like w3, changing all 10 of those by 0.01 would have as much effect as changing w1 by 0.1, while the total weight change according to the sum-of-squares measure would be even smaller. So to make sure we don't miss out on the cooperative effect of may small changes, we change every weight in proportion to its effect on the output, to nudge the output in the wanted direction.

If you don't want to run into problems with inputs having conflicting requirements, which alternately spoil the settings for each other, it is better not to perform this process one particular input at a time, but on the average ortotal error for a large batch of inputs (e.g. chess positions sampled from many self-play games). So that you immediately find the best compromise, instead of oscillating between settings good for one or the other.

That is really all the guy is saying, dressed up i an avelanche of totally unnecessary mathematical jargon.

pilgrimdan wrote:how did alphazero do this for chess in 4 hours...

By throwing large amounts of hardware towards it. But note that adjustment of the NN doesn't have to be done all that often. You train the NN on positions sampled from self-play games, ad most of the work is playing those games. Therefore they used 5,000 generation-1 TPUs for playing games, and only 64 generation-2 TPUs for training the NN based on positions occurring in these games. That makes it still 64 times faster than when they would have used a single gen-2 TPU, of course.

pilgrimdan · Post by **pilgrimdan** » Thu Dec 21, 2017 1:06 pm

hgm wrote:Note that this video is not what I would call a clear explanation. For one, it is heavily geared towards an audience of mathematicians, using concepts these are familiar with. But which are actually not needed at all to understand what is going on, and would just make it complete mumbo-jumbo for the non-mathematician.

Apart from that, the way they finally seem to adjust the weights seems highly stupid and inefficient...

The whole idea is actually quite simple, and only requires elementary-school arithmethic:

You have this huge network of connections, that does an unfathomable calculation that you rather would remain oblivious of. You present it inputs (like a Chess position), and it gives you outputs (like a winning probability for white). Now the outputs are not as you would like them (e.g. it predicts a win in a position of a game that it lost). So you want to improve it. How do you go about it?

Well, one by one you start tweeking the weight of every connection inside the NN a tiny bit, and rerun the network on the given input with this altered setting, to see how changing this single weight affects the outputs, and how much that changes the difference between what it gives you and what you would have wanted (the 'error'). After having done that for all weights in the entire network, you change these weights in the direction that reduces the error, in proportion with the effect that they had. So weights that did have no effect are not changed at all, weights that had a lot of effect are changed a lot.

This is the principle of minimal change. Weights that had no effect for the position at hand could be important for getting the correct output on other positions, so you don't want to mess with those if it is not needed and without using that other position to guage the effect of the change. By changing the things that contributed most, you can make the largest reduction of the output error for a given amount of change to the weights. (The total weight change being measured as the sum of the squares of all the individual changes, to make sure that negative changes still are counted as increasing the total change.)

As a practical example: when you have 3 weights w1, w2 and w3, and the output is 0.7, while the correct output would be 1.0, and increasing w1 by 0.01 increases the output to 0.71, increasing w2 by 0.01 decreases the output to 0.695, and increasing w3 by 0.01 would increase the output to 0.701, you would increase w1 by 0.01 (because that went in the right direction), decrease w2 by 0.005 (because increasing it had only half as much effect as for w1, and in the wrong direction), and increase w3 by 0.001 (because it hardly had any effect). Or, if you want the NN to learn a bit faster, change the weights by +0.1, -0.05 and +0.01.

You might wonder why we bother to change w3 at all, since it mattered so little. Wouldn't it be better to leave it alone, and only change w1? The point is that there can be very many weights that each contribute very little, but together contribute a lot. E.g. if there were 9 more like w3, changing all 10 of those by 0.01 would have as much effect as changing w1 by 0.1, while the total weight change according to the sum-of-squares measure would be even smaller. So to make sure we don't miss out on the cooperative effect of may small changes, we change every weight in proportion to its effect on the output, to nudge the output in the wanted direction.

If you don't want to run into problems with inputs having conflicting requirements, which alternately spoil the settings for each other, it is better not to perform this process one particular input at a time, but on the average ortotal error for a large batch of inputs (e.g. chess positions sampled from many self-play games). So that you immediately find the best compromise, instead of oscillating between settings good for one or the other.

That is really all the guy is saying, dressed up i an avelanche of totally unnecessary mathematical jargon.

pilgrimdan wrote:how did alphazero do this for chess in 4 hours...
By throwing large amounts of hardware towards it. But note that adjustment of the NN doesn't have to be done all that often. You train the NN on positions sampled from self-play games, ad most of the work is playing those games. Therefore they used 5,000 generation-1 TPUs for playing games, and only 64 generation-2 TPUs for training the NN based on positions occurring in these games. That makes it still 64 times faster than when they would have used a single gen-2 TPU, of course.

5,000 processing units ... okay ... how much does one of these cost ...

hgm · Post by **hgm** » Thu Dec 21, 2017 1:20 pm

They are made by Google for private use, and not sold. They are not more complex or expensive to produce than x86 CPU cores, but it doesn't seem there would be enough market for them to make development of something similar by other companies economically attractive.

I don't know how far you would get when trying to implement something similar in a FPGA. As multipliers are very specific components, I am afraid that building them out of the more general logic elements offered by FPGAs involves a large overhead, strongly reducing the number of them you could cram on a chip. (Or their speed.)

pilgrimdan · Post by **pilgrimdan** » Thu Dec 21, 2017 6:14 pm

hgm wrote:They are made by Google for private use, and not sold. They are not more complex or expensive to produce than x86 CPU cores, but it doesn't seem there would be enough market for them to make development of something similar by other companies economically attractive.

I don't know how far you would get when trying to implement something similar in a FPGA. As multipliers are very specific components, I am afraid that building them out of the more general logic elements offered by FPGAs involves a large overhead, strongly reducing the number of them you could cram on a chip. (Or their speed.)

okay .. thanks ..

syzygy · Post by **syzygy** » Thu Dec 21, 2017 8:03 pm

hgm wrote:They are made by Google for private use, and not sold.

But they can be rented on Google's Cloud platform:
https://cloud.google.com/tpu/
https://www.blog.google/topics/google-c ... -learning/

hgm · Post by **hgm** » Thu Dec 21, 2017 9:43 pm

I thought that was only for the generation-2 TPUs.

phhnguyen · Post by **phhnguyen** » Thu Dec 21, 2017 10:11 pm

pilgrimdan wrote: 5,000 processing units ... okay ... how much does one of these cost ...

25 million USD for the whole system of 5000 TPUs, according to AlphaGoZero article.

Google does not sell them. But I think for a guy having over 25m USD, willing to spend he can still buy

syzygy · Post by **syzygy** » Thu Dec 21, 2017 10:22 pm

hgm wrote:I thought that was only for the generation-2 TPUs.

You might be right. Their "Cloud TPUs" seem to be 2nd generation TPUs.

To TPU or not to TPU...

to TPU or not to TPU?

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...