To TPU or not to TPU...

Discussion of chess software programming and technical issues.

Moderators: hgm, Dann Corbit, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

to TPU or not to TPU?

Poll ended at Wed Jan 17, 2018 9:20 am

be patient and use TPUs via Frameworks
3
18%
optimize now for current Hardware
14
82%
 
Total votes: 17

Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 1:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: To TPU or not to TPU...

Post by Michael Sherwin » Wed Dec 20, 2017 11:05 pm

phhnguyen wrote:If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.
Yes! Thank you--someone that understands that most of the strength of Alpha Zero is due to the learning! :D

Add similar learning for alpha-beta to any top engine and Alpha Zero would be nicknamed Alpha Flop.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through

pilgrimdan
Posts: 405
Joined: Sat Jul 02, 2011 8:49 pm

Re: To TPU or not to TPU...

Post by pilgrimdan » Thu Dec 21, 2017 9:10 am

Vinvin wrote:Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM
thanks for the video...

wow ... that was an awful lot in 5 min...

need to go back to school and learn calculus...

this seems awfully time consuming...

how did alphazero do this for chess in 4 hours...

User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: To TPU or not to TPU...

Post by hgm » Thu Dec 21, 2017 9:59 am

Note that this video is not what I would call a clear explanation. For one, it is heavily geared towards an audience of mathematicians, using concepts these are familiar with. But which are actually not needed at all to understand what is going on, and would just make it complete mumbo-jumbo for the non-mathematician.

Apart from that, the way they finally seem to adjust the weights seems highly stupid and inefficient...

The whole idea is actually quite simple, and only requires elementary-school arithmethic:

You have this huge network of connections, that does an unfathomable calculation that you rather would remain oblivious of. You present it inputs (like a Chess position), and it gives you outputs (like a winning probability for white). Now the outputs are not as you would like them (e.g. it predicts a win in a position of a game that it lost). So you want to improve it. How do you go about it?

Well, one by one you start tweeking the weight of every connection inside the NN a tiny bit, and rerun the network on the given input with this altered setting, to see how changing this single weight affects the outputs, and how much that changes the difference between what it gives you and what you would have wanted (the 'error'). After having done that for all weights in the entire network, you change these weights in the direction that reduces the error, in proportion with the effect that they had. So weights that did have no effect are not changed at all, weights that had a lot of effect are changed a lot.

This is the principle of minimal change. Weights that had no effect for the position at hand could be important for getting the correct output on other positions, so you don't want to mess with those if it is not needed and without using that other position to guage the effect of the change. By changing the things that contributed most, you can make the largest reduction of the output error for a given amount of change to the weights. (The total weight change being measured as the sum of the squares of all the individual changes, to make sure that negative changes still are counted as increasing the total change.)

As a practical example: when you have 3 weights w1, w2 and w3, and the output is 0.7, while the correct output would be 1.0, and increasing w1 by 0.01 increases the output to 0.71, increasing w2 by 0.01 decreases the output to 0.695, and increasing w3 by 0.01 would increase the output to 0.701, you would increase w1 by 0.01 (because that went in the right direction), decrease w2 by 0.005 (because increasing it had only half as much effect as for w1, and in the wrong direction), and increase w3 by 0.001 (because it hardly had any effect). Or, if you want the NN to learn a bit faster, change the weights by +0.1, -0.05 and +0.01.

You might wonder why we bother to change w3 at all, since it mattered so little. Wouldn't it be better to leave it alone, and only change w1? The point is that there can be very many weights that each contribute very little, but together contribute a lot. E.g. if there were 9 more like w3, changing all 10 of those by 0.01 would have as much effect as changing w1 by 0.1, while the total weight change according to the sum-of-squares measure would be even smaller. So to make sure we don't miss out on the cooperative effect of may small changes, we change every weight in proportion to its effect on the output, to nudge the output in the wanted direction.

If you don't want to run into problems with inputs having conflicting requirements, which alternately spoil the settings for each other, it is better not to perform this process one particular input at a time, but on the average ortotal error for a large batch of inputs (e.g. chess positions sampled from many self-play games). So that you immediately find the best compromise, instead of oscillating between settings good for one or the other.

That is really all the guy is saying, dressed up i an avelanche of totally unnecessary mathematical jargon.
pilgrimdan wrote:how did alphazero do this for chess in 4 hours...
By throwing large amounts of hardware towards it. But note that adjustment of the NN doesn't have to be done all that often. You train the NN on positions sampled from self-play games, ad most of the work is playing those games. Therefore they used 5,000 generation-1 TPUs for playing games, and only 64 generation-2 TPUs for training the NN based on positions occurring in these games. That makes it still 64 times faster than when they would have used a single gen-2 TPU, of course.

pilgrimdan
Posts: 405
Joined: Sat Jul 02, 2011 8:49 pm

Re: To TPU or not to TPU...

Post by pilgrimdan » Thu Dec 21, 2017 12:06 pm

hgm wrote:Note that this video is not what I would call a clear explanation. For one, it is heavily geared towards an audience of mathematicians, using concepts these are familiar with. But which are actually not needed at all to understand what is going on, and would just make it complete mumbo-jumbo for the non-mathematician.

Apart from that, the way they finally seem to adjust the weights seems highly stupid and inefficient...

The whole idea is actually quite simple, and only requires elementary-school arithmethic:

You have this huge network of connections, that does an unfathomable calculation that you rather would remain oblivious of. You present it inputs (like a Chess position), and it gives you outputs (like a winning probability for white). Now the outputs are not as you would like them (e.g. it predicts a win in a position of a game that it lost). So you want to improve it. How do you go about it?

Well, one by one you start tweeking the weight of every connection inside the NN a tiny bit, and rerun the network on the given input with this altered setting, to see how changing this single weight affects the outputs, and how much that changes the difference between what it gives you and what you would have wanted (the 'error'). After having done that for all weights in the entire network, you change these weights in the direction that reduces the error, in proportion with the effect that they had. So weights that did have no effect are not changed at all, weights that had a lot of effect are changed a lot.

This is the principle of minimal change. Weights that had no effect for the position at hand could be important for getting the correct output on other positions, so you don't want to mess with those if it is not needed and without using that other position to guage the effect of the change. By changing the things that contributed most, you can make the largest reduction of the output error for a given amount of change to the weights. (The total weight change being measured as the sum of the squares of all the individual changes, to make sure that negative changes still are counted as increasing the total change.)

As a practical example: when you have 3 weights w1, w2 and w3, and the output is 0.7, while the correct output would be 1.0, and increasing w1 by 0.01 increases the output to 0.71, increasing w2 by 0.01 decreases the output to 0.695, and increasing w3 by 0.01 would increase the output to 0.701, you would increase w1 by 0.01 (because that went in the right direction), decrease w2 by 0.005 (because increasing it had only half as much effect as for w1, and in the wrong direction), and increase w3 by 0.001 (because it hardly had any effect). Or, if you want the NN to learn a bit faster, change the weights by +0.1, -0.05 and +0.01.

You might wonder why we bother to change w3 at all, since it mattered so little. Wouldn't it be better to leave it alone, and only change w1? The point is that there can be very many weights that each contribute very little, but together contribute a lot. E.g. if there were 9 more like w3, changing all 10 of those by 0.01 would have as much effect as changing w1 by 0.1, while the total weight change according to the sum-of-squares measure would be even smaller. So to make sure we don't miss out on the cooperative effect of may small changes, we change every weight in proportion to its effect on the output, to nudge the output in the wanted direction.

If you don't want to run into problems with inputs having conflicting requirements, which alternately spoil the settings for each other, it is better not to perform this process one particular input at a time, but on the average ortotal error for a large batch of inputs (e.g. chess positions sampled from many self-play games). So that you immediately find the best compromise, instead of oscillating between settings good for one or the other.

That is really all the guy is saying, dressed up i an avelanche of totally unnecessary mathematical jargon.
pilgrimdan wrote:how did alphazero do this for chess in 4 hours...
By throwing large amounts of hardware towards it. But note that adjustment of the NN doesn't have to be done all that often. You train the NN on positions sampled from self-play games, ad most of the work is playing those games. Therefore they used 5,000 generation-1 TPUs for playing games, and only 64 generation-2 TPUs for training the NN based on positions occurring in these games. That makes it still 64 times faster than when they would have used a single gen-2 TPU, of course.
5,000 processing units ... okay ... how much does one of these cost ...

User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: To TPU or not to TPU...

Post by hgm » Thu Dec 21, 2017 12:20 pm

They are made by Google for private use, and not sold. They are not more complex or expensive to produce than x86 CPU cores, but it doesn't seem there would be enough market for them to make development of something similar by other companies economically attractive.

I don't know how far you would get when trying to implement something similar in a FPGA. As multipliers are very specific components, I am afraid that building them out of the more general logic elements offered by FPGAs involves a large overhead, strongly reducing the number of them you could cram on a chip. (Or their speed.)

pilgrimdan
Posts: 405
Joined: Sat Jul 02, 2011 8:49 pm

Re: To TPU or not to TPU...

Post by pilgrimdan » Thu Dec 21, 2017 5:14 pm

hgm wrote:They are made by Google for private use, and not sold. They are not more complex or expensive to produce than x86 CPU cores, but it doesn't seem there would be enough market for them to make development of something similar by other companies economically attractive.

I don't know how far you would get when trying to implement something similar in a FPGA. As multipliers are very specific components, I am afraid that building them out of the more general logic elements offered by FPGAs involves a large overhead, strongly reducing the number of them you could cram on a chip. (Or their speed.)
okay .. thanks ..

syzygy
Posts: 5000
Joined: Tue Feb 28, 2012 10:56 pm

Re: To TPU or not to TPU...

Post by syzygy » Thu Dec 21, 2017 7:03 pm

hgm wrote:They are made by Google for private use, and not sold.
But they can be rented on Google's Cloud platform:
https://cloud.google.com/tpu/
https://www.blog.google/topics/google-c ... -learning/

User avatar
hgm
Posts: 26134
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: To TPU or not to TPU...

Post by hgm » Thu Dec 21, 2017 8:43 pm

I thought that was only for the generation-2 TPUs.

User avatar
phhnguyen
Posts: 939
Joined: Wed Apr 21, 2010 2:58 am
Location: Australia
Full name: Nguyen Hong Pham
Contact:

Re: To TPU or not to TPU...

Post by phhnguyen » Thu Dec 21, 2017 9:11 pm

pilgrimdan wrote: 5,000 processing units ... okay ... how much does one of these cost ...
25 million USD for the whole system of 5000 TPUs, according to AlphaGoZero article.

Google does not sell them. But I think for a guy having over 25m USD, willing to spend he can still buy ;)

syzygy
Posts: 5000
Joined: Tue Feb 28, 2012 10:56 pm

Re: To TPU or not to TPU...

Post by syzygy » Thu Dec 21, 2017 9:22 pm

hgm wrote:I thought that was only for the generation-2 TPUs.
You might be right. Their "Cloud TPUs" seem to be 2nd generation TPUs.

Post Reply