To TPU or not to TPU...

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

to TPU or not to TPU?

Poll ended at Wed Jan 17, 2018 10:20 am

be patient and use TPUs via Frameworks
3
18%
optimize now for current Hardware
14
82%
 
Total votes: 17

Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: To TPU or not to TPU...

Post by Rémi Coulom »

I have a bit of experience with these algorithms and hardware. It's a really exciting technology to learn, and I encourage all the chess programmers to try it now. You might not beat AlphaZero rapidly, but it is really refreshing to experience radically new algorithms. And AlphaZero still leaves a lot of room for engineering creativity, which is a lot less frustrating that the current state of computer chess where huge efforts bring very little improvement.

You can approach programming neural networks at different levels:
- High-level deep learning frameworks. In my opinion, reasonable solutions are limited to tensorflow and caffe2.
- The middle-level solution is to use cudnn (works on nvidia hardware only)
- The low-level approach: write your own code.

I started by writing my own vectorized cpu code (because my commercial code has to run everywhere, and is mainly used on cell phones). But it is really not a good solution for training the neural network. GPUs can be orders of magnitude faster.

Because I like writing my own code, I made my own implementation of backpropagation (with the help of Fabien Letouzey). I tried OpenCL and CUDA. OpenCL is more open/portable in theory. But CUDA is considerably more convenient to use, and has much better tools and documentation. In practice, nvidia offers the best deep learning hardware, and is competing against OpenCL with CUDA, so it offers a crippled OpenCL driver for its cards. If you want to understand the architecture of a GPU and how it is programmed, I recommend the CUDA documentation.

Writing your own implementation is likely to be a waste of time if you objective is to get something that works efficiently. I think there is no chance you can beat CUDNN with your own implementation. Especially since CUDNN is coded at a lower level than CUDA.

So, the reasonable choice is to use a high-level framework. I am using tensorflow. It can run on GPU, and can also produce CPU code. There is also a "tensorflow lite" for cell phones. There is also the experimental "xla compiler" that can produce optimized CPU code. Both tools are not yet mature, but should improve rapidly. Setting up tensorflow for use in C++ is complicated, but works nicely once you understand how it works.

For HGM: the Fast Fourier Transform is profitable only for large convolutions. For 3x3 convolution, they are not worth it. But there is another very nice trick, called the Winograd transform:
https://arxiv.org/abs/1509.09308

The Volta has ordinary cores that work like traditional GPUs. In addition, it has "tensor cores" that are specialized for deep-learning operation. I am not sure about the details, but my impression is that they are specialized for matrix-matrix multiplication. Convolution can be expressed as a matrix multiplication, and is often implemented this way:
https://petewarden.com/2015/04/20/why-g ... -learning/
trulses
Posts: 39
Joined: Wed Dec 06, 2017 5:34 pm

Re: To TPU or not to TPU...

Post by trulses »

I don't think that using tensorflow means it's not optimized for current consumer hardware. If you compile tensorflow yourself you can get it optimized for your CPU (this includes SIMD instruction sets like AVX). You can also compile whatever CUDA code they have in there that's not cuDNN to target the shader model/compute capability of your GPU. In any event it only makes sense to start coding your own implementation if you've profiled your code and notice a specific lack of performance in that department, even then it might be better to alter their code rather than starting from scratch.

If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: To TPU or not to TPU...

Post by Milos »

smatovic wrote:i guess i am not the only one who ponders about chess NN implementation,

so what do you think, should programmers use Frameworks/Libraries like TensorFlow,
to make use of rising TPUs
(Nvidia Volta, Google Cloud TPU, special chips in SmartPhones, ASICs)
or should we write our own NN implementations, optimized for current consumer Hardware
(CPU, AVX, GPU)?
I would prefer to get Google's TensorFlow TPU stack, but unfortunately it is proprietary.
So even if you made your own TPU in ASIC or FPGA (there are quite a few startups actually doing it) you'd still have to do everything from scratch using OpenCL and writing your own kernel and driver.
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: To TPU or not to TPU...

Post by smatovic »

trulses wrote: II don't think that using tensorflow means it's not optimized for current consumer hardware.
Maybe i asked the wrong question,

Frameworks vs. own implementation
rising TPUs vs. current hardware
trulses wrote: If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.
Fun seems to be a pretty good measure for an project :-)

--
Srdja
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: To TPU or not to TPU...

Post by smatovic »

I see, CUDA seems to be omnipresent in NN Frameworks,

https://en.wikipedia.org/wiki/Compariso ... g_software

But what about Leela Zero,
it looks like GCP uses OpenCL with own implementation?

https://github.com/gcp/leela-zero

--
Srdja
User avatar
phhnguyen
Posts: 1434
Joined: Wed Apr 21, 2010 4:58 am
Location: Australia
Full name: Nguyen Hong Pham

Re: To TPU or not to TPU...

Post by phhnguyen »

If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.
Vinvin
Posts: 5228
Joined: Thu Mar 09, 2006 9:40 am
Full name: Vincent Lejeune

Re: To TPU or not to TPU...

Post by Vinvin »

Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM
Michael Sherwin
Posts: 3196
Joined: Fri May 26, 2006 3:00 am
Location: WY, USA
Full name: Michael Sherwin

Re: To TPU or not to TPU...

Post by Michael Sherwin »

phhnguyen wrote:If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.
Yes! Thank you--someone that understands that most of the strength of Alpha Zero is due to the learning! :D

Add similar learning for alpha-beta to any top engine and Alpha Zero would be nicknamed Alpha Flop.
If you are on a sidewalk and the covid goes beep beep
Just step aside or you might have a bit of heat
Covid covid runs through the town all day
Can the people ever change their ways
Sherwin the covid's after you
Sherwin if it catches you you're through
pilgrimdan
Posts: 405
Joined: Sat Jul 02, 2011 10:49 pm

Re: To TPU or not to TPU...

Post by pilgrimdan »

Vinvin wrote:Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM
thanks for the video...

wow ... that was an awful lot in 5 min...

need to go back to school and learn calculus...

this seems awfully time consuming...

how did alphazero do this for chess in 4 hours...
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: To TPU or not to TPU...

Post by hgm »

Note that this video is not what I would call a clear explanation. For one, it is heavily geared towards an audience of mathematicians, using concepts these are familiar with. But which are actually not needed at all to understand what is going on, and would just make it complete mumbo-jumbo for the non-mathematician.

Apart from that, the way they finally seem to adjust the weights seems highly stupid and inefficient...

The whole idea is actually quite simple, and only requires elementary-school arithmethic:

You have this huge network of connections, that does an unfathomable calculation that you rather would remain oblivious of. You present it inputs (like a Chess position), and it gives you outputs (like a winning probability for white). Now the outputs are not as you would like them (e.g. it predicts a win in a position of a game that it lost). So you want to improve it. How do you go about it?

Well, one by one you start tweeking the weight of every connection inside the NN a tiny bit, and rerun the network on the given input with this altered setting, to see how changing this single weight affects the outputs, and how much that changes the difference between what it gives you and what you would have wanted (the 'error'). After having done that for all weights in the entire network, you change these weights in the direction that reduces the error, in proportion with the effect that they had. So weights that did have no effect are not changed at all, weights that had a lot of effect are changed a lot.

This is the principle of minimal change. Weights that had no effect for the position at hand could be important for getting the correct output on other positions, so you don't want to mess with those if it is not needed and without using that other position to guage the effect of the change. By changing the things that contributed most, you can make the largest reduction of the output error for a given amount of change to the weights. (The total weight change being measured as the sum of the squares of all the individual changes, to make sure that negative changes still are counted as increasing the total change.)

As a practical example: when you have 3 weights w1, w2 and w3, and the output is 0.7, while the correct output would be 1.0, and increasing w1 by 0.01 increases the output to 0.71, increasing w2 by 0.01 decreases the output to 0.695, and increasing w3 by 0.01 would increase the output to 0.701, you would increase w1 by 0.01 (because that went in the right direction), decrease w2 by 0.005 (because increasing it had only half as much effect as for w1, and in the wrong direction), and increase w3 by 0.001 (because it hardly had any effect). Or, if you want the NN to learn a bit faster, change the weights by +0.1, -0.05 and +0.01.

You might wonder why we bother to change w3 at all, since it mattered so little. Wouldn't it be better to leave it alone, and only change w1? The point is that there can be very many weights that each contribute very little, but together contribute a lot. E.g. if there were 9 more like w3, changing all 10 of those by 0.01 would have as much effect as changing w1 by 0.1, while the total weight change according to the sum-of-squares measure would be even smaller. So to make sure we don't miss out on the cooperative effect of may small changes, we change every weight in proportion to its effect on the output, to nudge the output in the wanted direction.

If you don't want to run into problems with inputs having conflicting requirements, which alternately spoil the settings for each other, it is better not to perform this process one particular input at a time, but on the average ortotal error for a large batch of inputs (e.g. chess positions sampled from many self-play games). So that you immediately find the best compromise, instead of oscillating between settings good for one or the other.

That is really all the guy is saying, dressed up i an avelanche of totally unnecessary mathematical jargon.
pilgrimdan wrote:how did alphazero do this for chess in 4 hours...
By throwing large amounts of hardware towards it. But note that adjustment of the NN doesn't have to be done all that often. You train the NN on positions sampled from self-play games, ad most of the work is playing those games. Therefore they used 5,000 generation-1 TPUs for playing games, and only 64 generation-2 TPUs for training the NN based on positions occurring in these games. That makes it still 64 times faster than when they would have used a single gen-2 TPU, of course.