To TPU or not to TPU...

Rémi Coulom · Post by **Rémi Coulom** » Sat Dec 16, 2017 12:15 pm

I have a bit of experience with these algorithms and hardware. It's a really exciting technology to learn, and I encourage all the chess programmers to try it now. You might not beat AlphaZero rapidly, but it is really refreshing to experience radically new algorithms. And AlphaZero still leaves a lot of room for engineering creativity, which is a lot less frustrating that the current state of computer chess where huge efforts bring very little improvement.

You can approach programming neural networks at different levels:
- High-level deep learning frameworks. In my opinion, reasonable solutions are limited to tensorflow and caffe2.
- The middle-level solution is to use cudnn (works on nvidia hardware only)
- The low-level approach: write your own code.

I started by writing my own vectorized cpu code (because my commercial code has to run everywhere, and is mainly used on cell phones). But it is really not a good solution for training the neural network. GPUs can be orders of magnitude faster.

Because I like writing my own code, I made my own implementation of backpropagation (with the help of Fabien Letouzey). I tried OpenCL and CUDA. OpenCL is more open/portable in theory. But CUDA is considerably more convenient to use, and has much better tools and documentation. In practice, nvidia offers the best deep learning hardware, and is competing against OpenCL with CUDA, so it offers a crippled OpenCL driver for its cards. If you want to understand the architecture of a GPU and how it is programmed, I recommend the CUDA documentation.

Writing your own implementation is likely to be a waste of time if you objective is to get something that works efficiently. I think there is no chance you can beat CUDNN with your own implementation. Especially since CUDNN is coded at a lower level than CUDA.

So, the reasonable choice is to use a high-level framework. I am using tensorflow. It can run on GPU, and can also produce CPU code. There is also a "tensorflow lite" for cell phones. There is also the experimental "xla compiler" that can produce optimized CPU code. Both tools are not yet mature, but should improve rapidly. Setting up tensorflow for use in C++ is complicated, but works nicely once you understand how it works.

For HGM: the Fast Fourier Transform is profitable only for large convolutions. For 3x3 convolution, they are not worth it. But there is another very nice trick, called the Winograd transform:
https://arxiv.org/abs/1509.09308

The Volta has ordinary cores that work like traditional GPUs. In addition, it has "tensor cores" that are specialized for deep-learning operation. I am not sure about the details, but my impression is that they are specialized for matrix-matrix multiplication. Convolution can be expressed as a matrix multiplication, and is often implemented this way:
https://petewarden.com/2015/04/20/why-g ... -learning/

trulses · Post by **trulses** » Sun Dec 17, 2017 1:45 am

I don't think that using tensorflow means it's not optimized for current consumer hardware. If you compile tensorflow yourself you can get it optimized for your CPU (this includes SIMD instruction sets like AVX). You can also compile whatever CUDA code they have in there that's not cuDNN to target the shader model/compute capability of your GPU. In any event it only makes sense to start coding your own implementation if you've profiled your code and notice a specific lack of performance in that department, even then it might be better to alter their code rather than starting from scratch.

If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.

Milos · Post by **Milos** » Sun Dec 17, 2017 2:23 am

smatovic wrote:i guess i am not the only one who ponders about chess NN implementation,

so what do you think, should programmers use Frameworks/Libraries like TensorFlow,
to make use of rising TPUs
(Nvidia Volta, Google Cloud TPU, special chips in SmartPhones, ASICs)
or should we write our own NN implementations, optimized for current consumer Hardware
(CPU, AVX, GPU)?

I would prefer to get Google's TensorFlow TPU stack, but unfortunately it is proprietary.
So even if you made your own TPU in ASIC or FPGA (there are quite a few startups actually doing it) you'd still have to do everything from scratch using OpenCL and writing your own kernel and driver.

smatovic · Post by **smatovic** » Sun Dec 17, 2017 8:30 am

trulses wrote: II don't think that using tensorflow means it's not optimized for current consumer hardware.

Maybe i asked the wrong question,

Frameworks vs. own implementation
rising TPUs vs. current hardware

trulses wrote: If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.

Fun seems to be a pretty good measure for an project

--
Srdja

smatovic · Post by **smatovic** » Sun Dec 17, 2017 8:37 am

I see, CUDA seems to be omnipresent in NN Frameworks,

https://en.wikipedia.org/wiki/Compariso ... g_software

But what about Leela Zero,
it looks like GCP uses OpenCL with own implementation?

https://github.com/gcp/leela-zero

--
Srdja

phhnguyen · Post by **phhnguyen** » Mon Dec 18, 2017 3:20 am

If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.

Vinvin · Post by **Vinvin** » Wed Dec 20, 2017 9:52 pm

Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM

Michael Sherwin · Post by **Michael Sherwin** » Thu Dec 21, 2017 12:05 am

phhnguyen wrote:If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.

Yes! Thank you--someone that understands that most of the strength of Alpha Zero is due to the learning!

Add similar learning for alpha-beta to any top engine and Alpha Zero would be nicknamed Alpha Flop.

pilgrimdan · Post by **pilgrimdan** » Thu Dec 21, 2017 10:10 am

Vinvin wrote:Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM

thanks for the video...

wow ... that was an awful lot in 5 min...

need to go back to school and learn calculus...

this seems awfully time consuming...

how did alphazero do this for chess in 4 hours...

hgm · Post by **hgm** » Thu Dec 21, 2017 10:59 am

Note that this video is not what I would call a clear explanation. For one, it is heavily geared towards an audience of mathematicians, using concepts these are familiar with. But which are actually not needed at all to understand what is going on, and would just make it complete mumbo-jumbo for the non-mathematician.

Apart from that, the way they finally seem to adjust the weights seems highly stupid and inefficient...

The whole idea is actually quite simple, and only requires elementary-school arithmethic:

You have this huge network of connections, that does an unfathomable calculation that you rather would remain oblivious of. You present it inputs (like a Chess position), and it gives you outputs (like a winning probability for white). Now the outputs are not as you would like them (e.g. it predicts a win in a position of a game that it lost). So you want to improve it. How do you go about it?

Well, one by one you start tweeking the weight of every connection inside the NN a tiny bit, and rerun the network on the given input with this altered setting, to see how changing this single weight affects the outputs, and how much that changes the difference between what it gives you and what you would have wanted (the 'error'). After having done that for all weights in the entire network, you change these weights in the direction that reduces the error, in proportion with the effect that they had. So weights that did have no effect are not changed at all, weights that had a lot of effect are changed a lot.

This is the principle of minimal change. Weights that had no effect for the position at hand could be important for getting the correct output on other positions, so you don't want to mess with those if it is not needed and without using that other position to guage the effect of the change. By changing the things that contributed most, you can make the largest reduction of the output error for a given amount of change to the weights. (The total weight change being measured as the sum of the squares of all the individual changes, to make sure that negative changes still are counted as increasing the total change.)

As a practical example: when you have 3 weights w1, w2 and w3, and the output is 0.7, while the correct output would be 1.0, and increasing w1 by 0.01 increases the output to 0.71, increasing w2 by 0.01 decreases the output to 0.695, and increasing w3 by 0.01 would increase the output to 0.701, you would increase w1 by 0.01 (because that went in the right direction), decrease w2 by 0.005 (because increasing it had only half as much effect as for w1, and in the wrong direction), and increase w3 by 0.001 (because it hardly had any effect). Or, if you want the NN to learn a bit faster, change the weights by +0.1, -0.05 and +0.01.

You might wonder why we bother to change w3 at all, since it mattered so little. Wouldn't it be better to leave it alone, and only change w1? The point is that there can be very many weights that each contribute very little, but together contribute a lot. E.g. if there were 9 more like w3, changing all 10 of those by 0.01 would have as much effect as changing w1 by 0.1, while the total weight change according to the sum-of-squares measure would be even smaller. So to make sure we don't miss out on the cooperative effect of may small changes, we change every weight in proportion to its effect on the output, to nudge the output in the wanted direction.

If you don't want to run into problems with inputs having conflicting requirements, which alternately spoil the settings for each other, it is better not to perform this process one particular input at a time, but on the average ortotal error for a large batch of inputs (e.g. chess positions sampled from many self-play games). So that you immediately find the best compromise, instead of oscillating between settings good for one or the other.

That is really all the guy is saying, dressed up i an avelanche of totally unnecessary mathematical jargon.

pilgrimdan wrote:how did alphazero do this for chess in 4 hours...

By throwing large amounts of hardware towards it. But note that adjustment of the NN doesn't have to be done all that often. You train the NN on positions sampled from self-play games, ad most of the work is playing those games. Therefore they used 5,000 generation-1 TPUs for playing games, and only 64 generation-2 TPUs for training the NN based on positions occurring in these games. That makes it still 64 times faster than when they would have used a single gen-2 TPU, of course.

To TPU or not to TPU...

to TPU or not to TPU?

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...