To TPU or not to TPU...

smatovic · Post by **smatovic** » Sat Dec 16, 2017 10:20 am

i guess i am not the only one who ponders about chess NN implementation,

so what do you think, should programmers use Frameworks/Libraries like TensorFlow,
to make use of rising TPUs
(Nvidia Volta, Google Cloud TPU, special chips in SmartPhones, ASICs)
or should we write our own NN implementations, optimized for current consumer Hardware
(CPU, AVX, GPU)?

--
Srdja

hgm · Post by **hgm** » Sat Dec 16, 2017 11:06 am

I am largely ignorant about GPU architecture. I understand the main computational task of image rendering (filling triangular planes of pixels with a color or depth gradient), and I can understand how a very wide SIMD architecture would be helpful for that. But even in SIMD one of the operands of each operation usally has to be freshly read from memory (or to be stored there), making that the 'MD' part makes a large demand on memory bandwidth.

From the descriptios I have seen, I understand TPUs are different. They don't only save on instruction-fetch bandwidth, but also on data-fetch bandwidth, by performing multiple operations on the same (vector) data. So that (in the most extreme case) you would only have to fetch 2 x 256 operands in order to do 256 x 256 multiplications + additions. This is tailored to automatically calculating a convolution of the data vectors, rather than just the inner product (as other SIMD architectures usually do).

I was a bit surprised by this, because the most efficient way to calculate convolutions is usually by Fast Fourier Transform. Perhaps the assumption here is that the filter kernels of neural networks are typically so small that direct shift-multiply-add is competitive.

Is the Nvidia Volta really such a convoluting TPU, or just a powerful GPU? I would love to have TPU capability in my PC. Even if I would have to make it myself from a FPGA (if these would be up to the task).

smatovic · Post by **smatovic** » Sat Dec 16, 2017 11:51 am

hgm wrote: Is the Nvidia Volta really such a convoluting TPU, or just a powerful GPU? I would love to have TPU capability in my PC. Even if I would have to make it myself from a FPGA (if these would be up to the task).

Tensor Core with Mixed Precision Matrix Math,
page 48 and page 50

http://on-demand.gputechconf.com/gtc/20 ... -volta.pdf

SM Layout on page 24.

--
Srdja

Rémi Coulom · Post by **Rémi Coulom** » Sat Dec 16, 2017 12:15 pm

I have a bit of experience with these algorithms and hardware. It's a really exciting technology to learn, and I encourage all the chess programmers to try it now. You might not beat AlphaZero rapidly, but it is really refreshing to experience radically new algorithms. And AlphaZero still leaves a lot of room for engineering creativity, which is a lot less frustrating that the current state of computer chess where huge efforts bring very little improvement.

You can approach programming neural networks at different levels:
- High-level deep learning frameworks. In my opinion, reasonable solutions are limited to tensorflow and caffe2.
- The middle-level solution is to use cudnn (works on nvidia hardware only)
- The low-level approach: write your own code.

I started by writing my own vectorized cpu code (because my commercial code has to run everywhere, and is mainly used on cell phones). But it is really not a good solution for training the neural network. GPUs can be orders of magnitude faster.

Because I like writing my own code, I made my own implementation of backpropagation (with the help of Fabien Letouzey). I tried OpenCL and CUDA. OpenCL is more open/portable in theory. But CUDA is considerably more convenient to use, and has much better tools and documentation. In practice, nvidia offers the best deep learning hardware, and is competing against OpenCL with CUDA, so it offers a crippled OpenCL driver for its cards. If you want to understand the architecture of a GPU and how it is programmed, I recommend the CUDA documentation.

Writing your own implementation is likely to be a waste of time if you objective is to get something that works efficiently. I think there is no chance you can beat CUDNN with your own implementation. Especially since CUDNN is coded at a lower level than CUDA.

So, the reasonable choice is to use a high-level framework. I am using tensorflow. It can run on GPU, and can also produce CPU code. There is also a "tensorflow lite" for cell phones. There is also the experimental "xla compiler" that can produce optimized CPU code. Both tools are not yet mature, but should improve rapidly. Setting up tensorflow for use in C++ is complicated, but works nicely once you understand how it works.

For HGM: the Fast Fourier Transform is profitable only for large convolutions. For 3x3 convolution, they are not worth it. But there is another very nice trick, called the Winograd transform:
https://arxiv.org/abs/1509.09308

The Volta has ordinary cores that work like traditional GPUs. In addition, it has "tensor cores" that are specialized for deep-learning operation. I am not sure about the details, but my impression is that they are specialized for matrix-matrix multiplication. Convolution can be expressed as a matrix multiplication, and is often implemented this way:
https://petewarden.com/2015/04/20/why-g ... -learning/

trulses · Post by **trulses** » Sun Dec 17, 2017 1:45 am

I don't think that using tensorflow means it's not optimized for current consumer hardware. If you compile tensorflow yourself you can get it optimized for your CPU (this includes SIMD instruction sets like AVX). You can also compile whatever CUDA code they have in there that's not cuDNN to target the shader model/compute capability of your GPU. In any event it only makes sense to start coding your own implementation if you've profiled your code and notice a specific lack of performance in that department, even then it might be better to alter their code rather than starting from scratch.

If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.

Milos · Post by **Milos** » Sun Dec 17, 2017 2:23 am

smatovic wrote:i guess i am not the only one who ponders about chess NN implementation,

so what do you think, should programmers use Frameworks/Libraries like TensorFlow,
to make use of rising TPUs
(Nvidia Volta, Google Cloud TPU, special chips in SmartPhones, ASICs)
or should we write our own NN implementations, optimized for current consumer Hardware
(CPU, AVX, GPU)?

I would prefer to get Google's TensorFlow TPU stack, but unfortunately it is proprietary.
So even if you made your own TPU in ASIC or FPGA (there are quite a few startups actually doing it) you'd still have to do everything from scratch using OpenCL and writing your own kernel and driver.

smatovic · Post by **smatovic** » Sun Dec 17, 2017 8:30 am

trulses wrote: II don't think that using tensorflow means it's not optimized for current consumer hardware.

Maybe i asked the wrong question,

Frameworks vs. own implementation
rising TPUs vs. current hardware

trulses wrote: If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.

Fun seems to be a pretty good measure for an project

--
Srdja

smatovic · Post by **smatovic** » Sun Dec 17, 2017 8:37 am

I see, CUDA seems to be omnipresent in NN Frameworks,

https://en.wikipedia.org/wiki/Compariso ... g_software

But what about Leela Zero,
it looks like GCP uses OpenCL with own implementation?

https://github.com/gcp/leela-zero

--
Srdja

phhnguyen · Post by **phhnguyen** » Mon Dec 18, 2017 3:20 am

If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.

Vinvin · Post by **Vinvin** » Wed Dec 20, 2017 9:52 pm

Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM

To TPU or not to TPU...

to TPU or not to TPU?

To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...

Re: To TPU or not to TPU...