To TPU or not to TPU...

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

to TPU or not to TPU?

Poll ended at Wed Jan 17, 2018 9:20 am

be patient and use TPUs via Frameworks
3
18%
optimize now for current Hardware
14
82%
 
Total votes: 17

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

To TPU or not to TPU...

Post by smatovic » Sat Dec 16, 2017 9:20 am

i guess i am not the only one who ponders about chess NN implementation,

so what do you think, should programmers use Frameworks/Libraries like TensorFlow,
to make use of rising TPUs
(Nvidia Volta, Google Cloud TPU, special chips in SmartPhones, ASICs)
or should we write our own NN implementations, optimized for current consumer Hardware
(CPU, AVX, GPU)?

--
Srdja

User avatar
hgm
Posts: 23474
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

Re: To TPU or not to TPU...

Post by hgm » Sat Dec 16, 2017 10:06 am

I am largely ignorant about GPU architecture. I understand the main computational task of image rendering (filling triangular planes of pixels with a color or depth gradient), and I can understand how a very wide SIMD architecture would be helpful for that. But even in SIMD one of the operands of each operation usally has to be freshly read from memory (or to be stored there), making that the 'MD' part makes a large demand on memory bandwidth.

From the descriptios I have seen, I understand TPUs are different. They don't only save on instruction-fetch bandwidth, but also on data-fetch bandwidth, by performing multiple operations on the same (vector) data. So that (in the most extreme case) you would only have to fetch 2 x 256 operands in order to do 256 x 256 multiplications + additions. This is tailored to automatically calculating a convolution of the data vectors, rather than just the inner product (as other SIMD architectures usually do).

I was a bit surprised by this, because the most efficient way to calculate convolutions is usually by Fast Fourier Transform. Perhaps the assumption here is that the filter kernels of neural networks are typically so small that direct shift-multiply-add is competitive.

Is the Nvidia Volta really such a convoluting TPU, or just a powerful GPU? I would love to have TPU capability in my PC. Even if I would have to make it myself from a FPGA (if these would be up to the task).

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: To TPU or not to TPU...

Post by smatovic » Sat Dec 16, 2017 10:51 am

hgm wrote: Is the Nvidia Volta really such a convoluting TPU, or just a powerful GPU? I would love to have TPU capability in my PC. Even if I would have to make it myself from a FPGA (if these would be up to the task).
Tensor Core with Mixed Precision Matrix Math,
page 48 and page 50

http://on-demand.gputechconf.com/gtc/20 ... -volta.pdf

SM Layout on page 24.


--
Srdja

Rémi Coulom
Posts: 429
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: To TPU or not to TPU...

Post by Rémi Coulom » Sat Dec 16, 2017 11:15 am

I have a bit of experience with these algorithms and hardware. It's a really exciting technology to learn, and I encourage all the chess programmers to try it now. You might not beat AlphaZero rapidly, but it is really refreshing to experience radically new algorithms. And AlphaZero still leaves a lot of room for engineering creativity, which is a lot less frustrating that the current state of computer chess where huge efforts bring very little improvement.

You can approach programming neural networks at different levels:
- High-level deep learning frameworks. In my opinion, reasonable solutions are limited to tensorflow and caffe2.
- The middle-level solution is to use cudnn (works on nvidia hardware only)
- The low-level approach: write your own code.

I started by writing my own vectorized cpu code (because my commercial code has to run everywhere, and is mainly used on cell phones). But it is really not a good solution for training the neural network. GPUs can be orders of magnitude faster.

Because I like writing my own code, I made my own implementation of backpropagation (with the help of Fabien Letouzey). I tried OpenCL and CUDA. OpenCL is more open/portable in theory. But CUDA is considerably more convenient to use, and has much better tools and documentation. In practice, nvidia offers the best deep learning hardware, and is competing against OpenCL with CUDA, so it offers a crippled OpenCL driver for its cards. If you want to understand the architecture of a GPU and how it is programmed, I recommend the CUDA documentation.

Writing your own implementation is likely to be a waste of time if you objective is to get something that works efficiently. I think there is no chance you can beat CUDNN with your own implementation. Especially since CUDNN is coded at a lower level than CUDA.

So, the reasonable choice is to use a high-level framework. I am using tensorflow. It can run on GPU, and can also produce CPU code. There is also a "tensorflow lite" for cell phones. There is also the experimental "xla compiler" that can produce optimized CPU code. Both tools are not yet mature, but should improve rapidly. Setting up tensorflow for use in C++ is complicated, but works nicely once you understand how it works.

For HGM: the Fast Fourier Transform is profitable only for large convolutions. For 3x3 convolution, they are not worth it. But there is another very nice trick, called the Winograd transform:
https://arxiv.org/abs/1509.09308

The Volta has ordinary cores that work like traditional GPUs. In addition, it has "tensor cores" that are specialized for deep-learning operation. I am not sure about the details, but my impression is that they are specialized for matrix-matrix multiplication. Convolution can be expressed as a matrix multiplication, and is often implemented this way:
https://petewarden.com/2015/04/20/why-g ... -learning/

trulses
Posts: 39
Joined: Wed Dec 06, 2017 4:34 pm

Re: To TPU or not to TPU...

Post by trulses » Sun Dec 17, 2017 12:45 am

I don't think that using tensorflow means it's not optimized for current consumer hardware. If you compile tensorflow yourself you can get it optimized for your CPU (this includes SIMD instruction sets like AVX). You can also compile whatever CUDA code they have in there that's not cuDNN to target the shader model/compute capability of your GPU. In any event it only makes sense to start coding your own implementation if you've profiled your code and notice a specific lack of performance in that department, even then it might be better to alter their code rather than starting from scratch.

If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.

Milos
Posts: 3383
Joined: Wed Nov 25, 2009 12:47 am

Re: To TPU or not to TPU...

Post by Milos » Sun Dec 17, 2017 1:23 am

smatovic wrote:i guess i am not the only one who ponders about chess NN implementation,

so what do you think, should programmers use Frameworks/Libraries like TensorFlow,
to make use of rising TPUs
(Nvidia Volta, Google Cloud TPU, special chips in SmartPhones, ASICs)
or should we write our own NN implementations, optimized for current consumer Hardware
(CPU, AVX, GPU)?
I would prefer to get Google's TensorFlow TPU stack, but unfortunately it is proprietary.
So even if you made your own TPU in ASIC or FPGA (there are quite a few startups actually doing it) you'd still have to do everything from scratch using OpenCL and writing your own kernel and driver.

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: To TPU or not to TPU...

Post by smatovic » Sun Dec 17, 2017 7:30 am

trulses wrote: II don't think that using tensorflow means it's not optimized for current consumer hardware.
Maybe i asked the wrong question,

Frameworks vs. own implementation
rising TPUs vs. current hardware
trulses wrote: If you want to code a full framework for practice or for your own satisfaction then this is a different matter altogether. It's a great exercise and I recommend it wholeheartedly, even if you only implement fully connected layers and one activation function it's pretty fun.
Fun seems to be a pretty good measure for an project :-)

--
Srdja

smatovic
Posts: 784
Joined: Wed Mar 10, 2010 9:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic
Contact:

Re: To TPU or not to TPU...

Post by smatovic » Sun Dec 17, 2017 7:37 am

I see, CUDA seems to be omnipresent in NN Frameworks,

https://en.wikipedia.org/wiki/Compariso ... g_software

But what about Leela Zero,
it looks like GCP uses OpenCL with own implementation?

https://github.com/gcp/leela-zero

--
Srdja

User avatar
phhnguyen
Posts: 360
Joined: Wed Apr 21, 2010 2:58 am
Location: Australia
Full name: Nguyen Hong Pham
Contact:

Re: To TPU or not to TPU...

Post by phhnguyen » Mon Dec 18, 2017 2:20 am

If your PC has a TPU, I guess you may run NN like few times faster than a normal PC with a good GPU card.

It is good but not worth for waiting / hopping.

I think the success key of AlphaZero is to have 5000 TPU for training, not a single one.

Vinvin
Posts: 4333
Joined: Thu Mar 09, 2006 8:40 am
Full name: Vincent Lejeune

Re: To TPU or not to TPU...

Post by Vinvin » Wed Dec 20, 2017 8:52 pm

Video about backpropagation : https://www.youtube.com/watch?v=q555kfIFUCM

Post Reply