I have a bit of experience with these algorithms and hardware. It's a really exciting technology to learn, and I encourage all the chess programmers to try it now. You might not beat AlphaZero rapidly, but it is really refreshing to experience radically new algorithms. And AlphaZero still leaves a lot of room for engineering creativity, which is a lot less frustrating that the current state of computer chess where huge efforts bring very little improvement.
You can approach programming neural networks at different levels:
- High-level deep learning frameworks. In my opinion, reasonable solutions are limited to tensorflow and caffe2.
- The middle-level solution is to use cudnn (works on nvidia hardware only)
- The low-level approach: write your own code.
I started by writing my own vectorized cpu code (because my commercial code has to run everywhere, and is mainly used on cell phones). But it is really not a good solution for training the neural network. GPUs can be orders of magnitude faster.
Because I like writing my own code, I made my own implementation of backpropagation (with the help of Fabien Letouzey). I tried OpenCL and CUDA. OpenCL is more open/portable in theory. But CUDA is considerably more convenient to use, and has much better tools and documentation. In practice, nvidia offers the best deep learning hardware, and is competing against OpenCL with CUDA, so it offers a crippled OpenCL driver for its cards. If you want to understand the architecture of a GPU and how it is programmed, I recommend the CUDA documentation.
Writing your own implementation is likely to be a waste of time if you objective is to get something that works efficiently. I think there is no chance you can beat CUDNN with your own implementation. Especially since CUDNN is coded at a lower level than CUDA.
So, the reasonable choice is to use a high-level framework. I am using tensorflow. It can run on GPU, and can also produce CPU code. There is also a "tensorflow lite" for cell phones. There is also the experimental "xla compiler" that can produce optimized CPU code. Both tools are not yet mature, but should improve rapidly. Setting up tensorflow for use in C++ is complicated, but works nicely once you understand how it works.
For HGM: the Fast Fourier Transform is profitable only for large convolutions. For 3x3 convolution, they are not worth it. But there is another very nice trick, called the Winograd transform:
https://arxiv.org/abs/1509.09308
The Volta has ordinary cores that work like traditional GPUs. In addition, it has "tensor cores" that are specialized for deep-learning operation. I am not sure about the details, but my impression is that they are specialized for matrix-matrix multiplication. Convolution can be expressed as a matrix multiplication, and is often implemented this way:
https://petewarden.com/2015/04/20/why-g ... -learning/