Many reasons.Daniel Shawul wrote: About the 3x3 filters, why does LCzero have code for inference anyway ? My first take with NN chess engine would probably use the c++ tensorflow backend for inference, which would use cuDNN as its backend which probably has better optimized algorithms than hand-written ones. Is this done for to avoid dependency on tensorflow or am I missing something ?
Leela (Zero) was written before TensorFlow had a C++ API that didn't suck. I originally used Caffe even, because TensorFlow itself also didn't exist outside Google. Meanwhile, someone already did a re-implementation using TF as the backend, it's in the same github, so you're not "missing something" in that sense.
Using OpenCL avoided a dependency on Caffe (or TensorFlow), cuDNN (don't see the point of arguing with Mr. Stanisavljevic here, as anyone can just go out and find the actual license, and AFAIK Linscott/Huizinga are negotiating with NVIDIA about it so it may not even matter) and consequently NVIDIA (I did not anticipate AMD basically self destructing). This made distributing the original Leela (a windows GUI program with a simple installer) manageable even with GPU support. Losing those dependencies makes the program usable by a lot more people. (Even the commercial Go engines don't bother to include GPU support so far.)
Re-implementing the filters manually was very interesting and fun. I like hardware, didn't do GPU programming before, it was a good learning experience. I got to learn OpenCL and CUDA (quiz: how do you optimize OpenCL code for NVIDIA? answer: you port it to CUDA, optimize, and port back the changes, because NVIDIA got rid of their OpenCL profilers) and more about modern GPU hardware than I wanted to know.
There is a port of Leela Zero to cuDNN, and it's slower than the OpenCL code. This seems to be because cuDNN is not very well optimized for small batch sizes (though TensorRT is!) which Leela uses by default (there have been some forks to address this). I was skeptical of large batches because of how they make the tree search less efficient, but current evidence seems to indicate the gain on the DCNN side overwhelms the search loss efficiency. There may be other things at play, IIRC Huizinga compared cuDNN with TensorRT and got basically batshit insane performance differences.
Also, I judged that just a simple exe that anyone can click would bring more users than requiring a Python/TensorFlow/cuDNN install. Sure, you can make an installer that deals with all of those, in theory. You go ahead and do that, because I won't.
As for CPU-only operation, you can just link with MKL, it's fully supported. The closed source version ships with it. I think nowadays you'd probably use MKL-DNN though, but no-one has bothered porting the backend to it. I even have a version that doesn't need a BLAS lib at all, and uses my own code and autovectorization. But it's very easy to suddenly lose a factor of 2 somewhere if the cache layout of the target system doesn't exactly match what you were expecting, so this is one of the few parts where I gave up trying to do things manually.
tl;dr Yes you can replace the whole thing by 2000 lines of python (https://github.com/tensorflow/minigo) and 500 dependencies but I actually like programming.