This code still forms a computational bottleneck. Maybe making specialized code that checks for sparsity might help. I don't know. Perhaps I even have to use sparse matrices. Don't know yet. Only getting more complicated which might introduce bad bugs. Already encountered a bug few days ago.
One epoch costs me already about twelve minutes and at least 200 epochs are needed etc.
And then you find out you need more filters/layers making it even more slow and then it starts all over again.
Sparse matrices are jagged dictionaries. So that means accessing matrix elements get log(n) slower.
So I doubt if it will get any faster. But only way to know for sure is try it (master of pain).
Henk wrote: ↑Mon Nov 05, 2018 11:10 am
Changed interface of matrix. What a disaster. Have to repair great many lines. Also when making mistakes bugs creep in costing too much time.
Only because I also wanted to support non jagged sparse matrices. The Idiot keeps going on.
Best is to minimize your code and check all classes that are heavily used and check their interfaces if they are generic enough.
But that is only theory.
Main problem is that I start forgetting how backpropagation works. Seeing network as a black box.
But same problem with PVS. Last week I had to recall what PVS is.
Daniel Shawul wrote: ↑Thu Oct 11, 2018 8:20 pm
I don't disagree on the need to understand the inner workings but you will have a hard time
beating vendor supplied optimized libraries such as Intel MKL, CuDNN, TensorRT etc...
Lczero already tried the former approach first and eventualy switched to cuDNN and MKL blas etc.
I am sure GCP put a lot of effort into coding winograd etc but these AI libraries are used by a lot of industry
so nvida/intel has a lot to gain from offering highly optimized libraries.
FWIW, Leela Zero and lc0 in OpenCL mode still use my code (though Henrik Forsten co-wrote large parts of the current implementation and he should get credit). When we benchmarked it against cuDNN in Leela Zero it was faster. It seems we need much more aggressive batching for cuDNN to outperfom it (for chess things are very different). This may have changed with RTX cards and tensor cores, which is why I was asking about this in the other thread. People are working on more aggressive batching for Leela Zero as well, but that should remind you that these days you cannot separate the DCNN implementation from the search specifics and tuning.
The main reason to write my own implementation was in any case to not have to depend on the whims of the vendors' licensing, and to not have split versions for both card vendors.
Note that lc0's cuDNN backend was written by an NVIDIA driver engineer, and he also dealt with getting the redistribution permission. I'm sure the implementation is state of the art