Lc0 v0.24 dev DX backend for AMD Radeon GPU

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

ankan
Posts: 77
Joined: Sun Apr 21, 2013 3:29 pm
Full name: Ankan Banerjee

Re: Lc0 v0.24 dev DX backend for AMD Radeon GPU

Post by ankan »

Laskos wrote: Thu Feb 13, 2020 2:32 pm
ankan wrote: Thu Feb 13, 2020 12:48 pm Thanks Laskos for testing. It's good to know that the speed increase translates to improvement in playing strength.
The dx backend (that defaults to fp16 precision) uses a different algorithm for convolution (winograd) that scales better with bigger networks compared to what cudnn-fp16 uses (implicit_gemm). We will be adding that path to cudnn-fp16 backend too.
Thanks for the info, I was thinking that cuDNN backend also uses fast Winograd convolutions, at least that was the talk more than year ago, and in fact I used that 10-12 years ago with image processing.
In theory winograd algorithm has lesser no of math operations and should be faster, but in practice with fp16 and tensor cores it's often slower because of the extra memory bandwidth required. Winograd algorithm needs multiple stages - transformations of inputs, and filter, the batched matrix multiplication, and then the reverse-transformation of output. All these stages need to read and write memory which becomes the bottleneck. Cudnn has a winograd algorithm, but for all network sizes I tested (including the huge 512 filter one) it runs slower than cudnn's implicit_gemm on RTX GPUs with fp16. This is because the algorithmic reduction in no. math operations isn't enough to offset the additional time spent in reading/writing intermediate data between the stages, and with tensor cores math is relatively cheap. With fp32 datatype winograd relatively runs much faster and we use that (because the entire operation is still mostly math bound).

The dx backend's custom implementation of winograd is a bit faster for bigger networks because we do the filter transform offline (i.e, at load time), and need "only" three passes (input transform, batched-gemm, output transform) - instead of 4.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Lc0 v0.24 dev DX backend for AMD Radeon GPU

Post by Laskos »

ankan wrote: Thu Feb 13, 2020 4:37 pm
Laskos wrote: Thu Feb 13, 2020 2:32 pm
ankan wrote: Thu Feb 13, 2020 12:48 pm Thanks Laskos for testing. It's good to know that the speed increase translates to improvement in playing strength.
The dx backend (that defaults to fp16 precision) uses a different algorithm for convolution (winograd) that scales better with bigger networks compared to what cudnn-fp16 uses (implicit_gemm). We will be adding that path to cudnn-fp16 backend too.
Thanks for the info, I was thinking that cuDNN backend also uses fast Winograd convolutions, at least that was the talk more than year ago, and in fact I used that 10-12 years ago with image processing.
In theory winograd algorithm has lesser no of math operations and should be faster, but in practice with fp16 and tensor cores it's often slower because of the extra memory bandwidth required. Winograd algorithm needs multiple stages - transformations of inputs, and filter, the batched matrix multiplication, and then the reverse-transformation of output. All these stages need to read and write memory which becomes the bottleneck. Cudnn has a winograd algorithm, but for all network sizes I tested (including the huge 512 filter one) it runs slower than cudnn's implicit_gemm on RTX GPUs with fp16. This is because the algorithmic reduction in no. math operations isn't enough to offset the additional time spent in reading/writing intermediate data between the stages, and with tensor cores math is relatively cheap. With fp32 datatype winograd relatively runs much faster and we use that (because the entire operation is still mostly math bound).

The dx backend's custom implementation of winograd is a bit faster for bigger networks because we do the filter transform offline (i.e, at load time), and need "only" three passes (input transform, batched-gemm, output transform) - instead of 4.
I didn't quite understood the factor of 2x and more (!) of speed-up with fp16 vs fp32, and for fp32, Winograd seems a no brainer (factor of almost 2 faster, isn't it?). Other things are beyond me, I don't know how cheap comes more math versus reading/writing memory etc.