Checking the backends with the new lc0 binary

Laskos · Post by **Laskos** » Thu Oct 01, 2020 1:50 pm

Lc0 v0.26.3-rc1 comes with the new CUDA backend and generally I decided to check all the best backends on my PC with RTX 2070 GPU.

The net is 30 blocks x 384 filters J92-190, so the default with my GPU is cuDNN FP16 with custom_winograd=true enabled. The benchmarks are here:

cuDNN FP16 (default)
lc0_v263rc1.exe benchmark --minibatch-size=240
Total time (ms) : 342097
Nodes searched : 2372484
Nodes/second : 6935

CUDA FP16
lc0_v263rc1.exe benchmark --backend=cuda-fp16 --minibatch-size=240
Total time (ms) : 342239
Nodes searched : 2122476
Nodes/second : 6202

DX12
lc0_v263rc1_dx.exe benchmark --minibatch-size=240
Total time (ms) : 341409
Nodes searched : 3077528
Nodes/second : 9014

To remark the excellent result of DX12 backend, which seems by NPS vastly superior to the other two. A glitch occurred with this command line:

lc0_v263rc1.exe benchmark --backend=cudnn-fp16 --minibatch-size=240

which sometimes exits with this error message:

Position: 1/34 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:789)
Unhandled exception in worker thread:

=============================

To check also the strength:

300 games at 15s + 0.25s, RR

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw
   1 DX12                           26      20     200   53.8%   55.5%
   2 cuda_fp16                      -3      20     200   49.5%   53.0%
   3 cudnn_fp16                    -23      21     200   46.8%   50.5%

Finished match

DX12 seems outside error margins above cuDNN FP16. I don't know why DX12 backend is not used much more often with these large 30x384 nets. Does this DX12 work well on AMD GPUs too? Can they be competitive with NVidia?

AdminX · Post by **AdminX** » Thu Oct 01, 2020 2:22 pm

Wow!

Laskos · Post by **Laskos** » Thu Oct 01, 2020 6:40 pm

Further check that DX12 backend is the best on my PC (RTX 2070 GPU) with J92-190 30x384 net:

400 games 15s + 0.25s:

Code: Select all

Score of cudnn_fp16 vs DX12: 67 - 98 - 235  [0.461] 400
...      cudnn_fp16 playing White: 64 - 7 - 129  [0.642] 200
...      cudnn_fp16 playing Black: 3 - 91 - 106  [0.280] 200
...      White vs Black: 155 - 10 - 235  [0.681] 400
Elo difference: -27.0 +/- 14.8, LOS: 0.0 %, DrawRatio: 58.8 %
Finished match

AdminX · Post by **AdminX** » Thu Oct 01, 2020 10:14 pm

Laskos wrote: ↑Thu Oct 01, 2020 1:50 pm
To remark the excellent result of DX12 backend, which seems by NPS vastly superior to the other two. A glitch occurred with this command line:

lc0_v263rc1.exe benchmark --backend=cudnn-fp16 --minibatch-size=240

which sometimes exits with this error message:

Position: 1/34 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:789)
Unhandled exception in worker thread:

I found it works better with this format:

lc0_v263rc1.exe benchmark --minibatch-size=240 --backend=cudnn-fp16

As you can see all I did was invert the two arguments.

AdminX · Post by **AdminX** » Thu Oct 01, 2020 10:55 pm

I was able to replicate your results on 2070 Super

Code: Select all

DX12 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341461
Nodes searched  : 3506458
Nodes/second    : 10269

Code: Select all

Cudnn-fp16 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341514
Nodes searched  : 2762948
Nodes/second    : 8090

schack · Post by **schack** » Thu Oct 01, 2020 11:08 pm

Any sense of whether DX12 will be faster for 20x256 nets? My 2060 will struggle to keep up with the 30x384.

Laskos · Post by **Laskos** » Thu Oct 01, 2020 11:51 pm

AdminX wrote: ↑Thu Oct 01, 2020 10:55 pm I was able to replicate your results on 2070 Super

Code: Select all

DX12 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341461
Nodes searched  : 3506458
Nodes/second    : 10269

Code: Select all

Cudnn-fp16 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341514
Nodes searched  : 2762948
Nodes/second    : 8090

Thanks, I was wondering why I am alone reporting that, as if correct, it's pretty important, probably for AMD GPUs too.

Laskos · Post by **Laskos** » Thu Oct 01, 2020 11:53 pm

schack wrote: ↑Thu Oct 01, 2020 11:08 pm Any sense of whether DX12 will be faster for 20x256 nets? My 2060 will struggle to keep up with the 30x384.

You can keep up with 2060 and 30x384. Just keep in mind that with your GPU these large nets are the best for slow Blitz and slower TC.

mwyoung · Post by **mwyoung** » Fri Oct 02, 2020 1:00 am

AdminX wrote: ↑Thu Oct 01, 2020 10:14 pm
Laskos wrote: ↑Thu Oct 01, 2020 1:50 pm
To remark the excellent result of DX12 backend, which seems by NPS vastly superior to the other two. A glitch occurred with this command line:

lc0_v263rc1.exe benchmark --backend=cudnn-fp16 --minibatch-size=240

which sometimes exits with this error message:

Position: 1/34 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:789)
Unhandled exception in worker thread:

I found it works better with this format:

lc0_v263rc1.exe benchmark --minibatch-size=240 --backend=cudnn-fp16

As you can see all I did was invert the two arguments.

No there is a issue 0.26.3-rc1. I tested it the day it came out in a 200 game blitz match. It was faster, but it also crashed about 23 times in 200 games. Causing a big loss in the match. I hope this will be corrected in rc2.

Laskos · Post by **Laskos** » Sat Oct 03, 2020 7:00 pm

AdminX wrote: ↑Thu Oct 01, 2020 10:55 pm I was able to replicate your results on 2070 Super

Code: Select all

DX12 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341461
Nodes searched  : 3506458
Nodes/second    : 10269

Code: Select all

Cudnn-fp16 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341514
Nodes searched  : 2762948
Nodes/second    : 8090

Someone directed me to this test version with CUDA 11.1 and cuDNN 8.04
https://appveyorcidatav2.blob.core.wind ... a-cuda.zip
and replace lc0 with this one:
https://appveyorcidatav2.blob.core.wind ... ld/lc0.exe

I am getting very much improved results (50%+ faster) for cudnn-fp16 and cuda-fp16:

cudnn-fp16
Total time (ms) : 341515
Nodes searched : 3547152
Nodes/second : 10386

cuda-fp16
Total time (ms) : 341370
Nodes searched : 3630548
Nodes/second : 10635

dx12
Total time (ms) : 341409
Nodes searched : 3077528
Nodes/second : 9014

Cuda-fp16 seems now even faster than cudnn-fp16, and both above DX12.

Checking the backends with the new lc0 binary

Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary

Re: Checking the backends with the new lc0 binary