Checking the backends with the new lc0 binary

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Checking the backends with the new lc0 binary

Post by Laskos »

Lc0 v0.26.3-rc1 comes with the new CUDA backend and generally I decided to check all the best backends on my PC with RTX 2070 GPU.

The net is 30 blocks x 384 filters J92-190, so the default with my GPU is cuDNN FP16 with custom_winograd=true enabled. The benchmarks are here:

cuDNN FP16 (default)
lc0_v263rc1.exe benchmark --minibatch-size=240
Total time (ms) : 342097
Nodes searched : 2372484
Nodes/second : 6935

CUDA FP16
lc0_v263rc1.exe benchmark --backend=cuda-fp16 --minibatch-size=240
Total time (ms) : 342239
Nodes searched : 2122476
Nodes/second : 6202

DX12
lc0_v263rc1_dx.exe benchmark --minibatch-size=240
Total time (ms) : 341409
Nodes searched : 3077528
Nodes/second : 9014

To remark the excellent result of DX12 backend, which seems by NPS vastly superior to the other two. A glitch occurred with this command line:

lc0_v263rc1.exe benchmark --backend=cudnn-fp16 --minibatch-size=240

which sometimes exits with this error message:

Position: 1/34 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:789)
Unhandled exception in worker thread:

=============================

To check also the strength:

300 games at 15s + 0.25s, RR

Code: Select all

Rank Name                          Elo     +/-   Games   Score    Draw
   1 DX12                           26      20     200   53.8%   55.5%
   2 cuda_fp16                      -3      20     200   49.5%   53.0%
   3 cudnn_fp16                    -23      21     200   46.8%   50.5%

Finished match
DX12 seems outside error margins above cuDNN FP16. I don't know why DX12 backend is not used much more often with these large 30x384 nets. Does this DX12 work well on AMD GPUs too? Can they be competitive with NVidia?
User avatar
AdminX
Posts: 6339
Joined: Mon Mar 13, 2006 2:34 pm
Location: Acworth, GA

Re: Checking the backends with the new lc0 binary

Post by AdminX »

Wow! :shock:
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Checking the backends with the new lc0 binary

Post by Laskos »

Further check that DX12 backend is the best on my PC (RTX 2070 GPU) with J92-190 30x384 net:

400 games 15s + 0.25s:

Code: Select all

Score of cudnn_fp16 vs DX12: 67 - 98 - 235  [0.461] 400
...      cudnn_fp16 playing White: 64 - 7 - 129  [0.642] 200
...      cudnn_fp16 playing Black: 3 - 91 - 106  [0.280] 200
...      White vs Black: 155 - 10 - 235  [0.681] 400
Elo difference: -27.0 +/- 14.8, LOS: 0.0 %, DrawRatio: 58.8 %
Finished match
User avatar
AdminX
Posts: 6339
Joined: Mon Mar 13, 2006 2:34 pm
Location: Acworth, GA

Re: Checking the backends with the new lc0 binary

Post by AdminX »

Laskos wrote: Thu Oct 01, 2020 1:50 pm
To remark the excellent result of DX12 backend, which seems by NPS vastly superior to the other two. A glitch occurred with this command line:

lc0_v263rc1.exe benchmark --backend=cudnn-fp16 --minibatch-size=240

which sometimes exits with this error message:

Position: 1/34 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:789)
Unhandled exception in worker thread:
I found it works better with this format:

lc0_v263rc1.exe benchmark --minibatch-size=240 --backend=cudnn-fp16

As you can see all I did was invert the two arguments.
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
User avatar
AdminX
Posts: 6339
Joined: Mon Mar 13, 2006 2:34 pm
Location: Acworth, GA

Re: Checking the backends with the new lc0 binary

Post by AdminX »

I was able to replicate your results on 2070 Super

Code: Select all

DX12 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341461
Nodes searched  : 3506458
Nodes/second    : 10269

Code: Select all

Cudnn-fp16 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341514
Nodes searched  : 2762948
Nodes/second    : 8090
"Good decisions come from experience, and experience comes from bad decisions."
__________________________________________________________________
Ted Summers
schack
Posts: 172
Joined: Thu May 27, 2010 3:32 am

Re: Checking the backends with the new lc0 binary

Post by schack »

Any sense of whether DX12 will be faster for 20x256 nets? My 2060 will struggle to keep up with the 30x384. :)
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Checking the backends with the new lc0 binary

Post by Laskos »

AdminX wrote: Thu Oct 01, 2020 10:55 pm I was able to replicate your results on 2070 Super

Code: Select all

DX12 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341461
Nodes searched  : 3506458
Nodes/second    : 10269

Code: Select all

Cudnn-fp16 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341514
Nodes searched  : 2762948
Nodes/second    : 8090
Thanks, I was wondering why I am alone reporting that, as if correct, it's pretty important, probably for AMD GPUs too.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Checking the backends with the new lc0 binary

Post by Laskos »

schack wrote: Thu Oct 01, 2020 11:08 pm Any sense of whether DX12 will be faster for 20x256 nets? My 2060 will struggle to keep up with the 30x384. :)
You can keep up with 2060 and 30x384. Just keep in mind that with your GPU these large nets are the best for slow Blitz and slower TC.
mwyoung
Posts: 2727
Joined: Wed May 12, 2010 10:00 pm

Re: Checking the backends with the new lc0 binary

Post by mwyoung »

AdminX wrote: Thu Oct 01, 2020 10:14 pm
Laskos wrote: Thu Oct 01, 2020 1:50 pm
To remark the excellent result of DX12 backend, which seems by NPS vastly superior to the other two. A glitch occurred with this command line:

lc0_v263rc1.exe benchmark --backend=cudnn-fp16 --minibatch-size=240

which sometimes exits with this error message:

Position: 1/34 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Unhandled exception in worker thread: CUDA error: an illegal memory access was encountered (c:\projects\lc0\src\neural\cuda\network_cudnn.cc:789)
Unhandled exception in worker thread:
I found it works better with this format:

lc0_v263rc1.exe benchmark --minibatch-size=240 --backend=cudnn-fp16

As you can see all I did was invert the two arguments.
No there is a issue 0.26.3-rc1. I tested it the day it came out in a 200 game blitz match. It was faster, but it also crashed about 23 times in 200 games. Causing a big loss in the match. I hope this will be corrected in rc2.
"The worst thing that can happen to a forum is a running wild attacking moderator(HGM) who is not corrected by the community." - Ed Schröder
But my words like silent raindrops fell. And echoed in the wells of silence.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Checking the backends with the new lc0 binary

Post by Laskos »

AdminX wrote: Thu Oct 01, 2020 10:55 pm I was able to replicate your results on 2070 Super

Code: Select all

DX12 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341461
Nodes searched  : 3506458
Nodes/second    : 10269

Code: Select all

Cudnn-fp16 

lc0.exe benchmark --minibatch-size=240 --threads=2 --backend-opts=gpu=0

===========================
Total time (ms) : 341514
Nodes searched  : 2762948
Nodes/second    : 8090

Someone directed me to this test version with CUDA 11.1 and cuDNN 8.04
https://appveyorcidatav2.blob.core.wind ... a-cuda.zip
and replace lc0 with this one:
https://appveyorcidatav2.blob.core.wind ... ld/lc0.exe

I am getting very much improved results (50%+ faster) for cudnn-fp16 and cuda-fp16:

cudnn-fp16
Total time (ms) : 341515
Nodes searched : 3547152
Nodes/second : 10386

cuda-fp16
Total time (ms) : 341370
Nodes searched : 3630548
Nodes/second : 10635

dx12
Total time (ms) : 341409
Nodes searched : 3077528
Nodes/second : 9014

Cuda-fp16 seems now even faster than cudnn-fp16, and both above DX12.