Lc0 speedup using CUDA 11

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Lc0 speedup using CUDA 11

Post by zullil »

I seem to get about a 15% gain in nps by upgrading to the latest CUDA 11 Runtime. But do not upgrade to cuDNN v8.0.1 RC2 (June 26th, 2020) for CUDA 11.0. That one reduces nps by about 50%. Stick with Cudnn version: 7.6.5

Code: Select all

$ ./lc0 benchmark --num-positions=1
       _
|   _ | |
|_ |_ |_| v0.26.0 built Jul  9 2020
Found pb network file: ./T40B.4-160
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.0.0
Cudnn version: 7.6.5
Latest version of CUDA supported by the driver: 11.0.0
GPU: GeForce RTX 2080 Ti
GPU memory: 10.7534 Gb
GPU clock frequency: 1635 MHz
GPU compute capability: 7.5

Position: 1/1 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Benchmark time 31ms, 3 nodes, 272 nps, move e2e4
Benchmark time 34ms, 5 nodes, 357 nps, move e2e4
Benchmark time 37ms, 10 nodes, 625 nps, move d2d4
Benchmark time 39ms, 18 nodes, 947 nps, move e2e4
Benchmark time 41ms, 27 nodes, 1285 nps, move d2d4
Benchmark time 45ms, 46 nodes, 1916 nps, move e2e4
Benchmark time 47ms, 75 nodes, 2777 nps, move e2e4
Benchmark time 52ms, 97 nodes, 3031 nps, move e2e4
Benchmark time 57ms, 135 nodes, 3648 nps, move e2e4
Benchmark time 59ms, 153 nodes, 3923 nps, move e2e4
Benchmark time 64ms, 183 nodes, 4159 nps, move e2e4
Benchmark time 70ms, 247 nodes, 4940 nps, move e2e4
Benchmark time 76ms, 336 nodes, 6000 nps, move e2e4
Benchmark time 82ms, 402 nodes, 6590 nps, move e2e4
Benchmark time 87ms, 467 nodes, 6970 nps, move e2e4
Benchmark time 94ms, 493 nodes, 6662 nps, move e2e4
Benchmark time 99ms, 569 nodes, 7202 nps, move e2e4
Benchmark time 105ms, 578 nodes, 6880 nps, move e2e4
Benchmark time 107ms, 584 nodes, 6712 nps, move e2e4
Benchmark time 110ms, 588 nodes, 6533 nps, move e2e4
Benchmark time 112ms, 645 nodes, 7010 nps, move e2e4
Benchmark time 116ms, 744 nodes, 7750 nps, move e2e4
Benchmark time 125ms, 884 nodes, 8419 nps, move e2e4
Benchmark time 133ms, 1039 nodes, 9194 nps, move e2e4
Benchmark time 143ms, 1276 nodes, 10373 nps, move e2e4
Benchmark time 151ms, 1506 nodes, 11584 nps, move e2e4
Benchmark time 163ms, 1802 nodes, 12601 nps, move e2e4
Benchmark time 176ms, 2126 nodes, 13628 nps, move e2e4
Benchmark time 194ms, 2651 nodes, 15235 nps, move e2e4
Benchmark time 200ms, 2905 nodes, 16138 nps, move e2e4
Benchmark time 210ms, 3230 nodes, 17000 nps, move e2e4
Benchmark time 219ms, 3582 nodes, 18000 nps, move e2e4
Benchmark time 229ms, 3917 nodes, 18741 nps, move e2e4
Benchmark time 236ms, 4158 nodes, 19250 nps, move e2e4
Benchmark time 241ms, 4364 nodes, 19746 nps, move e2e4
Benchmark time 252ms, 4742 nodes, 20439 nps, move e2e4
Benchmark time 262ms, 5088 nodes, 21112 nps, move e2e4
Benchmark time 273ms, 5502 nodes, 21747 nps, move e2e4
Benchmark time 287ms, 6058 nodes, 22774 nps, move e2e4
Benchmark time 303ms, 6726 nodes, 23766 nps, move e2e4
Benchmark time 312ms, 7110 nodes, 24432 nps, move e2e4
Benchmark time 321ms, 7471 nodes, 24820 nps, move e2e4
Benchmark time 356ms, 8874 nodes, 26410 nps, move e2e4
Benchmark time 374ms, 9605 nodes, 27132 nps, move e2e4
Benchmark time 394ms, 10426 nodes, 27877 nps, move e2e4
Benchmark time 413ms, 11097 nodes, 28236 nps, move e2e4
Benchmark time 430ms, 11596 nodes, 28352 nps, move e2e4
Benchmark time 445ms, 12200 nodes, 28705 nps, move e2e4
Benchmark time 461ms, 12985 nodes, 29444 nps, move e2e4
Benchmark time 526ms, 16009 nodes, 31638 nps, move e2e4
Benchmark time 544ms, 16871 nodes, 32196 nps, move e2e4
Benchmark time 620ms, 20329 nodes, 33938 nps, move e2e4
Benchmark time 642ms, 21030 nodes, 33810 nps, move e2e4
Benchmark time 766ms, 26992 nodes, 36182 nps, move e2e4
Benchmark time 971ms, 37556 nodes, 39491 nps, move e2e4
Benchmark time 1053ms, 41909 nodes, 40570 nps, move e2e4
Benchmark time 1394ms, 60623 nodes, 44121 nps, move e2e4
Benchmark time 1748ms, 80245 nodes, 46438 nps, move e2e4
Benchmark time 2120ms, 100292 nodes, 47758 nps, move e2e4
Benchmark time 2641ms, 131050 nodes, 50000 nps, move e2e4
Benchmark time 2904ms, 145050 nodes, 50294 nps, move e2e4
Benchmark time 4920ms, 271921 nodes, 55494 nps, move e2e4
Benchmark time 4983ms, 275155 nodes, 55441 nps, move e2e4
Benchmark time 7695ms, 438956 nodes, 57192 nps, move e2e4
Benchmark time 8303ms, 474040 nodes, 57230 nps, move e2e4
Benchmark time 10000ms, 570464 nodes, 57160 nps, move e2e4
bestmove e2e4

===========================
Total time (ms) : 10014
Nodes searched  : 570976
Nodes/second    : 57012
corres
Posts: 3657
Joined: Wed Nov 18, 2015 11:41 am
Location: hungary

Re: Lc0 speedup using CUDA 11

Post by corres »

zullil wrote: Thu Jul 09, 2020 6:37 pm I seem to get about a 15% gain in nps by upgrading to the latest CUDA 11 Runtime. But do not upgrade to cuDNN v8.0.1 RC2 (June 26th, 2020) for CUDA 11.0. That one reduces nps by about 50%. Stick with Cudnn version: 7.6.5

Code: Select all

$ ./lc0 benchmark --num-positions=1
       _
|   _ | |
|_ |_ |_| v0.26.0 built Jul  9 2020
Found pb network file: ./T40B.4-160
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.0.0
Cudnn version: 7.6.5
Latest version of CUDA supported by the driver: 11.0.0
GPU: GeForce RTX 2080 Ti
GPU memory: 10.7534 Gb
GPU clock frequency: 1635 MHz
GPU compute capability: 7.5

Position: 1/1 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Benchmark time 31ms, 3 nodes, 272 nps, move e2e4
Benchmark time 34ms, 5 nodes, 357 nps, move e2e4
Benchmark time 37ms, 10 nodes, 625 nps, move d2d4
Benchmark time 39ms, 18 nodes, 947 nps, move e2e4
Benchmark time 41ms, 27 nodes, 1285 nps, move d2d4
Benchmark time 45ms, 46 nodes, 1916 nps, move e2e4
Benchmark time 47ms, 75 nodes, 2777 nps, move e2e4
Benchmark time 52ms, 97 nodes, 3031 nps, move e2e4
Benchmark time 57ms, 135 nodes, 3648 nps, move e2e4
Benchmark time 59ms, 153 nodes, 3923 nps, move e2e4
Benchmark time 64ms, 183 nodes, 4159 nps, move e2e4
Benchmark time 70ms, 247 nodes, 4940 nps, move e2e4
Benchmark time 76ms, 336 nodes, 6000 nps, move e2e4
Benchmark time 82ms, 402 nodes, 6590 nps, move e2e4
Benchmark time 87ms, 467 nodes, 6970 nps, move e2e4
Benchmark time 94ms, 493 nodes, 6662 nps, move e2e4
Benchmark time 99ms, 569 nodes, 7202 nps, move e2e4
Benchmark time 105ms, 578 nodes, 6880 nps, move e2e4
Benchmark time 107ms, 584 nodes, 6712 nps, move e2e4
Benchmark time 110ms, 588 nodes, 6533 nps, move e2e4
Benchmark time 112ms, 645 nodes, 7010 nps, move e2e4
Benchmark time 116ms, 744 nodes, 7750 nps, move e2e4
Benchmark time 125ms, 884 nodes, 8419 nps, move e2e4
Benchmark time 133ms, 1039 nodes, 9194 nps, move e2e4
Benchmark time 143ms, 1276 nodes, 10373 nps, move e2e4
Benchmark time 151ms, 1506 nodes, 11584 nps, move e2e4
Benchmark time 163ms, 1802 nodes, 12601 nps, move e2e4
Benchmark time 176ms, 2126 nodes, 13628 nps, move e2e4
Benchmark time 194ms, 2651 nodes, 15235 nps, move e2e4
Benchmark time 200ms, 2905 nodes, 16138 nps, move e2e4
Benchmark time 210ms, 3230 nodes, 17000 nps, move e2e4
Benchmark time 219ms, 3582 nodes, 18000 nps, move e2e4
Benchmark time 229ms, 3917 nodes, 18741 nps, move e2e4
Benchmark time 236ms, 4158 nodes, 19250 nps, move e2e4
Benchmark time 241ms, 4364 nodes, 19746 nps, move e2e4
Benchmark time 252ms, 4742 nodes, 20439 nps, move e2e4
Benchmark time 262ms, 5088 nodes, 21112 nps, move e2e4
Benchmark time 273ms, 5502 nodes, 21747 nps, move e2e4
Benchmark time 287ms, 6058 nodes, 22774 nps, move e2e4
Benchmark time 303ms, 6726 nodes, 23766 nps, move e2e4
Benchmark time 312ms, 7110 nodes, 24432 nps, move e2e4
Benchmark time 321ms, 7471 nodes, 24820 nps, move e2e4
Benchmark time 356ms, 8874 nodes, 26410 nps, move e2e4
Benchmark time 374ms, 9605 nodes, 27132 nps, move e2e4
Benchmark time 394ms, 10426 nodes, 27877 nps, move e2e4
Benchmark time 413ms, 11097 nodes, 28236 nps, move e2e4
Benchmark time 430ms, 11596 nodes, 28352 nps, move e2e4
Benchmark time 445ms, 12200 nodes, 28705 nps, move e2e4
Benchmark time 461ms, 12985 nodes, 29444 nps, move e2e4
Benchmark time 526ms, 16009 nodes, 31638 nps, move e2e4
Benchmark time 544ms, 16871 nodes, 32196 nps, move e2e4
Benchmark time 620ms, 20329 nodes, 33938 nps, move e2e4
Benchmark time 642ms, 21030 nodes, 33810 nps, move e2e4
Benchmark time 766ms, 26992 nodes, 36182 nps, move e2e4
Benchmark time 971ms, 37556 nodes, 39491 nps, move e2e4
Benchmark time 1053ms, 41909 nodes, 40570 nps, move e2e4
Benchmark time 1394ms, 60623 nodes, 44121 nps, move e2e4
Benchmark time 1748ms, 80245 nodes, 46438 nps, move e2e4
Benchmark time 2120ms, 100292 nodes, 47758 nps, move e2e4
Benchmark time 2641ms, 131050 nodes, 50000 nps, move e2e4
Benchmark time 2904ms, 145050 nodes, 50294 nps, move e2e4
Benchmark time 4920ms, 271921 nodes, 55494 nps, move e2e4
Benchmark time 4983ms, 275155 nodes, 55441 nps, move e2e4
Benchmark time 7695ms, 438956 nodes, 57192 nps, move e2e4
Benchmark time 8303ms, 474040 nodes, 57230 nps, move e2e4
Benchmark time 10000ms, 570464 nodes, 57160 nps, move e2e4
bestmove e2e4

===========================
Total time (ms) : 10014
Nodes searched  : 570976
Nodes/second    : 57012
How many % is the enhancement?
The default source is appropriate to CUDA 11?
What is the version number of NVIDIA driver what you ised?
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Lc0 speedup using CUDA 11

Post by zullil »

corres wrote: Fri Jul 10, 2020 1:49 pm
zullil wrote: Thu Jul 09, 2020 6:37 pm I seem to get about a 15% gain in nps by upgrading to the latest CUDA 11 Runtime. But do not upgrade to cuDNN v8.0.1 RC2 (June 26th, 2020) for CUDA 11.0. That one reduces nps by about 50%. Stick with Cudnn version: 7.6.5

Code: Select all

$ ./lc0 benchmark --num-positions=1
       _
|   _ | |
|_ |_ |_| v0.26.0 built Jul  9 2020
Found pb network file: ./T40B.4-160
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.0.0
Cudnn version: 7.6.5
Latest version of CUDA supported by the driver: 11.0.0
GPU: GeForce RTX 2080 Ti
GPU memory: 10.7534 Gb
GPU clock frequency: 1635 MHz
GPU compute capability: 7.5

Position: 1/1 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Benchmark time 31ms, 3 nodes, 272 nps, move e2e4
Benchmark time 34ms, 5 nodes, 357 nps, move e2e4
Benchmark time 37ms, 10 nodes, 625 nps, move d2d4
Benchmark time 39ms, 18 nodes, 947 nps, move e2e4
Benchmark time 41ms, 27 nodes, 1285 nps, move d2d4
Benchmark time 45ms, 46 nodes, 1916 nps, move e2e4
Benchmark time 47ms, 75 nodes, 2777 nps, move e2e4
Benchmark time 52ms, 97 nodes, 3031 nps, move e2e4
Benchmark time 57ms, 135 nodes, 3648 nps, move e2e4
Benchmark time 59ms, 153 nodes, 3923 nps, move e2e4
Benchmark time 64ms, 183 nodes, 4159 nps, move e2e4
Benchmark time 70ms, 247 nodes, 4940 nps, move e2e4
Benchmark time 76ms, 336 nodes, 6000 nps, move e2e4
Benchmark time 82ms, 402 nodes, 6590 nps, move e2e4
Benchmark time 87ms, 467 nodes, 6970 nps, move e2e4
Benchmark time 94ms, 493 nodes, 6662 nps, move e2e4
Benchmark time 99ms, 569 nodes, 7202 nps, move e2e4
Benchmark time 105ms, 578 nodes, 6880 nps, move e2e4
Benchmark time 107ms, 584 nodes, 6712 nps, move e2e4
Benchmark time 110ms, 588 nodes, 6533 nps, move e2e4
Benchmark time 112ms, 645 nodes, 7010 nps, move e2e4
Benchmark time 116ms, 744 nodes, 7750 nps, move e2e4
Benchmark time 125ms, 884 nodes, 8419 nps, move e2e4
Benchmark time 133ms, 1039 nodes, 9194 nps, move e2e4
Benchmark time 143ms, 1276 nodes, 10373 nps, move e2e4
Benchmark time 151ms, 1506 nodes, 11584 nps, move e2e4
Benchmark time 163ms, 1802 nodes, 12601 nps, move e2e4
Benchmark time 176ms, 2126 nodes, 13628 nps, move e2e4
Benchmark time 194ms, 2651 nodes, 15235 nps, move e2e4
Benchmark time 200ms, 2905 nodes, 16138 nps, move e2e4
Benchmark time 210ms, 3230 nodes, 17000 nps, move e2e4
Benchmark time 219ms, 3582 nodes, 18000 nps, move e2e4
Benchmark time 229ms, 3917 nodes, 18741 nps, move e2e4
Benchmark time 236ms, 4158 nodes, 19250 nps, move e2e4
Benchmark time 241ms, 4364 nodes, 19746 nps, move e2e4
Benchmark time 252ms, 4742 nodes, 20439 nps, move e2e4
Benchmark time 262ms, 5088 nodes, 21112 nps, move e2e4
Benchmark time 273ms, 5502 nodes, 21747 nps, move e2e4
Benchmark time 287ms, 6058 nodes, 22774 nps, move e2e4
Benchmark time 303ms, 6726 nodes, 23766 nps, move e2e4
Benchmark time 312ms, 7110 nodes, 24432 nps, move e2e4
Benchmark time 321ms, 7471 nodes, 24820 nps, move e2e4
Benchmark time 356ms, 8874 nodes, 26410 nps, move e2e4
Benchmark time 374ms, 9605 nodes, 27132 nps, move e2e4
Benchmark time 394ms, 10426 nodes, 27877 nps, move e2e4
Benchmark time 413ms, 11097 nodes, 28236 nps, move e2e4
Benchmark time 430ms, 11596 nodes, 28352 nps, move e2e4
Benchmark time 445ms, 12200 nodes, 28705 nps, move e2e4
Benchmark time 461ms, 12985 nodes, 29444 nps, move e2e4
Benchmark time 526ms, 16009 nodes, 31638 nps, move e2e4
Benchmark time 544ms, 16871 nodes, 32196 nps, move e2e4
Benchmark time 620ms, 20329 nodes, 33938 nps, move e2e4
Benchmark time 642ms, 21030 nodes, 33810 nps, move e2e4
Benchmark time 766ms, 26992 nodes, 36182 nps, move e2e4
Benchmark time 971ms, 37556 nodes, 39491 nps, move e2e4
Benchmark time 1053ms, 41909 nodes, 40570 nps, move e2e4
Benchmark time 1394ms, 60623 nodes, 44121 nps, move e2e4
Benchmark time 1748ms, 80245 nodes, 46438 nps, move e2e4
Benchmark time 2120ms, 100292 nodes, 47758 nps, move e2e4
Benchmark time 2641ms, 131050 nodes, 50000 nps, move e2e4
Benchmark time 2904ms, 145050 nodes, 50294 nps, move e2e4
Benchmark time 4920ms, 271921 nodes, 55494 nps, move e2e4
Benchmark time 4983ms, 275155 nodes, 55441 nps, move e2e4
Benchmark time 7695ms, 438956 nodes, 57192 nps, move e2e4
Benchmark time 8303ms, 474040 nodes, 57230 nps, move e2e4
Benchmark time 10000ms, 570464 nodes, 57160 nps, move e2e4
bestmove e2e4

===========================
Total time (ms) : 10014
Nodes searched  : 570976
Nodes/second    : 57012
How many % is the enhancement?
The default source is appropriate to CUDA 11?
What is the version number of NVIDIA driver what you ised?
I haven't done precise testing. My estimate is the enhancement in speed is about between 10% and 15%.

Code: Select all

CUDA Toolkit		Linux x86_64 Driver Version		Windows x86_64 Driver Version
CUDA 11.0.194		 >= 450.51.05					>= 451.48
I built Lc0 26.0 from source after installing CUDA 11. I did not change the source code or build script in any way.
MOBMAT
Posts: 385
Joined: Sat Feb 04, 2017 11:57 pm
Location: USA

Re: Lc0 speedup using CUDA 11

Post by MOBMAT »

zullil wrote: Fri Jul 10, 2020 5:25 pm
corres wrote: Fri Jul 10, 2020 1:49 pm
zullil wrote: Thu Jul 09, 2020 6:37 pm I seem to get about a 15% gain in nps by upgrading to the latest CUDA 11 Runtime. But do not upgrade to cuDNN v8.0.1 RC2 (June 26th, 2020) for CUDA 11.0. That one reduces nps by about 50%. Stick with Cudnn version: 7.6.5

Code: Select all

$ ./lc0 benchmark --num-positions=1
       _
|   _ | |
|_ |_ |_| v0.26.0 built Jul  9 2020
Found pb network file: ./T40B.4-160
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.0.0
Cudnn version: 7.6.5
Latest version of CUDA supported by the driver: 11.0.0
GPU: GeForce RTX 2080 Ti
GPU memory: 10.7534 Gb
GPU clock frequency: 1635 MHz
GPU compute capability: 7.5

Position: 1/1 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Benchmark time 31ms, 3 nodes, 272 nps, move e2e4
Benchmark time 34ms, 5 nodes, 357 nps, move e2e4
Benchmark time 37ms, 10 nodes, 625 nps, move d2d4
Benchmark time 39ms, 18 nodes, 947 nps, move e2e4
Benchmark time 41ms, 27 nodes, 1285 nps, move d2d4
Benchmark time 45ms, 46 nodes, 1916 nps, move e2e4
Benchmark time 47ms, 75 nodes, 2777 nps, move e2e4
Benchmark time 52ms, 97 nodes, 3031 nps, move e2e4
Benchmark time 57ms, 135 nodes, 3648 nps, move e2e4
Benchmark time 59ms, 153 nodes, 3923 nps, move e2e4
Benchmark time 64ms, 183 nodes, 4159 nps, move e2e4
Benchmark time 70ms, 247 nodes, 4940 nps, move e2e4
Benchmark time 76ms, 336 nodes, 6000 nps, move e2e4
Benchmark time 82ms, 402 nodes, 6590 nps, move e2e4
Benchmark time 87ms, 467 nodes, 6970 nps, move e2e4
Benchmark time 94ms, 493 nodes, 6662 nps, move e2e4
Benchmark time 99ms, 569 nodes, 7202 nps, move e2e4
Benchmark time 105ms, 578 nodes, 6880 nps, move e2e4
Benchmark time 107ms, 584 nodes, 6712 nps, move e2e4
Benchmark time 110ms, 588 nodes, 6533 nps, move e2e4
Benchmark time 112ms, 645 nodes, 7010 nps, move e2e4
Benchmark time 116ms, 744 nodes, 7750 nps, move e2e4
Benchmark time 125ms, 884 nodes, 8419 nps, move e2e4
Benchmark time 133ms, 1039 nodes, 9194 nps, move e2e4
Benchmark time 143ms, 1276 nodes, 10373 nps, move e2e4
Benchmark time 151ms, 1506 nodes, 11584 nps, move e2e4
Benchmark time 163ms, 1802 nodes, 12601 nps, move e2e4
Benchmark time 176ms, 2126 nodes, 13628 nps, move e2e4
Benchmark time 194ms, 2651 nodes, 15235 nps, move e2e4
Benchmark time 200ms, 2905 nodes, 16138 nps, move e2e4
Benchmark time 210ms, 3230 nodes, 17000 nps, move e2e4
Benchmark time 219ms, 3582 nodes, 18000 nps, move e2e4
Benchmark time 229ms, 3917 nodes, 18741 nps, move e2e4
Benchmark time 236ms, 4158 nodes, 19250 nps, move e2e4
Benchmark time 241ms, 4364 nodes, 19746 nps, move e2e4
Benchmark time 252ms, 4742 nodes, 20439 nps, move e2e4
Benchmark time 262ms, 5088 nodes, 21112 nps, move e2e4
Benchmark time 273ms, 5502 nodes, 21747 nps, move e2e4
Benchmark time 287ms, 6058 nodes, 22774 nps, move e2e4
Benchmark time 303ms, 6726 nodes, 23766 nps, move e2e4
Benchmark time 312ms, 7110 nodes, 24432 nps, move e2e4
Benchmark time 321ms, 7471 nodes, 24820 nps, move e2e4
Benchmark time 356ms, 8874 nodes, 26410 nps, move e2e4
Benchmark time 374ms, 9605 nodes, 27132 nps, move e2e4
Benchmark time 394ms, 10426 nodes, 27877 nps, move e2e4
Benchmark time 413ms, 11097 nodes, 28236 nps, move e2e4
Benchmark time 430ms, 11596 nodes, 28352 nps, move e2e4
Benchmark time 445ms, 12200 nodes, 28705 nps, move e2e4
Benchmark time 461ms, 12985 nodes, 29444 nps, move e2e4
Benchmark time 526ms, 16009 nodes, 31638 nps, move e2e4
Benchmark time 544ms, 16871 nodes, 32196 nps, move e2e4
Benchmark time 620ms, 20329 nodes, 33938 nps, move e2e4
Benchmark time 642ms, 21030 nodes, 33810 nps, move e2e4
Benchmark time 766ms, 26992 nodes, 36182 nps, move e2e4
Benchmark time 971ms, 37556 nodes, 39491 nps, move e2e4
Benchmark time 1053ms, 41909 nodes, 40570 nps, move e2e4
Benchmark time 1394ms, 60623 nodes, 44121 nps, move e2e4
Benchmark time 1748ms, 80245 nodes, 46438 nps, move e2e4
Benchmark time 2120ms, 100292 nodes, 47758 nps, move e2e4
Benchmark time 2641ms, 131050 nodes, 50000 nps, move e2e4
Benchmark time 2904ms, 145050 nodes, 50294 nps, move e2e4
Benchmark time 4920ms, 271921 nodes, 55494 nps, move e2e4
Benchmark time 4983ms, 275155 nodes, 55441 nps, move e2e4
Benchmark time 7695ms, 438956 nodes, 57192 nps, move e2e4
Benchmark time 8303ms, 474040 nodes, 57230 nps, move e2e4
Benchmark time 10000ms, 570464 nodes, 57160 nps, move e2e4
bestmove e2e4

===========================
Total time (ms) : 10014
Nodes searched  : 570976
Nodes/second    : 57012
How many % is the enhancement?
The default source is appropriate to CUDA 11?
What is the version number of NVIDIA driver what you ised?
I haven't done precise testing. My estimate is the enhancement in speed is about between 10% and 15%.

Code: Select all

CUDA Toolkit		Linux x86_64 Driver Version		Windows x86_64 Driver Version
CUDA 11.0.194		 >= 450.51.05					>= 451.48
I built Lc0 26.0 from source after installing CUDA 11. I did not change the source code or build script in any way.
I installed CUDA Toolbox 11 and downloaded the latest (26.0) version of Lc0. When I start it, it still indicates that the Cuda Runtime Version is 10.0.0, not 11. Did I miss something? The output from Lc0 indicates "Latest version of CUDA supported by the driver: 11.0.0"
i7-6700K @ 4.00Ghz 32Gb, Win 10 Home, EGTBs on PCI SSD
Benchmark: Stockfish15.1 NNUE x64 bmi2 (nps): 1277K
zullil
Posts: 6442
Joined: Tue Jan 09, 2007 12:31 am
Location: PA USA
Full name: Louis Zulli

Re: Lc0 speedup using CUDA 11

Post by zullil »

MOBMAT wrote: Tue Jul 14, 2020 4:10 pm
zullil wrote: Fri Jul 10, 2020 5:25 pm
corres wrote: Fri Jul 10, 2020 1:49 pm
zullil wrote: Thu Jul 09, 2020 6:37 pm I seem to get about a 15% gain in nps by upgrading to the latest CUDA 11 Runtime. But do not upgrade to cuDNN v8.0.1 RC2 (June 26th, 2020) for CUDA 11.0. That one reduces nps by about 50%. Stick with Cudnn version: 7.6.5

Code: Select all

$ ./lc0 benchmark --num-positions=1
       _
|   _ | |
|_ |_ |_| v0.26.0 built Jul  9 2020
Found pb network file: ./T40B.4-160
Creating backend [cudnn-auto]...
Switching to [cudnn-fp16]...
CUDA Runtime version: 11.0.0
Cudnn version: 7.6.5
Latest version of CUDA supported by the driver: 11.0.0
GPU: GeForce RTX 2080 Ti
GPU memory: 10.7534 Gb
GPU clock frequency: 1635 MHz
GPU compute capability: 7.5

Position: 1/1 rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1
Benchmark time 31ms, 3 nodes, 272 nps, move e2e4
Benchmark time 34ms, 5 nodes, 357 nps, move e2e4
Benchmark time 37ms, 10 nodes, 625 nps, move d2d4
Benchmark time 39ms, 18 nodes, 947 nps, move e2e4
Benchmark time 41ms, 27 nodes, 1285 nps, move d2d4
Benchmark time 45ms, 46 nodes, 1916 nps, move e2e4
Benchmark time 47ms, 75 nodes, 2777 nps, move e2e4
Benchmark time 52ms, 97 nodes, 3031 nps, move e2e4
Benchmark time 57ms, 135 nodes, 3648 nps, move e2e4
Benchmark time 59ms, 153 nodes, 3923 nps, move e2e4
Benchmark time 64ms, 183 nodes, 4159 nps, move e2e4
Benchmark time 70ms, 247 nodes, 4940 nps, move e2e4
Benchmark time 76ms, 336 nodes, 6000 nps, move e2e4
Benchmark time 82ms, 402 nodes, 6590 nps, move e2e4
Benchmark time 87ms, 467 nodes, 6970 nps, move e2e4
Benchmark time 94ms, 493 nodes, 6662 nps, move e2e4
Benchmark time 99ms, 569 nodes, 7202 nps, move e2e4
Benchmark time 105ms, 578 nodes, 6880 nps, move e2e4
Benchmark time 107ms, 584 nodes, 6712 nps, move e2e4
Benchmark time 110ms, 588 nodes, 6533 nps, move e2e4
Benchmark time 112ms, 645 nodes, 7010 nps, move e2e4
Benchmark time 116ms, 744 nodes, 7750 nps, move e2e4
Benchmark time 125ms, 884 nodes, 8419 nps, move e2e4
Benchmark time 133ms, 1039 nodes, 9194 nps, move e2e4
Benchmark time 143ms, 1276 nodes, 10373 nps, move e2e4
Benchmark time 151ms, 1506 nodes, 11584 nps, move e2e4
Benchmark time 163ms, 1802 nodes, 12601 nps, move e2e4
Benchmark time 176ms, 2126 nodes, 13628 nps, move e2e4
Benchmark time 194ms, 2651 nodes, 15235 nps, move e2e4
Benchmark time 200ms, 2905 nodes, 16138 nps, move e2e4
Benchmark time 210ms, 3230 nodes, 17000 nps, move e2e4
Benchmark time 219ms, 3582 nodes, 18000 nps, move e2e4
Benchmark time 229ms, 3917 nodes, 18741 nps, move e2e4
Benchmark time 236ms, 4158 nodes, 19250 nps, move e2e4
Benchmark time 241ms, 4364 nodes, 19746 nps, move e2e4
Benchmark time 252ms, 4742 nodes, 20439 nps, move e2e4
Benchmark time 262ms, 5088 nodes, 21112 nps, move e2e4
Benchmark time 273ms, 5502 nodes, 21747 nps, move e2e4
Benchmark time 287ms, 6058 nodes, 22774 nps, move e2e4
Benchmark time 303ms, 6726 nodes, 23766 nps, move e2e4
Benchmark time 312ms, 7110 nodes, 24432 nps, move e2e4
Benchmark time 321ms, 7471 nodes, 24820 nps, move e2e4
Benchmark time 356ms, 8874 nodes, 26410 nps, move e2e4
Benchmark time 374ms, 9605 nodes, 27132 nps, move e2e4
Benchmark time 394ms, 10426 nodes, 27877 nps, move e2e4
Benchmark time 413ms, 11097 nodes, 28236 nps, move e2e4
Benchmark time 430ms, 11596 nodes, 28352 nps, move e2e4
Benchmark time 445ms, 12200 nodes, 28705 nps, move e2e4
Benchmark time 461ms, 12985 nodes, 29444 nps, move e2e4
Benchmark time 526ms, 16009 nodes, 31638 nps, move e2e4
Benchmark time 544ms, 16871 nodes, 32196 nps, move e2e4
Benchmark time 620ms, 20329 nodes, 33938 nps, move e2e4
Benchmark time 642ms, 21030 nodes, 33810 nps, move e2e4
Benchmark time 766ms, 26992 nodes, 36182 nps, move e2e4
Benchmark time 971ms, 37556 nodes, 39491 nps, move e2e4
Benchmark time 1053ms, 41909 nodes, 40570 nps, move e2e4
Benchmark time 1394ms, 60623 nodes, 44121 nps, move e2e4
Benchmark time 1748ms, 80245 nodes, 46438 nps, move e2e4
Benchmark time 2120ms, 100292 nodes, 47758 nps, move e2e4
Benchmark time 2641ms, 131050 nodes, 50000 nps, move e2e4
Benchmark time 2904ms, 145050 nodes, 50294 nps, move e2e4
Benchmark time 4920ms, 271921 nodes, 55494 nps, move e2e4
Benchmark time 4983ms, 275155 nodes, 55441 nps, move e2e4
Benchmark time 7695ms, 438956 nodes, 57192 nps, move e2e4
Benchmark time 8303ms, 474040 nodes, 57230 nps, move e2e4
Benchmark time 10000ms, 570464 nodes, 57160 nps, move e2e4
bestmove e2e4

===========================
Total time (ms) : 10014
Nodes searched  : 570976
Nodes/second    : 57012
How many % is the enhancement?
The default source is appropriate to CUDA 11?
What is the version number of NVIDIA driver what you ised?
I haven't done precise testing. My estimate is the enhancement in speed is about between 10% and 15%.

Code: Select all

CUDA Toolkit		Linux x86_64 Driver Version		Windows x86_64 Driver Version
CUDA 11.0.194		 >= 450.51.05					>= 451.48
I built Lc0 26.0 from source after installing CUDA 11. I did not change the source code or build script in any way.
I installed CUDA Toolbox 11 and downloaded the latest (26.0) version of Lc0. When I start it, it still indicates that the Cuda Runtime Version is 10.0.0, not 11. Did I miss something? The output from Lc0 indicates "Latest version of CUDA supported by the driver: 11.0.0"
I think you'd need to build Lc0 from source. Existing binaries are built to use 10.0.