Code: Select all
with cudnn 7.3 and 411.63 driver available at nvidia.com
minibatch-size=512, network id: 11250, go nodes 1000000
fp32 fp16
Titan V: 13295 29379
RTX 2080Ti: 12208 32472
Moderators: hgm, Rebel, chrisw
Code: Select all
with cudnn 7.3 and 411.63 driver available at nvidia.com
minibatch-size=512, network id: 11250, go nodes 1000000
fp32 fp16
Titan V: 13295 29379
RTX 2080Ti: 12208 32472
Thank you !jkiliani wrote: ↑Thu Sep 20, 2018 10:10 pm Ankan posted Lc0 benchmarks for the RTX 2080 Ti on Leela Discord today, since nondisclosure clauses regarding benchmarks of those are no longer in force now that the hardware is released:So, the (top) RTX card actually outperforms a Titan V for Lc0 when using fp16. Ankan will also post some benchmarks for the RTX 2080 soon.Code: Select all
with cudnn 7.3 and 411.63 driver available at nvidia.com minibatch-size=512, network id: 11250, go nodes 1000000 fp32 fp16 Titan V: 13295 29379 RTX 2080Ti: 12208 32472
Small update, Ankan added the 2080 to his benchmarks:
Code: Select all
with cudnn 7.3 and 411.63 driver available at nvidia.com
minibatch-size=512, network id: 11250, go nodes 1000000
fp32 fp16
Titan V: 13295 29379
RTX 2080: 9708 26678
RTX 2080Ti: 12208 32472
Thank you very much for the explanation!jkiliani wrote: ↑Fri Sep 21, 2018 8:27 amSmall update, Ankan added the 2080 to his benchmarks:About fp32 and fp16, this is the calculation precision of the neural network inference. fp32 refers to 32 bit floats, fp16 to 16 bit floats. It has been experimentally confirmed that the reduced floating point accuracy of 16 bit NN inference does not reduce playing strength for Lc0 significantly. However, there is not much point with GTX 10xx GPUs since those are not optimised for fp16. The RTX cards on the other hand are, in their case fp16 gains a large amount of speed as can be seen from those benchmarks.Code: Select all
with cudnn 7.3 and 411.63 driver available at nvidia.com minibatch-size=512, network id: 11250, go nodes 1000000 fp32 fp16 Titan V: 13295 29379 RTX 2080: 9708 26678 RTX 2080Ti: 12208 32472
As for how to use, it, you initialise Lc0 with "backend=cudnn-fp16" instead of "backend=cudnn".
Presumably if we keep going down in precision there will be a penalty as the weights won't be as precise.jkiliani wrote: ↑Fri Sep 21, 2018 8:27 amSmall update, Ankan added the 2080 to his benchmarks:About fp32 and fp16, this is the calculation precision of the neural network inference. fp32 refers to 32 bit floats, fp16 to 16 bit floats. It has been experimentally confirmed that the reduced floating point accuracy of 16 bit NN inference does not reduce playing strength for Lc0 significantly. However, there is not much point with GTX 10xx GPUs since those are not optimised for fp16. The RTX cards on the other hand are, in their case fp16 gains a large amount of speed as can be seen from those benchmarks.Code: Select all
with cudnn 7.3 and 411.63 driver available at nvidia.com minibatch-size=512, network id: 11250, go nodes 1000000 fp32 fp16 Titan V: 13295 29379 RTX 2080: 9708 26678 RTX 2080Ti: 12208 32472
As for how to use, it, you initialise Lc0 with "backend=cudnn-fp16" instead of "backend=cudnn".
It is just another uci option
Code: Select all
0.343: < option name Network weights file path type string default <autodiscover>
0.343: < option name Number of worker threads type spin default 2 min 1 max 128
0.343: < option name NNCache size type spin default 200000 min 0 max 999999999
0.343: < option name NN backend to use type combo default cudnn var cudnn var cudnn-fp16 var check var random var multiplexing
0.343: < option name NN backend parameters type string default
0.343: < option name Scale thinking time type string default 2.400000
0.343: < option name Move time overhead in milliseconds type spin default 100 min 0 max 10000
0.343: < option name Time weight curve peak ply type string default 26.200001
0.343: < option name Time weight curve width left of peak type string default 82.000000
0.343: < option name Time weight curve width right of peak type string default 74.000000
0.343: < option name List of Syzygy tablebase directories type string default
0.343: < option name Ponder type check default false
0.343: < option name Minibatch size for NN inference type spin default 256 min 1 max 1024
0.343: < option name Max prefetch nodes, per NN call type spin default 32 min 0 max 1024
0.343: < option name Cpuct MCTS option type string default 3.400000
0.343: < option name Initial temperature type string default 0.000000
0.343: < option name Moves with temperature decay type spin default 0 min 0 max 100
0.343: < option name Add Dirichlet noise at root node type check default false
0.343: < option name Display verbose move stats type check default false
0.343: < option name Aversion to search if change unlikely type string default 1.330000
0.343: < option name First Play Urgency Reduction type string default 0.900000
0.343: < option name Length of history to include in cache type spin default 1 min 0 max 7
0.343: < option name Policy softmax temperature type string default 2.200000
0.343: < option name Allowed node collisions, per batch type spin default 32 min 0 max 1024
0.343: < option name Out-of-order cache backpropagation type check default false
0.343: < option name Ignore alternatives to checkmate type check default false
0.343: < option name Configuration file path type string default lc0.config
0.343: < option name Do debug logging into file type string default
The penalty for the case of going from fp32 to fp16 was found to be small enough to be easily compensated by the speed increase. int8 inference which is also supported by the RTX cards got some preliminary experiments, but at least here the accuracy loss was severe enough to lose considerable strength. For fp16, we're fine (this has actually been used both for TCEC bonus games and CCCC).
Code: Select all
fp32 fp16
GTX 1080Ti: 8996 -
Titan V: 13295 29379
RTX 2080: 9708 26678
RTX 2080Ti: 12208 32472