Page 1 of 6

How good is the RTX 2080 Ti for Leela?

Posted: Sat Sep 15, 2018 11:12 pm
by Hai
How good is the RTX 2080 Ti for Leela?

Re: How good is the RTX 2080 Ti for Leela?

Posted: Sun Sep 16, 2018 1:43 am
by Robert Pope
Nobody has one, so nobody knows. According to Milos, it's only going to gain from the clock speed, so maybe 10%-15% faster. We'll know for sure once they are actually in the market.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Sun Sep 16, 2018 5:46 am
by ankan
It should be very similar to a Titan V for lc0.

It has tensor cores enabled, and it's peak fp16 tensor math throughput is almost exactly same as a Titan V (114 Tflops vs 110 Tflops):
https://www.anandtech.com/show/13282/nv ... eep-dive/6

I have one, but I can't post any benchmarks before reviews are out :)

Note that right now lc0 can't make use of int8 (or int4) math, but google did it with A0 on their TPUs so its something lc0 team wants to try in future. If successful, we hope to get another 2x speedup.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Mon Sep 17, 2018 6:05 am
by Werewolf
ankan wrote: Sun Sep 16, 2018 5:46 am It should be very similar to a Titan V for lc0.

It has tensor cores enabled, and it's peak fp16 tensor math throughput is almost exactly same as a Titan V (114 Tflops vs 110 Tflops):
https://www.anandtech.com/show/13282/nv ... eep-dive/6

I have one, but I can't post any benchmarks before reviews are out :)

Note that right now lc0 can't make use of int8 (or int4) math, but google did it with A0 on their TPUs so its something lc0 team wants to try in future. If successful, we hope to get another 2x speedup.
I’m not accusing you of lying but why would Nvidia cripple the CUDA cores on the 2080 Ti for FP16 (presumably to protect Quadro) and then allow the tensor cores to run full speed?

In a week or two Lc0’s speed on this card will finally be revealed- I hope you’re right

Re: How good is the RTX 2080 Ti for Leela?

Posted: Mon Sep 17, 2018 6:25 am
by ankan
Werewolf wrote: Mon Sep 17, 2018 6:05 am I’m not accusing you of lying but why would Nvidia cripple the CUDA cores on the 2080 Ti for FP16 (presumably to protect Quadro) and then allow the tensor cores to run full speed?

In a week or two Lc0’s speed on this card will finally be revealed- I hope you’re right
I don't know from where people got the rumors that Nvidia crippled non-tensor fp16 math on 2080Ti.

See page 8/9 of this document for full specs:
https://www.nvidia.com/content/dam/en-z ... epaper.pdf

The only thing that is different from Quadro is "Peak FP16 Tensor TFLOPS with FP32 Accumulate" which lc0 doesn't use.

Milos has no idea what he is talking about...

Re: How good is the RTX 2080 Ti for Leela?

Posted: Mon Sep 17, 2018 6:37 am
by Werewolf
ankan wrote: Mon Sep 17, 2018 6:25 am
I don't know from where people got the rumors that Nvidia crippled non-tensor fp16 math on 2080Ti.

Milos has no idea what he is talking about...
But it's not from Milos, it's from Wikipedia:

https://en.wikipedia.org/wiki/List_of_N ... _20_series

Unless you're saying the wiki page is wrong the CUDA cores are crippled for FP16. In addition to that there's also the debate as to whether LC0 can use tensor cores, but that's not something I know about.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Mon Sep 17, 2018 9:26 am
by Error323
Werewolf wrote: Mon Sep 17, 2018 6:37 am
ankan wrote: Mon Sep 17, 2018 6:25 am
I don't know from where people got the rumors that Nvidia crippled non-tensor fp16 math on 2080Ti.

Milos has no idea what he is talking about...
But it's not from Milos, it's from Wikipedia:

https://en.wikipedia.org/wiki/List_of_N ... _20_series

Unless you're saying the wiki page is wrong the CUDA cores are crippled for FP16. In addition to that there's also the debate as to whether LC0 can use tensor cores, but that's not something I know about.
It's not the CUDA cores that will be doing the FP16 computations, but the tensorcores. They are specifically designed for neural network inference, because that's what the new raytracing technique is using to make it work in realtime. Fortunately for us those cores are perfect for Lc0 as we use a very similar neural network architecture for chess (convolutional layers).

Also, you should listen to ankan, he's got that 2080 for a reason ;) And he wrote our cudnn backend!

Re: How good is the RTX 2080 Ti for Leela?

Posted: Mon Sep 17, 2018 10:40 am
by Werewolf
Error323 wrote: Mon Sep 17, 2018 9:26 am
It's not the CUDA cores that will be doing the FP16 computations, but the tensorcores. They are specifically designed for neural network inference, because that's what the new raytracing technique is using to make it work in realtime. Fortunately for us those cores are perfect for Lc0 as we use a very similar neural network architecture for chess (convolutional layers).

Also, you should listen to ankan, he's got that 2080 for a reason ;) And he wrote our cudnn backend!
Well if that's correct it's great news for everyone, I'm not complaining!
However, there do seem to be some differences with the CUDA cores between Quadro and Geforce.

Re: How good is the RTX 2080 Ti for Leela?

Posted: Mon Sep 17, 2018 12:25 pm
by Werewolf
Werewolf wrote: Mon Sep 17, 2018 10:40 am
However, there do seem to be some differences with the CUDA cores between Quadro and Geforce.
Forget that comment - I see why it's wrong now.

What's FP16 accumulate?

Re: How good is the RTX 2080 Ti for Leela?

Posted: Mon Sep 17, 2018 5:29 pm
by ankan
Werewolf wrote: Mon Sep 17, 2018 12:25 pm
Werewolf wrote: Mon Sep 17, 2018 10:40 am
However, there do seem to be some differences with the CUDA cores between Quadro and Geforce.
Forget that comment - I see why it's wrong now.

What's FP16 accumulate?
Tensor cores perform small matrix multiplies and accumulate. See https://devblogs.nvidia.com/programming ... es-cuda-9/ for more details.
They support two modes - either you can do everything in fp16, or you can do the multiply in fp16 and the accumulation in fp32. From the whitepaper it seems for gaming cards (RTX 20xx), the performance of fp32 accumulate mode has been cut to half compared to quadro cards. AFAIK, 32 bit accumulation mode is more useful for training. For inference doing everything in fp16 is generally sufficient (and that's what we use for lc0).