Next-Gen GPUs for LC0

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

mehmet123
Posts: 670
Joined: Sun Jan 26, 2020 10:38 pm
Location: Turkey
Full name: Mehmet Karaman

Re: Next-Gen GPUs for LC0

Post by mehmet123 »

Laskos wrote: Mon Sep 28, 2020 6:18 pm
What is "baseline" and "optimized"?
https://github.com/LeelaChessZero/lc0/pull/1428
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Next-Gen GPUs for LC0

Post by smatovic »

Milos wrote: Mon Sep 28, 2020 4:41 pm
Alayan wrote: Mon Sep 28, 2020 4:17 pm Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.

1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Turing : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently

But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?
Leela uses mainly FP16 multipliers from CUDA cores. I am really not aware that this definition changed. Tensor cores are only used for 3x3 convolutions in the input layer (rather inefficiently). You can't use Tensor cores for 1x1 convolutions (which is great majority of operations in Lc0 DNN inference), i.e. you can, but it is grossly inefficient.
Are you sure Lc0 does not use 3x3 convolutions in its CNN Filters?

Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:

https://www.nvidia.com/content/dam/en-z ... per-V1.pdf

If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.

--
Srdja
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: Next-Gen GPUs for LC0

Post by Werewolf »

smatovic wrote: Tue Sep 29, 2020 10:01 am
Milos wrote: Mon Sep 28, 2020 4:41 pm
Alayan wrote: Mon Sep 28, 2020 4:17 pm Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.

1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Turing : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently

But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?
Leela uses mainly FP16 multipliers from CUDA cores. I am really not aware that this definition changed. Tensor cores are only used for 3x3 convolutions in the input layer (rather inefficiently). You can't use Tensor cores for 1x1 convolutions (which is great majority of operations in Lc0 DNN inference), i.e. you can, but it is grossly inefficient.
Are you sure Lc0 does not use 3x3 convolutions in its CNN Filters?

Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:

https://www.nvidia.com/content/dam/en-z ... per-V1.pdf

If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.

--
Srdja
One thing which was weird was the previous results I saw on the A100 were only a little better than the Titan RTX, about 10% better IIRC.

These numbers are much better.
However, Tilips confirmed last night it gets pretty pointless with multiple A100 cards as it's hard for the CPU to keep up.
(not that many of us will buy even one A100, let alone 3 of them...)
Werewolf
Posts: 1796
Joined: Thu Sep 18, 2008 10:24 pm

Re: Next-Gen GPUs for LC0

Post by Werewolf »

By the way, there are still rumours online that Nvidia could still release an Ampere Titan, depending on how fast Big Navi turns out to be. Pinch of salt rumour of course...
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Next-Gen GPUs for LC0

Post by smatovic »

Werewolf wrote: Tue Sep 29, 2020 12:01 pm By the way, there are still rumours online that Nvidia could still release an Ampere Titan, depending on how fast Big Navi turns out to be. Pinch of salt rumour of course...
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun :)

--
Srdja
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Next-Gen GPUs for LC0

Post by Milos »

smatovic wrote: Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun :)
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Next-Gen GPUs for LC0

Post by smatovic »

Milos wrote: Tue Sep 29, 2020 12:39 pm
smatovic wrote: Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun :)
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- some people prefer AMD over Nvidia
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition

--
Srdja
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Next-Gen GPUs for LC0

Post by Milos »

smatovic wrote: Tue Sep 29, 2020 10:01 am Are you sure Lc0 does not use 3x3 convolutions in its CNN Filters?
At least according to original AlphaZero implementation (not sure if Lc0 changes anything really beside the size) a bulk of 3x3 convolutions is in input layer (convolutional block).
3x3 convolutions are also present in each residual block in input layer filter, but not further in policy and value heads that only have 1x1 convolutions.
Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:

https://www.nvidia.com/content/dam/en-z ... per-V1.pdf

If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.
That is just PR. Nothing really changed in terms of how FLOPS16 throughput for TensorCores is calculated or regarding the fact that those FMAs can only be used in tensor operations (otherwise you waste a full TensorCore to perform a single 1x1 convolution).
Milos
Posts: 4190
Joined: Wed Nov 25, 2009 1:47 am

Re: Next-Gen GPUs for LC0

Post by Milos »

smatovic wrote: Tue Sep 29, 2020 12:48 pm
Milos wrote: Tue Sep 29, 2020 12:39 pm
smatovic wrote: Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun :)
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- some people prefer AMD over Nvidia
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition

--
Srdja
Gamers profit for sure, ML scientist not at all. Who wants to buy AMD card that costs 1200$ and has worse performance for ML than NVIDIA card that costs 500$???
smatovic
Posts: 2645
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Next-Gen GPUs for LC0

Post by smatovic »

Milos wrote: Tue Sep 29, 2020 1:01 pm
smatovic wrote: Tue Sep 29, 2020 12:48 pm
Milos wrote: Tue Sep 29, 2020 12:39 pm
smatovic wrote: Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun :)
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- some people prefer AMD over Nvidia
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition

--
Srdja
Gamers profit for sure, ML scientist not at all. Who wants to buy AMD card that costs 1200$ and has worse performance for ML than NVIDIA card that costs 500$???
Hmm, why did DOE choose for its upcoming exa-FLOP systems Intel (Aurora), AMD (Frontier), AMD (El Capitan) and not IBM/Nvidia?

--
Srdja