Next-Gen GPUs for LC0

mehmet123 · Post by **mehmet123** » Mon Sep 28, 2020 6:25 pm

Laskos wrote: ↑Mon Sep 28, 2020 6:18 pm
What is "baseline" and "optimized"?

https://github.com/LeelaChessZero/lc0/pull/1428

smatovic · Post by **smatovic** » Tue Sep 29, 2020 10:01 am

Milos wrote: ↑Mon Sep 28, 2020 4:41 pm
Alayan wrote: ↑Mon Sep 28, 2020 4:17 pm Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.

1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Turing : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently

But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?
Leela uses mainly FP16 multipliers from CUDA cores. I am really not aware that this definition changed. Tensor cores are only used for 3x3 convolutions in the input layer (rather inefficiently). You can't use Tensor cores for 1x1 convolutions (which is great majority of operations in Lc0 DNN inference), i.e. you can, but it is grossly inefficient.

Are you sure Lc0 does not use 3x3 convolutions in its CNN Filters?

Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:

https://www.nvidia.com/content/dam/en-z ... per-V1.pdf

If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.

--
Srdja

Werewolf · Post by **Werewolf** » Tue Sep 29, 2020 11:46 am

smatovic wrote: ↑Tue Sep 29, 2020 10:01 am
Milos wrote: ↑Mon Sep 28, 2020 4:41 pm
Alayan wrote: ↑Mon Sep 28, 2020 4:17 pm Nvidia changed the definition of CUDA cores. You need a workload that fully saturates the FP32 units to get close (but not quite) to the effect the same number of CUDA cores would have had in Turing.

1 CUDA in Turing : 1xFP32 unit + 1xINT32 unit able to execute concurrently
2 CUDA in Turing : 1xFP32 unit + 1x(INT32 OR FP32) unit able to execute concurrently

But more importantly, isn't Leela supposed to use FP16 operations with most of the relevant FP16 compute from RTX cards coming from tensor cores and not from the 2xFP16 mode of FP32 units ?
Leela uses mainly FP16 multipliers from CUDA cores. I am really not aware that this definition changed. Tensor cores are only used for 3x3 convolutions in the input layer (rather inefficiently). You can't use Tensor cores for 1x1 convolutions (which is great majority of operations in Lc0 DNN inference), i.e. you can, but it is grossly inefficient.
Are you sure Lc0 does not use 3x3 convolutions in its CNN Filters?

Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:

https://www.nvidia.com/content/dam/en-z ... per-V1.pdf

If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.

--
Srdja

One thing which was weird was the previous results I saw on the A100 were only a little better than the Titan RTX, about 10% better IIRC.

These numbers are much better.
However, Tilips confirmed last night it gets pretty pointless with multiple A100 cards as it's hard for the CPU to keep up.
(not that many of us will buy even one A100, let alone 3 of them...)

Werewolf · Post by **Werewolf** » Tue Sep 29, 2020 12:01 pm

By the way, there are still rumours online that Nvidia could still release an Ampere Titan, depending on how fast Big Navi turns out to be. Pinch of salt rumour of course...

smatovic · Post by **smatovic** » Tue Sep 29, 2020 12:21 pm

Werewolf wrote: ↑Tue Sep 29, 2020 12:01 pm By the way, there are still rumours online that Nvidia could still release an Ampere Titan, depending on how fast Big Navi turns out to be. Pinch of salt rumour of course...

Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun

--
Srdja

Milos · Post by **Milos** » Tue Sep 29, 2020 12:39 pm

smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun

Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.

smatovic · Post by **smatovic** » Tue Sep 29, 2020 12:48 pm

Milos wrote: ↑Tue Sep 29, 2020 12:39 pm
smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.

- some people prefer AMD over Nvidia
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition

--
Srdja

Milos · Post by **Milos** » Tue Sep 29, 2020 12:59 pm

smatovic wrote: ↑Tue Sep 29, 2020 10:01 am Are you sure Lc0 does not use 3x3 convolutions in its CNN Filters?

At least according to original AlphaZero implementation (not sure if Lc0 changes anything really beside the size) a bulk of 3x3 convolutions is in input layer (convolutional block).
3x3 convolutions are also present in each residual block in input layer filter, but not further in policy and value heads that only have 1x1 convolutions.

Further it seems Nvidia switched from defining fix matrix sizes for TensorCores to FMA throughput per TensorCore, see page 22:

https://www.nvidia.com/content/dam/en-z ... per-V1.pdf

If the TensorCores are not bound to a specific matrix size but can perform actual FMA operations that could explain the diff between A100 and RTX 3080 numbers Werewolf posted....and give a hint to the close performance of RTX 2080 TI and RTX 3080 - all just my speculation, I guess only Ankan knows for sure.

That is just PR. Nothing really changed in terms of how FLOPS16 throughput for TensorCores is calculated or regarding the fact that those FMAs can only be used in tensor operations (otherwise you waste a full TensorCore to perform a single 1x1 convolution).

Milos · Post by **Milos** » Tue Sep 29, 2020 1:01 pm

smatovic wrote: ↑Tue Sep 29, 2020 12:48 pm
Milos wrote: ↑Tue Sep 29, 2020 12:39 pm
smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- some people prefer AMD over Nvidia
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition

--
Srdja

Gamers profit for sure, ML scientist not at all. Who wants to buy AMD card that costs 1200$ and has worse performance for ML than NVIDIA card that costs 500$???

smatovic · Post by **smatovic** » Tue Sep 29, 2020 1:55 pm

Milos wrote: ↑Tue Sep 29, 2020 1:01 pm
smatovic wrote: ↑Tue Sep 29, 2020 12:48 pm
Milos wrote: ↑Tue Sep 29, 2020 12:39 pm
smatovic wrote: ↑Tue Sep 29, 2020 12:21 pm
Yes, Big Navi with RDNA 2 is supposed to catch up with Nvidia's high-end line,
at least for gaming, we saw already with RTX 20xx Super series a second launch
of the same architecture with better performance/price, remains open if such
a thing will happen after AMD's launch of its high-end series...sad that Intel
does not launch its Xe-HPG this year...would have been fun
Both OpenCL and ROCm are crap compared to CUDA and cudnn, so I see little point in mentioning Big Navi in the context of DL.
One needs 2x faster RDNA2 card in terms of TFLOPS to match performance of RTX card.
- some people prefer AMD over Nvidia
- the DX12 backend of Lc0 runs on AMD too?
- you miss the point, competition is good for us end users, if we have three gaming vendors competing, we profit by the performance/price competition

--
Srdja
Gamers profit for sure, ML scientist not at all. Who wants to buy AMD card that costs 1200$ and has worse performance for ML than NVIDIA card that costs 500$???

Hmm, why did DOE choose for its upcoming exa-FLOP systems Intel (Aurora), AMD (Frontier), AMD (El Capitan) and not IBM/Nvidia?

--
Srdja

Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0

Re: Next-Gen GPUs for LC0