GPU rumors 2020

smatovic · Post by **smatovic** » Fri May 15, 2020 9:15 am

Nvidia Ampere 7nm, A100, for HPC/AI was released, in short/simplified, Nvidia doubled the performance of the TensorCores per SM.

https://en.wikipedia.org/wiki/Ampere_(m ... hitecture)

https://devblogs.nvidia.com/nvidia-ampe ... -in-depth/

...a whooping 54.2 billion transistor monster.

If Nvidia uses these 3rd gen Tensor Cores in its upcoming RTX 3000 series you can expect at least an doubling of NPS for LC0. The alternative is to split HPC and consumer architectures further.

--
Srdja

corres · Post by **corres** » Fri May 15, 2020 10:09 am

I am afraid if using of NVIDIA 3000 GPUs will be spred Stockfish can not keep space with Leela.
This will the real end of era of AB engines.

smatovic · Post by **smatovic** » Fri May 15, 2020 10:22 am

corres wrote: ↑Fri May 15, 2020 10:09 am I am afraid if using of NVIDIA 3000 GPUs will be spred Stockfish can not keep space with Leela.
This will the real end of era of AB engines.

https://eta-chess.app26.de/post/i-see-/

--
Srdja

smatovic · Post by **smatovic** » Mon Jun 08, 2020 7:14 am

smatovic wrote: ↑Fri May 15, 2020 9:15 am Nvidia Ampere 7nm, A100, for HPC/AI was released, in short/simplified, Nvidia doubled the performance of the TensorCores per SM.

https://en.wikipedia.org/wiki/Ampere_(m ... hitecture)

https://devblogs.nvidia.com/nvidia-ampe ... -in-depth/

...a whooping 54.2 billion transistor monster.

If Nvidia uses these 3rd gen Tensor Cores in its upcoming RTX 3000 series you can expect at least an doubling of NPS for LC0. The alternative is to split HPC and consumer architectures further.

--
Srdja

Okay, from the Ampere whitepaper

...
Each of the A100 Tensor Cores can execute 256 FP16 FMA operations per clock, allowing it to compute the results for an 8x4x8 mixed-precision matrix multiplication per clock.
---

https://www.nvidia.com/content/dam/en-z ... epaper.pdf

I am not into CNNs and LC0 implementation, hence I can not tell if these improved matrix-multiplications lead to an doubling of NPS for LC0...

--
Srdja

M ANSARI · Post by **M ANSARI** » Mon Jun 08, 2020 8:06 am

Until NN engines start playing endgames properly there will always be a place for a good sanity check with an AB engine. I wonder how much additional power an NN engine needs to be able to overcome its weaknesses. Most likely much much more as even with the 4 RTX 2080 Ti GPU's and tons of time, things really don't change much with some poor moves. Most likely there would need to be a fundamental change in NN engine algorithms to fix that. I think that will come in time and of course the course of AI hardware will only increase. I do think that the death of AB engines is over stated though as hardware and software will continue to also improve. It will certainly be an interesting next 5 years!

smatovic · Post by **smatovic** » Mon Jun 08, 2020 9:03 am

Yea, despite the talk about the death of Moore's Law we still have the 5nm and
3nm fabrication process in pipe, and maybe some unknown reserve, hence some 2^3
8x improvement of NPS for neural networks within 10 years possible...not to
mention new kind of combinations of search algorithms with NN, or new kind of NN
structures, interesting times ahead.

--
Srdja

Milos · Post by **Milos** » Mon Jun 08, 2020 12:45 pm

smatovic wrote: ↑Mon Jun 08, 2020 7:14 am
smatovic wrote: ↑Fri May 15, 2020 9:15 am Nvidia Ampere 7nm, A100, for HPC/AI was released, in short/simplified, Nvidia doubled the performance of the TensorCores per SM.

https://en.wikipedia.org/wiki/Ampere_(m ... hitecture)

https://devblogs.nvidia.com/nvidia-ampe ... -in-depth/

...a whooping 54.2 billion transistor monster.

If Nvidia uses these 3rd gen Tensor Cores in its upcoming RTX 3000 series you can expect at least an doubling of NPS for LC0. The alternative is to split HPC and consumer architectures further.

--
Srdja
Okay, from the Ampere whitepaper

...
Each of the A100 Tensor Cores can execute 256 FP16 FMA operations per clock, allowing it to compute the results for an 8x4x8 mixed-precision matrix multiplication per clock.
---
https://www.nvidia.com/content/dam/en-z ... epaper.pdf

I am not into CNNs and LC0 implementation, hence I can not tell if these improved matrix-multiplications lead to an doubling of NPS for LC0...

--
Srdja

Most probably not. Most common operation are a dot product and 3x3 convolutions. Tensor cores don't speed up dot product, and they are not very efficient for 3x3 convolutions. I guess this increase from 4x4 tensors to 8x8 tensors NVIDIA did mainly for newer CNN types that have 7x7 convolutions in the first layer. AFAIK Lc0 doesn't use 7x7 convolutions. So I expect that those super fast tensor cores would still only be for 3x3 convolutions with 2x lower efficiency and real gain would be only on those CUDA cores increase from 15 to 19TFLOPS in FP16.

JohnW · Post by **JohnW** » Mon Jun 08, 2020 6:18 pm

NVIDIA’s GeForce RTX 3080 Flagship GPU Pictured For The First Time

https://wccftech.com/nvidia-rtx-3080-pictured/

smatovic · Post by **smatovic** » Tue Jun 23, 2020 1:07 pm

Seems Intel plans to boost NNs on CPUs not only via AVX-VNNI but also via its
upcoming AMX, Advanced Matrix Extensions, maybe with the new Xeons in Aurora
supercomputer aimed for 2021...

https://www.tomshardware.com/news/intel ... d-for-2021

we will see how the upcoming Intel Xe GPUs will compete against these upcoming
Xeon AMX CPUs.

Maybe AMX and AVX-VNNI will be interesting for tasks with lower latencies, like
AB chess engines with smaller kind of NNs.

--
Srdja

smatovic · Post by **smatovic** » Sun Jul 26, 2020 10:30 am

Here one about Apple...

as you know, Apple plans the transition from Intel to its own ARM based CPUs,
but it remains open how things like LC0 and SF NNUE will perform on these...

Apple has its own designed GPU running on the A13 Bionic chip, programmed via
its Metal API and it is yet unknown what kind of GPU will sit in these Apple
desktop/laptop ARM based machines:

https://fudzilla.com/news/graphics/5115 ... screte-gpu

further, the ARM design allows up to 16 co-processors to be connected, known are
for exampple the SIMD/Vector units Neo (128 bit), Helium (128 bit) and SVE
(512 bit) which could be useful for AB engines with NNUE.

The current line of A12/A13 mobile chips have some neural network accelerators
on board called AMX blocks, used for matrix multiplications, not sure which kind
of API these use, but if put on the desktop/laptop line for sure a speed up
candidate for LC0 and NNUE...

--
Srdja

GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020