GPU rumors 2020

smatovic · Post by **smatovic** » Sun Mar 08, 2020 1:45 pm

Followup -

Intel CFO seems not satisfied with 10nm node:

https://www.anandtech.com/show/15580/in ... -than-22nm

Intel splits its upcoming Xe GPUs into three categories, Xe-LP, Xe-HP, Xe-HPC:

https://www.anandtech.com/show/15188/an ... -vecchio/2

They speak of variable vector width, not sure if they will have dedicated Tensor
Processing Units.

AMD splits its GPUs into RDNA and CDNA architecture, consumer and High-
Performance-Computing.

https://www.anandtech.com/show/15593/am ... ta-centers

Again, no word about dedicated Tensor Processing Units. Looks like CDNA is
based on GCN architecture.

AMD talks about upcoming RDNA2, "Big-Navi", "Navi 2x" GPU:

https://www.anandtech.com/show/15591/am ... erfperwatt

--
Srdja

noobpwnftw · Post by **noobpwnftw** » Sun Mar 08, 2020 5:42 pm

Maybe people just don't need one.

https://arxiv.org/abs/1903.03129

Leo · Post by **Leo** » Sun Mar 08, 2020 5:50 pm

noobpwnftw wrote: ↑Sun Mar 08, 2020 5:42 pm Maybe people just don't need one.

https://arxiv.org/abs/1903.03129

Interesting article.

Dann Corbit · Post by **Dann Corbit** » Sun Mar 08, 2020 8:34 pm

wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.

For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.

Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?

In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.

I think we need to learn how to use them better.

grahamj · Post by **grahamj** » Mon Mar 09, 2020 10:42 am

The SLIDE algorithm is not going to make GPUs redundant. I doubt it will make much dent in their sales.

SLIDE only applies to fully connected nets, not to convolutional nets as used in LCZero, and I think, almost all processing of visual information.

SLIDE was only shown to speed up certain 'extreme classification' tasks.

SLIDE only speeds up training, not inference.

The authors are quite open about choosing tasks where their algorithm shines. This is fine - it's a significant achievement to outperform the state of the art even for a small subset of ML tasks.

To show SLIDE’s real advantage, we will need large networks where
even a slight decrease in performance is noticeable. Thus,
the publicly available extreme classification datasets, requir-
ing more than 100 million parameters to train due to their
extremely wide last layer, fit this setting appropriately. For
these tasks, most of the computations (more than 99% ) are
in the final layer.

More about one of the two data sets they used.

The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label set. This competition provides the Amazon-670 dataset which is a product to product recommendation dataset. The task here is to learn an extreme classifier such that given a product's description it can recommend other products(out of possible 670K products) that a user might be interested in buying.

There is some confusion over the hardware used, among people who find it easier to speculate than to read, eg https://wccftech.com/intel-ai-breakthro ... -v100-gpu/.

From the article:

All the experiments are conducted on a
server equipped with two 22-core/44-thread processors (In-
tel Xeon E5-2699A v4 2.40GHz) and one NVIDIA Tesla
V100 Volta 32GB GPU.

smatovic · Post by **smatovic** » Mon Mar 09, 2020 3:18 pm

Maybe also worth to mention, AMD/Cray will build the third U.S. Exa-FLOP system
named El Capitan planed for 2023. Not sure why DOE did not choose IBM/Nvidia,
would make more sense to me to take a third domestic player beside Intel and
AMD.

https://www.anandtech.com/show/15581/el ... 2-exaflops

They speak of 'unified memory' across CPUs and GPUs via Infinity Fabric gen 3.

https://www.anandtech.com/show/15596/am ... everything

Nvidia's NVLink and AMD's Infinity Fabric are already used to couple several
GPUs together and run for example neural networks with billions of parameters.

https://en.wikipedia.org/wiki/NVLink

https://en.wikipedia.org/wiki/HyperTran ... ity_Fabric

--
Srdja

Zenmastur · Post by **Zenmastur** » Mon Mar 09, 2020 11:21 pm

Dann Corbit wrote: ↑Sun Mar 08, 2020 8:34 pm wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.

For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.

Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?

In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.

I think we need to learn how to use them better.

The more traditional definition of a teraflop is a trillion double precision floating point operations per second. How this got redefined to be "any" precision floating point operations per second is beyond me. So, I'm not sure your comparison is technically correct. It takes four FP16 mult to make a FP32 mult and four FP32 mult to make an FP64 mult. So a FP64 mult is 16 times the work of a FP16 mult.

Dann Corbit · Post by **Dann Corbit** » Tue Mar 10, 2020 3:28 am

FLOP is floating point operation. I guess originally it was 32 bit, since most work was done in 32 bits back when the term came about (but I am not sure about that). I believe that this was the paper that coined the term:
https://pdfs.semanticscholar.org/4343/5 ... 8d2180.pdf

Often, the GFlops or TFlops will have the precision specified, such as the cases mentioned.
https://en.wikipedia.org/wiki/FLOPS

At any rate for chess anyway, the width is irrelevant (though there may be some limit, there is even talk of 8 and 4 bit floats and it is possible that the loss of resolution would matter). But for LC0, for instance, 16 bit float is plenty good and using 16 bit float makes it a good deal stronger.

Zenmastur · Post by **Zenmastur** » Tue Mar 10, 2020 3:51 am

Dann Corbit wrote: ↑Tue Mar 10, 2020 3:28 am FLOP is floating point operation. I guess originally it was 32 bit, since most work was done in 32 bits back when the term came about (but I am not sure about that). I believe that this was the paper that coined the term:
https://pdfs.semanticscholar.org/4343/5 ... 8d2180.pdf

Often, the GFlops or TFlops will have the precision specified, such as the cases mentioned.
https://en.wikipedia.org/wiki/FLOPS

At any rate for chess anyway, the width is irrelevant (though there may be some limit, there is even talk of 8 and 4 bit floats and it is possible that the loss of resolution would matter). But for LC0, for instance, 16 bit float is plenty good and using 16 bit float makes it a good deal stronger.

The term was in use far before that paper was written. It was in use as far back as the Cyber 205 series of Super-Computers back in the day. Although back then it was Megaflops.

smatovic · Post by **smatovic** » Tue Mar 10, 2020 8:40 am

Zenmastur wrote: ↑Mon Mar 09, 2020 11:21 pm
Dann Corbit wrote: ↑Sun Mar 08, 2020 8:34 pm wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.

For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.

Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?

In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.

I think we need to learn how to use them better.
The more traditional definition of a teraflop is a trillion double precision floating point operations per second. How this got redefined to be "any" precision floating point operations per second is beyond me. So, I'm not sure your comparison is technically correct. It takes four FP16 mult to make a FP32 mult and four FP32 mult to make an FP64 mult. So a FP64 mult is 16 times the work of a FP16 mult.

In the super-computer realm FLOP usually refers to FP64, double precision, cos
the LINPACK benchmark uses 64 bit floating point operations to determine the
TOP500. Consumer GPUs are usually crippled in this regard, they can have
different FP32:FP64 ratios like 4:1 down to 32:1, or so, the current HPC server
class GPUs have an ratio of 2:1, still not sure if they use variable vector
width for this or have dedicated FP64 units. Meanwhile people start to use
simply OPs to refer to Tensor operations per second or alike with different bit
widths...

--
Srdja

GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020