GPU rumors 2020

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

smatovic
Posts: 2642
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2020

Post by smatovic »

Followup -

Intel CFO seems not satisfied with 10nm node:

https://www.anandtech.com/show/15580/in ... -than-22nm

Intel splits its upcoming Xe GPUs into three categories, Xe-LP, Xe-HP, Xe-HPC:

https://www.anandtech.com/show/15188/an ... -vecchio/2

They speak of variable vector width, not sure if they will have dedicated Tensor
Processing Units.

AMD splits its GPUs into RDNA and CDNA architecture, consumer and High-
Performance-Computing.

https://www.anandtech.com/show/15593/am ... ta-centers

Again, no word about dedicated Tensor Processing Units. Looks like CDNA is
based on GCN architecture.

AMD talks about upcoming RDNA2, "Big-Navi", "Navi 2x" GPU:

https://www.anandtech.com/show/15591/am ... erfperwatt

--
Srdja
noobpwnftw
Posts: 560
Joined: Sun Nov 08, 2015 11:10 pm

Re: GPU rumors 2020

Post by noobpwnftw »

Maybe people just don't need one.

https://arxiv.org/abs/1903.03129
Leo
Posts: 1080
Joined: Fri Sep 16, 2016 6:55 pm
Location: USA/Minnesota
Full name: Leo Anger

Re: GPU rumors 2020

Post by Leo »

noobpwnftw wrote: Sun Mar 08, 2020 5:42 pm Maybe people just don't need one.

https://arxiv.org/abs/1903.03129
Interesting article.
Advanced Micro Devices fan.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: GPU rumors 2020

Post by Dann Corbit »

wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.

For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.

Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?

In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.

I think we need to learn how to use them better.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
grahamj
Posts: 43
Joined: Thu Oct 11, 2018 2:26 pm
Full name: Graham Jones

Re: GPU rumors 2020

Post by grahamj »

The SLIDE algorithm is not going to make GPUs redundant. I doubt it will make much dent in their sales.
  • SLIDE only applies to fully connected nets, not to convolutional nets as used in LCZero, and I think, almost all processing of visual information.
  • SLIDE was only shown to speed up certain 'extreme classification' tasks.
  • SLIDE only speeds up training, not inference.
The authors are quite open about choosing tasks where their algorithm shines. This is fine - it's a significant achievement to outperform the state of the art even for a small subset of ML tasks.
To show SLIDE’s real advantage, we will need large networks where
even a slight decrease in performance is noticeable. Thus,
the publicly available extreme classification datasets, requir-
ing more than 100 million parameters to train due to their
extremely wide last layer, fit this setting appropriately. For
these tasks, most of the computations (more than 99% ) are
in the final layer.
More about one of the two data sets they used.
The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label set. This competition provides the Amazon-670 dataset which is a product to product recommendation dataset. The task here is to learn an extreme classifier such that given a product's description it can recommend other products(out of possible 670K products) that a user might be interested in buying.
There is some confusion over the hardware used, among people who find it easier to speculate than to read, eg https://wccftech.com/intel-ai-breakthro ... -v100-gpu/.

From the article:
All the experiments are conducted on a
server equipped with two 22-core/44-thread processors (In-
tel Xeon E5-2699A v4 2.40GHz) and one NVIDIA Tesla
V100 Volta 32GB GPU.
Graham Jones, www.indriid.com
smatovic
Posts: 2642
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2020

Post by smatovic »

Maybe also worth to mention, AMD/Cray will build the third U.S. Exa-FLOP system
named El Capitan planed for 2023. Not sure why DOE did not choose IBM/Nvidia,
would make more sense to me to take a third domestic player beside Intel and
AMD.

https://www.anandtech.com/show/15581/el ... 2-exaflops

They speak of 'unified memory' across CPUs and GPUs via Infinity Fabric gen 3.

https://www.anandtech.com/show/15596/am ... everything

Nvidia's NVLink and AMD's Infinity Fabric are already used to couple several
GPUs together and run for example neural networks with billions of parameters.

https://en.wikipedia.org/wiki/NVLink

https://en.wikipedia.org/wiki/HyperTran ... ity_Fabric

--
Srdja
Zenmastur
Posts: 919
Joined: Sat May 31, 2014 8:28 am

Re: GPU rumors 2020

Post by Zenmastur »

Dann Corbit wrote: Sun Mar 08, 2020 8:34 pm wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.

For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.

Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?

In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.

I think we need to learn how to use them better.
The more traditional definition of a teraflop is a trillion double precision floating point operations per second. How this got redefined to be "any" precision floating point operations per second is beyond me. So, I'm not sure your comparison is technically correct. It takes four FP16 mult to make a FP32 mult and four FP32 mult to make an FP64 mult. So a FP64 mult is 16 times the work of a FP16 mult.
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
Dann Corbit
Posts: 12538
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: GPU rumors 2020

Post by Dann Corbit »

FLOP is floating point operation. I guess originally it was 32 bit, since most work was done in 32 bits back when the term came about (but I am not sure about that). I believe that this was the paper that coined the term:
https://pdfs.semanticscholar.org/4343/5 ... 8d2180.pdf

Often, the GFlops or TFlops will have the precision specified, such as the cases mentioned.
https://en.wikipedia.org/wiki/FLOPS

At any rate for chess anyway, the width is irrelevant (though there may be some limit, there is even talk of 8 and 4 bit floats and it is possible that the loss of resolution would matter). But for LC0, for instance, 16 bit float is plenty good and using 16 bit float makes it a good deal stronger.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Zenmastur
Posts: 919
Joined: Sat May 31, 2014 8:28 am

Re: GPU rumors 2020

Post by Zenmastur »

Dann Corbit wrote: Tue Mar 10, 2020 3:28 am FLOP is floating point operation. I guess originally it was 32 bit, since most work was done in 32 bits back when the term came about (but I am not sure about that). I believe that this was the paper that coined the term:
https://pdfs.semanticscholar.org/4343/5 ... 8d2180.pdf

Often, the GFlops or TFlops will have the precision specified, such as the cases mentioned.
https://en.wikipedia.org/wiki/FLOPS

At any rate for chess anyway, the width is irrelevant (though there may be some limit, there is even talk of 8 and 4 bit floats and it is possible that the loss of resolution would matter). But for LC0, for instance, 16 bit float is plenty good and using 16 bit float makes it a good deal stronger.
The term was in use far before that paper was written. It was in use as far back as the Cyber 205 series of Super-Computers back in the day. Although back then it was Megaflops. :D :D :D
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
smatovic
Posts: 2642
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: GPU rumors 2020

Post by smatovic »

Zenmastur wrote: Mon Mar 09, 2020 11:21 pm
Dann Corbit wrote: Sun Mar 08, 2020 8:34 pm wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.

For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.

Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?

In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.

I think we need to learn how to use them better.
The more traditional definition of a teraflop is a trillion double precision floating point operations per second. How this got redefined to be "any" precision floating point operations per second is beyond me. So, I'm not sure your comparison is technically correct. It takes four FP16 mult to make a FP32 mult and four FP32 mult to make an FP64 mult. So a FP64 mult is 16 times the work of a FP16 mult.
In the super-computer realm FLOP usually refers to FP64, double precision, cos
the LINPACK benchmark uses 64 bit floating point operations to determine the
TOP500. Consumer GPUs are usually crippled in this regard, they can have
different FP32:FP64 ratios like 4:1 down to 32:1, or so, the current HPC server
class GPUs have an ratio of 2:1, still not sure if they use variable vector
width for this or have dedicated FP64 units. Meanwhile people start to use
simply OPs to refer to Tensor operations per second or alike with different bit
widths...

--
Srdja