Followup -
Intel CFO seems not satisfied with 10nm node:
https://www.anandtech.com/show/15580/in ... -than-22nm
Intel splits its upcoming Xe GPUs into three categories, Xe-LP, Xe-HP, Xe-HPC:
https://www.anandtech.com/show/15188/an ... -vecchio/2
They speak of variable vector width, not sure if they will have dedicated Tensor
Processing Units.
AMD splits its GPUs into RDNA and CDNA architecture, consumer and High-
Performance-Computing.
https://www.anandtech.com/show/15593/am ... ta-centers
Again, no word about dedicated Tensor Processing Units. Looks like CDNA is
based on GCN architecture.
AMD talks about upcoming RDNA2, "Big-Navi", "Navi 2x" GPU:
https://www.anandtech.com/show/15591/am ... erfperwatt
--
Srdja
GPU rumors 2020
Moderators: hgm, Rebel, chrisw
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
-
- Posts: 560
- Joined: Sun Nov 08, 2015 11:10 pm
-
- Posts: 1080
- Joined: Fri Sep 16, 2016 6:55 pm
- Location: USA/Minnesota
- Full name: Leo Anger
Re: GPU rumors 2020
Interesting article.noobpwnftw wrote: ↑Sun Mar 08, 2020 5:42 pm Maybe people just don't need one.
https://arxiv.org/abs/1903.03129
Advanced Micro Devices fan.
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: GPU rumors 2020
wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.
For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.
Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?
In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.
I think we need to learn how to use them better.
For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.
Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?
In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.
I think we need to learn how to use them better.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 43
- Joined: Thu Oct 11, 2018 2:26 pm
- Full name: Graham Jones
Re: GPU rumors 2020
The SLIDE algorithm is not going to make GPUs redundant. I doubt it will make much dent in their sales.
From the article:
- SLIDE only applies to fully connected nets, not to convolutional nets as used in LCZero, and I think, almost all processing of visual information.
- SLIDE was only shown to speed up certain 'extreme classification' tasks.
- SLIDE only speeds up training, not inference.
More about one of the two data sets they used.To show SLIDE’s real advantage, we will need large networks where
even a slight decrease in performance is noticeable. Thus,
the publicly available extreme classification datasets, requir-
ing more than 100 million parameters to train due to their
extremely wide last layer, fit this setting appropriately. For
these tasks, most of the computations (more than 99% ) are
in the final layer.
There is some confusion over the hardware used, among people who find it easier to speculate than to read, eg https://wccftech.com/intel-ai-breakthro ... -v100-gpu/.The objective in extreme multi-label learning is to learn a classifier that can automatically tag a datapoint with the most relevant subset of labels from an extremely large label set. This competition provides the Amazon-670 dataset which is a product to product recommendation dataset. The task here is to learn an extreme classifier such that given a product's description it can recommend other products(out of possible 670K products) that a user might be interested in buying.
From the article:
All the experiments are conducted on a
server equipped with two 22-core/44-thread processors (In-
tel Xeon E5-2699A v4 2.40GHz) and one NVIDIA Tesla
V100 Volta 32GB GPU.
Graham Jones, www.indriid.com
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU rumors 2020
Maybe also worth to mention, AMD/Cray will build the third U.S. Exa-FLOP system
named El Capitan planed for 2023. Not sure why DOE did not choose IBM/Nvidia,
would make more sense to me to take a third domestic player beside Intel and
AMD.
https://www.anandtech.com/show/15581/el ... 2-exaflops
They speak of 'unified memory' across CPUs and GPUs via Infinity Fabric gen 3.
https://www.anandtech.com/show/15596/am ... everything
Nvidia's NVLink and AMD's Infinity Fabric are already used to couple several
GPUs together and run for example neural networks with billions of parameters.
https://en.wikipedia.org/wiki/NVLink
https://en.wikipedia.org/wiki/HyperTran ... ity_Fabric
--
Srdja
named El Capitan planed for 2023. Not sure why DOE did not choose IBM/Nvidia,
would make more sense to me to take a third domestic player beside Intel and
AMD.
https://www.anandtech.com/show/15581/el ... 2-exaflops
They speak of 'unified memory' across CPUs and GPUs via Infinity Fabric gen 3.
https://www.anandtech.com/show/15596/am ... everything
Nvidia's NVLink and AMD's Infinity Fabric are already used to couple several
GPUs together and run for example neural networks with billions of parameters.
https://en.wikipedia.org/wiki/NVLink
https://en.wikipedia.org/wiki/HyperTran ... ity_Fabric
--
Srdja
-
- Posts: 919
- Joined: Sat May 31, 2014 8:28 am
Re: GPU rumors 2020
The more traditional definition of a teraflop is a trillion double precision floating point operations per second. How this got redefined to be "any" precision floating point operations per second is beyond me. So, I'm not sure your comparison is technically correct. It takes four FP16 mult to make a FP32 mult and four FP32 mult to make an FP64 mult. So a FP64 mult is 16 times the work of a FP16 mult.Dann Corbit wrote: ↑Sun Mar 08, 2020 8:34 pm wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.
For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.
Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?
In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.
I think we need to learn how to use them better.
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
-
- Posts: 12541
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: GPU rumors 2020
FLOP is floating point operation. I guess originally it was 32 bit, since most work was done in 32 bits back when the term came about (but I am not sure about that). I believe that this was the paper that coined the term:
https://pdfs.semanticscholar.org/4343/5 ... 8d2180.pdf
Often, the GFlops or TFlops will have the precision specified, such as the cases mentioned.
https://en.wikipedia.org/wiki/FLOPS
At any rate for chess anyway, the width is irrelevant (though there may be some limit, there is even talk of 8 and 4 bit floats and it is possible that the loss of resolution would matter). But for LC0, for instance, 16 bit float is plenty good and using 16 bit float makes it a good deal stronger.
https://pdfs.semanticscholar.org/4343/5 ... 8d2180.pdf
Often, the GFlops or TFlops will have the precision specified, such as the cases mentioned.
https://en.wikipedia.org/wiki/FLOPS
At any rate for chess anyway, the width is irrelevant (though there may be some limit, there is even talk of 8 and 4 bit floats and it is possible that the loss of resolution would matter). But for LC0, for instance, 16 bit float is plenty good and using 16 bit float makes it a good deal stronger.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 919
- Joined: Sat May 31, 2014 8:28 am
Re: GPU rumors 2020
The term was in use far before that paper was written. It was in use as far back as the Cyber 205 series of Super-Computers back in the day. Although back then it was Megaflops.Dann Corbit wrote: ↑Tue Mar 10, 2020 3:28 am FLOP is floating point operation. I guess originally it was 32 bit, since most work was done in 32 bits back when the term came about (but I am not sure about that). I believe that this was the paper that coined the term:
https://pdfs.semanticscholar.org/4343/5 ... 8d2180.pdf
Often, the GFlops or TFlops will have the precision specified, such as the cases mentioned.
https://en.wikipedia.org/wiki/FLOPS
At any rate for chess anyway, the width is irrelevant (though there may be some limit, there is even talk of 8 and 4 bit floats and it is possible that the loss of resolution would matter). But for LC0, for instance, 16 bit float is plenty good and using 16 bit float makes it a good deal stronger.
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU rumors 2020
In the super-computer realm FLOP usually refers to FP64, double precision, cosZenmastur wrote: ↑Mon Mar 09, 2020 11:21 pmThe more traditional definition of a teraflop is a trillion double precision floating point operations per second. How this got redefined to be "any" precision floating point operations per second is beyond me. So, I'm not sure your comparison is technically correct. It takes four FP16 mult to make a FP32 mult and four FP32 mult to make an FP64 mult. So a FP64 mult is 16 times the work of a FP16 mult.Dann Corbit wrote: ↑Sun Mar 08, 2020 8:34 pm wouldn't slide run better on a GPU (or perhaps an integrated memory system where GPU and CPU share the same very fast memory)? The AMD infinity fabric approach seems to be reaching for that.
For example, Matthew Lai first wrote a general purpose CPU version of his program, which did well.
But it was the tensor processor that made it knock everybody's socks off.
Of course, smart is good. The thing about GPU game solutions that is amazing is perhaps not that they do so well, but that they don't do nearly so well as one might think, given the compute resources at their command. There are consumer GPUs that do about 14,000 GFlops. The AMD 3990x at full tilt does about 425 GFlops. That's five doublings in speed. How is it that Stockfish can ever win a game against that thing?
In fp16, one 2080TI Super does 22 TFlops, so two of them gives 44 TFlops.
44000/425=100 times faster
And the FPS per dollar is even better.
Two 2080 TI cards are about $1500 and one 3990x is about $4000.
I think we need to learn how to use them better.
the LINPACK benchmark uses 64 bit floating point operations to determine the
TOP500. Consumer GPUs are usually crippled in this regard, they can have
different FP32:FP64 ratios like 4:1 down to 32:1, or so, the current HPC server
class GPUs have an ratio of 2:1, still not sure if they use variable vector
width for this or have dedicated FP64 units. Meanwhile people start to use
simply OPs to refer to Tensor operations per second or alike with different bit
widths...
--
Srdja