GPU rumors 2020

smatovic · Post by **smatovic** » Wed Nov 13, 2019 1:32 pm

Nvidia plans to release its next architecture named Ampere in 2020 based on
Samsungs and TSMCs 7nm process, current Turing architecture is based on TSMCs
12nm.

Intel plans to release the Xe series discrete GPUs based on their own 10 nm
process (comparable to TSMC 7nm?) in 2020.

AMD plans to enter the highend-gpu market in 2020 with Navi 23 based on RDNA2
architecture (TSMC 7nm+) to counter the Nvidia RTX 2080 line of gpus.

Nvidia has currently little pressure in the high-end market so I see two options
with Ampere, a) keep current core counts and only increase clock rates and lower
power consuption, b) double overall performance of Ampere to counter the upcoming
gpus by Intel and AMD to keep the forehand.

2020/2021 will be interesting, cos Intel will work on the Exa-Flop super computer
named Aurora and AMD on Frontier, both will use 'gpu-like' accelerators with new
kind of interconnections which offer new features and may make it into the
Desktop market in some years.

I have heard nothing about Nvidia like TensorCores (which boost NNs in LC0)
for Intel or AMD gpu architectures. Intel will probable try to propagate their
own ASIC inference hardware (Nervana), FPGAs, and even cpus with AVX 512 in
this segment. So AMD seems to have here a breach in its assortment.

By quantity the low end segment of GPUs (GTX 1050/RX 580) is the most important
and AMD was able to gain some market here, so I guess Nvidia will keep the
division into RTX series with TensorCores and GTX series without.

On the architecture side it is possible that 2020-gpus will all support INT8
math with higher throughput, afaik AlphaZero used INT8 nn inference on its TPUs
gen 1, so this could be a further step for LC0 in future...

--
Srdja

dragontamer5788 · Post by **dragontamer5788** » Wed Nov 13, 2019 4:21 pm

The big news for GPUs seems to be better support for memory cohesion with the host. OpenCL 2.0 introduced shared-virtual-memory, CUDA has pinned memory, so "memory cohesion with host" is an old trick by now... but what's changing is how fast this will get in the future.

* PCIe 4.0 will double-bandwidth from 15.7 GBps to 31.4GBps, still slower than DDR4 but that's a very thick pipe for communications.

* CCIX is creating a standard of cache-coherence between CPUs and accelerators over PCIe 4.0. If CCIX (or a similar protocol) becomes widely deployed, then "lock cmpxchg" instruction on x86 host will be coherent with GPU-caches. Technically OpenCL 2.0 is ready for this with "Fine-grained Shared Virtual Memory", but I don't believe that's been practically implemented yet. (AMD and Intel only support "coarse-grained SVM").

* Competing protocols, such as Compute Express Link (Intel), OpenCAPI (POWER9), NVLink, all are pushing for cache coherence. So even if CCIX fails as a protocol, cache-coherence is firmly going to be supported by somebody in the future.

In effect: the near-term future of GPUs is that they're going to get easier to program. The CPU <---> GPU bridge has always been a weak-spot in terms of high-performance programming, so simplifying that and speeding it up is a good thing.

smatovic · Post by **smatovic** » Wed Nov 13, 2019 5:28 pm

Yea, I guess Intel and AMD will use something new in their Exa-Flops systems,
and Nvidia will coninue its partnership with IBM, not sure when and if this
will drop to consumer products.

I agree, real memory coherence will offer new kind of gpgpu, and wish to add that
significant lower host-device latencies will be a further step, maybe one which
makes AB engines with GPU ANN feasible in future.

--
Srdja

dragontamer5788 · Post by **dragontamer5788** » Thu Nov 14, 2019 8:03 pm

smatovic wrote: ↑Wed Nov 13, 2019 5:28 pm Yea, I guess Intel and AMD will use something new in their Exa-Flops systems,
and Nvidia will coninue its partnership with IBM, not sure when and if this
will drop to consumer products.

I agree, real memory coherence will offer new kind of gpgpu, and wish to add that
significant lower host-device latencies will be a further step, maybe one which
makes AB engines with GPU ANN feasible in future.

--
Srdja

Host-device latencies are probably not going away: the raw speed of electricity is not going to change and the hardware engineers responsible for PCIe already are using all known tricks to drop latency as low as possible.

What cache-coherence gets us is a unified language, where atomics can "give permission" to the CPU, Bus, and GPU, to selectively execute some statements out-of-order with regards to each other. Relaxed atomics say "Here's a bunch of atomic synchronizations, but I don't actually care what order they're executed in".

While acquire/release atomics say "THIS acquire has to come before this release... but feel free to reorder other statements partially around this". (Acquire is a half-barrier which allows reordering after the Acquire, but not before. Release is a half-barrier which allows reordering before the barrier, but not afterwards).

And of course, sequential consistency atomics define a total ordering, all reads/writes to all memory locations is fully defined. This is the expectation of programmers who are untrained in the way of parallel processing, but its too slow to actually do in practice.

------

So its not so much that host-device latencies are going away. Its that we are getting a new standard language that allows our machines permission to execute in parallel: out-of-order with respect to each other. This gives us the illusion that the latency disappeared, but in reality, the machines just "found other work to do" while waiting for the same latency.

Acquire/release semantics are becoming very popular. Outside of a couple of obscure situations... most people only need acquire/release semantics for their parallel code to be correct. Total sequential consistency is too slow, fully-relaxed atomics are too loose and poorly understood.

In particular, Acquire-release seems "consistent enough" to implement spinlocks.

smatovic · Post by **smatovic** » Thu Nov 14, 2019 8:15 pm

dragontamer5788 wrote: ↑Thu Nov 14, 2019 8:03 pm
smatovic wrote: ↑Wed Nov 13, 2019 5:28 pm Yea, I guess Intel and AMD will use something new in their Exa-Flops systems,
and Nvidia will coninue its partnership with IBM, not sure when and if this
will drop to consumer products.

I agree, real memory coherence will offer new kind of gpgpu, and wish to add that
significant lower host-device latencies will be a further step, maybe one which
makes AB engines with GPU ANN feasible in future.

--
Srdja
Host-device latencies are probably not going away: the raw speed of electricity is not going to change and the hardware engineers responsible for PCIe already are using all known tricks to drop latency as low as possible.

What cache-coherence gets us is a unified language, where atomics can "give permission" to the CPU, Bus, and GPU, to selectively execute some statements out-of-order with regards to each other. Relaxed atomics say "Here's a bunch of atomic synchronizations, but I don't actually care what order they're executed in".

While acquire/release atomics say "THIS acquire has to come before this release... but feel free to reorder other statements around this". (The order to a particular memory location is well defined... but it is ill-defined across memory)

And of course, sequential consistency atomics define a total ordering, all reads/writes to all memory locations is fully defined. This is the expectation of programmers who are untrained in the way of parallel processing, but its too slow to actually do in practice.

------

So its not so much that host-device latencies are going away. Its that we are getting a new standard language that allows our machines permission to execute in parallel: out-of-order with respect to each other. This gives us the illusion that the latency disappeared, but in reality, the machines just "found other work to do" while waiting for the same latency.

I meant the host-device latencies also known as kernel launch overhead,
can vary between 5ms and 100ms on different systems for null-kernels,
this is primary not caused by the PCIe connection, but (speculation) by the
little embedded CPU controller in all GPUs that launches the kernels, I admit
there is little documentation out there to make further statements. But I guess
it would be possible to remove this embedded CPU and access the SIMD
untis directly...just thinking out loud.

---
Srdja

***edit***

link to pdf about Nvidia Falcon

https://riscv.org/wp-content/uploads/20 ... ory_V2.pdf

grahamj · Post by **grahamj** » Fri Nov 15, 2019 10:17 pm

smatovic wrote: ↑Thu Nov 14, 2019 8:15 pm I meant the host-device latencies also known as kernel launch overhead,
can vary between 5ms and 100ms on different systems for null-kernels,
this is primary not caused by the PCIe connection, but (speculation) by the
little embedded CPU controller in all GPUs that launches the kernels, I admit
there is little documentation out there to make further statements. But I guess
it would be possible to remove this embedded CPU and access the SIMD
untis directly...just thinking out loud.

I don't understand these timings. I presume you meant usec not ms, but they still seem wrong. If you synchronise the host and the device (and transferring data one way or the other requires synchronisation) you can get up to 100usec on Windows due to WDDM.

This (first answer) makes sense to me, and accords with my experience:
https://stackoverflow.com/questions/270 ... ls-in-cuda

smatovic · Post by **smatovic** » Sat Nov 16, 2019 10:23 am

grahamj wrote: ↑Fri Nov 15, 2019 10:17 pm
smatovic wrote: ↑Thu Nov 14, 2019 8:15 pm I meant the host-device latencies also known as kernel launch overhead,
can vary between 5ms and 100ms on different systems for null-kernels,
this is primary not caused by the PCIe connection, but (speculation) by the
little embedded CPU controller in all GPUs that launches the kernels, I admit
there is little documentation out there to make further statements. But I guess
it would be possible to remove this embedded CPU and access the SIMD
untis directly...just thinking out loud.
I don't understand these timings. I presume you meant usec not ms, but they still seem wrong. If you synchronise the host and the device (and transferring data one way or the other requires synchronisation) you can get up to 100usec on Windows due to WDDM.

This (first answer) makes sense to me, and accords with my experience:
https://stackoverflow.com/questions/270 ... ls-in-cuda

Ah, yes, usec - micro-seconds, not milli-seconds.

I know, this is not the way to utilize gpus, but I measured sequential kernel calls on my outdated system

http://www.talkchess.com/forum3/viewtop ... 47#p791580

to see if an AlpheBeta search on CPU with small GPU ANNs is feasible, and it is not. Cos of the latencies.

--
Srdja

smatovic · Post by **smatovic** » Sat Nov 16, 2019 4:12 pm

smatovic wrote: ↑Sat Nov 16, 2019 10:23 am
grahamj wrote: ↑Fri Nov 15, 2019 10:17 pm
smatovic wrote: ↑Thu Nov 14, 2019 8:15 pm I meant the host-device latencies also known as kernel launch overhead,
can vary between 5ms and 100ms on different systems for null-kernels,
this is primary not caused by the PCIe connection, but (speculation) by the
little embedded CPU controller in all GPUs that launches the kernels, I admit
there is little documentation out there to make further statements. But I guess
it would be possible to remove this embedded CPU and access the SIMD
untis directly...just thinking out loud.
I don't understand these timings. I presume you meant usec not ms, but they still seem wrong. If you synchronise the host and the device (and transferring data one way or the other requires synchronisation) you can get up to 100usec on Windows due to WDDM.

This (first answer) makes sense to me, and accords with my experience:
https://stackoverflow.com/questions/270 ... ls-in-cuda
Ah, yes, usec - micro-seconds, not milli-seconds.

I know, this is not the way to utilize gpus, but I measured sequential kernel calls on my outdated system

http://www.talkchess.com/forum3/viewtop ... 47#p791580

to see if an AlpheBeta search on CPU with small GPU ANNs is feasible, and it is not. Cos of the latencies.

--
Srdja

Followup - afaik Nvidia did not publish official numbers on the kernel-launch-overhead,
and it also depends on the system (OS, CPU/GPU, PCIe), but here the 'official'
statement by ROCm documentation for AMD devices:

...
For CPU devices, the kernel launch time is fast (tens of μs), but for discrete
GPU devices it can be several hundred μs.
...

https://rocm-documentation.readthedocs. ... ation.html

I have no numbers present, but iirc the AMD Fury X had significant higher host-device-latencies
than my Nvidia gpus.

--
Srdja

dragontamer5788 · Post by **dragontamer5788** » Mon Nov 18, 2019 5:31 am

smatovic wrote: ↑Sat Nov 16, 2019 4:12 pm I have no numbers present, but iirc the AMD Fury X had significant higher host-device-latencies
than my Nvidia gpus.

My AMD Vega64 takes 250052us for the 1st kernel invocation (probably for driver initialization), but only 7.7us for the 2nd kernel invocation over ROCm.

The timing code was:

Code: Select all

    clock_gettime(CLOCK_MONOTONIC, &start);
    hipLaunchKernelGGL(testScan, dim3(1), dim3(256), 0, 0, toScan, NUM_SCAN);
    clock_gettime(CLOCK_MONOTONIC, &end);

    nanoseconds = (end.tv_sec - start.tv_sec) * 1000000000ll + (end.tv_nsec - start.tv_nsec);
    using namespace std;
    cout << "2nd Kernel took: " << nanoseconds << "ns\n";

--------------

Intel has announced Ponte Vecchio (Xe GPU) formally today. https://www.anandtech.com/show/15119/in ... oming-2021

As part of the keynote speech today, Koduri has explained that Intel will have a single ‘Xe’ architecture but multiple sub-architectures / microarchitectures (or however you want to characterize it in a GPU) in order to address different parts of the market. The ultra-mobile parts of the product stack might focus on small die size and high efficiency libraries, whereas a compute product might have high double-precision performance and run high-performance libraries. Some variants might have the equivalent of tensor accelerators for AI, or some variants might have bigger cache variants to manage specific customer needs. Rather than a ‘one size fits all’ approach, it appears that Intel are going to stretch Xe as wide as they need to in order to accommodate as many customers as possible.

Ponte Vecchio will be the showcase accelerator in the Aurora supercomputer, to be installed at Argonne National Laboratory in 2021. Intel has announced that this will feature two Sapphire Rapids CPUs (the one after Ice Lake), along with six Ponte Vecchio GPUs, in a single node. More details on this in our Aurora coverage.

Sounds like there are at least 3 different Intel Xe designs being made under Ponte Vecchio architecture: high double-precision FP64, deep-learning tensors, and a high-bandwidth / high-capacity memory oriented version.

dragontamer5788 · Post by **dragontamer5788** » Mon Nov 18, 2019 7:00 pm

I don't know how many $10,000+ this will cost, but that's 16-chiplets loaded with HBM2.

GPU rumors 2020

GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020

Re: GPU rumors 2020