GPU ANN, how to deal with host-device latencies?

smatovic · Post by **smatovic** » Sun May 06, 2018 10:42 am

GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja

Rémi Coulom · Post by **Rémi Coulom** » Sun May 06, 2018 12:13 pm

Double buffering: transfer the next batch to the device while the previous batch is being computed.

The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.

So the search has to be very parallel.

Milos · Post by **Milos** » Sun May 06, 2018 12:24 pm

smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja

As Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.

Daniel Shawul · Post by **Daniel Shawul** » Sun May 06, 2018 1:24 pm

smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja

You need to use a highly parallel asynchronus search like MCTS. I expect alpha-beta rollouts to work equally well too. Since I am not using policy network, I am going to have to evaluate each child with a value network during expansion. This gives me on average 40 postiions to evaluate simultaneously, and maybe I could batch that with requests from other threads and latency won't be a problem. I am still running a convnet on the CPU so I haven't actually dealt with the problem yet.

Daniel

smatovic · Post by **smatovic** » Sun May 06, 2018 9:52 pm

Rémi Coulom wrote: ↑Sun May 06, 2018 12:13 pm Double buffering: transfer the next batch to the device while the previous batch is being computed.

The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.

So the search has to be very parallel.

Got it, buffering and batch size, thx.

--
Srdja

smatovic · Post by **smatovic** » Sun May 06, 2018 9:53 pm

Milos wrote: ↑Sun May 06, 2018 12:24 pm
smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
As Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.

Thx, i will take a look into the paper.

--
Srdja

smatovic · Post by **smatovic** » Sun May 06, 2018 9:55 pm

Daniel Shawul wrote: ↑Sun May 06, 2018 1:24 pm
smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja
You need to use a highly parallel asynchronus search like MCTS.

Yes, that was the point, i was thinking in serial alpha-beta..thx.

--
Srdja

smatovic · Post by **smatovic** » Thu Feb 28, 2019 1:28 pm

smatovic wrote: ↑Sun May 06, 2018 10:42 am GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.

How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?

--
Srdja

I am back at the drawing board, and stil struggle with OpenCL latencies,
maybe someone can comment on my numbers if they look correct?

OS: Ubuntu 18.04 x86-64
Device: Nvidia GTX 750, 1 GHz, 512 cores, 1 TFLOPs

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:

~35K calls per second

OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:

~10K calls per second

Note that my machine is a bit outdated:

- PCIe via Northbridge
- PCIe 2.0
- only 8 lanes per slot

Maybe on newer systems the latencies do not hurt at all?

--
Srdja

smatovic · Post by **smatovic** » Fri Mar 01, 2019 7:05 am

Got this answer on Nvidia developer forum,
maybe it is of interest for others...

I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.

PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.

https://devtalk.nvidia.com/default/topi ... atencies-/

--
Srdja

GPU ANN, how to deal with host-device latencies?

GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?

Re: GPU ANN, how to deal with host-device latencies?