GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.
How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?
--
Srdja
GPU ANN, how to deal with host-device latencies?
Moderators: hgm, Rebel, chrisw
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
-
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: GPU ANN, how to deal with host-device latencies?
Double buffering: transfer the next batch to the device while the previous batch is being computed.
The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.
So the search has to be very parallel.
The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.
So the search has to be very parallel.
-
- Posts: 4190
- Joined: Wed Nov 25, 2009 1:47 am
Re: GPU ANN, how to deal with host-device latencies?
As Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.
How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?
--
Srdja
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.
-
- Posts: 4185
- Joined: Tue Mar 14, 2006 11:34 am
- Location: Ethiopia
Re: GPU ANN, how to deal with host-device latencies?
You need to use a highly parallel asynchronus search like MCTS. I expect alpha-beta rollouts to work equally well too. Since I am not using policy network, I am going to have to evaluate each child with a value network during expansion. This gives me on average 40 postiions to evaluate simultaneously, and maybe I could batch that with requests from other threads and latency won't be a problem. I am still running a convnet on the CPU so I haven't actually dealt with the problem yet.smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.
How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?
--
Srdja
Daniel
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU ANN, how to deal with host-device latencies?
Got it, buffering and batch size, thx.Rémi Coulom wrote: ↑Sun May 06, 2018 12:13 pm Double buffering: transfer the next batch to the device while the previous batch is being computed.
The biggest problem is not host-devices latencies, but the inefficiency of small batches. With CUDNN, it is impossible to get good performance with batch_size=1. Especially on big GPUs with tensor cores. For my network, I measured that a batch of 8 is faster than a batch of 1 (because the batch of 1 cannot use the tensor cores), and a batch of 16 is almost as fast as a batch of 8 on the Titan V.
So the search has to be very parallel.
--
Srdja
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU ANN, how to deal with host-device latencies?
Thx, i will take a look into the paper.Milos wrote: ↑Sun May 06, 2018 12:24 pmAs Remi mentioned go increase batch size, for cuDNN (as it can be seen from LC0 project) the real speed gain starts only when batch size gets to 128.smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.
How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?
--
Srdja
Regarding buffering and efficient sending of data, have a look on the paper of my colleagues https://arxiv.org/abs/1803.06333.
--
Srdja
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU ANN, how to deal with host-device latencies?
Yes, that was the point, i was thinking in serial alpha-beta..thx.Daniel Shawul wrote: ↑Sun May 06, 2018 1:24 pmYou need to use a highly parallel asynchronus search like MCTS.smatovic wrote:GPGPU host-device latencies are, afaik, tens of microseconds,
so you can loose >100K CPU clock cycles for each GPU function call.
How to implement GPU based ANN evaluation for CPU based tree search with such
limitations?
--
Srdja
--
Srdja
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU ANN, how to deal with host-device latencies?
I am back at the drawing board, and stil struggle with OpenCL latencies,
maybe someone can comment on my numbers if they look correct?
OS: Ubuntu 18.04 x86-64
Device: Nvidia GTX 750, 1 GHz, 512 cores, 1 TFLOPs
OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, no memory buffer transfer and empty kernel:
~35K calls per second
OpenCL gpu kernel calls (terminated with clfinish), 1 million threads, with 8 KB memory write and 4 KB memory read transfer and empty kernel:
~10K calls per second
Note that my machine is a bit outdated:
- PCIe via Northbridge
- PCIe 2.0
- only 8 lanes per slot
Maybe on newer systems the latencies do not hurt at all?
--
Srdja
-
- Posts: 2658
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: GPU ANN, how to deal with host-device latencies?
Got this answer on Nvidia developer forum,
maybe it is of interest for others...
--
Srdja
maybe it is of interest for others...
https://devtalk.nvidia.com/default/topi ... atencies-/I have no idea what you are measuring, and I have had zero exposure to OpenCL. Under CUDA, the minimal observed kernel launch time is 5 microseconds for null kernels, meaning that there can be at most 200,000 kernel invocations per second. That minimal launch overhead has basically not changed much in about a decade, and the limiter appears to be the basic latency of the PCIe link. It is generally a good idea to design for minimal kernel execution time > 1 millisecond.
PCIe version and width impact primarily PCIe throughput, with little impact on PCIe latency. For minimum software overhead in the host-side driver stack, a CPU with high single-thread performance is recommended. At this time I would recommend a CPU with > 3.5 GHz base frequency as optimal.
--
Srdja