Wouldn't it be nice if C++ GPU

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Rémi Coulom
Posts: 426
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Rémi Coulom » Thu Apr 25, 2019 6:51 pm

smatovic wrote:
Thu Apr 25, 2019 5:38 pm
https://github.com/ankan-ban/ConvTest
Very interesting, thanks. I will try to make a tensor-core version.

Rein Halbersma
Posts: 685
Joined: Tue May 22, 2007 9:13 am

Re: Wouldn't it be nice if C++ GPU

Post by Rein Halbersma » Thu Apr 25, 2019 7:18 pm

Rémi Coulom wrote:
Thu Apr 25, 2019 6:48 pm
If you require GPU support on Ubuntu, please also install Bazel
(from https://github.com/FloopCZ/tensorflow_cc)
Thanks for correcting me! But at least it's a one time use and you don't need to integrate Bazel into your own project build.

I've also just found another package that is installable on Debian without having to build with Bazel: https://github.com/kecsap/tensorflow_cpp_packaging Not sure how mature it is though.

chrisw
Posts: 1548
Joined: Tue Apr 03, 2012 2:28 pm

Re: Wouldn't it be nice if C++ GPU

Post by chrisw » Thu Apr 25, 2019 7:40 pm

Rémi Coulom wrote:
Thu Apr 25, 2019 6:51 pm
smatovic wrote:
Thu Apr 25, 2019 5:38 pm
https://github.com/ankan-ban/ConvTest
Very interesting, thanks. I will try to make a tensor-core version.
Some integratable c++ source plus necessary headers that works with both windows and Linux/Ubuntu would be a great resource. Should also expand the variance of engines when developers aren’t locked in to particular inputs and can use preprocessed input. for the moment cuda nvidia supporting.

Daniel Shawul
Posts: 3645
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Daniel Shawul » Fri Apr 26, 2019 12:50 am

Rein Halbersma wrote:
Thu Apr 25, 2019 7:18 pm
Rémi Coulom wrote:
Thu Apr 25, 2019 6:48 pm
If you require GPU support on Ubuntu, please also install Bazel
(from https://github.com/FloopCZ/tensorflow_cc)
Thanks for correcting me! But at least it's a one time use and you don't need to integrate Bazel into your own project build.

I've also just found another package that is installable on Debian without having to build with Bazel: https://github.com/kecsap/tensorflow_cpp_packaging Not sure how mature it is though.
I have the option of building with either bazel or tensorflow_cc but there are some serious issues with the latter

a) tensorflow_cc is available only linux
b) multi-GPU problems with libtensorflow_cc. I reported the issue here https://github.com/FloopCZ/tensorflow_cc/issues/136
but there are still no solutions for it. Building with bazel will not have this problem. This was the deal breaker for me.
c) one more dependency libtensorflow_cc.so and maybe more

Bazel cons

Windows bazel build is kind of broken when compiling with GPU. There is a known issue which i can't find at the moment.
CPU build is OK though, so I use that and get one binary egbbdll.so without depedencies.
For GPU builds I compile directly against TensorRT. Note that you can configure tensorflow with TensorRT, MKL, experimental OpenCL etc so
theoretically you don't have to use anything other than tensorflow. Btw tensorflow flow do have TPU support which i have never explored. But building directly with TensoRT is so much easier (just another library) without going through the pain of compiling tensorflow (either via bazel or tensorflow_cc -- both equally painful). TensorRT gives atleast 2x speedup compared to Tensorflow compiled without TensorRT support. I am curious to know if Tensorlow with TensoRT support can perform equally well ..

@Remi Why do you to write even a single cuda kernel ? cuDNN has lots of convolution kernels to choose from anyway and the performance
the TensorRT performance is as good as hand-written cuda kernels of Ankan as I have detailed here
viewtopic.php?f=2&t=69885&hilit=Scorpio+Leela&start=10
And then when you factor in supporting tensor cores, fp16 and maybe int8/int4 with Turing, the old fp16 units in 1070 etc, one is inclined to conclude
this is better left for a library. I don't even build the graph explicitly like lc0 does because I don't intend to do any manual optimization of the graph
by writing convolution kernel etc.

Rémi Coulom
Posts: 426
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Rémi Coulom » Fri Apr 26, 2019 8:26 am

Daniel Shawul wrote:
Fri Apr 26, 2019 12:50 am
@Remi Why do you to write even a single cuda kernel ? cuDNN has lots of convolution kernels to choose from anyway and the performance
the TensorRT performance is as good as hand-written cuda kernels of Ankan as I have detailed here
viewtopic.php?f=2&t=69885&hilit=Scorpio+Leela&start=10
And then when you factor in supporting tensor cores, fp16 and maybe int8/int4 with Turing, the old fp16 units in 1070 etc, one is inclined to conclude
this is better left for a library. I don't even build the graph explicitly like lc0 does because I don't intend to do any manual optimization of the graph
by writing convolution kernel etc.
Writing my own cuda is certainly not a reasonable choice, but it is fun to try, and I still believe it is possible to outperform cuDNN for small batches.

Gian-Carlo Pascutto
Posts: 1135
Joined: Sat Dec 13, 2008 6:00 pm
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Gian-Carlo Pascutto » Fri Apr 26, 2019 4:04 pm

Rémi Coulom wrote:
Fri Apr 26, 2019 8:26 am
Writing my own cuda is certainly not a reasonable choice, but it is fun to try, and I still believe it is possible to outperform cuDNN for small batches.
You don't need to guess. lc0 got a commit recently where the contributor who also works on NVIDIA's drivers added a hand-written CUDA kernel for small batches, to be used instead of the cuDNN one.

I believe Henrik Forsten observed before that for Leela Zero Go, the OpenCL kernels were faster than CuDNN for very small batches.

Daniel Shawul
Posts: 3645
Joined: Tue Mar 14, 2006 10:34 am
Location: Ethiopia
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Daniel Shawul » Fri Apr 26, 2019 6:16 pm

Which begs the question, why use small batch sizes at all ? I don't use batch_size of less than 128.
Even launching 128 to 256 threads for multi-threaded batching on a 4-core cpu i see no problems...
Lc0 uses single threaded batching and defaults to batch size of 256 -- though smaller batch size of 32 is used for training ...

chrisw
Posts: 1548
Joined: Tue Apr 03, 2012 2:28 pm

Re: Wouldn't it be nice if C++ GPU

Post by chrisw » Fri Apr 26, 2019 8:03 pm

Daniel Shawul wrote:
Fri Apr 26, 2019 6:16 pm
Which begs the question, why use small batch sizes at all ? I don't use batch_size of less than 128.
Even launching 128 to 256 threads for multi-threaded batching on a 4-core cpu i see no problems...
Lc0 uses single threaded batching and defaults to batch size of 256 -- though smaller batch size of 32 is used for training ...
Well, if batch=1, you get full NN guidance whenever you want it. If batch=infinity, you get no NN guidance at all. As batchsize increases from 1 you get increasingly less NN guidance at and near leaf nodes. On the other side of the balance you get faster NN lookups.

Best N for batchsize is easily established by results. So far, so obvious, I guess the interesting bit is how, or not, one provides useful search guidance in the absense of the NN. The original paper, I think, just did a random selection. Does it make sense to use some kind of handcrafted selector? I’ve been trying with selecting obvious captures over the last few days, but only succeeded in breaking everything, so nothing useful to report right now. Which is basically why I want c++ gpu, fiddling around at a chess level with Python is not good. Need C++.

Rémi Coulom
Posts: 426
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Rémi Coulom » Sat Apr 27, 2019 12:13 am

Daniel Shawul wrote:
Fri Apr 26, 2019 6:16 pm
Which begs the question, why use small batch sizes at all ? I don't use batch_size of less than 128.
Even launching 128 to 256 threads for multi-threaded batching on a 4-core cpu i see no problems...
Lc0 uses single threaded batching and defaults to batch size of 256 -- though smaller batch size of 32 is used for training ...
Did you measure the effect of batch size on playing strength? Nodes per second is not a good measurement of performance. I am using very large batches for self-play game generation, because I can play N self-play games in parallel. But when playing a single game in a tournament, I feel such huge batches may hurt performance, especially when the number of nodes is small. I have not measured this very seriously, though. I will run some tests during the week-end.

Rémi Coulom
Posts: 426
Joined: Mon Apr 24, 2006 6:06 pm
Contact:

Re: Wouldn't it be nice if C++ GPU

Post by Rémi Coulom » Sat Apr 27, 2019 12:20 am

chrisw wrote:
Fri Apr 26, 2019 8:03 pm
Daniel Shawul wrote:
Fri Apr 26, 2019 6:16 pm
Which begs the question, why use small batch sizes at all ? I don't use batch_size of less than 128.
Even launching 128 to 256 threads for multi-threaded batching on a 4-core cpu i see no problems...
Lc0 uses single threaded batching and defaults to batch size of 256 -- though smaller batch size of 32 is used for training ...
Well, if batch=1, you get full NN guidance whenever you want it. If batch=infinity, you get no NN guidance at all. As batchsize increases from 1 you get increasingly less NN guidance at and near leaf nodes. On the other side of the balance you get faster NN lookups.

Best N for batchsize is easily established by results. So far, so obvious, I guess the interesting bit is how, or not, one provides useful search guidance in the absense of the NN. The original paper, I think, just did a random selection. Does it make sense to use some kind of handcrafted selector? I’ve been trying with selecting obvious captures over the last few days, but only succeeded in breaking everything, so nothing useful to report right now. Which is basically why I want c++ gpu, fiddling around at a chess level with Python is not good. Need C++.
In my program, when I reach a leaf that is already queued for evaluation, I simply add a virtual loss, and search again. If necessary, a leaf may accumulate more than one virtual loss. I believe DeepMind used a virtual loss, too.

Post Reply