Pigeon is now running on the GPU

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

StuartRiffle
Posts: 25
Joined: Tue Apr 05, 2016 9:34 pm
Location: Canada

Pigeon is now running on the GPU

Post by StuartRiffle »

Parallel search is working under CUDA in the dev branch! I am still fixing bugs on the CPU side. I just wanted to share because it's a big milestone. :)

Benchmarks later this week!

Code: Select all

     /O_    Pigeon 1.6.0 (UCI)
     ||     SSE4/POPCNT/CUDA
    / \\
  =/__//    pigeonengine.com
     ^^

uci
id name Pigeon 1.6.0
id author Stuart Riffle
option name Clear Hash type button
option name Hash type spin min 4 max 8192 default 512
option name OwnBook type check default true
option name Threads type spin default 1 min 1 max 24
option name Early Move type check default true
option name SIMD type check default true
option name POPCNT type check default true
option name CUDA type check default true
option name GPU Hash type spin min 4 max 8192 default 512
option name GPU Batch Size type spin min 32 max 8192 default 1024
option name GPU Batch Count type spin min 4 max 1024 default 32
option name GPU Plies type spin min 0 max 8 default 2
uciok
isready
info string CUDA 0: GeForce GTX 660 (CC 3.0, 960 cores, 1084 mHz, 2048 MB)
readyok
-Stuart
(Pigeon)
smatovic
Posts: 2658
Joined: Wed Mar 10, 2010 10:18 pm
Location: Hamburg, Germany
Full name: Srdja Matovic

Re: Pigeon is now running on the GPU

Post by smatovic »

Kudos.

May i ask why you chose CUDA over OpenCL?

...if you are in need of an sparring partner:

http://zeta-chess.app26.de/page-1.html#Zeta-098e

--
Srdja
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Pigeon is now running on the GPU

Post by cdani »

Nice!!! We wait for your details about this achievement :-)
StuartRiffle
Posts: 25
Joined: Tue Apr 05, 2016 9:34 pm
Location: Canada

Re: Pigeon is now running on the GPU

Post by StuartRiffle »

Thanks Srdja,

I'm using CUDA because Pigeon uses heavily templated C++, and the same code is compiled for the scalar, SIMD, and GPU paths. As far as I can tell, using templates would require vendor-specific extensions until OpenCL 2.1 is ready.

I expect to support OpenCL eventually, but for now CUDA just works. :)

I will definitely use Zeta for testing!
-Stuart
(Pigeon)
matthewlai
Posts: 793
Joined: Sun Aug 03, 2014 4:48 am
Location: London, UK

Re: Pigeon is now running on the GPU

Post by matthewlai »

StuartRiffle wrote:Parallel search is working under CUDA in the dev branch! I am still fixing bugs on the CPU side. I just wanted to share because it's a big milestone. :)

Benchmarks later this week!

Code: Select all

     /O_    Pigeon 1.6.0 (UCI)
     ||     SSE4/POPCNT/CUDA
    / \\
  =/__//    pigeonengine.com
     ^^

uci
id name Pigeon 1.6.0
id author Stuart Riffle
option name Clear Hash type button
option name Hash type spin min 4 max 8192 default 512
option name OwnBook type check default true
option name Threads type spin default 1 min 1 max 24
option name Early Move type check default true
option name SIMD type check default true
option name POPCNT type check default true
option name CUDA type check default true
option name GPU Hash type spin min 4 max 8192 default 512
option name GPU Batch Size type spin min 32 max 8192 default 1024
option name GPU Batch Count type spin min 4 max 1024 default 32
option name GPU Plies type spin min 0 max 8 default 2
uciok
isready
info string CUDA 0: GeForce GTX 660 (CC 3.0, 960 cores, 1084 mHz, 2048 MB)
readyok
Do you actually get a speedup from CUDA?

I would imagine all the branching in minimax will make a naive implementation very slow on CUDA.
Disclosure: I work for DeepMind on the AlphaZero project, but everything I say here is personal opinion and does not reflect the views of DeepMind / Alphabet.
brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: Pigeon is now running on the GPU

Post by brianr »

Thank you for sharing.
Got it to compile on my development system and was just wondering what GPU xxx options you might suggest for the following this graphics card:

Code: Select all

info string CUDA 0: GeForce GTX 770 (CC 3.0, 1536 cores, 1163 mHz, 2048 MB)
StuartRiffle
Posts: 25
Joined: Tue Apr 05, 2016 9:34 pm
Location: Canada

Re: Pigeon is now running on the GPU

Post by StuartRiffle »

Egads!

It's cool that you got a build, but the dev branch will probably give you crazy output because I'm still working on the CPU-side code to gather the CUDA results as they return and fold them into the search.

(But if you do a pull you will get some more bugfixes, FWIW. The CUDA part does appear to be doing batches of searches correctly).

I expect to have the CPU part working this afternoon. I'll let you know when those changes are checked in.
-Stuart
(Pigeon)
StuartRiffle
Posts: 25
Joined: Tue Apr 05, 2016 9:34 pm
Location: Canada

Re: Pigeon is now running on the GPU

Post by StuartRiffle »

matthewlai wrote:Do you actually get a speedup from CUDA?

I would imagine all the branching in minimax will make a naive implementation very slow on CUDA.
Oh, I'm nowhere near getting a speedup yet. I am doing horrible things to this poor GPU. At this stage I'm still pleasantly surprised it ticks over at all.

Current problems include:
- Warp occupancy under 5% (!!)
- Spilling registers left and right
- Heavy memory traffic in general (though L1 hit rate ~70%)
- 64-bit integer operations on current hardware are emulated with 32-bit registers (!!). This also blows up the code size, which is unfortunate, because the code is pretty big to start with.

The branch divergence could honestly be worse. I converted the negamax to an iterative implementation, and set things up so that after a thread finishes a search, it can start up another one and fall back into line with the rest of the warp.

There is still a lot of room for improvement though. :/
-Stuart
(Pigeon)
StuartRiffle
Posts: 25
Joined: Tue Apr 05, 2016 9:34 pm
Location: Canada

Re: Pigeon is now running on the GPU

Post by StuartRiffle »

Just a quick update.

The highest throughput I've been able to achieve with the current code on a GTX 660 is about 1 million nodes per second so far. Which is slower than a CPU, so... not compelling yet.

On the bright side, the code is utilizing the GPU very poorly, so there's a lot of room for improvement. :) I'm working on it.
-Stuart
(Pigeon)
ankan
Posts: 77
Joined: Sun Apr 21, 2013 3:29 pm
Full name: Ankan Banerjee

Re: Pigeon is now running on the GPU

Post by ankan »

what search algorithm are you using?
You mentioned negaMax - is that with or without alpha-beta pruning?