Fast perft on GPU (upto 20 Billion nps w/o hashing)

ankan · Post by **ankan** » Sat Jun 22, 2013 7:35 pm

I am starting to write a new chess engine, and I plan to attempt making a GPU version also.

As a first step, I wrote a move generator using bit-boards (kogge-stone for sliding piece move generation).

I know perft speed is not a measure of chess engine strength and there are many other things that I should be doing before trying to optimize move generation but I got somewhat addicted and decided to try a GPU version.

Here is my move-generator/perft tool for GPUs:
https://github.com/ankan-ban/perft_gpu

The program will work only on Nvidia GK110 or later as it uses CUDA dynamic parallelism (currently GTX 780, GTX Titan, or more expensive server class K20/K20X GPUs).

Note that my code is optimized only for perft (e.g, bulk-counting at leaves, no separate generation of captures, etc) and probably would need complete re-write to be used in a chess engine.

Currently it doesn't make use of hash tables (probably the next thing I am going to try...). It's almost 100X faster than my 32 bit single-threaded CPU version running on a core i7 2600k.

Here are some runs on a GTX 780:

start position:

Code: Select all


GPU Perft 1&#58; 20,   Time taken&#58; 2.16e-005 seconds, nps&#58; 925925

GPU Perft 2&#58; 400,   Time taken&#58; 0.000134368 seconds, nps&#58; 2976899

GPU Perft 3&#58; 8902,   Time taken&#58; 0.000221888 seconds, nps&#58; 40119338

GPU Perft 4&#58; 197281,   Time taken&#58; 0.000602656 seconds, nps&#58; 327352582

GPU Perft 5&#58; 4865609,   Time taken&#58; 0.00123642 seconds, nps&#58; 3935252431

GPU Perft 6&#58; 119060324,   Time taken&#58; 0.0105882 seconds, nps&#58; 11244597713

GPU Perft 7&#58; 3195901860,   Time taken&#58; 0.262085 seconds, nps&#58; 12194124526

GPU Perft 8&#58; 84998978956,   Time taken&#58; 7.13521 seconds, nps&#58; 11912610760

GPU Perft 9&#58; 2439530234167,   Time taken&#58; 199.261 seconds, nps&#58; 12242907844

position 2 (from CPW):

Code: Select all

GPU Perft 1&#58; 48,   Time taken&#58; 2.1248e-005 seconds, nps&#58; 2259036

GPU Perft 2&#58; 2039,   Time taken&#58; 0.000131424 seconds, nps&#58; 15514670

GPU Perft 3&#58; 97862,   Time taken&#58; 0.00034736 seconds, nps&#58; 281730781

GPU Perft 4&#58; 4085603,   Time taken&#58; 0.00159635 seconds, nps&#58; 2559337193

GPU Perft 5&#58; 193690690,   Time taken&#58; 0.0090281 seconds, nps&#58; 21454212020

GPU Perft 6&#58; 8031647685,   Time taken&#58; 0.49178 seconds, nps&#58; 16331775849

GPU Perft 7&#58; 374190009323,   Time taken&#58; 18.8954 seconds, nps&#58; 19803269462

I know the biggest challenge in building a gpu chess engine is parallel search and it might turn out to be very difficult make it work efficiently - as from whatever I have seen at least a few tens of thousand parallel threads are needed to utilize the GPU properly. At-least now I know that GPUs are not that bad at integer performance as I previously feared.

Daniel Shawul · Post by **Daniel Shawul** » Sat Jun 22, 2013 8:22 pm

That looks great! Maybe you can help in computing larger perft counts for chess. So far we are up to 13-14, and a 100x speed up can really help to reduce computation time. I don't think I saw 200 sec for perft(9) before.
Well my knowledge of GPUs is probably too outdated by now but last time I tried recursion was not possible. I guess that dynamic parallelism will simplify coding a lot. Also registers were very few so I could not store the board and moves generated there, that is when using warps as the minimum search unit. I don't know how hard a transition would it be for you to do parallel perfts but for alpha-beta search you would need to do that. But it is not needed for perft since you can simply feed the device with another perft call i.e. if the device allows you to do that. So does perft(10) take 20x200 sec = 40000sec or just a fraction of those due to overlapped computation?
Daniel

Ajedrecista · Post by **Ajedrecista** » Sat Jun 22, 2013 8:47 pm

Hello Ankan:

Truly outstanding! Congratulations.

Please make your tool hash compatible. Daniel's thought of computing large perft results like the 20 draft 13 results of Perft(14) or the 400 draft 12 results of Perft(14) makes sense.

Regarding the time of Perft(9) around 200 seconds without hashing... it is fantastic. As Daniel pointed out, few perft counters can do something similar... surely Paul Byrne, Steven Edwards and Peter Österlund only. Here is a time-to-ply perft table I made a time ago using JetChess (it has hash tables up to 1 GB for perft).

You said that your GPU perft counter is around 100 times faster than the CPU, single threaded, 32-bit version. Could you give an estimate of benefit with reasonable implementation of hash tables to both CPU and GPU versions, please? Thanks in advance and good luck for the task.

Regards from Spain.

Ajedrecista.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Jun 22, 2013 9:10 pm

Wow, impressive!

squaresInBetween is in the meantime corrected:
https://chessprogramming.wikispaces.com ... alculation

Good luck and fun!

Gerd

ibid · Post by **ibid** » Sat Jun 22, 2013 10:45 pm

Very impressive indeed!

I am much slower (on my phenom 1090T):
- 6797 seconds for perft 9 of the initial position for a single-thread no-hash version. (It would probably be ~1200 seconds with multiple threads.)
- 47 seconds for the multi-threaded version with 6 GB of hash table. (Hash tables are a major win...)
- 51 seconds for the program that uses unique positions (this approach is not well suited to such quick searches -- a conventional program can detect the same transpositions via the hash table).

(The first 2 methods are in a program called gperft that is almost ready for release -- it is basically a more general version of the gperftd binary written for the distributed setup.)

I have to imagine a hybrid setup that computes unique positions to a decent depth and feeds the resulting positions to a GPU program with hash tables for shorter perfts would be capable of wonders. :)

-paul

sje · Post by **sje** » Sat Jun 22, 2013 10:58 pm

Can you try running perft(14)? It needs independent verification and it would be helpful to have the 20 perft(13) subtotals available.

ankan · Post by **ankan** » Sun Jun 23, 2013 6:27 am

Daniel Shawul wrote:Well my knowledge of GPUs is probably too outdated by now but last time I tried recursion was not possible. I guess that dynamic parallelism will simplify coding a lot. Also registers were very few so I could not store the board and moves generated there, that is when using warps as the minimum search unit.

Recursion is available on nvidia compute capability 2.0 and later devices (GTX 4xx and later), but apart from making single threaded code easier to write it didn't help much. Dynamic parallelism (i.e, ability to launch new threads from GPU code) is much more useful.
For perft I let each thread in a warp count/generate moves for a different position. As normally the positions at the same ply/level are generally similar, there is thread divergence in warp is not a big issue. Trying to process multiple levels in every thread (e.g, using recursion) causes too much divergence and performance falls drastically.

The first gpu kernel I tried to compute perft was something like this:
1. if depth == 1 count moves at the given position and use atomicAdd/parallel reduction to update global sum. Exit.
2. generate all moves at this position (let's say nMoves moves). Copy-make these moves to get next level boards.
3. launch a new kernel (same function at depth-1) with nMoves threads - each for processing a newly generated position.

This was only slightly faster than my CPU version (less than 2X faster). The gpu tries running multiple kernels in parallel (if they are launched on different streams) but I soon realized that launching kernels with such small no. of threads (20-50) is not efficient.

The next thing I tried was instead of launching a new kernel for every thread, make only the first thread in thread block launch a single kernel to process child positions generated by all threads in the thread block. This results in launching kernel with ~(256 * 45) threads where 256 is my block size and let's say 45 is the average branching factor. This improved performance a lot.

To improve performance further, my program now launches a single kernel after gathering all positions for two levels. E.g, launching kernels with ~256*45*45 threads.

There were many other small improvements that helped and many more things that didn't.

Daniel Shawul wrote: I don't know how hard a transition would it be for you to do parallel perfts but for alpha-beta search you would need to do that. But it is not needed for perft since you can simply feed the device with another perft call i.e. if the device allows you to do that. So does perft(10) take 20x200 sec = 40000sec or just a fraction of those due to overlapped computation?
Daniel

The GPU tries overlapping kernels to improve performance but for perft I have observed that the GPU gets completely utilized starting at depths 5-7, so perft(10) takes nearly 20x200 seconds.

ankan · Post by **ankan** » Sun Jun 23, 2013 6:49 am

Ajedrecista wrote:Hello Ankan:

Truly outstanding! Congratulations.

Please make your tool hash compatible. Daniel's thought of computing large perft results like the 20 draft 13 results of Perft(14) or the 400 draft 12 results of Perft(14) makes sense.

Regarding the time of Perft(9) around 200 seconds without hashing... it is fantastic. As Daniel pointed out, few perft counters can do something similar... surely Paul Byrne, Steven Edwards and Peter Österlund only. Here is a time-to-ply perft table I made a time ago using JetChess (it has hash tables up to 1 GB for perft).

You said that your GPU perft counter is around 100 times faster than the CPU, single threaded, 32-bit version. Could you give an estimate of benefit with reasonable implementation of hash tables to both CPU and GPU versions, please? Thanks in advance and good luck for the task.

Regards from Spain.

Ajedrecista.

Thanks Jesús.

I haven't tried hash tables in either my CPU or GPU version yet but Paul's numbers show a great improvement using hashing. Benefit of hashing is highly dependent on the position but it's always useful.

I think the GPU version will benefit less from hashing as memory accesses are relatively expensive. It might be faster to use hashing only at interior nodes - upto second last or third last nodes from the leafs. I guess I will have to try and see.

Regards,
-Ankan

ankan · Post by **ankan** » Sun Jun 23, 2013 6:53 am

Gerd Isenberg wrote:Wow, impressive!

squaresInBetween is in the meantime corrected:
https://chessprogramming.wikispaces.com ... alculation

Good luck and fun!

Gerd

Thanks Gred.
The chess programming wiki is extremely helpful!

Regards,
-Ankan

ankan · Post by **ankan** » Sun Jun 23, 2013 7:13 am

sje wrote:Can you try running perft(14)? It needs independent verification and it would be helpful to have the 20 perft(13) subtotals available.

Sure, I will definitely try that after adding hash table support to my program. Without hashing it will take my program a few years to compute perft(13)

I also think Paul's approach of counting unique position to a certain depth and then using perft with hash for all unique positions is likely to be fastest for computing large perfts.

Regards,
-Ankan

Fast perft on GPU (upto 20 Billion nps w/o hashing)

Fast perft on GPU (upto 20 Billion nps w/o hashing)

Re: Fast perft on GPU (upto 20 Billion nps w/o hashing)

Re: Fast perft on GPU (up to 20 billion nps w/o hashing).

Re: Fast perft on GPU (upto 20 Billion nps w/o hashing)

Re: Fast perft on GPU (upto 20 Billion nps w/o hashing)

Re: Fast perft on GPU (upto 20 Billion nps w/o hashing)

Re: Fast perft on GPU (upto 20 Billion nps w/o hashing)

Re: Fast perft on GPU (up to 20 billion nps w/o hashing).

Re: Fast perft on GPU (upto 20 Billion nps w/o hashing)

Re: Fast perft on GPU (upto 20 Billion nps w/o hashing)