perft(15)

ankan · Post by **ankan** » Sat Aug 26, 2017 6:39 am

A cluster of DGX1 server systems was used to compute this result. The computation took about 8 days running on average of 4 systems (32 GPUs total) at a time. A second verification run took about 5 days running on average of 8 systems at a time. The verification run was performed with a different set or random numbers for zobrist hashes. Obviously this doesn't guarantee correctness as there could be bugs in my program - but at least the verification run avoids the possibility wrong result due to hardware instability or hash collisions.

My program makes use of GPUs to explore the tree and count no. of leaf nodes. It can make use of multiple GPUs in a system and now also supports a network mode where one can run the program on multiple systems on a network and these instances will communicate with each other to avoid duplicate work.

perft is basically a brute force problem - just exploring a tree and counting leaf nodes. When I first started with GPU version of perft - back in 2013, I was doing it just to see if GPUs are suitable for integer calculations and irregular problems like move generation for chess.
Most of the initial optimizations were focussed in improving speed of move generation. After many experiments I found that a breadth first search kind of approach works best for the GPU. My program starts exploring the tree in depth first manner on CPU. When a certain depth is reached a GPU function (kernel) is launched to compute perft of the subtree. The launch depth (currently 6) is chosen to make sure GPU kernel will be able to fit the entire subtree in memory. If it fails to do so, the kernel returns failure and multiple launches are made at depth-1. The GPU kernel explores the tree in breadth first manner. At every level, GPU kernels are launched to count and then generate child boards - which are then used for next level. The no of threads launched is equal to the no of frontier nodes (i.e, every thread processes one board position). At depth = 1 (just above leaves), a special kernel is launched to perform bulk counting (without actually generating child boards). For details of simple GPU perft routine see the function perft_bb_gpu_simple().

Soon I realized that if I want my perft program to be competitive and to be able to compute bigger perfts, I need transposition tables to avoid duplicate work. I still launch GPU work in breadth first manner. Before launching new kernel with all frontier nodes, a small separate hash table is used to find (and discard) duplicates in the current set of frontier nodes. The bigger transpositions tables are used in similar way as a CPU perft routine would use - i.e, in case of a hit, return value from hash table without exploring the node any further (for breadth first search, the node is removed from the list of frontier nodes). There are more than a single main hash table - different hash tables for different 'depths'. For depths that are explored on CPU, a lossless hash table is used (hash table with chain to hold multiple entries at same slot). For depths that are explored on the GPU, there are some hash tables in video/device memory (shallower levels) and some in system memory (deeper levels). These hash-tables can't hold all entries and act more like cache. See perft_bb_gpu_simple_hash() for more details.

Once I had my perft programming running and working well on single GPU, I wanted to experiment with scaling/distributing the work across multiple GPUs. Last year, for perft(14) verification I added some code to distribute sub-trees to different GPUs on a single system. The distribution happens at the root - every GPU gets a different set of sub-trees to explore. CPU worker threads are launched corresponding to each GPU on the system. These CPU threads are responsible for managing one GPU each. Each thread starts exploring a sub-tree and launches GPU work when GPU launch depth is reached. The CPU side hash tables are shared among these threads, and even the system memory hash tables used by GPU perft are also shared. However the device memory hash tables are not shared.

A few months back I decided to try distributing the problem to multiple systems (nodes on a cluster) - in the hope of computing perft(15) in reasonable amount of time. A shared file on network is used to hold a list (ip addresses) of all currently active nodes. Whenever a new instance comes online, it requests the first node in the list to send the complete CPU side hash table (the lossless one). It then adds its own IP address to the list. Whenever a node adds a new entry to CPU side hash table, it broadcasts this entry to all other nodes in the list. In case a node is no longer reachable, it's removed from the list (by the first peer that detects). This way the CPU side hash table is kept in sync across all instances of the program.

Disclosure: I work for Nvidia but chess programming is a personal hobby. I would like to thank my employer for letting me to use cycles on the cluster for this computation.

Dann Corbit · Post by **Dann Corbit** » Sat Aug 26, 2017 8:58 am

ankan wrote:
Dann Corbit wrote: Do you have stored statistics, such as how many mates are in the first 15 ply, how many stalemates, etc.?
Unfortunately no. I can modify my program to count these statistics - but it will probably become quite a bit slow. I would guess counting these will for first 14 ply would take about the same time as the perft 15 computation took.

I strongly suspect that your move generator (since it must detect mates to get the correct answer ) would create the strongest mate prover eve imagined.

Isn't it time to give Chest319 some competition?

ankan · Post by **ankan** » Sat Aug 26, 2017 9:10 am

Dann Corbit wrote: I strongly suspect that your move generator (since it must detect mates to get the correct answer ) would create the strongest mate prover eve imagined.

Isn't it time to give Chest319 some competition?

Yes, that's the next thing I plan to attempt. However the main issue is alpha-beta pruning (and current parallel search algorithms based on that) is not something that scales well to thousands to millions of threads. I have some plans and am going to try out a few things. Will update if I have any success with it.

Dann Corbit · Post by **Dann Corbit** » Sat Aug 26, 2017 9:31 am

ankan wrote:
Dann Corbit wrote: I strongly suspect that your move generator (since it must detect mates to get the correct answer ) would create the strongest mate prover eve imagined.

Isn't it time to give Chest319 some competition?
Yes, that's the next thing I plan to attempt. However the main issue is alpha-beta pruning (and current parallel search algorithms based on that) is not something that scales well to thousands to millions of threads. I have some plans and am going to try out a few things. Will update if I have any success with it.

I am so excited that someone with your immense talent is looking at this.

I expect the best mate finder (by at least two orders of magnitude) will be the result.

You have to really understand these GPU boards to get the horsepower out of them. I am really geeked to see people of great talent putting their brains at risk to solve these difficult problems.

!

cdani · Post by **cdani** » Sat Aug 26, 2017 12:20 pm

Nice work! Congratulations!!

ankan · Post by **ankan** » Sat Aug 26, 2017 2:28 pm

Thanks all. Wouldn't have been possible without encouragement from everyone in this forum (especially Steven).

smatovic · Post by **smatovic** » Sat Aug 26, 2017 6:39 pm

Kudos from Germany......years of compute cycles compressed into 5 days, impressive.

--
Srdja

Ajedrecista · Post by **Ajedrecista** » Sat Aug 26, 2017 8:34 pm

Hello Dann:

Dann Corbit wrote:Do you have stored statistics, such as how many mates are in the first 15 ply, how many stalemates, etc.?

The current shortest known stalemates are 19-ply stalemates. It was discussed in the following thread:

estimating the number of possible stalemates in perft(n)

Regarding checks, checkmates, discovered checks and so on, you can get interesting info for some depths at this site:

Statistics on chess games

Dann Corbit wrote:I strongly suspect that your move generator (since it must detect mates to get the correct answer ) would create the strongest mate prover eve imagined.

Isn't it time to give Chest319 some competition?

It would be great. By the way, should not ChestUCI 5.2 be somewhat better than Chest 319? Just asking.

Regards from Spain.

Ajedrecista.

bhamadicharef · Post by **bhamadicharef** » Thu Aug 31, 2017 8:46 am

What is Chest319 ? What method does it use ?

bhamadicharef · Post by **bhamadicharef** » Thu Aug 31, 2017 8:50 am

Nice job Ankan ! With 4 DGX-1 the job is getting faster then

I look forward to try your tool on my servers at work too !

perft(15)

Re: Perft(15): comparison of estimates with Ankan's result.

Re: perft(15)

Re: perft(15)

Re: perft(15)

Re: perft(15)

Re: perft(15)

Re: Perft(15): comparison of estimates with Ankan's result.

Re: Perft(15).

Re: perft(15)

Re: Perft(15): comparison of estimates with Ankan's result.