A cluster of DGX1 server systems was used to compute this result. The computation took about 8 days running on average of 4 systems (32 GPUs total) at a time. A second verification run took about 5 days running on average of 8 systems at a time. The verification run was performed with a different set or random numbers for zobrist hashes. Obviously this doesn't guarantee correctness as there could be bugs in my program - but at least the verification run avoids the possibility wrong result due to hardware instability or hash collisions.
makes use of GPUs to explore the tree and count no. of leaf nodes. It can make use of multiple GPUs in a system and now also supports a network mode where one can run the program on multiple systems on a network and these instances will communicate with each other to avoid duplicate work.
perft is basically a brute force problem - just exploring a tree and counting leaf nodes. When I first started with GPU version of perft - back in 2013, I was doing it just to see if GPUs are suitable for integer calculations and irregular problems like move generation for chess.
Most of the initial optimizations were focussed in improving speed of move generation. After many experiments I found that a breadth first search kind of approach works best for the GPU. My program starts exploring the tree in depth first manner on CPU. When a certain depth is reached a GPU function (kernel) is launched to compute perft of the subtree. The launch depth (currently 6) is chosen to make sure GPU kernel will be able to fit the entire subtree in memory. If it fails to do so, the kernel returns failure and multiple launches are made at depth-1. The GPU kernel explores the tree in breadth first manner. At every level, GPU kernels are launched to count and then generate child boards - which are then used for next level. The no of threads launched is equal to the no of frontier nodes (i.e, every thread processes one board position). At depth = 1 (just above leaves), a special kernel is launched to perform bulk counting (without actually generating child boards). For details of simple GPU perft routine see the function perft_bb_gpu_simple().
Soon I realized that if I want my perft program to be competitive and to be able to compute bigger perfts, I need transposition tables to avoid duplicate work. I still launch GPU work in breadth first manner. Before launching new kernel with all frontier nodes, a small separate hash table is used to find (and discard) duplicates in the current set of frontier nodes. The bigger transpositions tables are used in similar way as a CPU perft routine would use - i.e, in case of a hit, return value from hash table without exploring the node any further (for breadth first search, the node is removed from the list of frontier nodes). There are more than a single main hash table - different hash tables for different 'depths'. For depths that are explored on CPU, a lossless hash table is used (hash table with chain to hold multiple entries at same slot). For depths that are explored on the GPU, there are some hash tables in video/device memory (shallower levels) and some in system memory (deeper levels). These hash-tables can't hold all entries and act more like cache. See perft_bb_gpu_simple_hash() for more details.
Once I had my perft programming running and working well on single GPU, I wanted to experiment with scaling/distributing the work across multiple GPUs. Last year, for perft(14) verification I added some code to distribute sub-trees to different GPUs on a single system. The distribution happens at the root - every GPU gets a different set of sub-trees to explore. CPU worker threads are launched corresponding to each GPU on the system. These CPU threads are responsible for managing one GPU each. Each thread starts exploring a sub-tree and launches GPU work when GPU launch depth is reached. The CPU side hash tables are shared among these threads, and even the system memory hash tables used by GPU perft are also shared. However the device memory hash tables are not shared.
A few months back I decided to try distributing the problem to multiple systems (nodes on a cluster) - in the hope of computing perft(15) in reasonable amount of time. A shared file on network is used to hold a list (ip addresses) of all currently active nodes. Whenever a new instance comes online, it requests the first node in the list to send the complete CPU side hash table (the lossless one). It then adds its own IP address to the list. Whenever a node adds a new entry to CPU side hash table, it broadcasts this entry to all other nodes in the list. In case a node is no longer reachable, it's removed from the list (by the first peer that detects). This way the CPU side hash table is kept in sync across all instances of the program.
Disclosure: I work for Nvidia but chess programming is a personal hobby. I would like to thank my employer for letting me to use cycles on the cluster for this computation.