Very interesting project. Your #3 is closest to what I'm trying to do, so its good to hear it has a good raw-nodes-per-second calculation. As you point out, Alpha-Beta pruning in parallel seems like an unsolved problem. But I hope to accomplish it in this topic.smatovic wrote: ↑Fri Jun 21, 2019 9:14 pmMaybe a word on what I have tried on gpu, during development I got 6 different approaches playingdragontamer5788 wrote: ↑Fri Jun 21, 2019 7:35 pm ...
I've searched for some of the words in your post, and apparently this old topic exists: http://www.talkchess.com/forum3/viewtop ... =7&t=41853
I'll have to go through that topic and see what other parallel-methodologies were proposed and tried. I'll take a look at your code as well, and maybe I'll be inspired with alternative approaches.
chess on gpu, the latter two played decent chess...
1) 'SPPS' AlphaBeta parallelization, 128 threads work on up to 128 moves per node in parallel
2) MCTS without UCT across multiple Compute Units of the GPU
3) LIFO-Stack based load balancing for AlphaBeta search on one Compute Unit of the GPU
4) 'Nagging' and 'Spam' parallelization approach for AlphaBeta search on one Compute Unit of the GPU
5) Parallel BestFirstMiniMax Search, similar to MCTS but with AlphaBeta playouts at leaf nodes
6) Classic parallel AlphaBeta approach with ABDADA and RMO, randomized move order, for parallelization
5) uses the 'one thread one board' approach with thousands of threads and 6) the 'one SIMD one board'
approach with hundreds of workers where 64 threads are coupled to one work-group to work on the same
node in parallel during move generation, move sorting and position evaluation.
IIRC, 3), the LIFO-Stack, gave the best nps throughput, but I could not figure out how to implement
AlphaBeta pruning efficient.
Zeta milestones:
https://zeta-chess.app26.de/post/zeta-milestones/
Zeta v099k parallel scaling results:
https://github.com/smatovic/Zeta/blob/m ... esults.txt
--
Srdja
Instead of using a LIFO stack for one-compute unit, I plan to use a Deque as if it were a stack within a compute unit. Whenever the Deque fills up, it "work donates" to the global stack, which will allow other compute units to work-steal. A smaller "local" deque will allow for more workstealing, but will require more synchronization. A larger deque will have less workstealing, but will require less communication between compute units.
* Using a Deque as a stack is as simple as "add to tail" and "remove from tail" (local operations within a compute unit).
* The Deque work donation step is as simple as "remove from head". My current plan is to use simple spinlocks for the global work buffer.