Part 1: The open secret
GPUs run 100x faster on Integer arithmetic than a 32 core processor, you can fit 8 of them into one computer, floating point network operations are 100x faster as well, memory is 20x faster. From actual VK compute shader code I can achieve this memory bandwidth compared to host:
Code: Select all
GQueens: 113.976
GB/s : 911.805
Host GB/s : 23.1888
Part 2: The Architecture
You dont use AB depth first but MCTS breadth first with AB at the leafes. You keep all nodes in memory and have your working memory mapped as a memory mapped file imported to the GPU which is possible in VK and cuda. (So you can access 100s of GB of pcie memory inside the GPU without any issues, and work in chunks of 8GB video memory which get evicted to pcie SSDs implicitly).
Now having all nodes in memory means that you build the tree of relevant nodes and prune them similar to AB, but you dont need to incrementally increase depth and hope the TT gives you hints of what to prune, no you literally delete nodes and the whole subtree when you need to. Can this be done fast? Yes very - with cuda Thrust you can mark and delete dead nodes with 900GB/s in a single line of code - something that would have taken months of effort developing a parallel stream reduction a few years ago.
You run a MCTS algorithm with "random playouts" being that the cuda cores run AB tree search at the leaf nodes. LC0 has a big NN at the leaf nodes. This is not what I am proposing. At the leaf nodes you run classical AB tree search with 0 additional memory consumption - and here comes the cool part:
You run AB tree search on algorithm on 32 different leaf nodes at the same time (warp size), meaning the 2nd big problem of thread divergence on gpus can be solved with little overhead of a few sync() calls!. At the leaf nodes of AB you have a GPU ported NNUE implementation. GPU memory is 20x faster so loading and applying weights will be very fast. Also the integer and uin16_t relu arithmetic will be fast. I am proposing a NNUE binary compatible interface to not research and train twice.
With MCTS running on the GPU itself the kernel never stops executing, all the CPU needs to do is provide UCI - or does it?
With memory mapping we have the choice to run the same approach on the CPU AND other GPUs as well - you see having a single pointer on all devices. Having a single source of truth backed to fast memory is an approach that provides arbitrary parallelism
Why this was not possible 5 years ago
1) Support for accessing memory mapped pointers on the device was not widely spread. Today it is.
2) Pcie 2.0/3.0 ssds were too slow. If you buy 2 Pcie 4.0 SSDs with both can be gotten for less than 100 dollars you can effectively have a huge number of relevant nodes in memory. If a node gets pruned you need to move a lot of memory so you need to achieve 20GB/s at least which is not expensive today.
3) Cuda intrinsics! Not only do you get 10k cores. not only can each core execute at 3ghz, not only do you get 32 instructions per clock. NO! Nvidia has added __brevll bit reversal important for color agnostic boards, no they also added __viaddmin_s16x2_relu which Performs per-halfword max(min(a + b, c), 0). You can see the significance here: https://www.talkchess.com/forum3/viewto ... 42#p948672
Which makes
4) Atomics and Fences and Semaphores. Essentially you can have all workers execute on the same pointer mapped to a multi TB part of PCIE backed fast memory but when the time comes to prune, you need to tell everyone. From time to time you have to sync pointers essentially telling the cache of all devices the new outcomes of simulations for UCB1: https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
to make adjusted exploration strategies.
5) A MCTS node has to be at least uint64_t and a pointer as well as the move_id. Here is not the point for implementation details but it shows its easily possible to have many many billion nodes in working memory without any slowdowns completely together with efficient pruning etc.
If the opponent makes a move you evict all unreachable resulting positions from the tree instantly - currently AB cannot do that in any way shape or form.
There are a lot more details to be shared - but switching from pure AB which itself demands serial execution to a parallel approach and making the gpus weakness oth 32threads needing to execute the same instruction its strength by running AB on 32 different MCTS leaf nodes is very strong.
As a bonus we can on top of that use multiple devices effectively working on the same pointer as the tree root - picking promising nodes with different strategies and verifying with AB + NNUE at the playout phase.
Bonus - what you dont need to invent anew:
Any AB code ideas written in C++ will be mostly compatible. Warps can even share a small TT without any issues inside a 64kb shared memory buffer of ultra - ultra fast memory. Even using VRAM is 20x faster than current approaches.
Any NNUE networks trained will be binary compatible - essentially copying the weights to VRAM and executing there.
MCTS as a base - does not need to be invented as such, and allows for massive unconstrained parallelism. In my approach its neither bogged down by a huge NN neither is it bogged down by thread divergence, nor is pcie latency an issue.
Greetings - Daniel