A 3rd way

dangi12012 · Post by **dangi12012** » Wed Jun 07, 2023 12:14 pm

I can see it clearly as day now, from discussions with peers and research - I want to propose a 3rd way of architecture for a chess engine that is very much different from existing approaches.

Part 1: The open secret

GPUs run 100x faster on Integer arithmetic than a 32 core processor, you can fit 8 of them into one computer, floating point network operations are 100x faster as well, memory is 20x faster. From actual VK compute shader code I can achieve this memory bandwidth compared to host:

Code: Select all

GQueens: 113.976
GB/s   : 911.805
Host GB/s   : 23.1888

The only thing holding back GPU adoption is what people call latency. Its the 3.5us delay when executing a kernel and awaiting results. What I am proposing is never leaving the GPU. The fundamental difference is that AB depth first is not possible and can be replaced by a hybrid of MCTS and AB for the leaf nodes. At the leaf nodes of the leaf AB nodes you can execute a NNUE algorithm. Why this is possible now and not 5 years ago I will prove below.

Part 2: The Architecture

You dont use AB depth first but MCTS breadth first with AB at the leafes. You keep all nodes in memory and have your working memory mapped as a memory mapped file imported to the GPU which is possible in VK and cuda. (So you can access 100s of GB of pcie memory inside the GPU without any issues, and work in chunks of 8GB video memory which get evicted to pcie SSDs implicitly).

Now having all nodes in memory means that you build the tree of relevant nodes and prune them similar to AB, but you dont need to incrementally increase depth and hope the TT gives you hints of what to prune, no you literally delete nodes and the whole subtree when you need to. Can this be done fast? Yes very - with cuda Thrust you can mark and delete dead nodes with 900GB/s in a single line of code - something that would have taken months of effort developing a parallel stream reduction a few years ago.

You run a MCTS algorithm with "random playouts" being that the cuda cores run AB tree search at the leaf nodes. LC0 has a big NN at the leaf nodes. This is not what I am proposing. At the leaf nodes you run classical AB tree search with 0 additional memory consumption - and here comes the cool part:
You run AB tree search on algorithm on 32 different leaf nodes at the same time (warp size), meaning the 2nd big problem of thread divergence on gpus can be solved with little overhead of a few sync() calls!. At the leaf nodes of AB you have a GPU ported NNUE implementation. GPU memory is 20x faster so loading and applying weights will be very fast. Also the integer and uin16_t relu arithmetic will be fast. I am proposing a NNUE binary compatible interface to not research and train twice.

With MCTS running on the GPU itself the kernel never stops executing, all the CPU needs to do is provide UCI - or does it?
With memory mapping we have the choice to run the same approach on the CPU AND other GPUs as well - you see having a single pointer on all devices. Having a single source of truth backed to fast memory is an approach that provides arbitrary parallelism

Why this was not possible 5 years ago

1) Support for accessing memory mapped pointers on the device was not widely spread. Today it is.

2) Pcie 2.0/3.0 ssds were too slow. If you buy 2 Pcie 4.0 SSDs with both can be gotten for less than 100 dollars you can effectively have a huge number of relevant nodes in memory. If a node gets pruned you need to move a lot of memory so you need to achieve 20GB/s at least which is not expensive today.

3) Cuda intrinsics! Not only do you get 10k cores. not only can each core execute at 3ghz, not only do you get 32 instructions per clock. NO! Nvidia has added __brevll bit reversal important for color agnostic boards, no they also added __viaddmin_s16x2_relu which Performs per-halfword max(min(a + b, c), 0). You can see the significance here: https://www.talkchess.com/forum3/viewto ... 42#p948672
Which makes

4) Atomics and Fences and Semaphores. Essentially you can have all workers execute on the same pointer mapped to a multi TB part of PCIE backed fast memory but when the time comes to prune, you need to tell everyone. From time to time you have to sync pointers essentially telling the cache of all devices the new outcomes of simulations for UCB1: https://en.wikipedia.org/wiki/Monte_Carlo_tree_search
to make adjusted exploration strategies.

5) A MCTS node has to be at least uint64_t and a pointer as well as the move_id. Here is not the point for implementation details but it shows its easily possible to have many many billion nodes in working memory without any slowdowns completely together with efficient pruning etc.
If the opponent makes a move you evict all unreachable resulting positions from the tree instantly - currently AB cannot do that in any way shape or form.

There are a lot more details to be shared - but switching from pure AB which itself demands serial execution to a parallel approach and making the gpus weakness oth 32threads needing to execute the same instruction its strength by running AB on 32 different MCTS leaf nodes is very strong.
As a bonus we can on top of that use multiple devices effectively working on the same pointer as the tree root - picking promising nodes with different strategies and verifying with AB + NNUE at the playout phase.

Bonus - what you dont need to invent anew:
Any AB code ideas written in C++ will be mostly compatible. Warps can even share a small TT without any issues inside a 64kb shared memory buffer of ultra - ultra fast memory. Even using VRAM is 20x faster than current approaches.

Any NNUE networks trained will be binary compatible - essentially copying the weights to VRAM and executing there.

MCTS as a base - does not need to be invented as such, and allows for massive unconstrained parallelism. In my approach its neither bogged down by a huge NN neither is it bogged down by thread divergence, nor is pcie latency an issue.

Greetings - Daniel

smatovic · Post by **smatovic** » Thu Jun 08, 2023 6:57 am

my 6 cents:

1. MCTS-UCT is a best first search
2. There is MC-AB
https://www.chessprogramming.org/MC%CE%B1%CE%B2
3. There is the similar Best-First-MiniMax Search
https://www.chessprogramming.org/Best-F ... max_Search
4. Zeta v098 (~2012) stores the search tree in GPU memory
https://github.com/smatovic/Zeta/tree/v098
https://zeta-chess.app26.de/downloads/z ... 64-w64.exe
5. There is a paper on SSD based breadth first search (2012)
https://eldorado.tu-dortmund.de/handle/2003/29418
6. You might want to take a look into AB Rollouts (2015)
https://www.semanticscholar.org/paper/P ... 82d4740db9

--
Srdja

Modern Times · Post by **Modern Times** » Thu Jun 08, 2023 9:23 am

My 2 cents - Daniel, why don't you write a complete chess engine to showcase all your ideas, if you haven't done so already ?

dangi12012 · Post by **dangi12012** » Thu Jun 08, 2023 11:36 am

Modern Times wrote: ↑Thu Jun 08, 2023 9:23 am My 2 cents - Daniel, why don't you write a complete chess engine to showcase all your ideas, if you haven't done so already ?

I think its in progress for me. - if you read also read all of Zetas blogposts you will see that we just struggle with thread divergence. We get launched in blocks and each blocks contains a number of warps/wavefronts which is 32/64 threads at once.

If we dont make sure that all threads execute the same instruction we get a huge performance penalty, on top of that an "If" on the gpu does not exist both branches get executed. I am saying there is a way around that with current knowledge.

Hardware has progressed too. See a few years back it was common to not even support popcount: https://github.com/smatovic/Zeta/blob/9 ... ta.cl#L354

Nowadays we can assume Turing or better going forward with some 64bit intrinsics by default. So the best time to start a gpu engine is since around 2020. Probably publishing perfT at the beginning for huge numbers which always is fun.

Greetings!

smatovic · Post by **smatovic** » Sat Jun 17, 2023 6:55 pm

Seriously Daniel, give it some time and enjoy the ride, if you want to experiment with some stuff before complete engine, it is up to you, and, it really takes time to get into ~100 years of computer chess history, algorithms and implementations...

Zeta 0.8.0.6, Pawns are movin
yeha,

pawn moves are now generated in OpenCL on the GPU.

Communication latency between Host and GPU is high (~ 50 ms), but i hope the calculation speedup by the GPU will balance this out.

https://web.archive.org/web/20120306104 ... chive.html

I was so happy back then that I just got an move generator running on a GPU....no SF killer and alike in sight

...and, I talked some twaddle, takes some time to get into low level CPU/GPU stuff, step by step you dig depper, or alike.

And, I suggested it already to you, step 1) write a simple AlphaBeta engine by yourself, ~2000 Elo ought to be enough, with standard chess programming techniques, then step 2) outperform your own engine via GPU approach....Ankan wrote his Paladin and gpu_perft, I wrote Zeta Dva and Zeta, looking forward for Gigantua on a GPU

--
Srdja

dangi12012 · Post by **dangi12012** » Fri Jun 23, 2023 8:03 pm

Yeah there is more cool stuff to this. Since now we can finally use all of C++20 inside cuda code. Meaning std::countr_zero and std::popcount are supported natively by device code.

Here I want to say that the gpu became a lot more versatile in the past few years.

Here are a few points:

Its easily possible to run NNUE (or any incremental NN) on the gpu.

It now possible to have growing device vectors via thrust if needed: https://docs.nvidia.com/cuda/thrust/index.html
(which implies malloc inside a device kernel - a few years ago not possible)

Stream compaction (removal of all 0s in an array) with the possibility of using constexpr lambdas for it (custom comparators)

Memory mapped IO (probably the biggest point) we can just keep a MC Tree in virtual memory well outside the physical limits of any device. Multiple devices can work away on that memory.

Since the memory mapped file pointer is available to us there is nothing stopping us to probe from device code into all 150GB of 6man TB. This can help MCTS because what would be a "random playout" would stop much sooner etc.

Dynamic parallelism meaning a kernel can invoke more kernels (which is totally impossible in glsl but supported in cuda)

Biggest thing: Memory speed surpassing 1000 GB/s compared to CPUs 20-60GB/s memory speed - per device

Having 48kb of super ultra fast memory still available for each block of 256 threads

So imo there is a way to stop trying to squeeze out as much chess compute out of for example a 16 core machine with 5ghz and move towards a different mental model "32threads must execute the same instruction at all times" but we can have access to 10k cores with 2ghz each.
Even funny things like 1 thread per square are cheap and easy to do making the current square a compiletime constant etc.

This is not new but all of the above list made things a lot easier in the past lets say 10 years.
Best thing of all: When reading GLSL documentation it was always meant to be like cuda is today - a C++ compatible machinery where we dont need to learn new languages but can write the same language but only be aware of the different restrictions.

A 3rd way

A 3rd way

Re: A 3rd way

Re: A 3rd way

Re: A 3rd way

Re: A 3rd way

Re: A 3rd way