Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

dragontamer5788 · Post by **dragontamer5788** » Thu Jun 20, 2019 9:50 pm

Introduction

A few weeks ago, I theorized that GPUs can effectively perform work-efficient alpha-beta search. It led to a long discussion elsewhere on the internet, which pretty much ended with "If you think its possible, then you should do it". A couple of weeks later, here I am today ready to learn from the Chess community and begin tackling this project.

From my understanding: a work-efficient parallel algorithm for Alpha-Beta search is already a "unicorn project" never seen before... and I'm complicating the project by trying to make it work on a grossly non-traditional architecture: the GPU. Finally, I'm not even using CUDA, but instead the lesser-used AMD ROCm.

But despite all of these unknowns, the power of the Vega 64 GPU I'm targeting gives me much hope. With 4096 shaders (supporting 16,384 GPU-threads of execution) clocked at 1.5GHz, and with 8GB of HBM2 RAM delivering 483GB/s, the Vega 64 GPU is a monster of a machine: far wider than your typical 6-core CPU. Far more memory bandwidth than your typical 20GB/s DDR4 channel.

Furthermore, the GPGPU architecture is largely similar between NVidia and AMD, any ideas I implement on AMD's GPU should port over to NVidia's GPUs (although the code may have to change slightly). All modern GPUs have specs of this nature: incredible numbers of SIMD cores, and stupefying amounts of RAM bandwidth. It behooves the modern chess programmer to leverage the GPU.

There are many unanswered questions about how this can be done, but this topic hopes to at least answer the problem of search trees.

Previous Research

* https://github.com/StuartRiffle/Jaglavak -- An NVidia GPU-based chess engine that uses MCTS (calculated on the CPU) + Random Rollouts (calculated on the GPU). Not my project, but its proof that a GPU can implement a bitboard-based chess program, even if the MCTS search-tree remains strictly on the CPU-side of this implementation.

* Intel's Task-based Fibonnaci example: https://software.intel.com/en-us/node/506102 -- This is one of my main inspirations for the project. Task-based parallelism captures the structure of recursive algorithms (like Alpha-beta search) elegantly. Unfortunately, Task-based parallelism doesn't seem to have been ported to the GPU architecture yet.

* Stream Compaction: http://www.cse.chalmers.se/~uffe/streamcompaction.pdf -- Stream Compaction is the algorithm used for GPGPU-based raytracers. This is the algorithm I'll be using to implement parallel Task-queues (or parallel Task-stacks). Task-based parallelism for GPUs doesn't exist yet, but the Stream Compaction algorithm should make it possible.

* Futures / Promises -- https://en.wikipedia.org/wiki/Futures_and_promises . The C++ community (and other languages) have implemented future<int> as a synchronization primitive. This does not exist on a GPU yet.

* Tomasulo's Algorithm -- https://en.wikipedia.org/wiki/Tomasulo_algorithm. Tomasulo's Algorithm is the CPUs use to discover which operations can be safely executed out-of-order. This algorithm "discovers" parallelism, even in the (seemingly sequential) CPU Assembly language. I hope to use Tomasulo's Algorithm to implement Futures<> efficiently. Virtually every modern CPU implements this algorithm (or a variant of it) in the CPU's microcode to provide out-of-order execution of code.

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Now that the introductions are out of the way, here's my gameplan that I want to discuss. I hope to implement the following rough algorithm on the GPU (based on ChessProgramming.org Negamax AlphaBeta):

Code: Select all

    future<int> alphaBeta( future<int> alpha, future<int> beta, int depthleft ) {
       if( depthleft == 0 ) return quiesce( alpha, beta );
       for ( all moves)  {
          future<int> score = Spawn(-alphaBeta( -beta, -alpha, depthleft - 1 )); // "Lazy calculate" the "negate" operation without actually evaluating it.
          if(beta != infinity){ //Blocking is unnecessary while beta is infinity
             //Block until "score >= beta" can be evaluated.
             beta.wait(); // All stored "negate" and "max" operations will be evaluated here, and here only.
             score.wait(); // Ditto
             if( score >= beta ) 
                 return beta;
          }
         score = max(score, alpha); // "lazy calculate" the "max" operation without actually evaluating it until a "wait" statement is called
       }
       return alpha; // This is a "lazy symbolic fugure"
    }

Some notes:

1. "Spawn" will be Task-based parallelism. Spawn roughly correlates to "Place a new task into the work-queue". It is assumed that the available threads (the 16,000+ threads of the Vega64) will be grabbing tasks from this queue. Obviously, efficient implementation of this work-queue is of utter-importance for this whole approach to work.

2. Lazy evaluation of score = max(score, alpha) is the hard part. Presume "alpha" initially equals [A, B, max, -], while "score" initially equals "[C, D, Max]. Then max(score, alpha) will generate [C, D, Max, A, B, Max, -, Max], representing the "lazy evaluation" of the alpha-tree... where "A", "B", "C", and "D" are values that will be returned by some Spawned task. ("Register renaming" from Tomasulo's Algorithm). These lists are organized into postfix notation. That is, [C, D, Max, A, B, Max, -, Max] is equivalent to the computation Max(-Max(A,B), Max(C, D)).

3. "beta.wait()" is when this tree is actually evaluated. If beta = [A, B, max, -] on this line, then "beta.wait" will stay on the "Blocked Queue" until A and B are available.

The mechanism described in #2 should "discover" the innate parallelism across alpha-beta trees. There's a lot of edge cases (what to do if the list fills up? How big should the list be for chess?). But I've mapped out a lot of examples by hand, and I'm confident that a system like this would work now. It appears that the square-root(nodes) of Minimax are immediately available to this parallelism strategy.

As I program this, I'll update the thread. But I wanted to at least generate some discussion. Maybe someone else has made a system similar to mine that I should be aware of? Its always good to read other research papers on this subject.

smatovic · Post by **smatovic** » Thu Jun 20, 2019 10:18 pm

dragontamer5788 wrote: ↑Thu Jun 20, 2019 9:50 pm Introduction

...
As I program this, I'll update the thread. But I wanted to at least generate some discussion. Maybe someone else has made a system similar to mine that I should be aware of? Its always good to read other research papers on this subject.

I tinkered a bit with gpu chess, maybe you find this linklist of papers useful:

https://zeta-chess.app26.de/post/papers ... ee-search/

Don't know about ROCm, but Intel, Nvidia and AMD should support at least OpenCL 1.x,
but maybe you need kernel level recursion, so 1.x is no option...

Good luck and have fun,
Srdja

dragontamer5788 · Post by **dragontamer5788** » Thu Jun 20, 2019 10:38 pm

smatovic wrote: ↑Thu Jun 20, 2019 10:18 pm
dragontamer5788 wrote: ↑Thu Jun 20, 2019 9:50 pm Introduction

...
As I program this, I'll update the thread. But I wanted to at least generate some discussion. Maybe someone else has made a system similar to mine that I should be aware of? Its always good to read other research papers on this subject.
I tinkered a bit with gpu chess, maybe you find this linklist of papers useful:

https://zeta-chess.app26.de/post/papers ... ee-search/

Thank you very much for these links! they look very helpful.

Don't know about ROCm, but Intel, Nvidia and AMD should support at least OpenCL 1.x,
but maybe you need kernel level recursion, so 1.x is no option...

Unfortunately, I'm going to need automatic reference-counting on those "future" objects. Which means I'll need C++ destructors / RAII pattern, maybe even "movement semantics" to do it all efficiently. Since I need proper C++ to build the abstractions listed in this thread, I need to use ROCm (AMD's C++ GPU compiler), or CUDA (NVidia's C++ GPU compiler). My home-machine is AMD-based, so ROCm is what I'm going with.

These futures, symbolic-execution, and other such concepts are quite advanced. I definitely could write them in C, but there are way too many C++isms that will be beneficial to this particular project.

mphuget · Post by **mphuget** » Fri Jun 21, 2019 8:55 am

Hello,

Even if this is different in the objectives, you could consider as well Anka's work on Perft:

https://github.com/ankan-ban/perft_gpu

mph

grahamj · Post by **grahamj** » Fri Jun 21, 2019 11:16 am

I am working on a GPU chess engine, but it is not like yours, and does not use alpha-beta search. I will be interested to see how you get on. I described my ideas here. http://indriid.com/2019/2019-01-06-tinsmith.pdf

dragontamer5788 wrote: ↑Thu Jun 20, 2019 9:50 pm
* Tomasulo's Algorithm -- https://en.wikipedia.org/wiki/Tomasulo_algorithm. Tomasulo's Algorithm is the CPUs use to discover which operations can be safely executed out-of-order. This algorithm "discovers" parallelism, even in the (seemingly sequential) CPU Assembly language. I hope to use Tomasulo's Algorithm to implement Futures<> efficiently. Virtually every modern CPU implements this algorithm (or a variant of it) in the CPU's microcode to provide out-of-order execution of code.

Hmmm. Are trying to turn a GPU back into a CPU? I recommend this video
https://www.youtube.com/watch?v=KHa-OSrZPGo

Code: Select all

 
             beta.wait(); // All stored "negate" and "max" operations will be evaluated here, and here only.
             score.wait(); // Ditto
    }

My immediate reaction is that those wait()s will be problematic. The only way to obtain full synchronization is to return to the CPU and start another kernel.

dragontamer5788 · Post by **dragontamer5788** » Fri Jun 21, 2019 1:46 pm

grahamj wrote: ↑Fri Jun 21, 2019 11:16 am
dragontamer5788 wrote: ↑Thu Jun 20, 2019 9:50 pm
* Tomasulo's Algorithm -- https://en.wikipedia.org/wiki/Tomasulo_algorithm. Tomasulo's Algorithm is the CPUs use to discover which operations can be safely executed out-of-order. This algorithm "discovers" parallelism, even in the (seemingly sequential) CPU Assembly language. I hope to use Tomasulo's Algorithm to implement Futures<> efficiently. Virtually every modern CPU implements this algorithm (or a variant of it) in the CPU's microcode to provide out-of-order execution of code.
Hmmm. Are trying to turn a GPU back into a CPU? I recommend this video
https://www.youtube.com/watch?v=KHa-OSrZPGo

No. Tomasulo's Algorithm is simply "inspiration" for this project.

Here's a picture of Intel Skylake's execution engine:

As you can see, every core of Intel Skylake (i7-6700k) has 8 ports. These 8 ports execute assembly language in parallel. With 180 "renamed registers", a modern Intel Skylake core will search your assembly code and execute as many instructions out-of-order as possible. As such, CPU cores are physically executing Alpha-Beta Pruning in parallel already.

However, instead of searching for parallelism at the assembly level, my hope is to search for parallelism at the node-level. Alpha-beta pruning has a large number of nodes that could be executed in parallel (should an "oracle" be properly programmed). I hypothesize that Tomasulo's algorithm is the proper "oracle", and that Tomasulo's algorithm will be able to dynamically discover the parallelism innate to Alpha-Beta pruning.

To a C++ programmer, Tomasulo's algorithm is hardware based futures for scheduling parallel hardware execution resources. Because GPUs have 16000+ hardware threads available, it is very, very difficult to figure out an algorithm that properly utilizes the entire machine. My hypothesis: use Tomasulo's Algorithm, the hardware scheduling algorithm that has had huge-success in the CPU design world, to schedule these 16,000+ hardware threads.

Code: Select all
 
             beta.wait(); // All stored "negate" and "max" operations will be evaluated here, and here only.
             score.wait(); // Ditto
    }
My immediate reaction is that those wait()s will be problematic. The only way to obtain full synchronization is to return to the CPU and start another kernel.

I plan to write a scheduler inside of GPU code. That "wait" is a cooperative task switch. The GPU-thread will simply place the current task onto the blocked-task list, and then pick a "ready task" off of the ready queue. If no ready tasks are available, the GPU-thread will idle (AMD EXEC flag = false, or NVidia predicate register = off).

EDIT: In practice, that code will look like the following:

Code: Select all

    FutureExpression f = spawnBlockedBetaCuttoffTask(current_board, move-generator, alpha, beta, score);

The scheduler will keep "BetaCuttoffTasks" in the blocked queue until "beta" and "score" are ready to be evaluated. The "Beta Cutoff Task" is approximately:

Code: Select all

// Beta and Score are no longer "futures" by now.
Future BetaCutoffTask(Board b, MoveGenerator m, FutureExpression alpha, int16_t beta, int16_score){
    if(score >= beta){
        return beta;
    }
    
    alpha = max(alpha, score); // Alpha isn't necessarily ready to be evaluated yet. Lazy evaluate "max(alpha, score" here.
    m.GenerateNextMove(); // We will only spawn one move before getting Beta-blocked again.
    FutureExpression = Spawn(-alphaBeta( -beta, -alpha, depthleft - 1 ));
    return spawnBlockedBetaCuttoffTask(current_board, move-generator, alpha, beta, score); 
    // Remember, "spawnBlockedBetaCutoffTask" is equivalent to beta.wait() / score.wait();
}

I know I haven't explained "FutureExpressions" or "FutureSymbols" yet. But I'm still working out these details anyway. Ultimately, these "futures" haven't been written yet. But all together, they'll constitute a GPU-cooperative task scheduler.

"Spawn" returns a Future, a value that promises to hold an int16_t (centipawns evaluation). In Tomasulo's algorithm, it is an entry in the Reservation Station, a "renamed register". Instead of doing nothing while the "future" is being calculated, I say code should march forward and keep executing instead.

Notice that even the highly-sequential BetaCutoffTask will spawn 2-tasks in the worst case scenario (one unblocked alpha-beta task-spawn, and one Beta-Blocked task). So even in this worst-case scenario, you're still extracting parallelism from the Alpha-beta search tree.

grahamj · Post by **grahamj** » Fri Jun 21, 2019 3:29 pm

Thanks for the further details. I won't pretend I understand them yet, but your project sounds interesting. You have at least understood that the main problem with using a GPU for chess is how to implement the search tree.

Daniel Shawul · Post by **Daniel Shawul** » Fri Jun 21, 2019 3:58 pm

Nothing says one should use a recursive form of alpha-beta, so one could use a rollouts version of alpha-beta that is easier to do on the GPU instead.
I already did this for my mcts experiments on the cpu. Adding null-move and lmr was kind of complicated but managed to get them working.

I had implemented standard mcts for chess and hex on the GPU some 6 years ago, so I am sure i can get the alpha-beta rollouts version implemnted as well but i have no time for it. The alpha-go way of demoting the mcts to the CPU and using gpu for neural network evaluation works pretty well anyway.

Daniel

dragontamer5788 · Post by **dragontamer5788** » Fri Jun 21, 2019 7:35 pm

mphuget wrote: ↑Fri Jun 21, 2019 8:55 am Hello,

Even if this is different in the objectives, you could consider as well Anka's work on Perft:

https://github.com/ankan-ban/perft_gpu

mph

Thank you very much for this link. Its a good "companion project" because its looking at a lot of the bitboard-level optimizations, while my project is focusing on a different part of the chess engine problem.

- peak speeds on overclocked GTX 780:

-- without transposition tables:
--- 27 Billion moves per second (pos2 of cpw)
--- 16 Billion moves per second (starting pos)
--- 1.5 hours for perft(10) of start position

-- with transposition tables:
--- start position perft(11) in 1 hour!

I don't have an NVidia GPU to verify those claims, but you can clearly see the amount of parallelism a GPU offers from those benchmarks. A GTX 780 (Kepler) is quite slow by today's standards. An NVidia 1060 (~$130 refub, ~$200 new) would be roughly on the same speed, while $300 cards like Vega56 or NVidia 1660 Ti would be much faster. In any case, that project is clearly written by a GPU expert (The written code is clearly high-performance). So there's a lot I can probably learn from it.

Daniel Shawul wrote: ↑Fri Jun 21, 2019 3:58 pm Nothing says one should use a recursive form of alpha-beta, so one could use a rollouts version of alpha-beta that is easier to do on the GPU instead.
I already did this for my mcts experiments on the cpu. Adding null-move and lmr was kind of complicated but managed to get them working.

I had implemented standard mcts for chess and hex on the GPU some 6 years ago, so I am sure i can get the alpha-beta rollouts version implemnted as well but i have no time for it. The alpha-go way of demoting the mcts to the CPU and using gpu for neural network evaluation works pretty well anyway.

Daniel

Thanks for the links. I'm going to need time to grok those papers.

So strangely enough: I see a path forward for GPU-efficient recursive Alpha-beta, but I don't actually see a path forward for efficient GPU-based MCTS. Virtual-loss makes sense in the CPU-world where each thread has its own instruction pointer, but the GPU-world of SIMD-compute just doesn't make "sense" of the virtual-loss style MCTS parallelism.

Case in point: L1 cache of AMD Vega64 is non-cohesive. This means that to "share" information on the AMD Vega64, I must flush the cache. Or more specifically, __threadfence() compiles into a full L1 cache-flush operation (!!!). So the "virtual loss" message that will be passed through memory is going to be quite slow and inefficient on a GPU. I did a quickie spinlock test on the GPU, and that gives roughly 1-million spinlock / unlocks per second.

So yeah, GPUs are quite slow at passing messages around. Far slower than CPU cache-coherent communications. Whatever search methodology is invented for a GPU will have to keep this in mind.

I've searched for some of the words in your post, and apparently this old topic exists: http://www.talkchess.com/forum3/viewtop ... =7&t=41853

I'll have to go through that topic and see what other parallel-methodologies were proposed and tried. I'll take a look at your code as well, and maybe I'll be inspired with alternative approaches.

smatovic · Post by **smatovic** » Fri Jun 21, 2019 9:14 pm

dragontamer5788 wrote: ↑Fri Jun 21, 2019 7:35 pm ...
I've searched for some of the words in your post, and apparently this old topic exists: http://www.talkchess.com/forum3/viewtop ... =7&t=41853

I'll have to go through that topic and see what other parallel-methodologies were proposed and tried. I'll take a look at your code as well, and maybe I'll be inspired with alternative approaches.

Maybe a word on what I have tried on gpu, during development I got 6 different approaches playing
chess on gpu, the latter two played decent chess...

1) 'SPPS' AlphaBeta parallelization, 128 threads work on up to 128 moves per node in parallel
2) MCTS without UCT across multiple Compute Units of the GPU
3) LIFO-Stack based load balancing for AlphaBeta search on one Compute Unit of the GPU
4) 'Nagging' and 'Spam' parallelization approach for AlphaBeta search on one Compute Unit of the GPU
5) Parallel BestFirstMiniMax Search, similar to MCTS but with AlphaBeta playouts at leaf nodes
6) Classic parallel AlphaBeta approach with ABDADA and RMO, randomized move order, for parallelization

5) uses the 'one thread one board' approach with thousands of threads and 6) the 'one SIMD one board'
approach with hundreds of workers where 64 threads are coupled to one work-group to work on the same
node in parallel during move generation, move sorting and position evaluation.

IIRC, 3), the LIFO-Stack, gave the best nps throughput, but I could not figure out how to implement
AlphaBeta pruning efficient.

Zeta milestones:
https://zeta-chess.app26.de/post/zeta-milestones/

Zeta v099k parallel scaling results:
https://github.com/smatovic/Zeta/blob/m ... esults.txt

--
Srdja

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search

Re: Lazy-evaluation of futures for parallel work-efficient Alpha-Beta search