GPU chess engine

SamuelS · Post by **SamuelS** » Wed Feb 27, 2013 9:59 am

Hi,

I am new to this forum, but the topic has been discussed here before. I have been reading previous posts about GPUs and chess. There has been few attempts to implement a fully functional chess engine on the GPU (e.g. by Srdja Matovic). The limitations of the GPUs have often been mentioned. Different search algorithms, move generation algorithms, and board representations have been discussed. Obviously, there still a lot of work ahead before we can see a GPU chess algorithm compete with top CPU chess engines.

Although GPUs have much more computing power than CPUs on the desktop computers (e.g. NVIDIA GeForce GTX 690 can do 5.5 Tflops), there are major limitations which make it difficult to implement a chess engine on them:

1) Thread divergence issues. Blocks of threads (typical 32 or so) must follow the same execution path to be efficient. This is tough, because in alpha-beta-like searches it would be nice to have the threads search different branches of the search tree.

2) Limited memory. Although modern GPUs can have gigabytes of memory, it is mostly slow global memory. It takes hundreds of clock cycles to access it. What we want to use in the core of the engine is the on-chip memory, which incurs no penalty in clock cycles. However, there is only tens of kilobytes of it for a block of threads. It is difficult to fit in even the current position and move lists for each thread.

3) Communication issues. The threads in different blocks cannot communicate with each other, except via global memory. The communication inside a block of threads is limited to simple barrier-like synchronization commands.

I think these are the most important limitations. What do you think?

I have some ideas about how to make progress, but I do not have solutions to all of the problems yet. First, I would not abandon Principal Variation Search / Negascout approach. It is difficult to find an algorithm that performs better. Second, the parallel search does not have use threads, but blocks of threads as independent "processors". This solves most of the divergence and memory issues, although it means that the number of processors is little more than that of multicore CPUs. However, there is still SIMD-style parallellism available inside the processors.

Then there is the question which parallel search algorithm to use. I think the Young Brothers Wait Concept could be used. More complicated algorithms require too much communication between the processors. Do you have ideas which parallel search algorithms could work on the GPU?

Daniel Shawul · Post by **Daniel Shawul** » Wed Feb 27, 2013 5:08 pm

It is difficult to efficiently implement an inherently sequential algorithm such as alpha-beta on GPU. Unless all search threads follow the same line of execution, you can't use anything less than a Warp as the smallest search unit. With a different algorithm that combines tree growth on top of monte-carlo simulations, you can have every thread working on a separate search. This worked pretty well for Hex ,and I got about 60x speed on a very old gpu. This game is very convinient because it avoids all the problems you mentioned above. MC simulations require very small memory so board can be stored on chip memory. There is almost no thread divergence if you let the simulations go to the very end without testing for game termination in the middle, no communication / synchronization except when updating the tree which is stored in global memory. Unfortunately I was not able to get the same speed up for chess using similar method. Memory constraints (fact that you can't even store the movelist),and divergence problems creep in. But it served its purpose to demonstrate selective tree search is possible on gpus, and that doesn't have to be alpha-beta.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Feb 27, 2013 5:43 pm

SamuelS wrote:Then there is the question which parallel search algorithm to use. I think the Young Brothers Wait Concept could be used. More complicated algorithms require too much communication between the processors. Do you have ideas which parallel search algorithms could work on the GPU?

I have no idea, but as recently posted by Erdogan Günes in German CSS forum, it seems the guys introducing the NIKOLA Supercomputer Chess Engine might have an idea ...

The goal is 16000 ELO and 32 men-tablebase completed by the end of the year.
Alphabeta pruning as an improvement over minimax. We might be able to find other, unknown, pruning techniques...

http://nikolachess.com/

already discussed here as well ...

http://www.talkchess.com/forum/viewtopic.php?t=46424
http://www.talkchess.com/forum/viewtopic.php?t=47192

SamuelS · Post by **SamuelS** » Fri Mar 01, 2013 10:42 am

Daniel Shawul wrote:Unless all search threads follow the same line of execution, you can't use anything less than a Warp as the smallest search unit.

This is exactly, what I have thought. I will try to use warps as the smallest search unit. Inside the warps, the threads can do different things in parallel, e.g. move ordering or evaluation. Especially, if the evaluation can be made fast by utilizing parallellism, the threads can perform useful tasks most of the time even if they follow the same search path.

Daniel Shawul wrote:With a different algorithm that combines tree growth on top of monte-carlo simulations, you can have every thread working on a separate search. This worked pretty well for Hex ,and I got about 60x speed on a very old gpu. ... Unfortunately I was not able to get the same speed up for chess using similar method. Memory constraints (fact that you can't even store the movelist),and divergence problems creep in. But it served its purpose to demonstrate selective tree search is possible on gpus, and that doesn't have to be alpha-beta.

I have read about the Monte Carlo technique working well on other games, but as you said, it does not work that well in chess. I will try alpha-beta, unless someone can find a better algorithm.

I have implemented some algorithms on the GPU in the past. One was a search algorithm that traversed a search tree (not minimax, but still a tree), and it could be implemented without branching by using table look-ups to choose the next node in the tree. I considered a similar approach with alpha-beta, allowing thread level parallellism, but the memory constraints prevented that. Then, because obviously I have to use warps as smallest search units, I might as well do branching, because the divergence in not an issue there and it is easier to code.

smatovic · Post by **smatovic** » Fri Mar 01, 2013 8:15 pm

Heyho Samuel,

my current approach uses an best-first or monte-carlo-alphabeta algorithm,
and is not able to outperform an alphabeta searcher with the same evaluation function.

If you wish to split AlphaBeta across Warps maybe the following idea could be an solution:

One Warp takes x moves of the same position in parallel:
each thread takes one child-move or board of the same position,
computes its children and stores it in an array,
then the whole warp goes one ply deeper, takes the first board and the first bunch
of moves in parallel until the max ply is reached,
then the whole warp goes back one ply and takes the second board and bunch of moves....and so on.

I implemented this idea on an GT 8800 and it performed very well with one Multiprocessor running, but i failed to implement a Quiscence Search with Stand Pat correctly, so i am not sure how this idea will behave in Qsearch where less child-moves on the same position are present...

Anyway, good luck with your engine.

--
Srdja

Don · Post by **Don** » Fri Mar 01, 2013 9:52 pm

smatovic wrote:Heyho Samuel,

my current approach uses an best-first or monte-carlo-alphabeta algorithm,
and is not able to outperform an alphabeta searcher with the same evaluation function.

Monte Carlo tree search does not use the "same evaluation function" that an alpha beta searcher uses. So you are already doing something wrong!

MCTS (Monte Carlo Tree Search) scores a position statistically. The move that wins the most games is the best move. Imagine playing a bunch of random games from some position very quickly and keeping a tally of which move has the best score. Games are played to the END of the game. That is the MC in Monte Carlo Tree Search and one random game is called a "playout." There is NO evaluation function, the combined scores of many playouts IS the evaluation function.

The "Tree Search" part is based on using the statistics from the random games (or playouts as they are called) to direct which branch to explore, so it's a hybrid of tree search and Monte Carlo playouts. It's fairly complicated but you use some forumula to balance exploration with exploitation. You want to try bad move too, sometimes, but mostly the better moves. It requires keeping the tree in memory and building up statistics for each node. One playout consists of using that tree structure for guidance and expanding the statistics by 1 more game (and the tree by one more move.) and doing as many of these as possible to choose a move. If done well, the win statistics will tend to act like mini-max even when not explicitly doing mini-max.

With chess, playing random games is too crude. Actually, that is true even in GO and the best GO program superimpose a some knowledge on the "random" playouts so that they are not quite so random. But there must be a reasonable degree of randomness or it's not really Monte Carlo!

There are tricks of course to speed up the process. You probably don't have to play the game to the bitter end and can have "adjudication" rules when it's quite clear that one side is winning. You can use statistics about the move choices to good effect - the history heuristic in chess would be a very good table to provide some "intelligence" to the playouts.

A good goal for chess is to create a player that plays good chess even if it's several hundred ELO below the top programs. The reason that would be good is because MCTS would be a totally different kind of chess and would be interesting even if was not nearly as strong as the best programs. In some of the position that no chess program seems to understand, these programs would think much more like humans and they would appear to have a lot of positional understanding, even if tactically weak. For example:

A position was posted once where every pawn was locked up but white was up by a queen and rook or more. The key move was easy to understand, the queen had to give herself up to enable the pieces to break open the position and win. Even a very weak player could see that. You can give up the queen and still have a winning position! A MCTS chess program would very quickly discover this theme because it could care less about whether it loses the queen or not. It's going to do whatever is needed in order to win.

A strange side-effect of MCTS is that a program doesn't care HOW it wins so you will see some really odd-looking behavior. For example you might be up by some huge amount of material and let the opponent have most of it back if it has no impact on your winning chances.

If you wish to split AlphaBeta across Warps maybe the following idea could be an solution:

One Warp takes x moves of the same position in parallel:
each thread takes one child-move or board of the same position,
computes its children and stores it in an array,
then the whole warp goes one ply deeper, takes the first board and the first bunch
of moves in parallel until the max ply is reached,
then the whole warp goes back one ply and takes the second board and bunch of moves....and so on.

I implemented this idea on an GT 8800 and it performed very well with one Multiprocessor running, but i failed to implement a Quiscence Search with Stand Pat correctly, so i am not sure how this idea will behave in Qsearch where less child-moves on the same position are present...

Anyway, good luck with your engine.

--
Srdja

smatovic · Post by **smatovic** » Fri Mar 01, 2013 10:08 pm

Monte Carlo tree search does not use the "same evaluation function" that an alpha beta searcher uses. So you are already doing something wrong!

Monte-Carlo alpha beta, (MCαβ)
is a Best-First search algorithm based on Monte-Carlo Tree Search and a shallow alpha-beta-depth-first-search. It is used in Lines of Action and in conjunction with UCT (Upper Confidence bounds applied to Trees) also in Chess. It is similar to the the Best-First-Minimax-Search proposed by Korf and Chickering [1].

http://chessprogramming.wikispaces.com/MC%CE%B1%CE%B2

--
Srdja

Don · Post by **Don** » Fri Mar 01, 2013 10:17 pm

smatovic wrote:
Monte Carlo tree search does not use the "same evaluation function" that an alpha beta searcher uses. So you are already doing something wrong!
Monte-Carlo alpha beta, (MCαβ)
is a Best-First search algorithm based on Monte-Carlo Tree Search and a shallow alpha-beta-depth-first-search. It is used in Lines of Action and in conjunction with UCT (Upper Confidence bounds applied to Trees) also in Chess. It is similar to the the Best-First-Minimax-Search proposed by Korf and Chickering [1].

http://chessprogramming.wikispaces.com/MC%CE%B1%CE%B2

--
Srdja

Ok, so it's a different thing altogether.

I actually believe MCTS is feasible in chess and is more promising because it tries to do something different. MCAB is limited by the same broken evaluation function chess programs use.

When I say "feasible" I'm not claiming it will ever be as strong as what we are doing now - but I think it might be possible to create a very strong player with a really interesting style of play. It might be possible to combine this (with MP programs) with the standard approach. After so many processors chess becomes difficult to scale and and having such a program to supplement a standard program could be a big win if anyone ever figures out how to combine the opinions of different search agents.

Daniel Shawul · Post by **Daniel Shawul** » Sat Mar 02, 2013 1:11 am

SamuelS wrote:
Daniel Shawul wrote:Unless all search threads follow the same line of execution, you can't use anything less than a Warp as the smallest search unit.
This is exactly, what I have thought. I will try to use warps as the smallest search unit. Inside the warps, the threads can do different things in parallel, e.g. move ordering or evaluation. Especially, if the evaluation can be made fast by utilizing parallellism, the threads can perform useful tasks most of the time even if they follow the same search path.

I have reservations about using SIMD that way because chess evaluation has lots of branches, except for some bitboard operations that already benefit from SSE. There is not many operations to do on all pieces or all squares etc. at the same time. Good evaluation is usually full of branches.

Daniel Shawul wrote:With a different algorithm that combines tree growth on top of monte-carlo simulations, you can have every thread working on a separate search. This worked pretty well for Hex ,and I got about 60x speed on a very old gpu. ... Unfortunately I was not able to get the same speed up for chess using similar method. Memory constraints (fact that you can't even store the movelist),and divergence problems creep in. But it served its purpose to demonstrate selective tree search is possible on gpus, and that doesn't have to be alpha-beta.

I have read about the Monte Carlo technique working well on other games, but as you said, it does not work that well in chess. I will try alpha-beta, unless someone can find a better algorithm.
Yes you should try that. After my failed experiment chess UCT searcher (which btw should not be regarded as MC search) , i get frustrated thinking about implementing alpha-beta. It is a pain implementing it even on a cluster. Don't listen to me though because my negativity comes from bad experience with unrelated device.
I have implemented some algorithms on the GPU in the past. One was a search algorithm that traversed a search tree (not minimax, but still a tree), and it could be implemented without branching by using table look-ups to choose the next node in the tree. I considered a similar approach with alpha-beta, allowing thread level parallellism, but the memory constraints prevented that. Then, because obviously I have to use warps as smallest search units, I might as well do branching, because the divergence in not an issue there and it is easier to code.
At first i used as much warps as the SM can handle. That is ok because i do MC simulations at the tip. But later i re-wrote it so that a Warp can do MC simulations by itself AND update tree in a safe way. That really helped to grow the tree pretty quickly. I found deeper +7 average depth with that. But chess evaluation with MC is very incorrect so the infeasiblity of chess to MC is not really from device constrainst. My uct-chess on cpu also plays really bad game. The speed up comes down from 60x to only 2x for chess due to memory constraints. I replaced the move generation in such a way that any move can be generated when needed. I can't generate them all and store. If you are interested you can look at my GpuHex at github, and there is also the failed experiment with chess in a different branch. https://github.com/dshawul/GpuHex

Good luck

GPU chess engine

GPU chess engine

Re: GPU chess engine

Re: GPU chess engine

Re: GPU chess engine

Re: GPU chess engine

Re: GPU chess engine

Re: GPU chess engine

Re: GPU chess engine

Re: GPU chess engine