GPUs better for chess than CPUs?

rbarreira · Post by **rbarreira** » Tue Jun 07, 2011 11:56 pm

To anyone who is interested in working with this, I recommend making a GPU application which can calculate the perft of a chess position. No eval or alpha-beta needed.

If you can't do that with good performance, it's probably time to forget making a chess engine for a GPU.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jun 08, 2011 2:06 am

I will definately try and take my dose of failure

. But not with alpha-beta though. I think that it is difficult to finish the job on gpu with one kernel call and return a move. Hash tables, pawn tt, EGTBs etc can't be used unless you consult CPU... Also all non depth-first searches consume a lot of memory to be feasible on GPUs. Only feasible option I see is plain MC at the root only. Or to improve upon that I could build a shallow depth tree (say d=4) at startup and start simulations from the leaves... This is more or less what is done in APHID algorithm used for cluster computing, and it can give it a good tactical awareness...
I will try it on a simple game like Hex or Havannah (or some other game that I create for ease of implementation). Move generation would be easier for 'place a stone' games such as Go. But Go has captures so Hex is better. Also i am not sure if a good PRNG is available for GPUs yet. But anyway this is too much talk before typing one line of code.

Daniel Shawul · Post by **Daniel Shawul** » Wed Jun 08, 2011 2:19 am

Hello Srdja
Sad to hear you stopped working on it. What was your major difficulty ? Can you elaborate on your experience a bit ? I read that you have used it for move generation but I do not quite get what you have tried regarding search..

I don't think recursion is a problem because alpha-beta can be done iteratively. I do not think YBW or similar schemes are feasible since you have to sync() somewhere, even if we assume there was a "sub threading" mechanism in gpus.

regards,

Dann Corbit · Post by **Dann Corbit** » Wed Jun 08, 2011 3:35 am

Daniel Shawul wrote:I will definately try and take my dose of failure . But not with alpha-beta though. I think that it is difficult to finish the job on gpu with one kernel call and return a move. Hash tables, pawn tt, EGTBs etc can't be used unless you consult CPU... Also all non depth-first searches consume a lot of memory to be feasible on GPUs. Only feasible option I see is plain MC at the root only. Or to improve upon that I could build a shallow depth tree (say d=4) at startup and start simulations from the leaves... This is more or less what is done in APHID algorithm used for cluster computing, and it can give it a good tactical awareness...
I will try it on a simple game like Hex or Havannah (or some other game that I create for ease of implementation). Move generation would be easier for 'place a stone' games such as Go. But Go has captures so Hex is better. Also i am not sure if a good PRNG is available for GPUs yet. But anyway this is too much talk before typing one line of code.

Since the GPUs have memory, perhaps each GPU can have its own small local hash table. That should be fine if they only work on their own subset of the problem.

smatovic · Post by **smatovic** » Wed Jun 08, 2011 5:20 am

Hello Srdja
Sad to hear you stopped working on it. What was your major difficulty ? Can you elaborate on your experience a bit ? I read that you have used it for move generation but I do not quite get what you have tried regarding search..

I don't think recursion is a problem because alpha-beta can be done iteratively. I do not think YBW or similar schemes are feasible since you have to sync() somewhere, even if we assume there was a "sub threading" mechanism in gpus.

1) Memory
GPUs have a lot of RAM but only a few registers for use. The more threads you run the less registers you have, but you have to run a lot of threads to hide GPU internal latencies.
I calculate with 32 to 512 bytes by thread. The moves and the state of the non-recursive alpha-beta has to be stored in slow global memory(RAM).

2) 32 bit / 64 bit
My last idea was to use a Quad-Bitboard presentation (32 Bytes) and a modified Kogge-Stone Move Generator for all Pieces.
Means less memory and a SIMD friendly Movegenerator, but i will loose some computation power of the GPU which is 32 bit native.

3) Search algorithm
Thought on YBW, but have no idea how to implement the communication between Threads.
Maybe somekind of a clever stack-meachnism could do the job instead of a complex-work-delivery process.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Wed Jun 08, 2011 1:47 pm

Yes possible even for medium size egbbs I think. But I was under the impression that global memory latency is very high on the gpu especially with strided access and no caches. This approach also exculdes the CPU from getting involved in the search. I think that even the regular move list and score needs to be stored on RAM so a contigious block of moves should be fetched at once for execution by a warp. My plan is for a game with simplistic board representation that can avoid this by storing the board in shared mem, and moves can be generated as needed on the fly(no storage). Then each thread executes a random move on its board (say 32 times) and evaluate the result (which can introduce some branching).

Daniel Shawul · Post by **Daniel Shawul** » Wed Jun 08, 2011 1:56 pm

Thought on YBW, but have no idea how to implement the communication between Threads.
Maybe somekind of a clever stack-meachnism could do the job instead of a complex-work-delivery process.

I don't think that is possible. You can only synchronize threads in the same block. Here

Classical parallel programming techniques can't be used here. Threads can't spawn more threads; threads on one multiprocessor can't send results to threads on another multiprocessor; there's no facility for a critical section among all the threads across the whole system. Trying to use a Pthreads or OpenMP programming model will lead to pain, frustration, and failure.

smatovic · Post by **smatovic** » Wed Jun 08, 2011 2:42 pm

to pain, frustration, and failure.

As an alternative to YBW i thought on using a depth*128x128 board array.

If you run 128 threads and assume 128 max possible moves you get an array of

boards[max_searchdepth][128 Threads][128 possible boards]

With such an array you could move up and down in blocks of 128 threads through the search tree, it should behave like a non-recursive alphabeta, but with 128 boards in parallel.

When you got 128 Threads running you could extent the search with x * 128 Threads.

--
Srdja

bob · Post by **bob** » Wed Jun 08, 2011 7:03 pm

parrish wrote:
hgm wrote:Because Chess algorithms are by nature serial rather than parallel.
A search tree is, by it's nature, parallel, and not serial.

Alpha/beta is, _by definition_, serial. You have to establish the bounds by searching the first move at a node _before_ you can search the others using that bound to reduce required effort...

sluijten · Post by **sluijten** » Sat Jun 11, 2011 5:53 am

bob wrote:
parrish wrote:
hgm wrote:Because Chess algorithms are by nature serial rather than parallel.
A search tree is, by it's nature, parallel, and not serial.
Alpha/beta is, _by definition_, serial. You have to establish the bounds by searching the first move at a node _before_ you can search the others using that bound to reduce required effort...

Well I think it's a mix, it's not 100% serial and it's not 100% parallel.
On a high level it is serial, but significant parts of the tree can be searched in parallel (depending on the node type), without doing more work than a serial search.

GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?

Re: GPUs better for chess than CPUs?