Zeta, a chess engine in OpenCL

smatovic · Post by **smatovic** » Wed Mar 17, 2010 8:14 pm

Hello,

i just joined this board and want to introduce "Zeta", a project to implement a chess engine in OpenCL.

OpenCL is a C like programming language suited for SIMD-devices like GPUs. It has extensions to run concurrently threads.

The past months i managed to get into the bitboard technics and wrote a small CPU engine in C. Now i am going to design a Negamax search algorithm in OpenCL. Becouse of hardware and OpenCL limitations this is not that easy and i am not sure if this is possible in a performant matter at all.

If somebody has already experience with OpenCL or SIMD optimized search algorithms i would appreciate to share information.

You can find more about the project on my blog:
http://zeta-chess.blogspot.com/

Friendly Regards,
Srdja Matovic

ps: sorry for my english, i am more trained in reading than writing.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 17, 2010 9:28 pm

smatovic wrote:Hello,

i just joined this board and want to introduce "Zeta", a project to implement a chess engine in OpenCL.

OpenCL is a C like programming language suited for SIMD-devices like GPUs. It has extensions to run concurrently threads.

The past months i managed to get into the bitboard technics and wrote a small CPU engine in C. Now i am going to design a Negamax search algorithm in OpenCL. Becouse of hardware and OpenCL limitations this is not that easy and i am not sure if this is possible in a performant matter at all.

If somebody has already experience with OpenCL or SIMD optimized search algorithms i would appreciate to share information.

You can find more about the project on my blog:
http://zeta-chess.blogspot.com/

Friendly Regards,
Srdja Matovic

ps: sorry for my english, i am more trained in reading than writing.

Interesting! I have no real idea on GPU programming and how to implement alpha-beta on those devices. I can imagine for such SIMD devices pure register based direction-wise, multi-piece, fill-based attack generation (Kogge-Stone) is quite competitive compared to magic lookups.

Your English looks quite perfect to me

Welcome & Cheers,
Gerd

smatovic · Post by **smatovic** » Wed Mar 17, 2010 10:10 pm

Thanks for the "Kogge-Stone" hint. I ll read the wiki article carefully.

For move-generation i thought on x*6*64 threads (x boards, 6 piecetypes and 64 destination squares), with magic bitboard technic. x*6*64*64 (x boards, 6 piecetypes, 64 possible positions and 64 destination squares) would be nicer but current GPUs support only about 512 threads by computing unit (with 10-20 computing units by device).

By the way: i found an paper from Holger Hopp and Peter Sanders, describing an alpha-beta algorithm for SIMD machines. Maybe this is what i was looking for....

vladstamate · Post by **vladstamate** » Thu Mar 18, 2010 12:22 am

Here is what I was thinking: maybe you can take advantage of the many SIMD units to do special jobs, like move generation and evaluation. Think of them kind of like coprocessors.

On CPU/PPU/Main processor

Code: Select all

search()
{
    put board in gen_move processing queue
    put board in eval processing queue

    do nullmove, futility pruning, TT probing here

    wait until move_generation done
    for all move
        search()
}

quiesce()
{
    put board in gen_move processing queue
    put board in eval processing queue - for stand pat

    do delta pruning, TT probing here

    wait until move_generation done
    for all move
        search()
}

On SIMD units: GPU/SPU/etc
Thread type 1: move gen

Code: Select all

poll for any board
if found
    generate move and put them in a queue
    update board status that move generated
    store pointer to where the gen moves are put
goto poll

Thread type 2:evaluation

Code: Select all

poll for any board
if found
    evaluate, and mark board as evaluated
goto poll

Have as many of Thread type 1 and 2 as you have "threads" on the CPU. Individually, those threads 1 and 2 could be broken pretty much like you said so you get a lot of HW threads so that you hide latency.

Regards,
Vlad.

smatovic · Post by **smatovic** » Thu Mar 18, 2010 12:53 am

Thanks for the suggestions Vlad.

Unfortunately the OpenCL-Device has to be initialized by the host (CPU) for every calculation. My first tests with an array-based (12x10) board presentation showed that using a GPU as a single move-generator is possible but too slow for real practice. So i will try to package the whole search algorithm, including move generation and evaluation on the GPU.

The earlier mentioned paper describes some kind of master/slave Young Brothers Can Wait concept for SIMD. I think i will go deeper into this approach.

Regards,
Srdja

jwes · Post by **jwes** » Thu Mar 18, 2010 1:20 am

I was thinking of something similar but the problem is the latency between CPU and GPU is 10s of microseconds.

vladstamate · Post by **vladstamate** » Thu Mar 18, 2010 4:20 am

Well, in my solution you do not have to launch a threadgroup for every calculation. Rather you launch them when it is your turn to move. Then have the threads spin on a primitive, say a mutes, label, event, whatever you can. They will pick up jobs as needed.

Even if the latency is 10 microseconds, if you could "launch" a lot of "jobs" every time, you could make up for it. So maybe don't do the communication every time you need some move generation done, but once you have a larger number of boards for which you need to generate moves.

Also, be very mindful of conditionals. Every conditional will effectively (at the best) half you speed. Since if one thread takes a branch, all the threads in that threadgroup will do.

In the following scenario:

Code: Select all

if(...)
    a;
else
   b;

both a and b will execute.

Regards,
Vlad.

smatovic · Post by **smatovic** » Fri May 21, 2010 9:27 pm

They will pick up jobs as needed.

Memory between GPU and CPU is buffered, so it takes some time to transmitt these jobs...

Even if the latency is 10 microseconds, if you could "launch" a lot of "jobs" every time, you could make up for it. So maybe don't do the communication every time you need some move generation done, but once you have a larger number of boards for which you need to generate moves.

Yes, but the next step would be to generate these jobs/boards directly on the GPU.

Code: Select all

CPU -> actual Board -> GPU 
                                  |
                               do search on GPU
                                  |
CPU <- best move     <-

Even the Root-Search has to be performed on the GPU, becouse of the latency it takes too much time to transmit in serial 20-40 boards to the GPU.

Also, be very mindful of conditionals. Every conditional will effectively (at the best) half you speed. Since if one thread takes a branch, all the threads in that threadgroup will do.

Yes, this is true for convential SIMD Processing Elements, as far as i know GPUs and OpenCL although support SPMD, Single Programm Multible Data. So every Processing Element has its own instruction counter.

...i took a look in some papers about parallel tree-search algorithms, YBWC and DTS seems to have the best speedups, but PV-Split is more easy to implement....so i will start first with PV-Split, so one Processing Element of the GPU works on one subtree...

Regards,
Srdja

Dann Corbit · Post by **Dann Corbit** » Fri May 21, 2010 10:17 pm

smatovic wrote:
They will pick up jobs as needed.
Memory between GPU and CPU is buffered, so it takes some time to transmitt these jobs...

Even if the latency is 10 microseconds, if you could "launch" a lot of "jobs" every time, you could make up for it. So maybe don't do the communication every time you need some move generation done, but once you have a larger number of boards for which you need to generate moves.
Yes, but the next step would be to generate these jobs/boards directly on the GPU.
Code: Select all
CPU -> actual Board -> GPU 
                                  |
                               do search on GPU
                                  |
CPU <- best move     <-
Even the Root-Search has to be performed on the GPU, becouse of the latency it takes too much time to transmit in serial 20-40 boards to the GPU.

Also, be very mindful of conditionals. Every conditional will effectively (at the best) half you speed. Since if one thread takes a branch, all the threads in that threadgroup will do.
Yes, this is true for convential SIMD Processing Elements, as far as i know GPUs and OpenCL although support SPMD, Single Programm Multible Data. So every Processing Element has its own instruction counter.

...i took a look in some papers about parallel tree-search algorithms, YBWC and DTS seems to have the best speedups, but PV-Split is more easy to implement....so i will start first with PV-Split, so one Processing Element of the GPU works on one subtree...

Regards,
Srdja

This is a really fascinating project. I have heard a lot of people say that it can't be done or it's a waste of time to try it. It seems to me that some creative thinking might make a very powerful chess engine using this approach.

bob · Post by **bob** » Fri May 21, 2010 11:10 pm

Dann Corbit wrote:
smatovic wrote:
They will pick up jobs as needed.
Memory between GPU and CPU is buffered, so it takes some time to transmitt these jobs...

Even if the latency is 10 microseconds, if you could "launch" a lot of "jobs" every time, you could make up for it. So maybe don't do the communication every time you need some move generation done, but once you have a larger number of boards for which you need to generate moves.
Yes, but the next step would be to generate these jobs/boards directly on the GPU.
Code: Select all
CPU -> actual Board -> GPU 
                                  |
                               do search on GPU
                                  |
CPU <- best move     <-
Even the Root-Search has to be performed on the GPU, becouse of the latency it takes too much time to transmit in serial 20-40 boards to the GPU.

Also, be very mindful of conditionals. Every conditional will effectively (at the best) half you speed. Since if one thread takes a branch, all the threads in that threadgroup will do.
Yes, this is true for convential SIMD Processing Elements, as far as i know GPUs and OpenCL although support SPMD, Single Programm Multible Data. So every Processing Element has its own instruction counter.

...i took a look in some papers about parallel tree-search algorithms, YBWC and DTS seems to have the best speedups, but PV-Split is more easy to implement....so i will start first with PV-Split, so one Processing Element of the GPU works on one subtree...

Regards,
Srdja
This is a really fascinating project. I have heard a lot of people say that it can't be done or it's a waste of time to try it. It seems to me that some creative thinking might make a very powerful chess engine using this approach.

Note that PV split is just a special case of YBW. In PV split, you go down one edge of the tree until you reach your max allowable split depth, and split there. When everyone finishes, you back up one ply and split there, until you back all the way to the root and split there. The down-side is that as processors go idle at a split point, particularly split points closer to the root, they sit and do nothing.

We wrote a paper on an early enhancement to PVS called EPVS (E=enhanced). The idea was that when a processor goes idle, rather than having it sit and wait, we stop everyone, advance two plies deeper into the tree, and split there. This will almost always be an ALL node since it is two plies deeper than the PVS split point which is also an ALL node.

From there you go to more relaxed models that basically behave like PVS at the beginning, but as a processor goes idle, it can be invited to help any processor that remains busy, whenever a processor hits a node where the YBW criterion is satisfied..

PVsplit is not that bad for 2 and even 4 processors. Beyond that, idle time piles up. And when you get to 16 cores, you really don't want them all to split at the same node since endgame positions might not even have 16 moves to search.

Zeta, a chess engine in OpenCL

Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL

Re: Zeta, a chess engine in OpenCL