Monte carlo on a NVIDIA GPU ?

mcostalba · Post by **mcostalba** » Fri Aug 01, 2008 9:17 am

Nowadays you can get a super powerful massive parallel monster for few hundred bucks: a video card GPU

These toys can be programmed in C language (well, very similar to C actually) for tasks not only related to video applications.

As example the latest NVIDIA GPU, the GTX 280 sports 240 processors!!

Each processor runs at 1296 MHz and memory bandwidth is an incredible 141.7 GB/sec.

What is most important each processor is programmable in C with so called CUDA environment (see http://www.nvidia.com/object/cuda_learn.html)

This thread title suggests Monte Carlo analysis mostly because it is a technique well suited for parallel processing being each "little game" independent from each other.

So the advantage of a chess engine implemented on a GPU are

- Massively multi parallel: row power order of magnitudes higher then any octal or even bigger traditional box

- Programmable in C

- Low cost, easily affordable

- Evolution of GPU has been much faster then CPU in the past and the trend is still that: GPU market (video games driven) is far more important then quad or octal big boxes one (mainly enthusiast driven)

Finally the question

As anyone ever tried the GPU way to develop a chess engine ?

Thanks
Marco

Harald Johnsen · Post by **Harald Johnsen** » Fri Aug 01, 2008 10:14 am

mcostalba wrote: As anyone ever tried the GPU way to develop a chess engine ?

Thanks
Marco

A gpu is not what you think it is.

There is no processor or threads. You should see it as simd, ie your pixel shader code will be executed in parallel, input data is one frame buffer, output data is another frame buffer, textures can contains all kind of parameters.

So you can not write your search() function in your pixel shader, at most you could write things like domove() or eval(). There is no recursivity and loops were *very* limited last time i checked.

It's not a simple task, it's not just about copy paste a search function to the gpu.

But the real problem that I see is that all the gain you could have from gpu processing will be lost at the moment you start to exchange data between the cpu and the gpu. You know, the gpu is 'fast' because it is an asynchronous co-processor, all is lost a the moment you add synchro point between the two (transfers from/to the gpu have allways been a bottleneck in 3d graphics).

HJ.

mcostalba · Post by **mcostalba** » Fri Aug 01, 2008 10:50 am

Harald Johnsen wrote:
A gpu is not what you think it is.

There is no processor or threads. You should see it as simd, ie your pixel shader code will be executed in parallel, input data is one frame buffer, output data is another frame buffer, textures can contains all kind of parameters.

Current programming model for latest GPU (see CUDA docs for reference) allows to download from CPU to GPU more then one "pixel shader" (i.e. a specific code chunk): you can address different GPU processors with different shaders and each processor can operate on different data sets. So it is more a mimd then a simd.

Harald Johnsen wrote:
There is no recursivity and loops were *very* limited last time i checked.

It is always possible to rewrite a recursive algorithm in a stack/array based form, avoiding explicit recursion.

Harald Johnsen wrote:
It's not a simple task, it's not just about copy paste a search function to the gpu.

That's for sure !!!

Harald Johnsen wrote:
But the real problem that I see is that all the gain you could have from gpu processing will be lost at the moment you start to exchange data between the cpu and the gpu. You know, the gpu is 'fast' because it is an asynchronous co-processor, all is lost a the moment you add synchro point between the two (transfers from/to the gpu have allways been a bottleneck in 3d graphics).

HJ.

Well, this is a difficult topic to hand waving on as I'm going to do now but... it depends A LOT on the volume of information to exchange between CPU and GPU.

In case of 3d graphics, CPU must download texture info + geometrical models (vertex) and it is a lot (easily above 100MB for 3D scene).

In chess case IMHO I don't see such a volume if the search is all inside the GPU. In a ideal world CPU sends FEN positions and some search parameter and GPU return the list of evaluations for each different tested tree. Of course this is an over-simplicistic view.

The point here is just to state that assuming the bottleneck to be CPU-GPU bandwitdh is at least premature without sketching out a possible design of a GPU-aided chess engnine implementation.

mcostalba · Post by **mcostalba** » Fri Aug 01, 2008 8:52 pm

Harald Johnsen wrote:
A gpu is not what you think it is.

I have just found this interesting link:

http://www.gpuchess.com/

It seems that the idea of CUDA for a chess engine has already been implemented ! well it's just few months old actually !

plattyaj · Post by **plattyaj** » Fri Aug 01, 2008 9:35 pm

mcostalba wrote:
Harald Johnsen wrote:
A gpu is not what you think it is.

I have just found this interesting link:

http://www.gpuchess.com/

It seems that the idea of CUDA for a chess engine has already been implemented ! well it's just few months old actually !

Interesting - a couple of months ago I tried playing around with the Brook toolkit to see how that might work out. I realized just understanding the base technology would take too long for me to think about on my limited development time but I was wondering if I could implement a move generator using the gpu.

Andy.

bob · Post by **bob** » Fri Aug 01, 2008 11:57 pm

mcostalba wrote:Nowadays you can get a super powerful massive parallel monster for few hundred bucks: a video card GPU

These toys can be programmed in C language (well, very similar to C actually) for tasks not only related to video applications.

As example the latest NVIDIA GPU, the GTX 280 sports 240 processors!!

Each processor runs at 1296 MHz and memory bandwidth is an incredible 141.7 GB/sec.

What is most important each processor is programmable in C with so called CUDA environment (see http://www.nvidia.com/object/cuda_learn.html)

This thread title suggests Monte Carlo analysis mostly because it is a technique well suited for parallel processing being each "little game" independent from each other.

So the advantage of a chess engine implemented on a GPU are

- Massively multi parallel: row power order of magnitudes higher then any octal or even bigger traditional box

- Programmable in C

- Low cost, easily affordable

- Evolution of GPU has been much faster then CPU in the past and the trend is still that: GPU market (video games driven) is far more important then quad or octal big boxes one (mainly enthusiast driven)

Finally the question

As anyone ever tried the GPU way to develop a chess engine ?

Thanks
Marco

we have students doing a couple of interesting (but non-chess) things on GPUs. The problem is memory. It is painfully slow to copy things in/out of the GPU's memory to main memory. They have some neat features, but some issues as well if you want to use multiple GPUs to distribute a tree search, as the communication is a real issue.

mcostalba · Post by **mcostalba** » Sat Aug 02, 2008 12:12 am

bob wrote:
They have some neat features, but some issues as well if you want to use multiple GPUs to distribute a tree search, as the communication is a real issue.

Currently a multi-thread / multi-process approach is used in chess engines to divide the search job among different CPU cores.

Is the data exchange among threads /process of current chess engines so heavy ? Any idea of the required bandwidth ?

If the same splitting of functions would be implemented among the GPU processors instead of CPU cores perhaps the bandwidth requirements would be more or less the same. Am I missing something ?

bob · Post by **bob** » Sat Aug 02, 2008 12:17 am

mcostalba wrote:
bob wrote:
They have some neat features, but some issues as well if you want to use multiple GPUs to distribute a tree search, as the communication is a real issue.
Currently a multi-thread / multi-process approach is used in chess engines to divide the search job among different CPU cores.

Is the data exchange among threads /process of current chess engines so heavy ? Any idea of the required bandwidth ?

If the same splitting of functions would be implemented among the GPU processors instead of CPU cores perhaps the bandwidth requirements would be more or less the same. Am I missing something ?

There are lots of issues. Hashing. Done at every node multiple times (pawn hash, normal hash, some use others like king safety hash, etc.) Repetition list, board position, move lists, locks (a real killer on gpus) to synchronize updates. And the splitting operation itself is significant due to the data that must be replicated... there is actually a significant amount of sharing going on depending on how the program is designed/written.

mcostalba · Post by **mcostalba** » Sat Aug 02, 2008 12:42 am

bob wrote:
There are lots of issues. Hashing. Done at every node multiple times (pawn hash, normal hash, some use others like king safety hash, etc.) Repetition list, board position, move lists, locks (a real killer on gpus) to synchronize updates. And the splitting operation itself is significant due to the data that must be replicated... there is actually a significant amount of sharing going on depending on how the program is designed/written.

Thanks for your explanation.

If I have understood correctly, the problem is the data exchange between CPU and GPU, because the intra-gpu processors data exchange would not suffer from this given that the bandwidth between the GPU and the RAM embedded on the GPU card is stellar.

A possible gpu-aided chess engine design, so, should try to minimize CPU-GPU data exchange while, eventually split the functions to use intra-gpu processors data exchange instead ?

bob · Post by **bob** » Sat Aug 02, 2008 2:54 am

mcostalba wrote:
bob wrote:
There are lots of issues. Hashing. Done at every node multiple times (pawn hash, normal hash, some use others like king safety hash, etc.) Repetition list, board position, move lists, locks (a real killer on gpus) to synchronize updates. And the splitting operation itself is significant due to the data that must be replicated... there is actually a significant amount of sharing going on depending on how the program is designed/written.
Thanks for your explanation.

If I have understood correctly, the problem is the data exchange between CPU and GPU, because the intra-gpu processors data exchange would not suffer from this given that the bandwidth between the GPU and the RAM embedded on the GPU card is stellar.

A possible gpu-aided chess engine design, so, should try to minimize CPU-GPU data exchange while, eventually split the functions to use intra-gpu processors data exchange instead ?

I haven't given it a lot of thought, so I can't really comment intelligently on exactly how a parallel search would/should look on the GPUs. Since there are several different vendors, and each one has different characteristics, it would be a bit of a pain. And the real issue is a "locking" mechanism which the gpus don't currently have a need for.

Monte carlo on a NVIDIA GPU ?

Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?

Re: Monte carlo on a NVIDIA GPU ?