uct on gpu

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 6:51 pm

I did a quick run and I got even better results doing the comparion all on the gpu. As you see the efficiency is about 91% which is even better than I predicted. One problem I have with measuring speed ups on the gpu is that I can not go down to 1 thread for obvious reasons so I compared with blocks instead. The watchdog timer kills my kernel automatically if it exceeds 10 seconds so the 28blocks kernel call takes too small a time to finish. So probably 0.73 efficiency result can be improved if it got a bigger job but I can't do that on my gpu. That proves my point.

Code: Select all

Blocks   Time   Ratio	    Efficiency
28	    453	20.66004415	0.73785872
14	    750	12.47866667	0.891333333
7	    1406	6.656472262	0.950924609
4	    2406	3.889858687	0.972464672
1	    9359	1	                1
			
		    Average	0.910516267

That checkers is suitable for UCT is a fact. I have a working engine with it that clearly shows it is a very good one and also Bjornsson and co using their GGP engine have written a paper on it. I gave a link to it sometime back. There is no substitutue for getting your hands dirty ...

bob · Post by **bob** » Thu Mar 15, 2012 8:36 pm

Daniel Shawul wrote:What are you missing? If I launch one cuda thread vs many threads that engage the whole device, i get something close to 100. That tells you the efficiency of your implementation compared to comparing to some random CPU.

If that is what you wrote, then I misunderstood. The correct comparison is always one cpu vs N cpus, where the cpus are identical, If that is what you did, you did it correctly. My reading suggested that you ran 41x faster on 100+ cores than on just one, and then you corrected that based on a clock speed difference...

It appears to be my error... sorry...

diep · Post by **diep** » Thu Mar 15, 2012 9:14 pm

which type gpu are you using?

programming in CUDA or in OpenCL?

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 9:21 pm

If that is what you wrote, then I misunderstood. The correct comparison is always one cpu vs N cpus, where the cpus are identical, If that is what you did, you did it correctly. My reading suggested that you ran 41x faster on 100+ cores than on just one, and then you corrected that based on a clock speed difference...

It appears to be my error... sorry...

That is allright. I post to get feed backs here. And I may not have been so clear with my setup and what I am trying to achieve with it.

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 9:27 pm

diep wrote:which type gpu are you using?

programming in CUDA or in OpenCL?

Hello vincent
My GPU is Quadro FX 3700 with 14 SM = 112 cores. The other details are in the result I posted somewhere in this thread. It is used for display as well so now and then I have to boot up when I get the blue screen..I use cuda for programming. For now I am just experimenting with simpler games. So there is nothing to report on chess yet except that it runs.

diep · Post by **diep** » Sat Mar 17, 2012 2:17 pm

Daniel Shawul wrote:
diep wrote:which type gpu are you using?

programming in CUDA or in OpenCL?
Hello vincent
My GPU is Quadro FX 3700 with 14 SM = 112 cores. The other details are in the result I posted somewhere in this thread. It is used for display as well so now and then I have to boot up when I get the blue screen..I use cuda for programming. For now I am just experimenting with simpler games. So there is nothing to report on chess yet except that it runs.

The famous 5 second rule of windows before it is hung...

Why don't you try install linux or another NVIDIA graphics card?

As a matter of fact i'm doing a new install on linux now for a few tesla's 2075's for tuning computerchess.

if you install a cheapo 2nd card and use that as first device you're there.

smatovic · Post by **smatovic** » Mon Mar 19, 2012 4:04 pm

Hey Daniel,

just want to mention that i doubled my move gen performance simply by using a vector datatype "long4" for the QuadBitboards.

Running a simple loop with the starting position i get about 300Knps per SIMD Unit,
if i turn of legality check it is up to 1Mnps.

How does your move generator perform?

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 19, 2012 9:01 pm

Hey Daniel,

just want to mention that i doubled my move gen performance simply by using a vector datatype "long4" for the QuadBitboards.

That is good. GPUs don't have SSE instructions so I suppose your speedup must have come from memory optimizations.

Running a simple loop with the starting position i get about 300Knps per SIMD Unit,
if i turn of legality check it is up to 1Mnps.

Yes legality testing can be a killer. For my GGP engine Nebiyu I got a 100% speedup by allowing the king to be captured. If your attacks is slow you may want to try that.

How does your move generator perform?

I don't have a full move generator but a random legal move generator. And I also don't store moves in global memory unlike you. Your engine is probably memory bound because of that. 300k nps is not that much when you divide it by the number of threads in a SIMD unit. When you decide to give up relying on registers (& shared mem), you should bear in mind that the kernel may not be faster than a single cpu eventually. Hashtables are typical examples that are better generated on cpu. Well chess probably is not suited for gpu computation because monte-carlo misses a lot of tactics, but my effort is concentrated on MCTS and games that are suitable for that approach. The MC part is really needed to harness the power of the gpu and hide global memory latency as well.
----
cheers

diep · Post by **diep** » Mon Mar 19, 2012 9:33 pm

smatovic wrote:Hey Daniel,

just want to mention that i doubled my move gen performance simply by using a vector datatype "long4" for the QuadBitboards.

Running a simple loop with the starting position i get about 300Knps per SIMD Unit,
if i turn of legality check it is up to 1Mnps.

How does your move generator perform?

--
Srdja

You're buliding a chessprogram on a gpu as well Srdje?

Can you remind us which gpu you're using Srdja and how high it is clocked?

AMD has 64 pe's per SIMD and Nvidia has at Fermi around 32 streamcores per SIMD.

SIMD also gets called compute units by the way, for those who wonder.

The tesla's i've here are 1.15Ghz it seems, 448 cores and 6GB ram.

Regards,
Vincent

diep · Post by **diep** » Mon Mar 19, 2012 9:43 pm

Daniel Shawul wrote:
Hey Daniel,

just want to mention that i doubled my move gen performance simply by using a vector datatype "long4" for the QuadBitboards.
That is good. GPUs don't have SSE instructions so I suppose your speedup must have come from memory optimizations.
Running a simple loop with the starting position i get about 300Knps per SIMD Unit,
if i turn of legality check it is up to 1Mnps.

Yes legality testing can be a killer. For my GGP engine Nebiyu I got a 100% speedup by allowing the king to be captured. If your attacks is slow you may want to try that.
How does your move generator perform?
I don't have a full move generator but a random legal move generator. And I also don't store moves in global memory unlike you. Your engine is probably memory bound because of that. 300k nps is not that much when you divide it by the number of threads in a SIMD unit. When you decide to give up relying on registers (& shared mem), you should bear in mind that the kernel may not be faster than a single cpu eventually. Hashtables are typical examples that are better generated on cpu. Well chess probably is not suited for gpu computation because monte-carlo misses a lot of tactics, but my effort is concentrated on MCTS and games that are suitable for that approach. The MC part is really needed to harness the power of the gpu and hide global memory latency as well.
----
cheers

You just shouldn't use monte carlo nor UCT for chess

Of course chess can work genius at gpu's, just it's a lot of work and you need 3 layers of SMP.

One within 1 compute unit, one between compute units (only works for nvidia) and one between the gpu's (using the RAM on the motherboard).

You can easily stack a bunch of gpu's onto a single machine with riser cards for those who wonder. Each few gpu's you can give its own psu.

Nvidia works genius there, with AMD you see massive bugreports - but i guess everyone here is programming for nvidia not AMD anyway.

Riser cards btw pretty cheap. I ordered a few in hongkong for a few dollar a piece.

Probably will take a few weeks to arrive here though

uct on gpu

Table

Re: 100x speed up

Re: uct on gpu

Re: 100x speed up

Re: uct on gpu

Re: uct on gpu

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes

Re: uct for chess - move gen speedup by vector datatypes