uct on gpu

smatovic · Post by **smatovic** » Wed Mar 14, 2012 2:00 pm

I am sorry but that is just utterly unbelivable atleast for any NVIDIA gpu I know. There the latency is atleast 400-800 cycles. The designer have gone to great length to reduce that through coalesced access and addigng more caches, , introducing the warp execution system and what have you. But still fermi says same figures for latency.. So whatever you have in there is not a solution for common gpus.

ups, you are right, the 24 cycles may match the constant memory....

--
Srdja

bob · Post by **bob** » Thu Mar 15, 2012 3:13 pm

First, this is the kind of approach that fits GPUs. Independent calculations.

Second, it is not quite correct to "normalize" the clock speeds. There is a reason the GPU clock is not any faster. I would report the actual numbers, not manipulated numbers that would be correct if the GPU could somehow be made 2+x faster...

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 4:24 pm

First, this is the kind of approach that fits GPUs. Independent calculations.

You got that right. You yourself should start experimenting with non alpha-beta search methods and exploring other options. Not long ago you were _shocked_ to find out the method worked so well for checkers. I am pretty sure if I implemented checkers on gpu it would turn out to be a pretty strong one.

Second, it is not quite correct to "normalize" the clock speeds. There is a reason the GPU clock is not any faster. I would report the actual numbers, not manipulated numbers that would be correct if the GPU could somehow be made 2+x faster...

It will be a 100x relative to a 1.25ghz computer and a 41x speed up compared to a 3 ghz computer as I stated. I find the former better because it tells me how efficient my code is. I have 114 cores in total so a 100x speedup says about 87% efficiency. But depending on the audience (f.i marketing) you may be required to report numbers against an i7 running on 4 cores if that is what is being used right now for their business. I have even seen comparisons based on equal cost (multi-gpu vs cpu cluster)...

bob · Post by **bob** » Thu Mar 15, 2012 5:35 pm

Daniel Shawul wrote:
First, this is the kind of approach that fits GPUs. Independent calculations.
You got that right. You yourself should start experimenting with non alpha-beta search methods and exploring other options. Not long ago you were _shocked_ to find out the method worked so well for checkers. I am pretty sure if I implemented checkers on gpu it would turn out to be a pretty strong one.
Second, it is not quite correct to "normalize" the clock speeds. There is a reason the GPU clock is not any faster. I would report the actual numbers, not manipulated numbers that would be correct if the GPU could somehow be made 2+x faster...
It will be a 100x relative to a 1.25ghz computer and a 41x speed up compared to a 3 ghz computer as I stated. I find the former better because it tells me how efficient my code is. I have 114 cores in total so a 100x speedup says about 87% efficiency. But depending on the audience (f.i marketing) you may be required to report numbers against an i7 running on 4 cores if that is what is being used right now for their business. I have even seen comparisons based on equal cost (multi-gpu vs cpu cluster)...

That last statement is wrong. You are not getting 87% of 100 cores. You are getting 41% of 100 cores. Regardless of their speed. Stating it the other way is technically misleading and incorrect. I'm still not convinced of this as an approach for chess/checkers. Perhaps someone will do one that proves it (or fails to prove)...

Speedup is defined as

speedup = (time required for one cpu) / (time required for N cpus)

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 6:11 pm

What are you missing? If I launch one cuda thread vs many threads that engage the whole device, i get something close to 100. That tells you the efficiency of your implementation compared to comparing to some random CPU.

bob · Post by **bob** » Thu Mar 15, 2012 6:12 pm

Daniel Shawul wrote:The reason why they run slower is because of my crappy old gpu. Only Fermi (2.0 and later) has hardware support for them but it also means whoever implemented the software __ffsll() in earlier versions used the 64 bit version and did not use the faster Matt taylor's folding trick that is suitable for 32bit. I got almost similar slow downs by using the 64 bit version as the intrinisc _ffsll.
Also they have leading zero count which is not all that common in cpus i think.

leading zeros was a hardware instruction on the Crays. If you think about it, it is nothing more than a BSR but computed like this on the intel:

leading_zeros = 63 - BSR ( value )

That is why early crafty's numbered bits from left to right, rather than intel-like going right to left (lsb=0 to msb=31 or 63 depending on word length)...

Crypto guys used this all the time, and they were the driving force to get a popcnt instruction on Intel as well.

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 6:51 pm

I did a quick run and I got even better results doing the comparion all on the gpu. As you see the efficiency is about 91% which is even better than I predicted. One problem I have with measuring speed ups on the gpu is that I can not go down to 1 thread for obvious reasons so I compared with blocks instead. The watchdog timer kills my kernel automatically if it exceeds 10 seconds so the 28blocks kernel call takes too small a time to finish. So probably 0.73 efficiency result can be improved if it got a bigger job but I can't do that on my gpu. That proves my point.

Code: Select all

Blocks   Time   Ratio	    Efficiency
28	    453	20.66004415	0.73785872
14	    750	12.47866667	0.891333333
7	    1406	6.656472262	0.950924609
4	    2406	3.889858687	0.972464672
1	    9359	1	                1
			
		    Average	0.910516267

That checkers is suitable for UCT is a fact. I have a working engine with it that clearly shows it is a very good one and also Bjornsson and co using their GGP engine have written a paper on it. I gave a link to it sometime back. There is no substitutue for getting your hands dirty ...

bob · Post by **bob** » Thu Mar 15, 2012 8:36 pm

Daniel Shawul wrote:What are you missing? If I launch one cuda thread vs many threads that engage the whole device, i get something close to 100. That tells you the efficiency of your implementation compared to comparing to some random CPU.

If that is what you wrote, then I misunderstood. The correct comparison is always one cpu vs N cpus, where the cpus are identical, If that is what you did, you did it correctly. My reading suggested that you ran 41x faster on 100+ cores than on just one, and then you corrected that based on a clock speed difference...

It appears to be my error... sorry...

diep · Post by **diep** » Thu Mar 15, 2012 9:14 pm

which type gpu are you using?

programming in CUDA or in OpenCL?

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 9:21 pm

If that is what you wrote, then I misunderstood. The correct comparison is always one cpu vs N cpus, where the cpus are identical, If that is what you did, you did it correctly. My reading suggested that you ran 41x faster on 100+ cores than on just one, and then you corrected that based on a clock speed difference...

It appears to be my error... sorry...

That is allright. I post to get feed backs here. And I may not have been so clear with my setup and what I am trying to achieve with it.

uct on gpu

Re: uct for chess

Re: 100x speed up

Re: 100x speed up

Re: 100x speed up

Re: 100x speed up

Re: intrinsic popcnt

Table

Re: 100x speed up

Re: uct on gpu

Re: 100x speed up