uct on gpu

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 14, 2012 6:21 am

c) 32 bit firstone and bitcounts are ok but still very slow. I need to count psedudo legal moves before generating them.
The bitboards are very sparse and maybe a simpler version of population count would help.

Good lord. I was dead wrong on this. There are all sorts of integer instructions even for 64 bitboards __popcll() , __clzll() and what have you. If your integers are 24 bit then there are faster multiplications __mul24() as well. I got some speed up with the popcnt already. I assumed that they wouldn't bother implementing integer intrinsics since floats are faster and that is what gpu are intended for. Anyway I hope Gerd sees this and do his magic

Edit: Wait. That was a fluke, intrinsics are slower especially the __ffsll(). But I shaved of 5 registers and removed some tables by using intrinsic. I will do unit-testing to find if they are actually slower.

The intrinsics are hardware implemented according to this page so they should be faster.

Btw the bitmagic library has some SSE popcnt routines that can be used when there is no hardware support for it. Here is a straight forward translation of the well known 32 bits pop count.

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 14, 2012 6:50 am

The reason why they run slower is because of my crappy old gpu. Only Fermi (2.0 and later) has hardware support for them but it also means whoever implemented the software __ffsll() in earlier versions used the 64 bit version and did not use the faster Matt taylor's folding trick that is suitable for 32bit. I got almost similar slow downs by using the 64 bit version as the intrinisc _ffsll.
Also they have leading zero count which is not all that common in cpus i think.

smatovic · Post by **smatovic** » Wed Mar 14, 2012 12:56 pm

Even though I use 60 registers for chess, I do not read from global memory any variables so I am not all that unhappy about it. I can control latency by simply doing more monte-carlo simulations which would be impossible if that too uses global memory.

It depends on the computation, it can be faster to access global memory sometimes....afaik my device needs 24 cycles to fetch from global memory....if you consider the SIMT architecture of the device then it can be faster to use global memory then to compute different branches in one warp.

I have in total 9 bitboards (18 registers) so it is not a lot but the calculations to generate one random legal move are intensive and before you know it you hit 60 registers. Also quad bitboards require SIMD operations for decoding if I am not mistaken. GPUs are SIMT and as such do not have SSE instructions.

I use QuadBitboards only for the Board Presentation, Move Generation is done with Magic Biboards (global memory and constant memory).

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 14, 2012 1:46 pm

It depends on the computation, it can be faster to access global memory sometimes....afaik my device needs 24 cycles to fetch from global memory....if you consider the SIMT architecture of the device then it can be faster to use global memory then to compute different branches in one warp.

I am sorry but that is just utterly unbelivable atleast for any NVIDIA gpu I know. There the latency is atleast 400-800 cycles. The designer have gone to great length to reduce that through coalesced access and addigng more caches, , introducing the warp execution system and what have you. But still fermi says same figures for latency.. So whatever you have in there is not a solution for common gpus.

I use QuadBitboards only for the Board Presentation, Move Generation is done with Magic Biboards (global memory and constant memory).

I don't use globabl memory anywhere and even gone through to great length to generate a single legal move. Using 4 vs 8 bitboards for representation is not much of a difference for me. Register usage keeps the same if I put the board for each thread on the register or shared mem but it does get slower.

smatovic · Post by **smatovic** » Wed Mar 14, 2012 2:00 pm

I am sorry but that is just utterly unbelivable atleast for any NVIDIA gpu I know. There the latency is atleast 400-800 cycles. The designer have gone to great length to reduce that through coalesced access and addigng more caches, , introducing the warp execution system and what have you. But still fermi says same figures for latency.. So whatever you have in there is not a solution for common gpus.

ups, you are right, the 24 cycles may match the constant memory....

--
Srdja

bob · Post by **bob** » Thu Mar 15, 2012 3:13 pm

First, this is the kind of approach that fits GPUs. Independent calculations.

Second, it is not quite correct to "normalize" the clock speeds. There is a reason the GPU clock is not any faster. I would report the actual numbers, not manipulated numbers that would be correct if the GPU could somehow be made 2+x faster...

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 4:24 pm

First, this is the kind of approach that fits GPUs. Independent calculations.

You got that right. You yourself should start experimenting with non alpha-beta search methods and exploring other options. Not long ago you were _shocked_ to find out the method worked so well for checkers. I am pretty sure if I implemented checkers on gpu it would turn out to be a pretty strong one.

Second, it is not quite correct to "normalize" the clock speeds. There is a reason the GPU clock is not any faster. I would report the actual numbers, not manipulated numbers that would be correct if the GPU could somehow be made 2+x faster...

It will be a 100x relative to a 1.25ghz computer and a 41x speed up compared to a 3 ghz computer as I stated. I find the former better because it tells me how efficient my code is. I have 114 cores in total so a 100x speedup says about 87% efficiency. But depending on the audience (f.i marketing) you may be required to report numbers against an i7 running on 4 cores if that is what is being used right now for their business. I have even seen comparisons based on equal cost (multi-gpu vs cpu cluster)...

bob · Post by **bob** » Thu Mar 15, 2012 5:35 pm

Daniel Shawul wrote:
First, this is the kind of approach that fits GPUs. Independent calculations.
You got that right. You yourself should start experimenting with non alpha-beta search methods and exploring other options. Not long ago you were _shocked_ to find out the method worked so well for checkers. I am pretty sure if I implemented checkers on gpu it would turn out to be a pretty strong one.
Second, it is not quite correct to "normalize" the clock speeds. There is a reason the GPU clock is not any faster. I would report the actual numbers, not manipulated numbers that would be correct if the GPU could somehow be made 2+x faster...
It will be a 100x relative to a 1.25ghz computer and a 41x speed up compared to a 3 ghz computer as I stated. I find the former better because it tells me how efficient my code is. I have 114 cores in total so a 100x speedup says about 87% efficiency. But depending on the audience (f.i marketing) you may be required to report numbers against an i7 running on 4 cores if that is what is being used right now for their business. I have even seen comparisons based on equal cost (multi-gpu vs cpu cluster)...

That last statement is wrong. You are not getting 87% of 100 cores. You are getting 41% of 100 cores. Regardless of their speed. Stating it the other way is technically misleading and incorrect. I'm still not convinced of this as an approach for chess/checkers. Perhaps someone will do one that proves it (or fails to prove)...

Speedup is defined as

speedup = (time required for one cpu) / (time required for N cpus)

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 15, 2012 6:11 pm

What are you missing? If I launch one cuda thread vs many threads that engage the whole device, i get something close to 100. That tells you the efficiency of your implementation compared to comparing to some random CPU.

bob · Post by **bob** » Thu Mar 15, 2012 6:12 pm

Daniel Shawul wrote:The reason why they run slower is because of my crappy old gpu. Only Fermi (2.0 and later) has hardware support for them but it also means whoever implemented the software __ffsll() in earlier versions used the 64 bit version and did not use the faster Matt taylor's folding trick that is suitable for 32bit. I got almost similar slow downs by using the 64 bit version as the intrinisc _ffsll.
Also they have leading zero count which is not all that common in cpus i think.

leading zeros was a hardware instruction on the Crays. If you think about it, it is nothing more than a BSR but computed like this on the intel:

leading_zeros = 63 - BSR ( value )

That is why early crafty's numbered bits from left to right, rather than intel-like going right to left (lsb=0 to msb=31 or 63 depending on word length)...

Crypto guys used this all the time, and they were the driving force to get a popcnt instruction on Intel as well.

uct on gpu

intrinsic popcnt

Re: intrinsic popcnt

Re: uct for chess

Re: uct for chess

Re: uct for chess

Re: 100x speed up

Re: 100x speed up

Re: 100x speed up

Re: 100x speed up

Re: intrinsic popcnt