uct on gpu

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 24, 2012 6:52 am

I have now a partially working version of UCT on gpu. The tree is generated and stored on the device so there is no overhead due to cpu-gpu data transfer. Also the way I implemented it, there doesn't seem to be a significant slow down going from pure monte carlo to uct.

a) The tree is stored in global memory so it is shared across multi processors
b) The minimum number of simulations you can ask for is right now 8192 as compared to 1 in the conventional cpu method. For my current gpu device with 14 multi processors, I launch 56 blocks with 64 threads each, so all of the blocks are active. Each thread does 128 simulations for a total of 64x128=8192 simulations and then consults the tree stored in global memory. Then a UCT_select is done and a different node could be selected for simulation.
c) I use spin locks for each node which can be called from each of the 56 active blocks. I do not want to allow each thread to work independently on generating and grabbing nodes because that will increase the idle time on the spinlocks. Later on I may allow a warp to be the smallest unit of working unit on the tree but for now it is a block.
Another reason is that the game I am using right now has simple move generation which allows a game to be run on registers alone. So I want to avoid global memory consults as much as possible. Chess or Go may will probably require some tables that won't fit in shared memory but for now i want to keep it simple.

What side effects (if any) do you think will the block simulation have on UCT? For perft approximations we did before, I used 2 simulations per call but the results were not added up.

Any ideas and suggestions are welcome
Daniel

Edit: I tested 32 threads in a block (1 warp!) and doubled the block sizes i.e 112 x 32 and it performed only slighlty worse. 13sec vs 11sec for completing
90 million simulations. The maximum number of active blocks per MP is 8 so this new setting is the best I can do 8 X 14 = 112. In the test the tree was expanded upto depth=3 and 4600 nodes added. The game is 8x8 hex so BF goes like 64,63,62... The rate of tree growth is definitely slower than what would be possible if UCT_select is done after 1 simulation. Each thread does 128 cycles before consulting the shared tree, so I will try to lower that and see how it affects tree growth and speed. Maybe I am being paranoid about the global memory access , a few cycles could perform better who knows..

Edit2: Indeed I was paranoid! I lowered the number of cycles from 128 gradually down to 16 and it finished the solution in the same time but with much bigger tree. At 16 cycles in fact it used up all the 2 Mega bytes memory I reserved (about 65536 nodes). It does 512 block simulations before checking the tree. I think the cuda warp execution model hides the latency so well ...

smatovic · Post by **smatovic** » Fri Feb 24, 2012 9:17 am

Cool, congrats Daniel!

What side effects (if any) do you think will the block simulation have on UCT? For perft approximations we did before, I used 2 simulations per call but the results were not added up.

I didnt get it....if you use spinlocks block-wise you can not access a node by one thread alone?

--
Srdja

smatovic · Post by **smatovic** » Fri Feb 24, 2012 9:45 am

Edit: I tested 32 threads in a block (1 warp!) and doubled the block sizes i.e 112 x 32 and it performed only slighlty worse. 13sec vs 11sec for completing
90 million simulations. The maximum number of active blocks per MP is 8 so this new setting is the best I can do 8 X 14 = 112. .

Cuda devices with compute capabiltiy 2.0 should have 8 Blocks per MP and 256 threads per Block, you should be able to run 8*32 threads per block...if you are not running out of registers...

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 24, 2012 2:00 pm

Hi Srdja,
Thanks for thumbs up.

I didnt get it....if you use spinlocks block-wise you can not access a node by one thread alone?

Yes I use block wise spinlocks and there is no need a thread wise spinlock. At first I was afraid about the tree growth would be so slow
but now after reducing the number of cycles (simulations) that a thread does before checking the tree, I understood that infact the tree
grows so fast I am gonna have to find a way to control it.

Cuda devices with compute capabiltiy 2.0 should have 8 Blocks per MP and 256 threads per Block, you should be able to run 8*32 threads per block...if you are not running out of registers...

Yes, but for my case there is little need to share information between threads unlike what you do with YBW for alpha-beta. My basic montecarlo kernel uses

Code: Select all

ptxas info    &#58; Compiling entry function '_Z7playouti' for 'sm_11'
ptxas info    &#58; Used 20 registers, 186+16 bytes smem, 67 bytes cmem&#91;0&#93;, 84 bytes cmem&#91;1&#93;, 36 bytes cmem&#91;14&#93;

So I can fit in roughly 352 threads per block. Max allowed is 512 but the 20 registers used per thread limits it. Due to someother avoidable constraint,
I can only use power-of-two number of threads so I use 256 threads per block. Infact I don't need to cram all those threads in one block (even though I could) because
I don't share much between the threads. The device has 14 multiprocessors, so I launch 8 active blocks per MP = 8 x 14 = 112 blocks and 1 warp per block = 32 threads.
Those threads will always be active (no batching) so I keep them busy until the specified number of simulations is reached. Increasing the number of threads or blocks
doesn't increase performance because then it would start timeslicing to accomodate all the threads. I just have to make sure that there are enough warps (could be from different
blocks loaded at the same MP) to avoid latency due to global memory read/writes.

Here is the tree growth rate for different number of cycles and a 112x32 setup

Code: Select all

cycles   Nodes      Time&#40;sec&#41;
128	4200     13
64	5833     13
32	14327    13
16	45202    13.5
8	80858    14
4	286716   15.6

As you can see at 4 cycles i.e 4 x 32 = 128 simulations per block the tree growth is really high with only a slight increase in simulation time.
So letting each thread grow the tree will blow up the tree. My windows "watchdog" is timing out the kernel so I could not test lower number of cycles.
---
cheers

smatovic · Post by **smatovic** » Fri Feb 24, 2012 2:44 pm

The device has 14 multiprocessors, so I launch 8 active blocks per MP = 8 x 14 = 112 blocks and 1 warp per block = 32 threads.

Ah,now i see, you run 8*32 threads per MP to hide latency.

May i ask how you store a position with n children in global memory?

I work currently on an similiar algorithm like uct, RBFM, it also needs to store the tree in memory.
An solution would be to fill an global 2D array with nodes and store the array index to parent node and the first child:

Code: Select all

node.move = move
node.children = 20
node.firstchildindex = 1
node.parentnodeindex =  -1

nodes&#91;0&#93; = node
nodes&#91;1&#93; = chiild1
...

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 24, 2012 3:28 pm

My node

Code: Select all


typedef U64 MOVE;

struct Node &#123;
	U32 uct_wins;
	U32 uct_visits;
	MOVE move;
	Node* parent;
	Node* child;
	Node* next;
	LOCK lock;
	
	__device__ void clear&#40;) &#123;
		uct_wins = 0;
		uct_visits = 0;
		parent = 0;
		child = 0;
		next = 0;
		move = MOVE&#40;);
		l_create&#40;lock&#41;;
	&#125;
&#125;;

I allocate about 128mb (4 million entries at start up) and the blocks grab nodes from there. The move is currently a bitboard so 32 bytes per node are allocated.
I have three pointers to the parent, child and siblings.
I think that the RBFM approach is a good choice. Keep us posted.
I will make a git page for my engine after I clean up the code.
---
cheers

smatovic · Post by **smatovic** » Fri Feb 24, 2012 4:04 pm

My node

Thanks,
until now i avoided pointers and structs on the gpu and used arrays.

I will post my results, hope the next approach will produce better ones.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Fri Feb 24, 2012 4:53 pm

until now i avoided pointers and structs on the gpu and used arrays.

Ah, struct of arrays (SoA) Vs arrays of structs (AoS). GPUs prefer SoA but in this case it may not be worth it since hash table and such data structures are accessed in a random manner. Also your program will probably look like a spaghettied fortran77 code. But I didn't really try SoA.

Daniel Shawul · Post by **Daniel Shawul** » Sat Feb 25, 2012 9:30 pm

What side effects (if any) do you think will the block simulation have on UCT? For perft approximations we did before, I used 2 simulations per call but the results were not added up.

I think this has a significant effect for gpu uct. The problem was so severe I had to fix it through modification of the UCT formula for node selection. The problem is if one multi-processor takes a node to simulate, it sure is not going to return just after one simulation (which btw is the case for cpu UCT). It could produce 8192 simulation for instance. In fact all the multi processors (14 of them) will follow the same path to that node as there is no randomness in the selection formula!! That was very bad indeed and it resulted in an effect where the whole device works only one node at a time and then move to another,and another... I fixed this problem by adding a workers counter for each node that is incremented and decremented atomically. Then during selection , nodes that have been grabed already are given lower priorities. For example univisted nodes will have a score:

Code: Select all

score = FPU - &#40;n->workers / nBlocks&#41;

Same modification to those nodes which are visited.

Dave_N · Post by **Dave_N** » Sun Feb 26, 2012 1:42 am

Can you use Render To Texture to increase memory ? (i.e some of the register uses could maybe be texture implementations)

uct on gpu

uct on gpu

Re: uct on gpu

Re: uct on gpu

Re: uct on gpu

Re: uct on gpu

Re: uct on gpu

Re: uct on gpu

Re: uct on gpu

Re: uct on gpu

Re: uct on gpu