uct on gpu
Posted: Fri Feb 24, 2012 6:52 am
I have now a partially working version of UCT on gpu. The tree is generated and stored on the device so there is no overhead due to cpu-gpu data transfer. Also the way I implemented it, there doesn't seem to be a significant slow down going from pure monte carlo to uct.
a) The tree is stored in global memory so it is shared across multi processors
b) The minimum number of simulations you can ask for is right now 8192 as compared to 1 in the conventional cpu method. For my current gpu device with 14 multi processors, I launch 56 blocks with 64 threads each, so all of the blocks are active. Each thread does 128 simulations for a total of 64x128=8192 simulations and then consults the tree stored in global memory. Then a UCT_select is done and a different node could be selected for simulation.
c) I use spin locks for each node which can be called from each of the 56 active blocks. I do not want to allow each thread to work independently on generating and grabbing nodes because that will increase the idle time on the spinlocks. Later on I may allow a warp to be the smallest unit of working unit on the tree but for now it is a block.
Another reason is that the game I am using right now has simple move generation which allows a game to be run on registers alone. So I want to avoid global memory consults as much as possible. Chess or Go may will probably require some tables that won't fit in shared memory but for now i want to keep it simple.
What side effects (if any) do you think will the block simulation have on UCT? For perft approximations we did before, I used 2 simulations per call but the results were not added up.
Any ideas and suggestions are welcome
Daniel
Edit: I tested 32 threads in a block (1 warp!) and doubled the block sizes i.e 112 x 32 and it performed only slighlty worse. 13sec vs 11sec for completing
90 million simulations. The maximum number of active blocks per MP is 8 so this new setting is the best I can do 8 X 14 = 112. In the test the tree was expanded upto depth=3 and 4600 nodes added. The game is 8x8 hex so BF goes like 64,63,62... The rate of tree growth is definitely slower than what would be possible if UCT_select is done after 1 simulation. Each thread does 128 cycles before consulting the shared tree, so I will try to lower that and see how it affects tree growth and speed. Maybe I am being paranoid about the global memory access , a few cycles could perform better who knows..
Edit2: Indeed I was paranoid! I lowered the number of cycles from 128 gradually down to 16 and it finished the solution in the same time but with much bigger tree. At 16 cycles in fact it used up all the 2 Mega bytes memory I reserved (about 65536 nodes). It does 512 block simulations before checking the tree. I think the cuda warp execution model hides the latency so well ...
a) The tree is stored in global memory so it is shared across multi processors
b) The minimum number of simulations you can ask for is right now 8192 as compared to 1 in the conventional cpu method. For my current gpu device with 14 multi processors, I launch 56 blocks with 64 threads each, so all of the blocks are active. Each thread does 128 simulations for a total of 64x128=8192 simulations and then consults the tree stored in global memory. Then a UCT_select is done and a different node could be selected for simulation.
c) I use spin locks for each node which can be called from each of the 56 active blocks. I do not want to allow each thread to work independently on generating and grabbing nodes because that will increase the idle time on the spinlocks. Later on I may allow a warp to be the smallest unit of working unit on the tree but for now it is a block.
Another reason is that the game I am using right now has simple move generation which allows a game to be run on registers alone. So I want to avoid global memory consults as much as possible. Chess or Go may will probably require some tables that won't fit in shared memory but for now i want to keep it simple.
What side effects (if any) do you think will the block simulation have on UCT? For perft approximations we did before, I used 2 simulations per call but the results were not added up.
Any ideas and suggestions are welcome
Daniel
Edit: I tested 32 threads in a block (1 warp!) and doubled the block sizes i.e 112 x 32 and it performed only slighlty worse. 13sec vs 11sec for completing
90 million simulations. The maximum number of active blocks per MP is 8 so this new setting is the best I can do 8 X 14 = 112. In the test the tree was expanded upto depth=3 and 4600 nodes added. The game is 8x8 hex so BF goes like 64,63,62... The rate of tree growth is definitely slower than what would be possible if UCT_select is done after 1 simulation. Each thread does 128 cycles before consulting the shared tree, so I will try to lower that and see how it affects tree growth and speed. Maybe I am being paranoid about the global memory access , a few cycles could perform better who knows..
Edit2: Indeed I was paranoid! I lowered the number of cycles from 128 gradually down to 16 and it finished the solution in the same time but with much bigger tree. At 16 cycles in fact it used up all the 2 Mega bytes memory I reserved (about 65536 nodes). It does 512 block simulations before checking the tree. I think the cuda warp execution model hides the latency so well ...