uct on gpu

Daniel Shawul · Post by **Daniel Shawul** » Mon Feb 27, 2012 9:02 pm

I didn't manipulate results to get that suitable (100x) figure often reported in literature

It is what I got according to my calculation and this result is even on an old gpu of compatibility 1.1. Imagine what the beast fermi could do to it! Without further adieu the result:

I did 45 mil uct simulations on GPU (cuda) and on CPU (OpenMP) :
CPU = 230000 milli seconds
GPU = 5563 milli seconds
GPU/CPU = 41.3

Then the "manuplation" comes in. CPU is 3ghz and cuda processors are 1.25ghz, so on equal grounds that would be (3/1.25) * 41.3 = 99.6 !!

Since the CPU code generates the tree much quickly, I rerun it so that it generates less nodes (checks the tree less often ,about 100x less). I got slightly less speedup but still in the same ballpark.
------------------------------------------------------------------------------
The complete run follows
------------------------------------------------------------------------------

Code: Select all

E&#58;\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C&#58;\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_00001a48_00000000-3_hex.cudafe1.gpu
tmpxft_00001a48_00000000-8_hex.cudafe2.gpu
hex.cu
./hex.cu&#40;328&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;332&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;338&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;344&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;422&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;422&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;328&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;332&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;338&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;344&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;429&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;273&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
./hex.cu&#40;290&#41;&#58; Warning&#58; Cannot tell what pointer points to, assuming global memory space
ptxas info    &#58; Compiling entry function '_ZN5TABLE10print_treeEi' for 'sm_11'
ptxas info    &#58; Used 13 registers, 4+16 bytes smem, 53 bytes cmem&#91;0&#93;, 32 bytes cmem&#91;1&#93;, 36 bytes cmem&#91;14&#93;
ptxas info    &#58; Compiling entry function '_ZN5TABLE5resetEv' for 'sm_11'
ptxas info    &#58; Used 5 registers, 53 bytes cmem&#91;0&#93;, 36 bytes cmem&#91;14&#93;
ptxas info    &#58; Compiling entry function '_Z7playoutj' for 'sm_11'
ptxas info    &#58; Used 18 registers, 552+16 bytes smem, 53 bytes cmem&#91;0&#93;, 92 bytes cmem&#91;1&#93;, 36 bytes cmem&#91;14&#93;
tmpxft_00001a48_00000000-3_hex.cudafe1.cpp
tmpxft_00001a48_00000000-14_hex.ii

E&#58;\Alltests\hex>hex
 --- General Information for device 0 ---
Name&#58; Quadro FX 3700
Compute capability&#58; 1.1
Clock rate&#58; 1250000
Device copy overlap&#58; Enabled
Kernel execition timeout &#58; Enabled
 --- Memory Information for device 0 ---
Total global mem&#58; 536543232
Total constant Mem&#58; 65536
Max mem pitch&#58; 2147483647
Texture Alignment&#58; 256
 --- MP Information for device 0 ---
Multiprocessor count&#58; 14
Shared mem per mp&#58; 16384
Registers per mp&#58; 8192
Threads in warp&#58; 32
Max threads per block&#58; 512
Max thread dimensions&#58; &#40;512, 512, 64&#41;
Max grid dimensions&#58; &#40;65535, 65535, 1&#41;

nBlocks=42 X nThreads=128
0.0 26051534 45959168 0.566841
        1.0 320443 749568 0.427504
        1.1 309591 712704 0.434389
        1.2 318579 720896 0.441921
        1.3 324636 722944 0.449047
        1.4 334108 735232 0.454425
        1.5 333075 722944 0.460720
        1.6 328187 714752 0.459162
        1.7 321134 696320 0.461187
        1.8 317725 737280 0.430942
        1.9 294177 702464 0.418779
        1.10 296311 706560 0.419371
        1.11 319123 741376 0.430447
        1.12 313588 727040 0.431322
        1.13 325349 731136 0.444991
        1.14 320649 716800 0.447334
        1.15 327181 714752 0.457755
        1.16 311085 716800 0.433991
        1.17 296036 704512 0.420200
        1.18 284808 696320 0.409019
        1.19 303228 743424 0.407880
        1.20 298766 702464 0.425311
        1.21 305367 708608 0.430939
        1.22 313238 708608 0.442047
        1.23 340089 747520 0.454956
        1.24 317631 724992 0.438117
        1.25 308738 731136 0.422272
        1.26 301522 735232 0.410105
        1.27 292630 712704 0.410591
        1.28 288996 698368 0.413816
        1.29 301258 710656 0.423915
        1.30 311704 714752 0.436101
        1.31 309933 694272 0.446414
        1.32 321523 720896 0.446005
        1.33 314992 727040 0.433253
        1.34 280126 694272 0.403482
        1.35 287599 702464 0.409415
        1.36 283096 694272 0.407759
        1.37 286109 698368 0.409682
        1.38 301178 708608 0.425028
        1.39 313143 704512 0.444482
        1.40 344140 751616 0.457867
        1.41 321821 733184 0.438936
        1.42 303774 706560 0.429934
        1.43 293859 704512 0.417110
        1.44 292651 714752 0.409444
        1.45 299109 739328 0.404569
        1.46 292907 708608 0.413355
        1.47 311510 712704 0.437082
        1.48 342379 741376 0.461816
        1.49 331490 747520 0.443453
        1.50 314933 714752 0.440619
        1.51 302164 698368 0.432672
        1.52 302040 704512 0.428722
        1.53 301562 720896 0.418316
        1.54 298117 720896 0.413537
        1.55 311379 714752 0.435646
        1.56 330249 712704 0.463375
        1.57 331215 716800 0.462074
        1.58 325603 724992 0.449113
        1.59 313257 698368 0.448556
        1.60 330508 743424 0.444575
        1.61 306871 702464 0.436849
        1.62 301675 698368 0.431971
        1.63 294899 708608 0.416167
Total nodes in tree&#58; 44146
Errors&#58; no error
time 5563

E&#58;\Alltests\hex>mynvcc

E&#58;\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C&#58;\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_00001088_00000000-3_hex.cudafe1.gpu
tmpxft_00001088_00000000-8_hex.cudafe2.gpu
hex.cu
ptxas info    &#58; Compiling entry function '__cuda_dummy_entry__' for 'sm_11'
ptxas info    &#58; Used 0 registers
tmpxft_00001088_00000000-3_hex.cudafe1.cpp
tmpxft_00001088_00000000-14_hex.ii

E&#58;\Alltests\hex>hex
Block 0 &#58; Thread 0 of 1
0.0 26956049 45875200 0.587595
        1.0 285697 716800 0.398573
        1.1 296215 716800 0.413246
        1.2 302066 716800 0.421409
        1.3 304944 716800 0.425424
        1.4 308393 716800 0.430236
        1.5 311334 716800 0.434339
        1.6 312983 716800 0.436639
        1.7 316357 716800 0.441346
        1.8 295229 716800 0.411871
        1.9 282094 716800 0.393546
        1.10 286715 716800 0.399993
        1.11 291940 716800 0.407282
        1.12 296803 716800 0.414067
        1.13 301366 716800 0.420432
        1.14 307084 716800 0.428410
        1.15 312986 716800 0.436643
        1.16 298152 716800 0.415949
        1.17 284801 716800 0.397323
        1.18 281071 716800 0.392119
        1.19 282984 716800 0.394788
        1.20 287125 716800 0.400565
        1.21 293889 716800 0.410001
        1.22 301135 716800 0.420110
        1.23 309922 716800 0.432369
        1.24 302701 716800 0.422295
        1.25 289184 716800 0.403438
        1.26 281007 716800 0.392030
        1.27 278771 716800 0.388910
        1.28 281615 716800 0.392878
        1.29 286664 716800 0.399922
        1.30 295080 716800 0.411663
        1.31 305713 716800 0.426497
        1.32 305216 716800 0.425804
        1.33 295183 716800 0.411807
        1.34 285659 716800 0.398520
        1.35 280260 716800 0.390988
        1.36 278022 716800 0.387866
        1.37 282298 716800 0.393831
        1.38 288800 716800 0.402902
        1.39 303020 716800 0.422740
        1.40 309028 716800 0.431122
        1.41 300008 716800 0.418538
        1.42 292816 716800 0.408504
        1.43 285548 716800 0.398365
        1.44 281900 716800 0.393276
        1.45 280186 716800 0.390884
        1.46 284531 716800 0.396946
        1.47 299205 716800 0.417418
        1.48 313206 716800 0.436950
        1.49 305785 716800 0.426597
        1.50 300395 716800 0.419078
        1.51 296256 716800 0.413304
        1.52 290044 716800 0.404637
        1.53 284538 716800 0.396956
        1.54 281402 716800 0.392581
        1.55 294216 716800 0.410458
        1.56 315878 716800 0.440678
        1.57 312280 716800 0.435658
        1.58 310249 716800 0.432825
        1.59 306552 716800 0.427667
        1.60 304670 716800 0.425042
        1.61 301157 716800 0.420141
        1.62 295130 716800 0.411733
        1.63 283681 716784 0.395769
Total nodes in tree&#58; 4194304
time 230953

E&#58;\Alltests\hex>mynvcc

E&#58;\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C&#58;\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_0000193c_00000000-3_hex.cudafe1.gpu
tmpxft_0000193c_00000000-8_hex.cudafe2.gpu
hex.cu
ptxas info    &#58; Compiling entry function '__cuda_dummy_entry__' for 'sm_11'
ptxas info    &#58; Used 0 registers
tmpxft_0000193c_00000000-3_hex.cudafe1.cpp
tmpxft_0000193c_00000000-14_hex.ii

E&#58;\Alltests\hex>mynvcc

E&#58;\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C&#58;\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_00001a78_00000000-3_hex.cudafe1.gpu
tmpxft_00001a78_00000000-8_hex.cudafe2.gpu
hex.cu
ptxas info    &#58; Compiling entry function '__cuda_dummy_entry__' for 'sm_11'
ptxas info    &#58; Used 0 registers
tmpxft_00001a78_00000000-3_hex.cudafe1.cpp
tmpxft_00001a78_00000000-14_hex.ii

E&#58;\Alltests\hex>hex
Block 0 &#58; Thread 0 of 1
0.0 25917002 45875200 0.564946
        1.0 309973 716800 0.432440
        1.1 320298 716800 0.446844
        1.2 325234 716800 0.453730
        1.3 326913 716800 0.456073
        1.4 328267 716800 0.457962
        1.5 329356 716800 0.459481
        1.6 331198 716800 0.462051
        1.7 333611 716800 0.465417
        1.8 308014 716800 0.429707
        1.9 297346 716800 0.414824
        1.10 302469 716800 0.421971
        1.11 308482 716800 0.430360
        1.12 314012 716800 0.438075
        1.13 318265 716800 0.444008
        1.14 324154 716800 0.452224
        1.15 330633 716800 0.461263
        1.16 314286 716800 0.438457
        1.17 298969 716800 0.417088
        1.18 294477 716800 0.410822
        1.19 298057 716800 0.415816
        1.20 302509 716800 0.422027
        1.21 309711 716800 0.432074
        1.22 317887 716800 0.443481
        1.23 327544 716800 0.456953
        1.24 318512 716800 0.444353
        1.25 303714 716800 0.423708
        1.26 296283 716800 0.413341
        1.27 293188 716800 0.409023
        1.28 295658 716800 0.412469
        1.29 301671 716800 0.420858
        1.30 311441 716800 0.434488
        1.31 323817 716800 0.451754
        1.32 321556 716800 0.448599
        1.33 309537 716800 0.431832
        1.34 300712 716800 0.419520
        1.35 294625 716800 0.411028
        1.36 293905 716800 0.410024
        1.37 296831 716800 0.414106
        1.38 305926 716800 0.426794
        1.39 319627 716800 0.445908
        1.40 325168 716800 0.453638
        1.41 315082 716800 0.439568
        1.42 307522 716800 0.429021
        1.43 300571 716800 0.419323
        1.44 296205 716800 0.413232
        1.45 294149 716800 0.410364
        1.46 299416 716800 0.417712
        1.47 315161 716800 0.439678
        1.48 329004 716800 0.458990
        1.49 323159 716800 0.450836
        1.50 316023 716800 0.440880
        1.51 310920 716800 0.433761
        1.52 305969 716800 0.426854
        1.53 300227 716800 0.418843
        1.54 295512 716800 0.412266
        1.55 309296 716800 0.431496
        1.56 332796 716800 0.464280
        1.57 328419 716800 0.458174
        1.58 326375 716800 0.455322
        1.59 322991 716800 0.450601
        1.60 319545 716800 0.445794
        1.61 316442 716800 0.441465
        1.62 310132 716800 0.432662
        1.63 298636 715200 0.417556
Total nodes in tree&#58; 254081
time 221734

E&#58;\Alltests\hex>

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 08, 2012 2:08 pm

A ton of very serious bugs were fixed. For starters the uct select was completely broken producing rather strange shallow and uniform tree. But after the many bug fixes, it starts producing good result as shown below. Also the occupancy of my last test was at 50% so I have a chance to increase that to 100% by simply changing configurations. Infact the occupancy has been slightly increased on my current gpu to 67% after the register usage reduced when the bugs are fixed. So that directly translated to performance increase. On capability >1.3 block size of 256 gives 100% occupancy so that will be doubled performance if everything goes well.
One other thing is that now there is absolutely no synchronization (__syncthreads) except for the inherent warp sync. A warp is the smallest working unit to fetch a node from DRAM; previously it was a block. The threads in a warp can diverge since each does its own monte carlo game on the same initial board. That has speeded up the code and simplified it as well.
On the initial position e4! (this is HEX btw!) is chosen as the best move and it was expanded to depth 7 after 45 mil simulation. Time is decreased to 5.375sec compared to previous ~5.5. Tree is also very selective.

Code: Select all

0.h8     22323755     45989376     0.485411
        1.a1       148364       291840     0.508374
        1.b1        88821       177152     0.501383
        1.c1        10350        20992     0.493045
        1.d1         7756        15872     0.488659
        1.e1         8021        16384     0.489563
        1.f1        36382        75264     0.483392
        1.g1        54957       116224     0.472854
        1.h1          947         2048     0.462402
        1.a2        58208       116224     0.500826
        1.b2       254948       501248     0.508626
        1.c2       174206       344064     0.506319
        1.d2        59118       116736     0.506425
        1.e2        58413       116224     0.502590
        1.f2        57629       116736     0.493669
        1.g2         1467         3072     0.477539
        1.h2          680         1536     0.442708
        1.a3        58347       117760     0.495474
        1.b3       239661       470528     0.509345
        1.c3       185939       364032     0.510777
        1.d3       238775       470528     0.507462
        1.e3       174534       344576     0.506518
        1.f3        53076       105472     0.503224
        1.g3        57767       117248     0.492691
        1.h3         1972         4096     0.481445
        1.a4        54223       111616     0.485799
        1.b4       156572       313344     0.499681
        1.c4       259251       508928     0.509406
        1.d4       402122       789504     0.509335
        1.e4      1890305      3686400     0.512778
        1.f4       120209       241152     0.498478
        1.g4        54437       108544     0.501520
        1.h4        57571       119296     0.482590
        1.a5        37182        76800     0.484141
        1.b5        58508       116736     0.501199
        1.c5       172019       337408     0.509825
        1.d5       149083       292352     0.509943
        1.e5     15977301     30803968     0.518677
        1.f5       256793       503296     0.510223
        1.g5        59556       117248     0.507949
        1.h5        57704       118784     0.485789
        1.a6         1699         3584     0.474051
        1.b6        58560       118784     0.492996
        1.c6        59423       117760     0.504611
        1.d6       225991       443392     0.509687
        1.e6       177571       348160     0.510027
        1.f6       211449       414720     0.509860
        1.g6       115356       231424     0.498462
        1.h6        57993       117760     0.492468
        1.a7         2211         4608     0.479818
        1.b7         1935         4096     0.472412
        1.c7        57972       118272     0.490158
        1.d7        58431       117760     0.496187
        1.e7        60376       119296     0.506102
        1.f7       225979       445952     0.506734
        1.g7       130554       256000     0.509977
        1.h7        59264       118784     0.498922
        1.a8         1896         4096     0.462891
        1.b8         1966         4096     0.479980
        1.c8         2192         4608     0.475694
        1.d8        57759       119296     0.484165
        1.e8         1983         4096     0.484131
        1.f8        58658       119296     0.491701
        1.g8        58889       117760     0.500076
        1.h8       127233       249856     0.509225
Total nodes   &#58; 31499
Leaf  nodes   &#58; 30985
Maximum depth &#58; 7
Average depth &#58; 3.72
Errors&#58; no error
time 5375

Daniel Shawul · Post by **Daniel Shawul** » Thu Mar 08, 2012 2:26 pm

Nope texture is constant memory so i can't use it for computation. However I have used the constant memory to store bitboard tables for move generation in chess. I use minimal kindergarten and all other bitboard tables summed up to ~3kb. Since constant/texture memory is cached and infact each mp has 8kb (64kb total) , that made first_one and all attacks (rooks & bishops) calculation as fast what can be achieved if they were all loaded in shared memory or registers. I measured it infact. When those tables are put in global memory I get 10x slow down.
Now the obstacle is where to store the generated moves. I have to use global memory if I am to do any kind of alpha beta. But for monte carlo I need only to generate a random legal move which simplifies that requirement. The problem is I can't even store one move generation step (256 moves max) on shared memory to check which moves are legal. I am left with something that can stor 16 moves in shared mem. May be I will try a sliding test with that available memory. Or finish up the random legal move generation without using any. If that is done I will probably have world's fastest chess perft estimator with a uct_perft. Stay tuned..

Daniel Shawul · Post by **Daniel Shawul** » Mon Mar 12, 2012 11:30 pm

I have finally finished a uct for chess on gpu. It is not that it is difficult but that move generation and other stuff consume space. Go would be ideal for UCT but requirement of linked list of stones for capture detection is a performance killer. Checkers would be another game ideally suited for gpu if i manage to generate those captures ...
It turned out to be a lot more slower than Hex since move generation and legality testing is a pain in the neck. The most annoying thing is that
there is not enough space to declare an int[256] for each thread to hold on the moves generated.
Bear in mind it is not a stack but just one entry. Here are the callenges:

a) The move generation step I mentioned above. I have gone through great length to generate one legal random move:
1 - Count pseudo legal moves
2 - generate a random number and pick which move to test
3 - generate only that move
4 - make the move and test for legality
5 - keep a bitset of 256 bits to mark moves that are already tested. This added 7 more registers (total = 59 now)
6 - If all bits are set as is the case when checkmated, break out.
I use kindergarten bitboards (8 bitboards total) which are light weight (512bytes for bishop/rook attacks).
But legal move generation is a pain due to either big table requirement in_between, direction[64][64] etc..
or something else. If I can generate a random legal move, it would save me from going through all those trouble.

b) Due to high register pressure (almost 60 right now), the occupancy is very low. Only 17% at my GPU right now.
But it is still 2-3x faster. I can offload some of those to shared memory but it is debatable whether trying to increase
occupancy that way is better. Some guys who did FFT using cuda claim that lower occupancy with high register usage actually
runs faster. Also the nvcc optimizer is very good so trying to squeeze out a register or two takes a lot of effort!

b) I use a copy / make instead of do / undo basically because there is n't space for a stack. But to test for legality
of pseudo legal moves I needed to save enpassant,castle,fifty,player flags just for one step. So undo is used there.
For the rest of MC playout a copy/make is used. BTW the 8 bitboards for positions representation simplifies do/undo and
my undo_move is basically do_move itself. A little problem I encounted with that board representation is that it is difficult
/slower to know what kind of piece is being captured. I delay that detection until I actually make the move ..

c) 32 bit firstone and bitcounts are ok but still very slow. I need to count psedudo legal moves before generating them.
The bitboards are very sparse and maybe a simpler version of population count would help.

e) Can't use arrays if you want your variable to be stored on register. Could be a pain for coding with many switch staments needed.

Probably some more. Ideas are welcome especially on how to generate legal random moves for chess preferable without using
large tables.

I guess that if I modify the uct code a bit it would be a fast UCT_perft approximator.

---
cheers

Karlo Bala · Post by **Karlo Bala** » Tue Mar 13, 2012 12:14 am

Daniel Shawul wrote:I have finally finished a uct for chess on gpu. It is not that it is difficult but that move generation and other stuff consume space. Go would be ideal for UCT but requirement of linked list of stones for capture detection is a performance killer. Checkers would be another game ideally suited for gpu if i manage to generate those captures ...
It turned out to be a lot more slower than Hex since move generation and legality testing is a pain in the neck. The most annoying thing is that
there is not enough space to declare an int[256] for each thread to hold on the moves generated.
Bear in mind it is not a stack but just one entry. Here are the callenges:

a) The move generation step I mentioned above. I have gone through great length to generate one legal random move:
1 - Count pseudo legal moves
2 - generate a random number and pick which move to test
3 - generate only that move
4 - make the move and test for legality
5 - keep a bitset of 256 bits to mark moves that are already tested. This added 7 more registers (total = 59 now)
6 - If all bits are set as is the case when checkmated, break out.
I use kindergarten bitboards (8 bitboards total) which are light weight (512bytes for bishop/rook attacks).
But legal move generation is a pain due to either big table requirement in_between, direction[64][64] etc..
or something else. If I can generate a random legal move, it would save me from going through all those trouble.

b) Due to high register pressure (almost 60 right now), the occupancy is very low. Only 17% at my GPU right now.
But it is still 2-3x faster. I can offload some of those to shared memory but it is debatable whether trying to increase
occupancy that way is better. Some guys who did FFT using cuda claim that lower occupancy with high register usage actually
runs faster. Also the nvcc optimizer is very good so trying to squeeze out a register or two takes a lot of effort!

b) I use a copy / make instead of do / undo basically because there is n't space for a stack. But to test for legality
of pseudo legal moves I needed to save enpassant,castle,fifty,player flags just for one step. So undo is used there.
For the rest of MC playout a copy/make is used. BTW the 8 bitboards for positions representation simplifies do/undo and
my undo_move is basically do_move itself. A little problem I encounted with that board representation is that it is difficult
/slower to know what kind of piece is being captured. I delay that detection until I actually make the move ..

c) 32 bit firstone and bitcounts are ok but still very slow. I need to count psedudo legal moves before generating them.
The bitboards are very sparse and maybe a simpler version of population count would help.

e) Can't use arrays if you want your variable to be stored on register. Could be a pain for coding with many switch staments needed.

Probably some more. Ideas are welcome especially on how to generate legal random moves for chess preferable without using
large tables.

I guess that if I modify the uct code a bit it would be a fast UCT_perft approximator.

---
cheers

Perhaps you should simply allow illegal moves. End of a game will be when the king is captured. I suppose that you'll get tactically stronger engine because UCT will explore more across checks due to more king captures.

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 13, 2012 1:13 am

Perhaps you should simply allow illegal moves. End of a game will be when the king is captured. I suppose that you'll get tactically stronger engine because UCT will explore more across checks due to more king captures.

That is infact what I do for CPU nebiyu in general game playing mode and when using alpha-beta. I have an option "allow_king_capture" that is set to on by default. The reason is that I get more than 100% slow down if I do legality testing (1Mnps vs 0.5Mnps). I was surprized about that at first. A very slow attacksTo() is cause of that problem. But soon I discovered it is not so straight forward when using UCT (maybe I have not thought it through). The problem is if I allow any move when being in check, then on the next ply the opponent _must_ capture my king and not make a random move. I got all sorts of crashes before I realized that... So if that is required, testing for legality (attacks(opp,king)) right after making the move becomes equivalent to letting it go one more ply.
For alpha-beta I test if the side to play has a king otherwise it is considered as checkmated. This doesn't need attacks() however in monte-carlo play a checkmate could be simply skipped since both basically make a random move without checking if their king is attacked.

Karlo Bala · Post by **Karlo Bala** » Tue Mar 13, 2012 1:52 pm

Daniel Shawul wrote: For alpha-beta I test if the side to play has a king otherwise it is considered as checkmated. This doesn't need attacks() however in monte-carlo play a checkmate could be simply skipped since both basically make a random move without checking if their king is attacked.

I had similar problem when I played with the BPIP-DFISA. If one side is checkmated and the other side fails to capture the king it will do it in the next few moves, because most of the time checkmated side will not find a way out (since both side play random moves). If the rules of chess have been somewhat different, and if the goal was to capture the king, the algorithm that treat check (and out of check) like every other move would be perfectly sound.

However, it is just a theory, and theories are cheap

smatovic · Post by **smatovic** » Tue Mar 13, 2012 9:08 pm

Yep, to achieve a full occupancy of the device is nearly impossible with such an complex chess move generation and such little registers.

With Zeta i use something above 60 registers and can run on my Device 16*2*32 threads, full occupancy would mean running 16*24*32 therads.

I ordered an AMD HD 7750, the new GCN architecture has 32 KB registers for each SIMD Unit, each SIMD can run 10*16 threads. So you got 204,8 Bytes per Thread, great! But unfortunately there are still no drivers for Linux.

Maybe you could switch to an QuadBitboard Board presentation, uses only 32 Bytes.

In Cuda there should be somekind of shared memory, in OpenCL called local memory. It is fast and big enough to hold for each thread the Board. So you could save some registers for computation.

--
Srdja

Daniel Shawul · Post by **Daniel Shawul** » Tue Mar 13, 2012 10:43 pm

Yep, to achieve a full occupancy of the device is nearly impossible with such an complex chess move generation and such little registers.

With Zeta i use something above 60 registers and can run on my Device 16*2*32 threads, full occupancy would mean running 16*24*32 therads.

My GPU is an old one so I can't get full occupancy even for HEX (50% only). Even though I use 60 registers for chess, I do not read from global memory any variables so I am not all that unhappy about it. I can control latency by simply doing more monte-carlo simulations which would be impossible if that too uses global memory.

I ordered an AMD HD 7750, the new GCN architecture has 32 KB registers for each SIMD Unit, each SIMD can run 10*16 threads. So you got 204,8 Bytes per Thread, great! But unfortunately there are still no drivers for Linux.

That is great. Hopefully someday I will test it on a fermi which has 3x more registers.

Maybe you could switch to an QuadBitboard Board presentation, uses only 32 Bytes.

I have in total 9 bitboards (18 registers) so it is not a lot but the calculations to generate one random legal move are intensive and before you know it you hit 60 registers. Also quad bitboards require SIMD operations for decoding if I am not mistaken. GPUs are SIMT and as such do not have SSE instructions.

In Cuda there should be somekind of shared memory, in OpenCL called local memory. It is fast and big enough to hold for each thread the Board. So you could save some registers for computation.

I have infact ample space from shared memory that I can spend. But so far no improvement. I put the Board struct on shared memory , neither the register usage decreased nor it run equally faster. But theoretically that is the only option left. Btw what cuda calls "local memory" is a disguised slow as a tortoise global memory (much like thread local storage in cpu). I can spill some registers to local memory during compilation but it runs slower. Also as I mentioned before manually moving some variables to shared mem didn't improve performance for some people working on fast libraries. But I will keep on digging.
-----
cheers

Daniel Shawul · Post by **Daniel Shawul** » Wed Mar 14, 2012 3:21 am

Here is the interesting article on occupancy which has put me in dilemma. They argue the programming guides of nvidia are misleading by showing an example that achieves 84% of the peak performance at a mere 4% occupancy. They also say shared memory is not as fast as advertised and the gap between that and arithmetic throughput is ever increasing with the newer gpus. And much more ...

uct on gpu

100x speed up

Re: uct on gpu

Re: uct on gpu

uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess

Re: uct for chess