I did 45 mil uct simulations on GPU (cuda) and on CPU (OpenMP) :
CPU = 230000 milli seconds
GPU = 5563 milli seconds
GPU/CPU = 41.3
Then the "manuplation" comes in. CPU is 3ghz and cuda processors are 1.25ghz, so on equal grounds that would be (3/1.25) * 41.3 = 99.6 !!
Since the CPU code generates the tree much quickly, I rerun it so that it generates less nodes (checks the tree less often ,about 100x less). I got slightly less speedup but still in the same ballpark.
------------------------------------------------------------------------------
The complete run follows
------------------------------------------------------------------------------
Code: Select all
E:\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_00001a48_00000000-3_hex.cudafe1.gpu
tmpxft_00001a48_00000000-8_hex.cudafe2.gpu
hex.cu
./hex.cu(328): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(332): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(338): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(344): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(422): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(422): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(328): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(332): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(338): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(344): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(429): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(273): Warning: Cannot tell what pointer points to, assuming global memory space
./hex.cu(290): Warning: Cannot tell what pointer points to, assuming global memory space
ptxas info : Compiling entry function '_ZN5TABLE10print_treeEi' for 'sm_11'
ptxas info : Used 13 registers, 4+16 bytes smem, 53 bytes cmem[0], 32 bytes cmem[1], 36 bytes cmem[14]
ptxas info : Compiling entry function '_ZN5TABLE5resetEv' for 'sm_11'
ptxas info : Used 5 registers, 53 bytes cmem[0], 36 bytes cmem[14]
ptxas info : Compiling entry function '_Z7playoutj' for 'sm_11'
ptxas info : Used 18 registers, 552+16 bytes smem, 53 bytes cmem[0], 92 bytes cmem[1], 36 bytes cmem[14]
tmpxft_00001a48_00000000-3_hex.cudafe1.cpp
tmpxft_00001a48_00000000-14_hex.ii
E:\Alltests\hex>hex
--- General Information for device 0 ---
Name: Quadro FX 3700
Compute capability: 1.1
Clock rate: 1250000
Device copy overlap: Enabled
Kernel execition timeout : Enabled
--- Memory Information for device 0 ---
Total global mem: 536543232
Total constant Mem: 65536
Max mem pitch: 2147483647
Texture Alignment: 256
--- MP Information for device 0 ---
Multiprocessor count: 14
Shared mem per mp: 16384
Registers per mp: 8192
Threads in warp: 32
Max threads per block: 512
Max thread dimensions: (512, 512, 64)
Max grid dimensions: (65535, 65535, 1)
nBlocks=42 X nThreads=128
0.0 26051534 45959168 0.566841
1.0 320443 749568 0.427504
1.1 309591 712704 0.434389
1.2 318579 720896 0.441921
1.3 324636 722944 0.449047
1.4 334108 735232 0.454425
1.5 333075 722944 0.460720
1.6 328187 714752 0.459162
1.7 321134 696320 0.461187
1.8 317725 737280 0.430942
1.9 294177 702464 0.418779
1.10 296311 706560 0.419371
1.11 319123 741376 0.430447
1.12 313588 727040 0.431322
1.13 325349 731136 0.444991
1.14 320649 716800 0.447334
1.15 327181 714752 0.457755
1.16 311085 716800 0.433991
1.17 296036 704512 0.420200
1.18 284808 696320 0.409019
1.19 303228 743424 0.407880
1.20 298766 702464 0.425311
1.21 305367 708608 0.430939
1.22 313238 708608 0.442047
1.23 340089 747520 0.454956
1.24 317631 724992 0.438117
1.25 308738 731136 0.422272
1.26 301522 735232 0.410105
1.27 292630 712704 0.410591
1.28 288996 698368 0.413816
1.29 301258 710656 0.423915
1.30 311704 714752 0.436101
1.31 309933 694272 0.446414
1.32 321523 720896 0.446005
1.33 314992 727040 0.433253
1.34 280126 694272 0.403482
1.35 287599 702464 0.409415
1.36 283096 694272 0.407759
1.37 286109 698368 0.409682
1.38 301178 708608 0.425028
1.39 313143 704512 0.444482
1.40 344140 751616 0.457867
1.41 321821 733184 0.438936
1.42 303774 706560 0.429934
1.43 293859 704512 0.417110
1.44 292651 714752 0.409444
1.45 299109 739328 0.404569
1.46 292907 708608 0.413355
1.47 311510 712704 0.437082
1.48 342379 741376 0.461816
1.49 331490 747520 0.443453
1.50 314933 714752 0.440619
1.51 302164 698368 0.432672
1.52 302040 704512 0.428722
1.53 301562 720896 0.418316
1.54 298117 720896 0.413537
1.55 311379 714752 0.435646
1.56 330249 712704 0.463375
1.57 331215 716800 0.462074
1.58 325603 724992 0.449113
1.59 313257 698368 0.448556
1.60 330508 743424 0.444575
1.61 306871 702464 0.436849
1.62 301675 698368 0.431971
1.63 294899 708608 0.416167
Total nodes in tree: 44146
Errors: no error
time 5563
E:\Alltests\hex>mynvcc
E:\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_00001088_00000000-3_hex.cudafe1.gpu
tmpxft_00001088_00000000-8_hex.cudafe2.gpu
hex.cu
ptxas info : Compiling entry function '__cuda_dummy_entry__' for 'sm_11'
ptxas info : Used 0 registers
tmpxft_00001088_00000000-3_hex.cudafe1.cpp
tmpxft_00001088_00000000-14_hex.ii
E:\Alltests\hex>hex
Block 0 : Thread 0 of 1
0.0 26956049 45875200 0.587595
1.0 285697 716800 0.398573
1.1 296215 716800 0.413246
1.2 302066 716800 0.421409
1.3 304944 716800 0.425424
1.4 308393 716800 0.430236
1.5 311334 716800 0.434339
1.6 312983 716800 0.436639
1.7 316357 716800 0.441346
1.8 295229 716800 0.411871
1.9 282094 716800 0.393546
1.10 286715 716800 0.399993
1.11 291940 716800 0.407282
1.12 296803 716800 0.414067
1.13 301366 716800 0.420432
1.14 307084 716800 0.428410
1.15 312986 716800 0.436643
1.16 298152 716800 0.415949
1.17 284801 716800 0.397323
1.18 281071 716800 0.392119
1.19 282984 716800 0.394788
1.20 287125 716800 0.400565
1.21 293889 716800 0.410001
1.22 301135 716800 0.420110
1.23 309922 716800 0.432369
1.24 302701 716800 0.422295
1.25 289184 716800 0.403438
1.26 281007 716800 0.392030
1.27 278771 716800 0.388910
1.28 281615 716800 0.392878
1.29 286664 716800 0.399922
1.30 295080 716800 0.411663
1.31 305713 716800 0.426497
1.32 305216 716800 0.425804
1.33 295183 716800 0.411807
1.34 285659 716800 0.398520
1.35 280260 716800 0.390988
1.36 278022 716800 0.387866
1.37 282298 716800 0.393831
1.38 288800 716800 0.402902
1.39 303020 716800 0.422740
1.40 309028 716800 0.431122
1.41 300008 716800 0.418538
1.42 292816 716800 0.408504
1.43 285548 716800 0.398365
1.44 281900 716800 0.393276
1.45 280186 716800 0.390884
1.46 284531 716800 0.396946
1.47 299205 716800 0.417418
1.48 313206 716800 0.436950
1.49 305785 716800 0.426597
1.50 300395 716800 0.419078
1.51 296256 716800 0.413304
1.52 290044 716800 0.404637
1.53 284538 716800 0.396956
1.54 281402 716800 0.392581
1.55 294216 716800 0.410458
1.56 315878 716800 0.440678
1.57 312280 716800 0.435658
1.58 310249 716800 0.432825
1.59 306552 716800 0.427667
1.60 304670 716800 0.425042
1.61 301157 716800 0.420141
1.62 295130 716800 0.411733
1.63 283681 716784 0.395769
Total nodes in tree: 4194304
time 230953
E:\Alltests\hex>mynvcc
E:\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_0000193c_00000000-3_hex.cudafe1.gpu
tmpxft_0000193c_00000000-8_hex.cudafe2.gpu
hex.cu
ptxas info : Compiling entry function '__cuda_dummy_entry__' for 'sm_11'
ptxas info : Used 0 registers
tmpxft_0000193c_00000000-3_hex.cudafe1.cpp
tmpxft_0000193c_00000000-14_hex.ii
E:\Alltests\hex>mynvcc
E:\Alltests\hex>nvcc --maxrregcount 64 hex.cu -o hex -arch=sm_11 -ccbin "C:\Program Files\Microsoft Visual Studio 9.0\VC\bin" -Xcompiler /op
enmp --ptxas-options=-v
hex.cu
tmpxft_00001a78_00000000-3_hex.cudafe1.gpu
tmpxft_00001a78_00000000-8_hex.cudafe2.gpu
hex.cu
ptxas info : Compiling entry function '__cuda_dummy_entry__' for 'sm_11'
ptxas info : Used 0 registers
tmpxft_00001a78_00000000-3_hex.cudafe1.cpp
tmpxft_00001a78_00000000-14_hex.ii
E:\Alltests\hex>hex
Block 0 : Thread 0 of 1
0.0 25917002 45875200 0.564946
1.0 309973 716800 0.432440
1.1 320298 716800 0.446844
1.2 325234 716800 0.453730
1.3 326913 716800 0.456073
1.4 328267 716800 0.457962
1.5 329356 716800 0.459481
1.6 331198 716800 0.462051
1.7 333611 716800 0.465417
1.8 308014 716800 0.429707
1.9 297346 716800 0.414824
1.10 302469 716800 0.421971
1.11 308482 716800 0.430360
1.12 314012 716800 0.438075
1.13 318265 716800 0.444008
1.14 324154 716800 0.452224
1.15 330633 716800 0.461263
1.16 314286 716800 0.438457
1.17 298969 716800 0.417088
1.18 294477 716800 0.410822
1.19 298057 716800 0.415816
1.20 302509 716800 0.422027
1.21 309711 716800 0.432074
1.22 317887 716800 0.443481
1.23 327544 716800 0.456953
1.24 318512 716800 0.444353
1.25 303714 716800 0.423708
1.26 296283 716800 0.413341
1.27 293188 716800 0.409023
1.28 295658 716800 0.412469
1.29 301671 716800 0.420858
1.30 311441 716800 0.434488
1.31 323817 716800 0.451754
1.32 321556 716800 0.448599
1.33 309537 716800 0.431832
1.34 300712 716800 0.419520
1.35 294625 716800 0.411028
1.36 293905 716800 0.410024
1.37 296831 716800 0.414106
1.38 305926 716800 0.426794
1.39 319627 716800 0.445908
1.40 325168 716800 0.453638
1.41 315082 716800 0.439568
1.42 307522 716800 0.429021
1.43 300571 716800 0.419323
1.44 296205 716800 0.413232
1.45 294149 716800 0.410364
1.46 299416 716800 0.417712
1.47 315161 716800 0.439678
1.48 329004 716800 0.458990
1.49 323159 716800 0.450836
1.50 316023 716800 0.440880
1.51 310920 716800 0.433761
1.52 305969 716800 0.426854
1.53 300227 716800 0.418843
1.54 295512 716800 0.412266
1.55 309296 716800 0.431496
1.56 332796 716800 0.464280
1.57 328419 716800 0.458174
1.58 326375 716800 0.455322
1.59 322991 716800 0.450601
1.60 319545 716800 0.445794
1.61 316442 716800 0.441465
1.62 310132 716800 0.432662
1.63 298636 715200 0.417556
Total nodes in tree: 254081
time 221734
E:\Alltests\hex>