smp experiments on gpu...

smatovic · Post by **smatovic** » Tue Mar 21, 2017 6:13 pm

...working recently on parallel alpha beta search implementation on gpu,
here my current results...

SHT
shared hash table with no further adjustments, up to 16 workers and 256 MB
no speedup in time to depth.

ABDADA
up to 4x speedup in time to depth with 16 workers, but lausy implementation,
atomic functions for move hash table locks slow the engine significantly down

YBWC
seems not possible for OpenCL 1.x devices,
maybe on 2.x or Cuda with kernel recursion feature...

Lazy SMP
- no threadwise depth increase
- retrieving alpha values and ttmove from hash table
- randomized move ordering on first 1-3 plies for even helper threads,
but oldest son or ttmove is searched first
- randomized move ordering on last 1-3 plies for odd helper threads,
but oldest son or ttmove is searched first
- termination when root or helper thread finishes
about x4 speedup in time to depth with 16 workers on Kaufmann positions with depth 9,
gain of about 2 to 3 plies in game

--
Srdja

Dann Corbit · Post by **Dann Corbit** » Tue Mar 21, 2017 7:57 pm

Maybe the problem is that nature of warps and threads on GPUs.

All bazillion threads run in lockstep all at the same time, and they also wait for the slowest thread to finish, likely because it had more branches than the other threads did.

Since (for instance) since the move generator can be made to run so blazing fast like Ankan did:
https://github.com/ankan-ban/perft_gpu

Another approach might be to have a gather/scatter or map/reduce type of metaphor with each of a zillion threads working on independent problems and the gather step collects the result. Perhaps the gather/map step has each of the threads calculate a tiny sphere of search, with some threads working on problems way down the tree, and some right at the root. Way different than the traditional search and I have no idea how you would prune, expect perhaps to make decisions after the gather/reduce step completes.

smatovic · Post by **smatovic** » Tue Mar 21, 2017 8:57 pm

....consider it more as an silly hack than an competitive approach,

my current design lacked before in nps throughput per worker,
and now it lacks in terms of parallel speedup scaling across workers.

It would be great to see how Ankan makes use of his Giga nps monster for an game playing engine....but i guess the techniques he needs have yet to be invented...

--
Srdja

smatovic · Post by **smatovic** » Wed Mar 22, 2017 10:19 am

Maybe the problem is that nature of warps and threads on GPUs.

Sorry, i forgot to mention that here one worker consists of 64 threads,
which indeed run in lockstep on gpu,
they work in parallel during move gen, move pick and eval on the same node.

All bazillion threads run in lockstep all at the same time, and they also wait for the slowest thread to finish, likely because it had more branches than the other threads did.

Like Daniel and Vincent (btw, where is he?) already pointed out,
you dont need to run that much threads to utilize a gpu,
with less memory access you have to hide less memory latency.

But i admit, that an engine that could also make use of the high memory bandwidth of
gpus would make more sense.

--
Srdja

smatovic · Post by **smatovic** » Thu Apr 06, 2017 8:16 pm

scaling on starting position, depth 12, on Nv Titan X...

Code: Select all

### workers #nps        #nps speedup    #time in s  #time to depth speedup ###
### 1       57350       1.000000        136.184000  1.000000 
### 2       113864      1.985423        123.176000  1.105605 
### 4       226090      3.942284        89.658000   1.518927 
### 8       466873      8.140767        47.153000   2.888130 
### 16      991879      17.295187       30.398000   4.480032 
### 32      2085310     36.361116       21.543000   6.321497 
### 64      4245302     74.024446       12.967000   10.502352 
### 128     8338703     145.400227      13.982000   9.739951 
### 256     12554769    218.914891      12.396000   10.986125 
### 512     14254878    248.559337      6.159000    22.111382 
### 1024    13575501    236.713182      5.036000    27.042097

--
Srdja

PS: thanks to Ankan for the test runs.

Dann Corbit · Post by **Dann Corbit** » Fri Apr 07, 2017 4:46 am

The scaling with 512 workers shows only 50% smp loss for NPS speedup.
Unfortunately, the time to depth is suffering for some reason.

Especially surprising is the 10% improvement going from 1 worker to 2.

smatovic · Post by **smatovic** » Sat Apr 15, 2017 3:22 pm

Especially surprising is the 10% improvement going from 1 worker to 2.

This is probably an effect of the random move order for helper workers,
the less workers are running the less random the order should be.

Code: Select all

### workers #nps        #nps speedup    #time in s  #time to depth speedup ### 
### 64      4245302     74.024446       12.967000   10.502352
### 128     8338703     145.400227      13.982000   9.739951
### 256     12554769    218.914891      12.396000   10.986125

Considering the ttd speedup between 64 and 256 workers,
there is obviously space for optimization.

--
Srdja

smatovic · Post by **smatovic** » Sun Apr 15, 2018 3:38 pm

Inspired by Tom Kerrigan's ABDADA results,

http://talkchess.com/forum/viewtopic.ph ... ght=abdada
http://www.tckerrigan.com/Chess/Parallel_Search/

i gave another implementation* on the gpu a try,
and the results look now more reasonable:

# Startpos, depth 12, mean of 4 runs, Nvidia GTX 750:

Code: Select all

### workers #nps        #nps speedup    #time in s  #ttd speedup    #relative ttd speedup
### 1       45843       1.000000        228.534000  1.000000        1.000000 
### 2       99110       2.161944        131.521500  1.737617        1.737617
### 4       210107      4.583186        62.6065000  3.650323        2.100764
### 8       432930      9.443754        38.1875000  5.984523        1.639450
### 16      796515      17.374845       21.9432500  10.41477        1.740284

# Startpos, depth 12, mean of 4 runs, AMD Fury X:

Code: Select all

### workers #nps        #nps speedup    #time in s  #ttd speedup    #relative ttd speedup
### 1       31054       1.000000        296.711000  1.000000        1.000000 
### 2       62135       2.000869        165.971750  1.787719        1.787719
### 4       124471      4.008212        92.3350000  3.213418        1.797495
### 8       248226      7.993366        47.2362500  6.281425        1.954748 
### 16      497841      16.03146        33.6230000  8.824643        1.404879
### 32      996298      32.08275        25.2667500  11.74314        1.330721
### 64      1980535     63.77713        25.1675000  11.78945        1.003943
### 128     3924712     126.3834        14.5877500  20.33973        1.725248
### 256     7598046     244.6720        11.8680000  25.00092        1.229166

*Above 64 workers i have to randomize the move order again to achieve an speedup.

--
Srdja

Edsel Apostol · Post by **Edsel Apostol** » Tue Apr 24, 2018 1:06 pm

This means ABDADA is the most efficient of them all?

smatovic · Post by **smatovic** » Tue Apr 24, 2018 1:16 pm

This means ABDADA is the most efficient of them all?

*YBWC not implemented and tested
** in my implementation plain ABDADA scales only up to 32 or 64 workers

Yes.

--
Srdja

smp experiments on gpu...

smp experiments on gpu...

Re: smp experiments on gpu...

Re: smp experiments on gpu...

Re: smp experiments on gpu...

Re: smp experiments on gpu...1024 workers

Re: smp experiments on gpu...1024 workers

Re: smp experiments on gpu...1024 workers

Re: smp experiments on gpu...fun with ABDADA

Re: smp experiments on gpu...fun with ABDADA

Re: smp experiments on gpu...fun with ABDADA