...working recently on parallel alpha beta search implementation on gpu,
here my current results...
SHT
shared hash table with no further adjustments, up to 16 workers and 256 MB
no speedup in time to depth.
ABDADA
up to 4x speedup in time to depth with 16 workers, but lausy implementation,
atomic functions for move hash table locks slow the engine significantly down
YBWC
seems not possible for OpenCL 1.x devices,
maybe on 2.x or Cuda with kernel recursion feature...
Lazy SMP
- no threadwise depth increase
- retrieving alpha values and ttmove from hash table
- randomized move ordering on first 1-3 plies for even helper threads,
but oldest son or ttmove is searched first
- randomized move ordering on last 1-3 plies for odd helper threads,
but oldest son or ttmove is searched first
- termination when root or helper thread finishes
about x4 speedup in time to depth with 16 workers on Kaufmann positions with depth 9,
gain of about 2 to 3 plies in game
--
Srdja
smp experiments on gpu...
Moderators: hgm, Rebel, chrisw
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: smp experiments on gpu...
Maybe the problem is that nature of warps and threads on GPUs.
All bazillion threads run in lockstep all at the same time, and they also wait for the slowest thread to finish, likely because it had more branches than the other threads did.
Since (for instance) since the move generator can be made to run so blazing fast like Ankan did:
https://github.com/ankan-ban/perft_gpu
Another approach might be to have a gather/scatter or map/reduce type of metaphor with each of a zillion threads working on independent problems and the gather step collects the result. Perhaps the gather/map step has each of the threads calculate a tiny sphere of search, with some threads working on problems way down the tree, and some right at the root. Way different than the traditional search and I have no idea how you would prune, expect perhaps to make decisions after the gather/reduce step completes.
All bazillion threads run in lockstep all at the same time, and they also wait for the slowest thread to finish, likely because it had more branches than the other threads did.
Since (for instance) since the move generator can be made to run so blazing fast like Ankan did:
https://github.com/ankan-ban/perft_gpu
Another approach might be to have a gather/scatter or map/reduce type of metaphor with each of a zillion threads working on independent problems and the gather step collects the result. Perhaps the gather/map step has each of the threads calculate a tiny sphere of search, with some threads working on problems way down the tree, and some right at the root. Way different than the traditional search and I have no idea how you would prune, expect perhaps to make decisions after the gather/reduce step completes.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: smp experiments on gpu...
....consider it more as an silly hack than an competitive approach,
my current design lacked before in nps throughput per worker,
and now it lacks in terms of parallel speedup scaling across workers.
It would be great to see how Ankan makes use of his Giga nps monster for an game playing engine....but i guess the techniques he needs have yet to be invented...
--
Srdja
my current design lacked before in nps throughput per worker,
and now it lacks in terms of parallel speedup scaling across workers.
It would be great to see how Ankan makes use of his Giga nps monster for an game playing engine....but i guess the techniques he needs have yet to be invented...
--
Srdja
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: smp experiments on gpu...
Sorry, i forgot to mention that here one worker consists of 64 threads,Maybe the problem is that nature of warps and threads on GPUs.
which indeed run in lockstep on gpu,
they work in parallel during move gen, move pick and eval on the same node.
Like Daniel and Vincent (btw, where is he?) already pointed out,All bazillion threads run in lockstep all at the same time, and they also wait for the slowest thread to finish, likely because it had more branches than the other threads did.
you dont need to run that much threads to utilize a gpu,
with less memory access you have to hide less memory latency.
But i admit, that an engine that could also make use of the high memory bandwidth of
gpus would make more sense.
--
Srdja
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: smp experiments on gpu...1024 workers
scaling on starting position, depth 12, on Nv Titan X...
--
Srdja
PS: thanks to Ankan for the test runs.
Code: Select all
### workers #nps #nps speedup #time in s #time to depth speedup ###
### 1 57350 1.000000 136.184000 1.000000
### 2 113864 1.985423 123.176000 1.105605
### 4 226090 3.942284 89.658000 1.518927
### 8 466873 8.140767 47.153000 2.888130
### 16 991879 17.295187 30.398000 4.480032
### 32 2085310 36.361116 21.543000 6.321497
### 64 4245302 74.024446 12.967000 10.502352
### 128 8338703 145.400227 13.982000 9.739951
### 256 12554769 218.914891 12.396000 10.986125
### 512 14254878 248.559337 6.159000 22.111382
### 1024 13575501 236.713182 5.036000 27.042097
Srdja
PS: thanks to Ankan for the test runs.
-
- Posts: 12540
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: smp experiments on gpu...1024 workers
The scaling with 512 workers shows only 50% smp loss for NPS speedup.
Unfortunately, the time to depth is suffering for some reason.
Especially surprising is the 10% improvement going from 1 worker to 2.
Unfortunately, the time to depth is suffering for some reason.
Especially surprising is the 10% improvement going from 1 worker to 2.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: smp experiments on gpu...1024 workers
This is probably an effect of the random move order for helper workers,Especially surprising is the 10% improvement going from 1 worker to 2.
the less workers are running the less random the order should be.
Code: Select all
### workers #nps #nps speedup #time in s #time to depth speedup ###
### 64 4245302 74.024446 12.967000 10.502352
### 128 8338703 145.400227 13.982000 9.739951
### 256 12554769 218.914891 12.396000 10.986125
there is obviously space for optimization.
--
Srdja
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: smp experiments on gpu...fun with ABDADA
Inspired by Tom Kerrigan's ABDADA results,
http://talkchess.com/forum/viewtopic.ph ... ght=abdada
http://www.tckerrigan.com/Chess/Parallel_Search/
i gave another implementation* on the gpu a try,
and the results look now more reasonable:
# Startpos, depth 12, mean of 4 runs, Nvidia GTX 750:
# Startpos, depth 12, mean of 4 runs, AMD Fury X:
*Above 64 workers i have to randomize the move order again to achieve an speedup.
--
Srdja
http://talkchess.com/forum/viewtopic.ph ... ght=abdada
http://www.tckerrigan.com/Chess/Parallel_Search/
i gave another implementation* on the gpu a try,
and the results look now more reasonable:
# Startpos, depth 12, mean of 4 runs, Nvidia GTX 750:
Code: Select all
### workers #nps #nps speedup #time in s #ttd speedup #relative ttd speedup
### 1 45843 1.000000 228.534000 1.000000 1.000000
### 2 99110 2.161944 131.521500 1.737617 1.737617
### 4 210107 4.583186 62.6065000 3.650323 2.100764
### 8 432930 9.443754 38.1875000 5.984523 1.639450
### 16 796515 17.374845 21.9432500 10.41477 1.740284
Code: Select all
### workers #nps #nps speedup #time in s #ttd speedup #relative ttd speedup
### 1 31054 1.000000 296.711000 1.000000 1.000000
### 2 62135 2.000869 165.971750 1.787719 1.787719
### 4 124471 4.008212 92.3350000 3.213418 1.797495
### 8 248226 7.993366 47.2362500 6.281425 1.954748
### 16 497841 16.03146 33.6230000 8.824643 1.404879
### 32 996298 32.08275 25.2667500 11.74314 1.330721
### 64 1980535 63.77713 25.1675000 11.78945 1.003943
### 128 3924712 126.3834 14.5877500 20.33973 1.725248
### 256 7598046 244.6720 11.8680000 25.00092 1.229166
*Above 64 workers i have to randomize the move order again to achieve an speedup.
--
Srdja
-
- Posts: 803
- Joined: Mon Jul 17, 2006 5:53 am
- Full name: Edsel Apostol
Re: smp experiments on gpu...fun with ABDADA
This means ABDADA is the most efficient of them all?
-
- Posts: 2645
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: smp experiments on gpu...fun with ABDADA
*YBWC not implemented and testedThis means ABDADA is the most efficient of them all?
** in my implementation plain ABDADA scales only up to 32 or 64 workers
Yes.
--
Srdja