Lazy SMP, part 2

dchoman · Post by **dchoman** » Mon Jan 14, 2013 11:58 am

jd1 wrote:Thanks Dan.

Just to make sure I understand your method correctly, at every iteration you start the even threads at depth+1, and stop them when the iteration finishes.

This is then quite different to the usual shared hash table smp implementation, which launches x threads at the beginning of the search, as http://chessprogramming.wikispaces.com/ ... sh%20Table

I am testing right now.

Jerry

Yes, I am currently starting half the threads with the current iteration depth and the other half with iteration depth + 1. As I said in an earlier post, there may be a more optimum configuration than this, but this is what I am doing now (perhaps starting all extra threads with different depths? +1, +2, +3, etc... but that may be too much difference between them for the shared hash to do its magic)

From reading the wikispace paragraph (thanks for the link!), my implementation is not that different... the only detailed difference is that I am restarting each thread with each new iteration and some of my threads are searching a somewhat deeper iteration depth (+1).

- Dan

Joerg Oster · Post by **Joerg Oster** » Mon Jan 14, 2013 5:43 pm

dchoman wrote:Thanks! This is very interesting! I would have expected a significant gain for 4 processors relative to 1 processor at TCx2 search. Not sure what SMP implementation Stockfish is using, but I assume it is state-of-the-art. Perhaps the simple SMP version I am using at the moment is good enough? Not sure if that would be true for more processors or non-bullet time controls.

- Dan

Indeed, Stockfish uses an implementation of the YBW Concept. When I first compared to your results, I was a bit shocked. OTOH, we must not forget that SF has a limiting parameter, Min Split Depth, not to split the search at too shallow depths. For my experiment this parameter was set to default = 4. My guess is, that with increased tc we might gain more. Maybe I will do the same test with double or triple tc.

Anyways, your results seem to be promising.
If you want me to run some tests with 8 cores, let me know.

dchoman · Post by **dchoman** » Tue Jan 15, 2013 1:29 am

Joerg Oster wrote:
dchoman wrote:Thanks! This is very interesting! I would have expected a significant gain for 4 processors relative to 1 processor at TCx2 search. Not sure what SMP implementation Stockfish is using, but I assume it is state-of-the-art. Perhaps the simple SMP version I am using at the moment is good enough? Not sure if that would be true for more processors or non-bullet time controls.

- Dan
Indeed, Stockfish uses an implementation of the YBW Concept. When I first compared to your results, I was a bit shocked. OTOH, we must not forget that SF has a limiting parameter, Min Split Depth, not to split the search at too shallow depths. For my experiment this parameter was set to default = 4. My guess is, that with increased tc we might gain more. Maybe I will do the same test with double or triple tc.

Anyways, your results seem to be promising.
If you want me to run some tests with 8 cores, let me know.

Thank you, that is a generous offer. I may take you up on it. At the moment, I am still testing some changes to my implementation, but I will send you a message when I have a version that may benefit from tests on more cores.

- Dan

Joerg Oster · Post by **Joerg Oster** » Thu Jan 17, 2013 11:33 am

Here the result for Stockfish with double time, 12"+0.2".

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 SF-4T         101   18   18   699   68%   -44   33% 
   3 SF-1T-TCx2     59   18   17   699   63%   -44   38% 
   5 SF-1T         -29   17   17   699   52%   -44   33%

As expected, SF simply needs a bit more time to make full use of its SMP-implementation.

@Tord/Marco: Is this in accordance with your test results?

CRoberson · Post by **CRoberson** » Fri Jan 18, 2013 9:02 pm

Hi Dan,

Your results seem reasonable. Here is some math for it. Assume you have an EBF of 2, then parallel search at the root only gives at best (no matter the number of processors) a 2x speed up.

By parallel search at the root only, you avoid lots of overhead that comes in at parallel searches down in the branches of the tree. So, you gain some efficiency. You don't have any of the node type misprediction issues that explode the tree. So, there aren't any trade-offs that are ever mishandled.

Your observation about game performance being better than speed performance is not unusual. There are critical positions in games where where improvements that help those positions the most yield nonlinear gains. Tactical positions gain more from parallel search than quiet positions. In quiet positions, many moves evaluate nearly the same as the best move. In tactical positions, that isn't so.

You should enter the CCT that is comming up in Feb.

bob · Post by **bob** » Fri Jan 18, 2013 9:22 pm

CRoberson wrote:Hi Dan,

Your results seem reasonable. Here is some math for it. Assume you have an EBF of 2, then parallel search at the root only gives at best (no matter the number of processors) a 2x speed up.

By parallel search at the root only, you avoid lots of overhead that comes in at parallel searches down in the branches of the tree. So, you gain some efficiency. You don't have any of the node type misprediction issues that explode the tree. So, there aren't any trade-offs that are ever mishandled.

Your observation about game performance being better than speed performance is not unusual. There are critical positions in games where where improvements that help those positions the most yield nonlinear gains. Tactical positions gain more from parallel search than quiet positions. In quiet positions, many moves evaluate nearly the same as the best move. In tactical positions, that isn't so.

You should enter the CCT that is comming up in Feb.

This is not quite right. You CAN get a >2x speedup splitting only at the root. Consider the case where move ordering is not so good at the root, and you are going to change your mind 3 times to choose among the 4 best moves. 4 cpus will run that 4x faster, searching each root move in the set of 4 in parallel.

On average, however, that does not happen, and the actual speedup is closer to 1.5x or so, and I have test data for that, as that is how our first parallel search worked in 1983 (we had all of 2 weeks to get something, anything, to use on the first Cray XMP.

dchoman · Post by **dchoman** » Sat Jan 19, 2013 1:51 am

CRoberson wrote:Hi Dan,

Your results seem reasonable. Here is some math for it. Assume you have an EBF of 2, then parallel search at the root only gives at best (no matter the number of processors) a 2x speed up.

By parallel search at the root only, you avoid lots of overhead that comes in at parallel searches down in the branches of the tree. So, you gain some efficiency. You don't have any of the node type misprediction issues that explode the tree. So, there aren't any trade-offs that are ever mishandled.

Your observation about game performance being better than speed performance is not unusual. There are critical positions in games where where improvements that help those positions the most yield nonlinear gains. Tactical positions gain more from parallel search than quiet positions. In quiet positions, many moves evaluate nearly the same as the best move. In tactical positions, that isn't so.

You should enter the CCT that is comming up in Feb.

My implementation is actually even worse than you describe because I am not even splitting at the root... I am just replicating the entire search on each processor, so the only speed up comes from the various hash table communications.

Regarding CCT, I just registered. I am excited to play again for the first time since CCT1.

- Dan

mar · Post by **mar** » Sun Jan 20, 2013 5:32 pm

Hi Dan,

I have just finished a test match with lazy smp.
I got less convincing results than you but in my case I only communicated through the main hash table.
I only tested two threads.
The lazy smp version scored 58% with 46% draws (1748/3000).
Elostat says +58 +/- 9, bayeselo +50 +/-10 with 100% LOS.
As it was only self play it's probably even less.
I haven't tested a single threaded version with 2x as much thinking time.
Anyway price/performance ratio is pretty good actually as it was really straightforward to implement.
Plus no idle time and no need for minimum split depth either.
So thanks for sharing your idea, you're a genius!
I checked some top engines and they rarely get over 50 elo on two cores,
less than 50 is common. No need for complicated nonsense.
I guess everyone will start to rewrite their smp implementations now

Martin

dchoman · Post by **dchoman** » Sun Jan 20, 2013 10:10 pm

mar wrote:Hi Dan,

I have just finished a test match with lazy smp.
I got less convincing results than you but in my case I only communicated through the main hash table.
I only tested two threads.
The lazy smp version scored 58% with 46% draws (1748/3000).
Elostat says +58 +/- 9, bayeselo +50 +/-10 with 100% LOS.
As it was only self play it's probably even less.
I haven't tested a single threaded version with 2x as much thinking time.
Anyway price/performance ratio is pretty good actually as it was really straightforward to implement.
Plus no idle time and no need for minimum split depth either.
So thanks for sharing your idea, you're a genius!
I checked some top engines and they rarely get over 50 elo on two cores,
less than 50 is common. No need for complicated nonsense.
I guess everyone will start to rewrite their smp implementations now

Martin

It is not my idea. I remember reading about it years ago, and it is briefly described in the chessprogramming wiki. I had remembered that it was not supposed to work very well, so I was very surprised that it did... hence my original post.

My program has 4 hash tables (main, pawn, score, and combination move), so it may gain a bit more for that reason. My testing also shows that using a larger depth (+1 or so), for at least some of the parallel threads helps speed things up as well.

I am not sure how well it will scale to longer searches, so that is something for me to test further. If the scaling is not so good, I may try the more traditional YBW type, but that is much more complicated and I have to admit to being a little afraid of the bugs

- Dan

bob · Post by **bob** » Mon Jan 21, 2013 1:47 am

dchoman wrote:
mar wrote:Hi Dan,

I have just finished a test match with lazy smp.
I got less convincing results than you but in my case I only communicated through the main hash table.
I only tested two threads.
The lazy smp version scored 58% with 46% draws (1748/3000).
Elostat says +58 +/- 9, bayeselo +50 +/-10 with 100% LOS.
As it was only self play it's probably even less.
I haven't tested a single threaded version with 2x as much thinking time.
Anyway price/performance ratio is pretty good actually as it was really straightforward to implement.
Plus no idle time and no need for minimum split depth either.
So thanks for sharing your idea, you're a genius!
I checked some top engines and they rarely get over 50 elo on two cores,
less than 50 is common. No need for complicated nonsense.
I guess everyone will start to rewrite their smp implementations now

Martin
It is not my idea. I remember reading about it years ago, and it is briefly described in the chessprogramming wiki. I had remembered that it was not supposed to work very well, so I was very surprised that it did... hence my original post.

My program has 4 hash tables (main, pawn, score, and combination move), so it may gain a bit more for that reason. My testing also shows that using a larger depth (+1 or so), for at least some of the parallel threads helps speed things up as well.

I am not sure how well it will scale to longer searches, so that is something for me to test further. If the scaling is not so good, I may try the more traditional YBW type, but that is much more complicated and I have to admit to being a little afraid of the bugs

- Dan

For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...

Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2

Re: Lazy SMP, part 2