Lazy SMP, part 2

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

dchoman
Posts: 171
Joined: Wed Dec 28, 2011 8:44 pm
Location: United States

Re: Lazy SMP, part 2

Post by dchoman »

jd1 wrote:Thanks Dan.

Just to make sure I understand your method correctly, at every iteration you start the even threads at depth+1, and stop them when the iteration finishes.

This is then quite different to the usual shared hash table smp implementation, which launches x threads at the beginning of the search, as http://chessprogramming.wikispaces.com/ ... sh%20Table

I am testing right now.

Jerry
Yes, I am currently starting half the threads with the current iteration depth and the other half with iteration depth + 1. As I said in an earlier post, there may be a more optimum configuration than this, but this is what I am doing now (perhaps starting all extra threads with different depths? +1, +2, +3, etc... but that may be too much difference between them for the shared hash to do its magic)

From reading the wikispace paragraph (thanks for the link!), my implementation is not that different... the only detailed difference is that I am restarting each thread with each new iteration and some of my threads are searching a somewhat deeper iteration depth (+1).

- Dan
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: Lazy SMP, part 2

Post by Joerg Oster »

dchoman wrote:Thanks! This is very interesting! I would have expected a significant gain for 4 processors relative to 1 processor at TCx2 search. Not sure what SMP implementation Stockfish is using, but I assume it is state-of-the-art. Perhaps the simple SMP version I am using at the moment is good enough? Not sure if that would be true for more processors or non-bullet time controls.

- Dan
Indeed, Stockfish uses an implementation of the YBW Concept. When I first compared to your results, I was a bit shocked. OTOH, we must not forget that SF has a limiting parameter, Min Split Depth, not to split the search at too shallow depths. For my experiment this parameter was set to default = 4. My guess is, that with increased tc we might gain more. Maybe I will do the same test with double or triple tc.

Anyways, your results seem to be promising.
If you want me to run some tests with 8 cores, let me know.
Jörg Oster
dchoman
Posts: 171
Joined: Wed Dec 28, 2011 8:44 pm
Location: United States

Re: Lazy SMP, part 2

Post by dchoman »

Joerg Oster wrote:
dchoman wrote:Thanks! This is very interesting! I would have expected a significant gain for 4 processors relative to 1 processor at TCx2 search. Not sure what SMP implementation Stockfish is using, but I assume it is state-of-the-art. Perhaps the simple SMP version I am using at the moment is good enough? Not sure if that would be true for more processors or non-bullet time controls.

- Dan
Indeed, Stockfish uses an implementation of the YBW Concept. When I first compared to your results, I was a bit shocked. OTOH, we must not forget that SF has a limiting parameter, Min Split Depth, not to split the search at too shallow depths. For my experiment this parameter was set to default = 4. My guess is, that with increased tc we might gain more. Maybe I will do the same test with double or triple tc.

Anyways, your results seem to be promising.
If you want me to run some tests with 8 cores, let me know.
Thank you, that is a generous offer. I may take you up on it. At the moment, I am still testing some changes to my implementation, but I will send you a message when I have a version that may benefit from tests on more cores.

- Dan
Joerg Oster
Posts: 937
Joined: Fri Mar 10, 2006 4:29 pm
Location: Germany

Re: Lazy SMP, part 2

Post by Joerg Oster »

Here the result for Stockfish with double time, 12"+0.2".

Code: Select all

Rank Name          Elo    +    - games score oppo. draws 
   1 SF-4T         101   18   18   699   68%   -44   33% 
   3 SF-1T-TCx2     59   18   17   699   63%   -44   38% 
   5 SF-1T         -29   17   17   699   52%   -44   33% 
As expected, SF simply needs a bit more time to make full use of its SMP-implementation.

@Tord/Marco: Is this in accordance with your test results?
Jörg Oster
CRoberson
Posts: 2055
Joined: Mon Mar 13, 2006 2:31 am
Location: North Carolina, USA

Re: Lazy SMP, part 2

Post by CRoberson »

Hi Dan,

Your results seem reasonable. Here is some math for it. Assume you have an EBF of 2, then parallel search at the root only gives at best (no matter the number of processors) a 2x speed up.

By parallel search at the root only, you avoid lots of overhead that comes in at parallel searches down in the branches of the tree. So, you gain some efficiency. You don't have any of the node type misprediction issues that explode the tree. So, there aren't any trade-offs that are ever mishandled.

Your observation about game performance being better than speed performance is not unusual. There are critical positions in games where where improvements that help those positions the most yield nonlinear gains. Tactical positions gain more from parallel search than quiet positions. In quiet positions, many moves evaluate nearly the same as the best move. In tactical positions, that isn't so.

You should enter the CCT that is comming up in Feb.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Lazy SMP, part 2

Post by bob »

CRoberson wrote:Hi Dan,

Your results seem reasonable. Here is some math for it. Assume you have an EBF of 2, then parallel search at the root only gives at best (no matter the number of processors) a 2x speed up.

By parallel search at the root only, you avoid lots of overhead that comes in at parallel searches down in the branches of the tree. So, you gain some efficiency. You don't have any of the node type misprediction issues that explode the tree. So, there aren't any trade-offs that are ever mishandled.

Your observation about game performance being better than speed performance is not unusual. There are critical positions in games where where improvements that help those positions the most yield nonlinear gains. Tactical positions gain more from parallel search than quiet positions. In quiet positions, many moves evaluate nearly the same as the best move. In tactical positions, that isn't so.

You should enter the CCT that is comming up in Feb.
This is not quite right. You CAN get a >2x speedup splitting only at the root. Consider the case where move ordering is not so good at the root, and you are going to change your mind 3 times to choose among the 4 best moves. 4 cpus will run that 4x faster, searching each root move in the set of 4 in parallel.

On average, however, that does not happen, and the actual speedup is closer to 1.5x or so, and I have test data for that, as that is how our first parallel search worked in 1983 (we had all of 2 weeks to get something, anything, to use on the first Cray XMP.
dchoman
Posts: 171
Joined: Wed Dec 28, 2011 8:44 pm
Location: United States

Re: Lazy SMP, part 2

Post by dchoman »

CRoberson wrote:Hi Dan,

Your results seem reasonable. Here is some math for it. Assume you have an EBF of 2, then parallel search at the root only gives at best (no matter the number of processors) a 2x speed up.

By parallel search at the root only, you avoid lots of overhead that comes in at parallel searches down in the branches of the tree. So, you gain some efficiency. You don't have any of the node type misprediction issues that explode the tree. So, there aren't any trade-offs that are ever mishandled.

Your observation about game performance being better than speed performance is not unusual. There are critical positions in games where where improvements that help those positions the most yield nonlinear gains. Tactical positions gain more from parallel search than quiet positions. In quiet positions, many moves evaluate nearly the same as the best move. In tactical positions, that isn't so.

You should enter the CCT that is comming up in Feb.
My implementation is actually even worse than you describe because I am not even splitting at the root... I am just replicating the entire search on each processor, so the only speed up comes from the various hash table communications.

Regarding CCT, I just registered. I am excited to play again for the first time since CCT1.

- Dan
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Lazy SMP, part 2

Post by mar »

Hi Dan,

I have just finished a test match with lazy smp.
I got less convincing results than you but in my case I only communicated through the main hash table.
I only tested two threads.
The lazy smp version scored 58% with 46% draws (1748/3000).
Elostat says +58 +/- 9, bayeselo +50 +/-10 with 100% LOS.
As it was only self play it's probably even less.
I haven't tested a single threaded version with 2x as much thinking time.
Anyway price/performance ratio is pretty good actually as it was really straightforward to implement.
Plus no idle time and no need for minimum split depth either.
So thanks for sharing your idea, you're a genius!
I checked some top engines and they rarely get over 50 elo on two cores,
less than 50 is common. No need for complicated nonsense.
I guess everyone will start to rewrite their smp implementations now :wink:

Martin
dchoman
Posts: 171
Joined: Wed Dec 28, 2011 8:44 pm
Location: United States

Re: Lazy SMP, part 2

Post by dchoman »

mar wrote:Hi Dan,

I have just finished a test match with lazy smp.
I got less convincing results than you but in my case I only communicated through the main hash table.
I only tested two threads.
The lazy smp version scored 58% with 46% draws (1748/3000).
Elostat says +58 +/- 9, bayeselo +50 +/-10 with 100% LOS.
As it was only self play it's probably even less.
I haven't tested a single threaded version with 2x as much thinking time.
Anyway price/performance ratio is pretty good actually as it was really straightforward to implement.
Plus no idle time and no need for minimum split depth either.
So thanks for sharing your idea, you're a genius!
I checked some top engines and they rarely get over 50 elo on two cores,
less than 50 is common. No need for complicated nonsense.
I guess everyone will start to rewrite their smp implementations now :wink:

Martin
It is not my idea. I remember reading about it years ago, and it is briefly described in the chessprogramming wiki. I had remembered that it was not supposed to work very well, so I was very surprised that it did... hence my original post.

My program has 4 hash tables (main, pawn, score, and combination move), so it may gain a bit more for that reason. My testing also shows that using a larger depth (+1 or so), for at least some of the parallel threads helps speed things up as well.

I am not sure how well it will scale to longer searches, so that is something for me to test further. If the scaling is not so good, I may try the more traditional YBW type, but that is much more complicated and I have to admit to being a little afraid of the bugs :)

- Dan
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Lazy SMP, part 2

Post by bob »

dchoman wrote:
mar wrote:Hi Dan,

I have just finished a test match with lazy smp.
I got less convincing results than you but in my case I only communicated through the main hash table.
I only tested two threads.
The lazy smp version scored 58% with 46% draws (1748/3000).
Elostat says +58 +/- 9, bayeselo +50 +/-10 with 100% LOS.
As it was only self play it's probably even less.
I haven't tested a single threaded version with 2x as much thinking time.
Anyway price/performance ratio is pretty good actually as it was really straightforward to implement.
Plus no idle time and no need for minimum split depth either.
So thanks for sharing your idea, you're a genius!
I checked some top engines and they rarely get over 50 elo on two cores,
less than 50 is common. No need for complicated nonsense.
I guess everyone will start to rewrite their smp implementations now :wink:

Martin
It is not my idea. I remember reading about it years ago, and it is briefly described in the chessprogramming wiki. I had remembered that it was not supposed to work very well, so I was very surprised that it did... hence my original post.

My program has 4 hash tables (main, pawn, score, and combination move), so it may gain a bit more for that reason. My testing also shows that using a larger depth (+1 or so), for at least some of the parallel threads helps speed things up as well.

I am not sure how well it will scale to longer searches, so that is something for me to test further. If the scaling is not so good, I may try the more traditional YBW type, but that is much more complicated and I have to admit to being a little afraid of the bugs :)

- Dan
For parallel search measurements, the simplest test is to take a set of positions, search them to a fixed depth D using just one CPU/thread. Then search to the SAME depth D using multiple CPUS/threads. Your speedup = time(1cpu) / time(Ncpus).

That tells the tale. This type of algorithm fares poorly when # cpus > 2, and it doesn't do that well for 2, were a straightforward YBW algorithm can do 1.7x-1.8x using 2 cpus...