Cluster Rybka

Discussion of chess software programming and technical issues.

Moderator: Ras

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

Michael Sherwin wrote:
bob wrote:
krazyken wrote:One place I'd expect a cluster to be more useful is in pondering. You should be able to get a 100% ponder hit with enough nodes.
doesn't work. If you spread nodes across all moves you ponder, you search less deeply which hurts rather than helps, since each different move will only be pondered with one node, not the entire group.
Useing 8 cores to search each of 5 ponder moves. What would be the ponder hit rate then?
High.. But if your hit rate is already > 50%, which would you rather do, have each node search one different ponder move for 2 minutes, or all 5 nodes search the move that is right > 50% of the time and average 2.5 nodes searching the right move? 2.5 > 1.0.

Old topic, beat to death already. Pondering a single move is better, unless your prediction rate is horrible.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more data

Post by bob »

jswaff wrote:Without the gimped up search...

Code: Select all

CPUS	Time	Speedup	Nodes	NPS
1	1.00	1.00	1.00	1.00
2	0.81	1.23	1.39	1.75
3	0.82	1.22	1.68	2.18
4	0.81	1.24	1.85	2.44
5	0.83	1.21	2.03	2.59
6	0.85	1.18	2.19	2.75
7	0.84	1.18	2.24	2.84
8	0.86	1.16	2.36	2.98

OK, back into the mode of reading everything here that appears relevant. :)

My first question is a result of the above. It would appear that for any N, N>1 (N is number of processors) your tree size _shrinks_. If this is so, it begs two questions:

(1) is this true over a significant set of positions, or just for a few (or even only one)? If it is true over a significant set, then;

(2) it would appear that you might have either (a) a bug lurking in the background that is allowing you to split somewhere and fail to search all the moves; or (b) your move ordering is so bad that the parallel search lets you search move 3 (or 4 or whatever) before you finish searching move two (I assume some sort of YBW predicate in that the first move at some branch is always searched serially before any sort of split is done, which is true of classic YBW or DTS when possible.)

So my first suggestion is to find out "where are those missing nodes going?" That will lead you to the cause since this should never happen (you should never search a smaller tree on average, although in some cases you will because nobody has perfect move ordering. Once you find out where the missing nodes are going, you might be on the way to even better play. I have had that sort of bug in the past and it is often well-hidden, which is a pain..

bob wrote:
jswaff wrote:I recently ran a "root splitting only" experiment with Prophet. Each thread would get a root move and search it, and update alpha and the PV when finished if the score > alpha. With two processors I saw a speedup of 1.56. With 3 processors 1.89, with 4: 2.08, with 5: 2.17, with 6: 2.27, with 7: 2.27, with 8: 2.27. The root move ordering was simply : make sure the first move from the last iteration is first, and sort the rest by node count. So I have to agree, the algorithm sucks from a scalability perspective. Still, it's something.
Is the above without a YBW predicate condition? That is, do you search move 1 by itself, then split the rest, or do you start splitting from the get-go so that you are searching N root moves in parallel with no idea of the proper alpha/beta bound (as would be established by searching the first move before splitting)??



The "unsynchronized depth" issue is sticky. I can't think of a sensible way to handle that.
2.27 is actually remarkable and suggests some real move ordering issues, unless you produced that number on some _tactical_ positions only. There you can get above 2, because tactical positions are those where move ordering is bad and the final "best move" is a "surprise" that rises to the top at the end of the search. Here are some numbers from yesterday's tournament:

Code: Select all

               19     3:44   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               19->   3:53   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               20     4:05   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4
               20->   4:19   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4 (s=2)




               18    14.71   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               18->  19.09   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               19    24.20   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               19->  30.63   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               20     1:01   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5
               20->   1:21   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5

Let's just take the last search output (this came from the Rybka game so it is a good example from a real game rather than being a test position. the first move at the root took 14.7 seconds (depth=18) and the remainder of the moves took just over 4 seconds total. If you split at the root, the processor that gets that move (best move, Rfc8) is going to take 14.7 seconds. Let's assume that the other moves get searched in zero time (which is not possible, but is an "optimal" case.) So rather than taking a total of 19 seconds, you are done in 14.7 seconds. Speedup = 19.09 / 14.7, 1.3x being generous. For depth 19, the first move takes 24.2 seconds, all the reas take 6.6 seconds. Speedup = 30.6 / 24.2. This is easy to understand. If, instead, you take a position where move ordering is bad, you might get this kind of output:

Code: Select all

               17->  59.74   0.28   19. ... Kf8 20. Bd2 Qb6 21. Ng3 Qd8
                                    22. Be3 Nb7 23. b4 Rc3 24. Ne2 Rcc8
                                    25. Rc1 Rxc1 26. Qxc1 Rc8 27. Nc3 a5
                                    28. bxa5 Nxa5 (s=6)
               18     1:12   0.35   19. ... Kf8 20. Ng3 Qb8 21. Be3 Nb7
                                    22. b4 Rc3 23. Rb3 Rc8 24. Qd2 a5 25.
                                    Rc1 axb4 26. axb4 Rxc1+ 27. Qxc1 Ra4
                                    28. Qf1 Qe8 (s=5)
               18     1:16   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    (s=4)
               18->   1:27   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    (s=3)

In this case, we finish depth 17 at 60 seconds. At depth 18, it takes 12 seconds to find the score for Kf8, then another 4 seconds ti discover that Bd8 is even better, then 11 seconds to search the rest of the moves. We get a larger gain here since we will search Kf8 and Bd8 at the root and get a score back for both in a total of 12 seconds, rather than the serial 16 seconds. So we save the 4 second search time since it is done in parallel with the 12 second search, and we are done after 12 seconds total rather than after 16 (again, all the others we assume go in zero time to be as favorable to this algorithm timing as possible).

This kind of case is where you get the most from splitting at the root, and with today's ultra-low branching factor, what you get is very little.

You didn't say what you are doing elsewhere, LMR? NULL R=3? Etc. The lower your branching factor the worse this looks...

I brought this up because of the various discussions and comments yesterday, particularly people asking about "why all the different depths? why is the depth bouncing up and down? How fast is this 5-node 8-core-per-node cluster? Etc." The answer is it is no faster than a normal 8-core 1-node box. I am finally making preparations to do a _real_ cluster approach. And I know I am not going to get 100 Elo from a 5-node parallel search. You lose too much. No shared hash table. No shared killers. No shared anything whatsoever. So the losses are large. But if you can get a real 2.5 speedup out of 5 nodes, that would be significant, as it would be like searching 50M nodes per second on the hardware I used yesterday when we were typically searching 20M nodes per second. Or more importantly, if I can get 35x on the big 70-node 8-core node, that will be serious numbers. 700M nodes per second is serious computing. I have no idea whether I can get that kind of speedup. I believe I can get 2.5x out of 5. But stretching that ratio to 70 is improbable. But anything > 1.0 is good...
bob wrote:I had a chance to watch Rybka's analysis during our PanAM game today, and what I saw was a far cry from what I was expecting.

Several were noticing that Rybka's depth and score were bouncing all over the place. I finally figured out why on a position where Crafty played QxQ and Rybka had one recapture move (RxQ) and several other moves. We were seeing all sorts of -9.0 scores being kibitzed. As I went through them, what I discovered was this: Rybka uses an algorithm first tried (and then discarded because it is useless) back around 1978 or so by Monty Newborn. What is happening is this: Each root move is given to a different cluster node when the first iteration of the search starts. Each move is searched independently until time runs out. The depth is not synchronized in any way, so when you look at Rybka's analysis, you might see this:

Code: Select all

depth=18 time=5.08 node=7028735 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=20 time=6.09 node=8650750 speed=1357325 score=+0.75
pv="Ra1 Kf8 Kf2"

depth=19 time=6.09 node=8650750 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=21 time=7.11 node=10131160 speed=1357325
score=+0.75 pv="Ra1 Kf8 Kf2"

depth=21 time=8.20 node=11654740 speed=1420616
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=16 time=8.20 node=11654740 speed=1420616
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=20 time=10.16 node=15113065 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8" 

depth=17 time=10.16 node=15113065 speed=1520741
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=21 time=11.17 node=16464745 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"

depth=18 time=12.19 node=18357100 speed=1520741
score=+0.52 pv="Rc1 Kf7 Kf2"

depth=18 time=13.20 node=19605795 speed=1520741
score=+0.65 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=14.22 node=20854490 speed=1520741
score=+0.75 pv="Ra1 Kf8 Kf2 Ndc4"

depth=19 time=15.23 node=22244790 speed=1520741
score=+0.56 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=17.38 node=25377725 speed=1460504
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=20 time=17.38 node=25377725 speed=1460504
score=+0.55 pv="g4 Kf7 Ra1 Ke7"

depth=19 time=18.28 node=27213820 speed=1460504
score=+0.52 pv="Rc1 Kf7 Kf2 Ndc4"

depth=22 time=19.30 node=29466620 speed=1527572
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"
If you look at that at first glance, you wonder why is the depth bouncing around going forward and backward all over the place? But if you look more carefully, and find just the kibitzes for the same move and ignore the others, you will find that the depths and scores now look sensible.

The primary fault with this is that the speedup is zero. If you watch the typical program, 90% of the search effort is on the _first_ move. Here, each node is searching a "first move" which breaks alpha/beta completely. Even worse, there is a depth issue in that each of the root moves are often searched to different depths. Which is better, +1 at depth 17 or +.01 at depth 21? That was the problem Monty discovered and that is why this is no good, without even considering the issue that the distributed search is not speeding up the process one bit.

I have no idea why such an algorithm would be used. There is no up-side to doing this and using a single node would be no slower. And it actually would be a bit faster since all root moves get searched to the same depth so that there is no issue in comparing scores from different depths.

At least the hyperbole about the "40 core monster" has been discovered. Yes, Rybka is extremely strong. But the cluster Rybka is absolutely no stronger and perhaps a bit weaker due to the unsynchronized depth issue.

live and learn...

BTW, after watching this game it is apparent that the requirement for kibitzing node counts, NPS, depth, etc is worthless. :) The numbers being kibitzed are absolute nonsense with respect to the accepted meaning of the terms... Several of us noticed the NPS numbers matching at several different points. That never happens for the rest of us. :)
krazyken

Re: more data

Post by krazyken »

bob wrote:
jswaff wrote:Without the gimped up search...

Code: Select all

CPUS	Time	Speedup	Nodes	NPS
1		1.00	1.00		1.00	1.00
2		0.81	1.23		1.39	1.75
3		0.82	1.22		1.68	2.18
4		0.81	1.24		1.85	2.44
5		0.83	1.21		2.03	2.59
6		0.85	1.18		2.19	2.75
7		0.84	1.18		2.24	2.84
8		0.86	1.16		2.36	2.98
OK, back into the mode of reading everything here that appears relevant. :)

My first question is a result of the above. It would appear that for any N, N>1 (N is number of processors) your tree size _shrinks_. If this is so, it begs two questions:

(1) is this true over a significant set of positions, or just for a few (or even only one)? If it is true over a significant set, then;

(2) it would appear that you might have either (a) a bug lurking in the background that is allowing you to split somewhere and fail to search all the moves; or (b) your move ordering is so bad that the parallel search lets you search move 3 (or 4 or whatever) before you finish searching move two (I assume some sort of YBW predicate in that the first move at some branch is always searched serially before any sort of split is done, which is true of classic YBW or DTS when possible.)

So my first suggestion is to find out "where are those missing nodes going?" That will lead you to the cause since this should never happen (you should never search a smaller tree on average, although in some cases you will because nobody has perfect move ordering. Once you find out where the missing nodes are going, you might be on the way to even better play. I have had that sort of bug in the past and it is often well-hidden, which is a pain..
I recall when testing Glaurung for up to 8 threads I got a similar distribution of results. Performance was absolutely abysmal with 8 threads, going back to 4 got me the strongest play.
The games for this test are available here.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more data

Post by bob »

krazyken wrote:
bob wrote:
jswaff wrote:Without the gimped up search...

Code: Select all

CPUS	Time	Speedup	Nodes	NPS
1		1.00	1.00		1.00	1.00
2		0.81	1.23		1.39	1.75
3		0.82	1.22		1.68	2.18
4		0.81	1.24		1.85	2.44
5		0.83	1.21		2.03	2.59
6		0.85	1.18		2.19	2.75
7		0.84	1.18		2.24	2.84
8		0.86	1.16		2.36	2.98
OK, back into the mode of reading everything here that appears relevant. :)

My first question is a result of the above. It would appear that for any N, N>1 (N is number of processors) your tree size _shrinks_. If this is so, it begs two questions:

(1) is this true over a significant set of positions, or just for a few (or even only one)? If it is true over a significant set, then;

(2) it would appear that you might have either (a) a bug lurking in the background that is allowing you to split somewhere and fail to search all the moves; or (b) your move ordering is so bad that the parallel search lets you search move 3 (or 4 or whatever) before you finish searching move two (I assume some sort of YBW predicate in that the first move at some branch is always searched serially before any sort of split is done, which is true of classic YBW or DTS when possible.)

So my first suggestion is to find out "where are those missing nodes going?" That will lead you to the cause since this should never happen (you should never search a smaller tree on average, although in some cases you will because nobody has perfect move ordering. Once you find out where the missing nodes are going, you might be on the way to even better play. I have had that sort of bug in the past and it is often well-hidden, which is a pain..
I recall when testing Glaurung for up to 8 threads I got a similar distribution of results. Performance was absolutely abysmal with 8 threads, going back to 4 got me the strongest play.
The games for this test are available here.
I have not looked at how Tord did his parallel search, but 8 processors is not a particularly difficult performance challenge. 16 gets harder and the difficulty increases as you go up. For 8, my code works as well as it does for 4.

But the issue here was that his trees get smaller with parallel search, independent of the scaling issue. Smaller trees suggests some other sort of issue...
krazyken

Re: more data

Post by krazyken »

bob wrote:
krazyken wrote:
bob wrote:
jswaff wrote:Without the gimped up search...

Code: Select all

CPUS	Time	Speedup	Nodes	NPS
1		1.00	1.00		1.00	1.00
2		0.81	1.23		1.39	1.75
3		0.82	1.22		1.68	2.18
4		0.81	1.24		1.85	2.44
5		0.83	1.21		2.03	2.59
6		0.85	1.18		2.19	2.75
7		0.84	1.18		2.24	2.84
8		0.86	1.16		2.36	2.98
OK, back into the mode of reading everything here that appears relevant. :)

My first question is a result of the above. It would appear that for any N, N>1 (N is number of processors) your tree size _shrinks_. If this is so, it begs two questions:

(1) is this true over a significant set of positions, or just for a few (or even only one)? If it is true over a significant set, then;

(2) it would appear that you might have either (a) a bug lurking in the background that is allowing you to split somewhere and fail to search all the moves; or (b) your move ordering is so bad that the parallel search lets you search move 3 (or 4 or whatever) before you finish searching move two (I assume some sort of YBW predicate in that the first move at some branch is always searched serially before any sort of split is done, which is true of classic YBW or DTS when possible.)

So my first suggestion is to find out "where are those missing nodes going?" That will lead you to the cause since this should never happen (you should never search a smaller tree on average, although in some cases you will because nobody has perfect move ordering. Once you find out where the missing nodes are going, you might be on the way to even better play. I have had that sort of bug in the past and it is often well-hidden, which is a pain..
I recall when testing Glaurung for up to 8 threads I got a similar distribution of results. Performance was absolutely abysmal with 8 threads, going back to 4 got me the strongest play.
The games for this test are available here.
I have not looked at how Tord did his parallel search, but 8 processors is not a particularly difficult performance challenge. 16 gets harder and the difficulty increases as you go up. For 8, my code works as well as it does for 4.

But the issue here was that his trees get smaller with parallel search, independent of the scaling issue. Smaller trees suggests some other sort of issue...
I guess I'll let my ignorance show here, how does the data suggest shrinking tree size? I see time fluctuations, but nodes is increasing with N.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more data

Post by bob »

Actually I think I might have erred. James had sent me a couple of emails and I may well have confused their content with earlier posts in this thread... In fact, after looking, that is what happened. So from what he posted earlier you could not conclude what I stated. He had quizzed me about my search overhead and was finding way lower increases, or even decreases in nodes searched, which is interesting at best and suggests an undiscovered bug at worst...
jswaff

Re: more data

Post by jswaff »

I'm glad to read this... I was wondering how you arrived at that conclusion myself.

My concern was (is) that you've stated roughly 30% increase in node count with each additional processor, with nearly perfect NPS scaling. I'm seeing a much lower overhead, and much worse NPS scaling. The NPS scaling doesn't concern me that much (not that it doesn't, but I can "explain" it). It's the fact that my search overhead seems to be so much lower than yours... it makes me suspect "bug" as well.

That said, on every test I've run Prophet has produced the same answers with eight processors as it has with one, just faster. Also, I've gone as far as splitting min-max style (no cutoffs) and verifying the node counts are correct. So, that makes a bug seem unlikely.

What I took away from our chat during the tournament is that I should focus on improving NPS scaling and see what happens from there.

Edit: Just wanted to point out, I don't ever see a decrease in nodes searched, just a "lower increase." And just to be clear: my intent wasn't to say "Bob, look, I have a lower increase than you!" ... it was more like "Bob, something must be wrong here... any ideas?"

--
James

bob wrote:Actually I think I might have erred. James had sent me a couple of emails and I may well have confused their content with earlier posts in this thread... In fact, after looking, that is what happened. So from what he posted earlier you could not conclude what I stated. He had quizzed me about my search overhead and was finding way lower increases, or even decreases in nodes searched, which is interesting at best and suggests an undiscovered bug at worst...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: more data

Post by bob »

jswaff wrote:I'm glad to read this... I was wondering how you arrived at that conclusion myself.

My concern was (is) that you've stated roughly 30% increase in node count with each additional processor, with nearly perfect NPS scaling. I'm seeing a much lower overhead, and much worse NPS scaling. The NPS scaling doesn't concern me that much (not that it doesn't, but I can "explain" it). It's the fact that my search overhead seems to be so much lower than yours... it makes me suspect "bug" as well.

That said, on every test I've run Prophet has produced the same answers with eight processors as it has with one, just faster. Also, I've gone as far as splitting min-max style (no cutoffs) and verifying the node counts are correct. So, that makes a bug seem unlikely.

What I took away from our chat during the tournament is that I should focus on improving NPS scaling and see what happens from there.

Edit: Just wanted to point out, I don't ever see a decrease in nodes searched, just a "lower increase." And just to be clear: my intent wasn't to say "Bob, look, I have a lower increase than you!" ... it was more like "Bob, something must be wrong here... any ideas?"

--
James
I didn't interpret it that way so no problem. But on one of your tests you sent, I though I saw a drop in nodes at say cpus=2, compared to 1, then a jump back up later, but I might well be remembering wrong. But you actually can search fewer nodes with a parallel search at times, that is what produces the super-linear speedups we had discussed elsewhere.


bob wrote:Actually I think I might have erred. James had sent me a couple of emails and I may well have confused their content with earlier posts in this thread... In fact, after looking, that is what happened. So from what he posted earlier you could not conclude what I stated. He had quizzed me about my search overhead and was finding way lower increases, or even decreases in nodes searched, which is interesting at best and suggests an undiscovered bug at worst...
jswaff

Re: more data

Post by jswaff »

Just to point out: the results I posted were doing an experiment splitting just at the root, so performance is expected to taper off or even decrease with more processors. The data shows that splitting only at the root gave Prophet a max speedup of 1.24, which supports Bob's claim that splitting at the root is a crappy way to do a parallel search.

krazyken wrote:
bob wrote:
jswaff wrote:Without the gimped up
search...

Code: Select all

CPUS	Time	Speedup	Nodes	NPS
1		1.00	1.00		1.00	1.00
2		0.81	1.23		1.39	1.75
3		0.82	1.22		1.68	2.18
4		0.81	1.24		1.85	2.44
5		0.83	1.21		2.03	2.59
6		0.85	1.18		2.19	2.75
7		0.84	1.18		2.24	2.84
8		0.86	1.16		2.36	2.98
OK, back into the mode of reading everything here that appears relevant. :)

My first question is a result of the above. It would appear that for any N, N>1 (N is number of processors) your tree size _shrinks_. If this is so, it begs two questions:

(1) is this true over a significant set of positions, or just for a few (or even only one)? If it is true over a significant set, then;

(2) it would appear that you might have either (a) a bug lurking in the background that is allowing you to split somewhere and fail to search all the moves; or (b) your move ordering is so bad that the parallel search lets you search move 3 (or 4 or whatever) before you finish searching move two (I assume some sort of YBW predicate in that the first move at some branch is always searched serially before any sort of split is done, which is true of classic YBW or DTS when possible.)

So my first suggestion is to find out "where are those missing nodes going?" That will lead you to the cause since this should never happen (you should never search a smaller tree on average, although in some cases you will because nobody has perfect move ordering. Once you find out where the missing nodes are going, you might be on the way to even better play. I have had that sort of bug in the past and it is often well-hidden, which is a pain..
I recall when testing Glaurung for up to 8 threads I got a similar distribution of results. Performance was absolutely abysmal with 8 threads, going back to 4 got me the strongest play.
The games for this test are available here.
Dirt
Posts: 2851
Joined: Wed Mar 08, 2006 10:01 pm
Location: Irvine, CA, USA

Re: more data

Post by Dirt »

bob wrote:...But on one of your tests you sent, I though I saw a drop in nodes at say cpus=2, compared to 1, then a jump back up later, but I might well be remembering wrong. But you actually can search fewer nodes with a parallel search at times, that is what produces the super-linear speedups we had discussed elsewhere.
Maybe you were thinking about this data that James posted earlier in the thread:

Code: Select all

CPUS  Time    Speedup Nodes   NPS
1     1.00    1.00    1.00    1.00
2     0.49    2.03    0.98    1.94
3     0.37    2.68    1.06    2.77
4     0.32    3.16    1.14    3.58
5     0.28    3.54    1.22    4.30
6     0.27    3.76    1.31    4.95
7     0.26    3.90    1.39    5.50
8     0.25    3.95    1.49    6.00