Cluster Rybka

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Cluster Rybka

Post by bob »

I had a chance to watch Rybka's analysis during our PanAM game today, and what I saw was a far cry from what I was expecting.

Several were noticing that Rybka's depth and score were bouncing all over the place. I finally figured out why on a position where Crafty played QxQ and Rybka had one recapture move (RxQ) and several other moves. We were seeing all sorts of -9.0 scores being kibitzed. As I went through them, what I discovered was this: Rybka uses an algorithm first tried (and then discarded because it is useless) back around 1978 or so by Monty Newborn. What is happening is this: Each root move is given to a different cluster node when the first iteration of the search starts. Each move is searched independently until time runs out. The depth is not synchronized in any way, so when you look at Rybka's analysis, you might see this:

Code: Select all

depth=18 time=5.08 node=7028735 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=20 time=6.09 node=8650750 speed=1357325 score=+0.75
pv="Ra1 Kf8 Kf2"

depth=19 time=6.09 node=8650750 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=21 time=7.11 node=10131160 speed=1357325
score=+0.75 pv="Ra1 Kf8 Kf2"

depth=21 time=8.20 node=11654740 speed=1420616
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=16 time=8.20 node=11654740 speed=1420616
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=20 time=10.16 node=15113065 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8" 

depth=17 time=10.16 node=15113065 speed=1520741
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=21 time=11.17 node=16464745 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"

depth=18 time=12.19 node=18357100 speed=1520741
score=+0.52 pv="Rc1 Kf7 Kf2"

depth=18 time=13.20 node=19605795 speed=1520741
score=+0.65 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=14.22 node=20854490 speed=1520741
score=+0.75 pv="Ra1 Kf8 Kf2 Ndc4"

depth=19 time=15.23 node=22244790 speed=1520741
score=+0.56 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=17.38 node=25377725 speed=1460504
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=20 time=17.38 node=25377725 speed=1460504
score=+0.55 pv="g4 Kf7 Ra1 Ke7"

depth=19 time=18.28 node=27213820 speed=1460504
score=+0.52 pv="Rc1 Kf7 Kf2 Ndc4"

depth=22 time=19.30 node=29466620 speed=1527572
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"
If you look at that at first glance, you wonder why is the depth bouncing around going forward and backward all over the place? But if you look more carefully, and find just the kibitzes for the same move and ignore the others, you will find that the depths and scores now look sensible.

The primary fault with this is that the speedup is zero. If you watch the typical program, 90% of the search effort is on the _first_ move. Here, each node is searching a "first move" which breaks alpha/beta completely. Even worse, there is a depth issue in that each of the root moves are often searched to different depths. Which is better, +1 at depth 17 or +.01 at depth 21? That was the problem Monty discovered and that is why this is no good, without even considering the issue that the distributed search is not speeding up the process one bit.

I have no idea why such an algorithm would be used. There is no up-side to doing this and using a single node would be no slower. And it actually would be a bit faster since all root moves get searched to the same depth so that there is no issue in comparing scores from different depths.

At least the hyperbole about the "40 core monster" has been discovered. Yes, Rybka is extremely strong. But the cluster Rybka is absolutely no stronger and perhaps a bit weaker due to the unsynchronized depth issue.

live and learn...

BTW, after watching this game it is apparent that the requirement for kibitzing node counts, NPS, depth, etc is worthless. :) The numbers being kibitzed are absolute nonsense with respect to the accepted meaning of the terms... Several of us noticed the NPS numbers matching at several different points. That never happens for the rest of us. :)
User avatar
Ovyron
Posts: 4556
Joined: Tue Jul 03, 2007 4:30 am

Re: Cluster Rybka

Post by Ovyron »

bob wrote:I have no idea why such an algorithm would be used.
Because it was easy to implement and [they claimed] it's 100 ELO stronger than an octal?
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

Ovyron wrote:
bob wrote:I have no idea why such an algorithm would be used.
Because it was easy to implement and [they claimed] it's 100 ELO stronger than an octal?
and that is utter bullshit.
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: Cluster Rybka

Post by sje »

The memory is hazy, but I seem to recall that back in the early 1980s Fidelity showed up at a no-holds-barred CC event with a custom dedicated unit with thirty-plus processors. Perhaps they used the same algorithm in that unit.
jswaff

Re: Cluster Rybka

Post by jswaff »

I recently ran a "root splitting only" experiment with Prophet. Each thread would get a root move and search it, and update alpha and the PV when finished if the score > alpha. With two processors I saw a speedup of 1.56. With 3 processors 1.89, with 4: 2.08, with 5: 2.17, with 6: 2.27, with 7: 2.27, with 8: 2.27. The root move ordering was simply : make sure the first move from the last iteration is first, and sort the rest by node count. So I have to agree, the algorithm sucks from a scalability perspective. Still, it's something.

The "unsynchronized depth" issue is sticky. I can't think of a sensible way to handle that.

bob wrote:I had a chance to watch Rybka's analysis during our PanAM game today, and what I saw was a far cry from what I was expecting.

Several were noticing that Rybka's depth and score were bouncing all over the place. I finally figured out why on a position where Crafty played QxQ and Rybka had one recapture move (RxQ) and several other moves. We were seeing all sorts of -9.0 scores being kibitzed. As I went through them, what I discovered was this: Rybka uses an algorithm first tried (and then discarded because it is useless) back around 1978 or so by Monty Newborn. What is happening is this: Each root move is given to a different cluster node when the first iteration of the search starts. Each move is searched independently until time runs out. The depth is not synchronized in any way, so when you look at Rybka's analysis, you might see this:

Code: Select all

depth=18 time=5.08 node=7028735 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=20 time=6.09 node=8650750 speed=1357325 score=+0.75
pv="Ra1 Kf8 Kf2"

depth=19 time=6.09 node=8650750 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=21 time=7.11 node=10131160 speed=1357325
score=+0.75 pv="Ra1 Kf8 Kf2"

depth=21 time=8.20 node=11654740 speed=1420616
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=16 time=8.20 node=11654740 speed=1420616
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=20 time=10.16 node=15113065 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8" 

depth=17 time=10.16 node=15113065 speed=1520741
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=21 time=11.17 node=16464745 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"

depth=18 time=12.19 node=18357100 speed=1520741
score=+0.52 pv="Rc1 Kf7 Kf2"

depth=18 time=13.20 node=19605795 speed=1520741
score=+0.65 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=14.22 node=20854490 speed=1520741
score=+0.75 pv="Ra1 Kf8 Kf2 Ndc4"

depth=19 time=15.23 node=22244790 speed=1520741
score=+0.56 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=17.38 node=25377725 speed=1460504
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=20 time=17.38 node=25377725 speed=1460504
score=+0.55 pv="g4 Kf7 Ra1 Ke7"

depth=19 time=18.28 node=27213820 speed=1460504
score=+0.52 pv="Rc1 Kf7 Kf2 Ndc4"

depth=22 time=19.30 node=29466620 speed=1527572
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"
If you look at that at first glance, you wonder why is the depth bouncing around going forward and backward all over the place? But if you look more carefully, and find just the kibitzes for the same move and ignore the others, you will find that the depths and scores now look sensible.

The primary fault with this is that the speedup is zero. If you watch the typical program, 90% of the search effort is on the _first_ move. Here, each node is searching a "first move" which breaks alpha/beta completely. Even worse, there is a depth issue in that each of the root moves are often searched to different depths. Which is better, +1 at depth 17 or +.01 at depth 21? That was the problem Monty discovered and that is why this is no good, without even considering the issue that the distributed search is not speeding up the process one bit.

I have no idea why such an algorithm would be used. There is no up-side to doing this and using a single node would be no slower. And it actually would be a bit faster since all root moves get searched to the same depth so that there is no issue in comparing scores from different depths.

At least the hyperbole about the "40 core monster" has been discovered. Yes, Rybka is extremely strong. But the cluster Rybka is absolutely no stronger and perhaps a bit weaker due to the unsynchronized depth issue.

live and learn...

BTW, after watching this game it is apparent that the requirement for kibitzing node counts, NPS, depth, etc is worthless. :) The numbers being kibitzed are absolute nonsense with respect to the accepted meaning of the terms... Several of us noticed the NPS numbers matching at several different points. That never happens for the rest of us. :)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

sje wrote:The memory is hazy, but I seem to recall that back in the early 1980s Fidelity showed up at a no-holds-barred CC event with a custom dedicated unit with thirty-plus processors. Perhaps they used the same algorithm in that unit.
I don't remember anything with 30+ processors back then. I believe Schaeffer use a 16-node cluster at one point in time. But that algorithm simply doesn't offer anything. Even back in the days where the branching factor was 6, you could (optimally) get a speedup of 1.5X splitting at the root, usually much less...

Claiming 100+ elo would make anyone understanding parallel search fall off their chair laughing... That translates to 2+ plies of search. On a 5-node cluster. A speedup of at least 4.0. Splitting only at the root. No shared alpha/beta values. Pretty funny stuff...
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

jswaff wrote:I recently ran a "root splitting only" experiment with Prophet. Each thread would get a root move and search it, and update alpha and the PV when finished if the score > alpha. With two processors I saw a speedup of 1.56. With 3 processors 1.89, with 4: 2.08, with 5: 2.17, with 6: 2.27, with 7: 2.27, with 8: 2.27. The root move ordering was simply : make sure the first move from the last iteration is first, and sort the rest by node count. So I have to agree, the algorithm sucks from a scalability perspective. Still, it's something.

The "unsynchronized depth" issue is sticky. I can't think of a sensible way to handle that.
2.27 is actually remarkable and suggests some real move ordering issues, unless you produced that number on some _tactical_ positions only. There you can get above 2, because tactical positions are those where move ordering is bad and the final "best move" is a "surprise" that rises to the top at the end of the search. Here are some numbers from yesterday's tournament:

Code: Select all

               19     3:44   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               19->   3&#58;53   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               20     4&#58;05   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4
               20->   4&#58;19   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4 &#40;s=2&#41;




               18    14.71   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               18->  19.09   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               19    24.20   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               19->  30.63   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               20     1&#58;01   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5
               20->   1&#58;21   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5

Let's just take the last search output (this came from the Rybka game so it is a good example from a real game rather than being a test position. the first move at the root took 14.7 seconds (depth=18) and the remainder of the moves took just over 4 seconds total. If you split at the root, the processor that gets that move (best move, Rfc8) is going to take 14.7 seconds. Let's assume that the other moves get searched in zero time (which is not possible, but is an "optimal" case.) So rather than taking a total of 19 seconds, you are done in 14.7 seconds. Speedup = 19.09 / 14.7, 1.3x being generous. For depth 19, the first move takes 24.2 seconds, all the reas take 6.6 seconds. Speedup = 30.6 / 24.2. This is easy to understand. If, instead, you take a position where move ordering is bad, you might get this kind of output:

Code: Select all

               17->  59.74   0.28   19. ... Kf8 20. Bd2 Qb6 21. Ng3 Qd8
                                    22. Be3 Nb7 23. b4 Rc3 24. Ne2 Rcc8
                                    25. Rc1 Rxc1 26. Qxc1 Rc8 27. Nc3 a5
                                    28. bxa5 Nxa5 &#40;s=6&#41;
               18     1&#58;12   0.35   19. ... Kf8 20. Ng3 Qb8 21. Be3 Nb7
                                    22. b4 Rc3 23. Rb3 Rc8 24. Qd2 a5 25.
                                    Rc1 axb4 26. axb4 Rxc1+ 27. Qxc1 Ra4
                                    28. Qf1 Qe8 &#40;s=5&#41;
               18     1&#58;16   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    &#40;s=4&#41;
               18->   1&#58;27   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    &#40;s=3&#41;

In this case, we finish depth 17 at 60 seconds. At depth 18, it takes 12 seconds to find the score for Kf8, then another 4 seconds ti discover that Bd8 is even better, then 11 seconds to search the rest of the moves. We get a larger gain here since we will search Kf8 and Bd8 at the root and get a score back for both in a total of 12 seconds, rather than the serial 16 seconds. So we save the 4 second search time since it is done in parallel with the 12 second search, and we are done after 12 seconds total rather than after 16 (again, all the others we assume go in zero time to be as favorable to this algorithm timing as possible).

This kind of case is where you get the most from splitting at the root, and with today's ultra-low branching factor, what you get is very little.

You didn't say what you are doing elsewhere, LMR? NULL R=3? Etc. The lower your branching factor the worse this looks...

I brought this up because of the various discussions and comments yesterday, particularly people asking about "why all the different depths? why is the depth bouncing up and down? How fast is this 5-node 8-core-per-node cluster? Etc." The answer is it is no faster than a normal 8-core 1-node box. I am finally making preparations to do a _real_ cluster approach. And I know I am not going to get 100 Elo from a 5-node parallel search. You lose too much. No shared hash table. No shared killers. No shared anything whatsoever. So the losses are large. But if you can get a real 2.5 speedup out of 5 nodes, that would be significant, as it would be like searching 50M nodes per second on the hardware I used yesterday when we were typically searching 20M nodes per second. Or more importantly, if I can get 35x on the big 70-node 8-core node, that will be serious numbers. 700M nodes per second is serious computing. I have no idea whether I can get that kind of speedup. I believe I can get 2.5x out of 5. But stretching that ratio to 70 is improbable. But anything > 1.0 is good...
bob wrote:I had a chance to watch Rybka's analysis during our PanAM game today, and what I saw was a far cry from what I was expecting.

Several were noticing that Rybka's depth and score were bouncing all over the place. I finally figured out why on a position where Crafty played QxQ and Rybka had one recapture move (RxQ) and several other moves. We were seeing all sorts of -9.0 scores being kibitzed. As I went through them, what I discovered was this: Rybka uses an algorithm first tried (and then discarded because it is useless) back around 1978 or so by Monty Newborn. What is happening is this: Each root move is given to a different cluster node when the first iteration of the search starts. Each move is searched independently until time runs out. The depth is not synchronized in any way, so when you look at Rybka's analysis, you might see this:

Code: Select all

depth=18 time=5.08 node=7028735 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=20 time=6.09 node=8650750 speed=1357325 score=+0.75
pv="Ra1 Kf8 Kf2"

depth=19 time=6.09 node=8650750 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=21 time=7.11 node=10131160 speed=1357325
score=+0.75 pv="Ra1 Kf8 Kf2"

depth=21 time=8.20 node=11654740 speed=1420616
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=16 time=8.20 node=11654740 speed=1420616
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=20 time=10.16 node=15113065 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8" 

depth=17 time=10.16 node=15113065 speed=1520741
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=21 time=11.17 node=16464745 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"

depth=18 time=12.19 node=18357100 speed=1520741
score=+0.52 pv="Rc1 Kf7 Kf2"

depth=18 time=13.20 node=19605795 speed=1520741
score=+0.65 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=14.22 node=20854490 speed=1520741
score=+0.75 pv="Ra1 Kf8 Kf2 Ndc4"

depth=19 time=15.23 node=22244790 speed=1520741
score=+0.56 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=17.38 node=25377725 speed=1460504
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=20 time=17.38 node=25377725 speed=1460504
score=+0.55 pv="g4 Kf7 Ra1 Ke7"

depth=19 time=18.28 node=27213820 speed=1460504
score=+0.52 pv="Rc1 Kf7 Kf2 Ndc4"

depth=22 time=19.30 node=29466620 speed=1527572
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"
If you look at that at first glance, you wonder why is the depth bouncing around going forward and backward all over the place? But if you look more carefully, and find just the kibitzes for the same move and ignore the others, you will find that the depths and scores now look sensible.

The primary fault with this is that the speedup is zero. If you watch the typical program, 90% of the search effort is on the _first_ move. Here, each node is searching a "first move" which breaks alpha/beta completely. Even worse, there is a depth issue in that each of the root moves are often searched to different depths. Which is better, +1 at depth 17 or +.01 at depth 21? That was the problem Monty discovered and that is why this is no good, without even considering the issue that the distributed search is not speeding up the process one bit.

I have no idea why such an algorithm would be used. There is no up-side to doing this and using a single node would be no slower. And it actually would be a bit faster since all root moves get searched to the same depth so that there is no issue in comparing scores from different depths.

At least the hyperbole about the "40 core monster" has been discovered. Yes, Rybka is extremely strong. But the cluster Rybka is absolutely no stronger and perhaps a bit weaker due to the unsynchronized depth issue.

live and learn...

BTW, after watching this game it is apparent that the requirement for kibitzing node counts, NPS, depth, etc is worthless. :) The numbers being kibitzed are absolute nonsense with respect to the accepted meaning of the terms... Several of us noticed the NPS numbers matching at several different points. That never happens for the rest of us. :)
jswaff

Re: Cluster Rybka

Post by jswaff »

Actually, yes, that test was done with significant portions of the program "disabled." I'll repeat the experiment this evening with "everything on" and post some new results tomorrow morning. I'm sure with improved move ordering the speedups will only get worse, as you say.

Here are some results I had with splitting only at the root using completely random move ordering:

Code: Select all

CPUS	Time	Speedup	Nodes	NPS
1	1.00	1.00	1.00	1.00
2	0.49	2.03	0.98	1.94
3	0.37	2.68	1.06	2.77
4	0.32	3.16	1.14	3.58
5	0.28	3.54	1.22	4.30
6	0.27	3.76	1.31	4.95
7	0.26	3.90	1.39	5.50
8	0.25	3.95	1.49	6.00
Even then I didn't quite get 4X (though NPS scaling is inhibiting that somewhat).

bob wrote:
jswaff wrote:I recently ran a "root splitting only" experiment with Prophet. Each thread would get a root move and search it, and update alpha and the PV when finished if the score > alpha. With two processors I saw a speedup of 1.56. With 3 processors 1.89, with 4: 2.08, with 5: 2.17, with 6: 2.27, with 7: 2.27, with 8: 2.27. The root move ordering was simply : make sure the first move from the last iteration is first, and sort the rest by node count. So I have to agree, the algorithm sucks from a scalability perspective. Still, it's something.

The "unsynchronized depth" issue is sticky. I can't think of a sensible way to handle that.
2.27 is actually remarkable and suggests some real move ordering issues, unless you produced that number on some _tactical_ positions only. There you can get above 2, because tactical positions are those where move ordering is bad and the final "best move" is a "surprise" that rises to the top at the end of the search. Here are some numbers from yesterday's tournament:

Code: Select all

               19     3&#58;44   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               19->   3&#58;53   0.30   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    <HT>
               20     4&#58;05   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4
               20->   4&#58;19   0.27   18. ... Rfc8 19. Ne3 Qb6 20. b3 Kf8
                                    21. Nh4 g6 22. Nf3 Rab8 23. Rc1 Rxc1
                                    24. Bxc1 b4 25. axb4 Qxb4 26. Bd2 Qa3
                                    27. Nc4 Nxc4 28. Bxc4 &#40;s=2&#41;




               18    14.71   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               18->  19.09   0.21   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Bd4 26. Rc1 Rxc1
                                    27. Qxc1
               19    24.20   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               19->  30.63   0.20   18. ... Rfc8 19. Bd2 Bd8 20. Ne3 Qa7
                                    21. g4 h6 22. Nf5 Bxf5 23. gxf5 Bb6
                                    24. Rf1 Nb7 25. b4 Rc7 26. Rc1 Rac8
                                    27. Rxc7 Rxc7
               20     1&#58;01   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5
               20->   1&#58;21   0.28   18. ... Rfc8 19. Bd2 Bd8 20. Ng3 Qa7
                                    21. Rf1 Nb7 22. b4 a5 23. Qe2 axb4
                                    24. axb4 Qxa1 25. Rxa1 Rxa1+ 26. Be1
                                    Rac1 27. Bxb5 R8c2 28. Qd3 Bxb5 29.
                                    Qxb5

Let's just take the last search output (this came from the Rybka game so it is a good example from a real game rather than being a test position. the first move at the root took 14.7 seconds (depth=18) and the remainder of the moves took just over 4 seconds total. If you split at the root, the processor that gets that move (best move, Rfc8) is going to take 14.7 seconds. Let's assume that the other moves get searched in zero time (which is not possible, but is an "optimal" case.) So rather than taking a total of 19 seconds, you are done in 14.7 seconds. Speedup = 19.09 / 14.7, 1.3x being generous. For depth 19, the first move takes 24.2 seconds, all the reas take 6.6 seconds. Speedup = 30.6 / 24.2. This is easy to understand. If, instead, you take a position where move ordering is bad, you might get this kind of output:

Code: Select all

               17->  59.74   0.28   19. ... Kf8 20. Bd2 Qb6 21. Ng3 Qd8
                                    22. Be3 Nb7 23. b4 Rc3 24. Ne2 Rcc8
                                    25. Rc1 Rxc1 26. Qxc1 Rc8 27. Nc3 a5
                                    28. bxa5 Nxa5 &#40;s=6&#41;
               18     1&#58;12   0.35   19. ... Kf8 20. Ng3 Qb8 21. Be3 Nb7
                                    22. b4 Rc3 23. Rb3 Rc8 24. Qd2 a5 25.
                                    Rc1 axb4 26. axb4 Rxc1+ 27. Qxc1 Ra4
                                    28. Qf1 Qe8 &#40;s=5&#41;
               18     1&#58;16   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    &#40;s=4&#41;
               18->   1&#58;27   0.29   19. ... Bd8 20. Ng3 Qb8 21. Bd2 Qa7
                                    22. Qe2 Nb7 23. b4 Bb6 24. Red1 h6
                                    25. Rbc1 Rab8 26. Rf1 Ra8 27. Bc3 Qb8
                                    &#40;s=3&#41;

In this case, we finish depth 17 at 60 seconds. At depth 18, it takes 12 seconds to find the score for Kf8, then another 4 seconds ti discover that Bd8 is even better, then 11 seconds to search the rest of the moves. We get a larger gain here since we will search Kf8 and Bd8 at the root and get a score back for both in a total of 12 seconds, rather than the serial 16 seconds. So we save the 4 second search time since it is done in parallel with the 12 second search, and we are done after 12 seconds total rather than after 16 (again, all the others we assume go in zero time to be as favorable to this algorithm timing as possible).

This kind of case is where you get the most from splitting at the root, and with today's ultra-low branching factor, what you get is very little.

You didn't say what you are doing elsewhere, LMR? NULL R=3? Etc. The lower your branching factor the worse this looks...

I brought this up because of the various discussions and comments yesterday, particularly people asking about "why all the different depths? why is the depth bouncing up and down? How fast is this 5-node 8-core-per-node cluster? Etc." The answer is it is no faster than a normal 8-core 1-node box. I am finally making preparations to do a _real_ cluster approach. And I know I am not going to get 100 Elo from a 5-node parallel search. You lose too much. No shared hash table. No shared killers. No shared anything whatsoever. So the losses are large. But if you can get a real 2.5 speedup out of 5 nodes, that would be significant, as it would be like searching 50M nodes per second on the hardware I used yesterday when we were typically searching 20M nodes per second. Or more importantly, if I can get 35x on the big 70-node 8-core node, that will be serious numbers. 700M nodes per second is serious computing. I have no idea whether I can get that kind of speedup. I believe I can get 2.5x out of 5. But stretching that ratio to 70 is improbable. But anything > 1.0 is good...
bob wrote:I had a chance to watch Rybka's analysis during our PanAM game today, and what I saw was a far cry from what I was expecting.

Several were noticing that Rybka's depth and score were bouncing all over the place. I finally figured out why on a position where Crafty played QxQ and Rybka had one recapture move (RxQ) and several other moves. We were seeing all sorts of -9.0 scores being kibitzed. As I went through them, what I discovered was this: Rybka uses an algorithm first tried (and then discarded because it is useless) back around 1978 or so by Monty Newborn. What is happening is this: Each root move is given to a different cluster node when the first iteration of the search starts. Each move is searched independently until time runs out. The depth is not synchronized in any way, so when you look at Rybka's analysis, you might see this:

Code: Select all

depth=18 time=5.08 node=7028735 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=20 time=6.09 node=8650750 speed=1357325 score=+0.75
pv="Ra1 Kf8 Kf2"

depth=19 time=6.09 node=8650750 speed=1357325 score=+0.82
pv="Kf2 Nec4 Ra1 Kf8"

depth=21 time=7.11 node=10131160 speed=1357325
score=+0.75 pv="Ra1 Kf8 Kf2"

depth=21 time=8.20 node=11654740 speed=1420616
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=16 time=8.20 node=11654740 speed=1420616
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=20 time=10.16 node=15113065 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8" 

depth=17 time=10.16 node=15113065 speed=1520741
score=+0.74 pv="g4 Kf7 Ra1 Ke7"

depth=21 time=11.17 node=16464745 speed=1520741
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"

depth=18 time=12.19 node=18357100 speed=1520741
score=+0.52 pv="Rc1 Kf7 Kf2"

depth=18 time=13.20 node=19605795 speed=1520741
score=+0.65 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=14.22 node=20854490 speed=1520741
score=+0.75 pv="Ra1 Kf8 Kf2 Ndc4"

depth=19 time=15.23 node=22244790 speed=1520741
score=+0.56 pv="g4 Kf7 Ra1 Ke7"

depth=22 time=17.38 node=25377725 speed=1460504
score=+0.79 pv="Ra1 Kf7 Kf2 Ndc4 a4 Nb2 axb5 Bxb5 Ke3 Ke7 Re1 Kd6 Rc1 Kd7"

depth=20 time=17.38 node=25377725 speed=1460504
score=+0.55 pv="g4 Kf7 Ra1 Ke7"

depth=19 time=18.28 node=27213820 speed=1460504
score=+0.52 pv="Rc1 Kf7 Kf2 Ndc4"

depth=22 time=19.30 node=29466620 speed=1527572
score=+0.82 pv="Kf2 Nec4 Ra1 Kf8"
If you look at that at first glance, you wonder why is the depth bouncing around going forward and backward all over the place? But if you look more carefully, and find just the kibitzes for the same move and ignore the others, you will find that the depths and scores now look sensible.

The primary fault with this is that the speedup is zero. If you watch the typical program, 90% of the search effort is on the _first_ move. Here, each node is searching a "first move" which breaks alpha/beta completely. Even worse, there is a depth issue in that each of the root moves are often searched to different depths. Which is better, +1 at depth 17 or +.01 at depth 21? That was the problem Monty discovered and that is why this is no good, without even considering the issue that the distributed search is not speeding up the process one bit.

I have no idea why such an algorithm would be used. There is no up-side to doing this and using a single node would be no slower. And it actually would be a bit faster since all root moves get searched to the same depth so that there is no issue in comparing scores from different depths.

At least the hyperbole about the "40 core monster" has been discovered. Yes, Rybka is extremely strong. But the cluster Rybka is absolutely no stronger and perhaps a bit weaker due to the unsynchronized depth issue.

live and learn...

BTW, after watching this game it is apparent that the requirement for kibitzing node counts, NPS, depth, etc is worthless. :) The numbers being kibitzed are absolute nonsense with respect to the accepted meaning of the terms... Several of us noticed the NPS numbers matching at several different points. That never happens for the rest of us. :)
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Cluster Rybka

Post by bob »

Those numbers sound pretty reasonable. I'm not so happy with "those" that report numbers that are simply fictional, and which anybody that has done any reading or research into parallel search could immediately recognize as bogus.

I still hope that one day someone will post some real numbers on Rybka's parallel speedup on an 8-way box, by running some "normal" positions to a fixed depth using 1, 2, 4 and 8 processors. He claims to scale better than any other program. Somehow I doubt it. Maybe "as good as" if he is lucky. But so far we just have urban legend to go on. Speedups for my program are quite easy to produce and anybody can do it.
jarkkop
Posts: 198
Joined: Thu Mar 09, 2006 2:44 am
Location: Helsinki, Finland

Re: Cluster Rybka

Post by jarkkop »

Hi

Is the ACCA version of Crafty soon in our disposal?