Cluster Rybka

bob · Post by **bob** » Mon Nov 10, 2008 9:15 pm

jarkkop wrote:Hi

Is the ACCA version of Crafty soon in our disposal?

Yes. I have one more thing I'd like to tune, dealing with passed pawns. I had it scheduled for last week before the ACCA, but our cluster keeps having IBRIX filesystem inconsistencies. IBRIX has been removed and we are going to straight NFS to get the reliability back. Cluster should be back up tonight, version 22.2 will be released later this week. This version we _know_ is significantly better than all previous versions, with no room for error except what is introduced in an opening book...

M ANSARI · Post by **M ANSARI** » Mon Nov 10, 2008 11:03 pm

I don't think you can make a judgement regarding the cluster by simply presuming that it is using a certain algo. The classic way of thinking how a cluster works ... by sharing memory and acting as one large parallel unit might not be the most practical or best option using PC's. Personally I think there are better ways of using multiple cores in a cluster. It is a well known fact that while you can add some knowledge that works great in certain positions in an engine, many times that knowledge is not added because it reduces the overall strength of the engine. So instead of using an active cluster as you envision, maybe using a passive cluster is more useful and easier to implement. I am not sure Vas is using a passive cluster but I wouldn't be surprised.

In the case of a passive cluster you cannot really make it play weaker than its core Master machine unless you are an idiot. You can setup the different other engines to be experts in special fields and only get activated if and only if a certain pre-determined change of evaluation trigger is reached. I can think of a scenario where such an option would immediately add a few ELO points to a master engine if you use the Master without EGTB's and have the EGTB's handled with an independent slave computer. Not having to search EGTB's alone would make the master computer stronger ... and ofcourse it will still have the safety net of being able to solve any resulting 6 piece EGTB by the slave. So an active cluster sharing memory does not necessarily have to be the only way to use additional cores, passive might be the way to go forward.

The strongest chess ever played has probably been in Freestyle chess ... a good Freestyle chess player usually has several motherboards running independently ... and usually runs several flavors of engines to make sure all moves get a sanity check ... and will sometimes probe a position to see what it will reveal. In certain situations that person will switch from one engine to another depending on the dynamics of the game. He will know which engine performs better where. Rybka is blessed because in its arsenal it has Rybka Human, Rybka Default and Rybka Dynamic. Since Vas is one of the top Freestyle players, I am sure he understands best how to try to mimic a Freestyle player, and the 100 ELO points would have resulted in actual testing against a simple 8 core Rybka 3. The Cluster Rybka by the way has played quite a few tournaments on Playchess where it faced other Octa Rybka's and apparently came out ahead quite easily. So the 100 ELO gain does seem to be reasonable.

diep · Post by **diep** » Mon Nov 10, 2008 11:13 pm

Huber algorithm at 40 logical core nehalem which gets released within 6 months or so anyway:

Such giant shared hashtable, you never know which PV comes first

Big problem of cluster is how to do the hashtable? Fastest manner of getting something from a remote node is about 4-6 us. Say most expensive network now is roughly 2 us to do that, that is if every core idles and you got a special core just to do MPI communication / windows MPI communication.

Losing a core is not so clever.

What's Rybka getting. About 4 million nps a core times 8 cores = 32 mln nps or so a node.

How to communicate that if on average it'll be a microsecond or 5 to communicate hashtable remote?

You can do that maybe 200k times a second.
So 1 read from hashtable and 1 store that's 100k nps a second.

That's all communication from and to together.

Realize that's from all nodes together that communication speed; not only you want to get hashentries from remote nodes, they also want to get hashentries to you.

Not storing hashtable last 6 plies or so loses you a ply or 3 in search depth

Vincent

bob · Post by **bob** » Mon Nov 10, 2008 11:23 pm

M ANSARI wrote:I don't think you can make a judgement regarding the cluster by simply presuming that it is using a certain algo.

Note that I did not "presume" anything. I saw the actual output which clearly shows the underlying algorithm. And I doubt using 5 nodes is worse than using 1 although it is absolutely possible. But as I said, I am certain it is no better than using one. This is old stuff with 30 years of history behind it.

The classic way of thinking how a cluster works ... by sharing memory and acting as one large parallel unit might not be the most practical or best option using PC's. Personally I think there are better ways of using multiple cores in a cluster. It is a well known fact that while you can add some knowledge that works great in certain positions in an engine, many times that knowledge is not added because it reduces the overall strength of the engine. So instead of using an active cluster as you envision, maybe using a passive cluster is more useful and easier to implement. I am not sure Vas is using a passive cluster but I wouldn't be surprised.

I am absolutely certain of what he is doing, nothing else explains the output we saw yesterday and the data was obvious once I began to look at the output carefully. It was producing so much "noise" that at first I paid no attention until someone asked "why is its depth bouncing up, down, back up, and why are the scores all over the place?" A quick and careful look explained what was going on.

In the case of a passive cluster you cannot really make it play weaker than its core Master machine unless you are an idiot.

It is actually not hard at all to be quite good and still play worse. message-passing overhead can creep in without your knowing. It's happened many times here as people use MPI/openMP/etc to do parallel programs that are worse as more nodes are added.

You can setup the different other engines to be experts in special fields and only get activated if and only if a certain pre-determined change of evaluation trigger is reached. I can think of a scenario where such an option would immediately add a few ELO points to a master engine if you use the Master without EGTB's and have the EGTB's handled with an independent slave computer. Not having to search EGTB's alone would make the master computer stronger ... and ofcourse it will still have the safety net of being able to solve any resulting 6 piece EGTB by the slave. So an active cluster sharing memory does not necessarily have to be the only way to use additional cores, passive might be the way to go forward.

The strongest chess ever played has probably been in Freestyle chess ... a good Freestyle chess player usually has several motherboards running independently ... and usually runs several flavors of engines to make sure all moves get a sanity check ... and will sometimes probe a position to see what it will reveal. In certain situations that person will switch from one engine to another depending on the dynamics of the game. He will know which engine performs better where. Rybka is blessed because in its arsenal it has Rybka Human, Rybka Default and Rybka Dynamic. Since Vas is one of the top Freestyle players, I am sure he understands best how to try to mimic a Freestyle player, and the 100 ELO points would have resulted in actual testing against a simple 8 core Rybka 3. The Cluster Rybka by the way has played quite a few tournaments on Playchess where it faced other Octa Rybka's and apparently came out ahead quite easily. So the 100 ELO gain does seem to be reasonable.

Freestyle is a completely different animal. Totally unrelated to playing a single game against a single opponent, and playing the best move you can play using all available hardware. For oddball things, different solutions are possible. But for playing a single game, this is not a new topic. There is almost 30 years of work on distributed chess available, with probably every idea one might ever think of already tried and tested.

100 elo against a single opponent in normal chess is not going to come from this approach. You need a speedup of a factor of 4 to reach that. 5 nodes splitting at the root will be hard-pressed to produce a speedup of even 1.5x...

bob · Post by **bob** » Mon Nov 10, 2008 11:29 pm

diep wrote:Huber algorithm at 40 logical core nehalem which gets released within 6 months or so anyway:

Such giant shared hashtable, you never know which PV comes first

Big problem of cluster is how to do the hashtable? Fastest manner of getting something from a remote node is about 4-6 us. Say most expensive network now is roughly 2 us to do that, that is if every core idles and you got a special core just to do MPI communication / windows MPI communication.

Losing a core is not so clever.

What's Rybka getting. About 4 million nps a core times 8 cores = 32 mln nps or so a node.

Most of the time yesterday Rybka was reporting about 1.6M nodes per second. I consider that a laughable number. From several different perspectives. Anybody that does a parallel search knows that you won't get the _exact_ same nodes per second on two different positions. By exact, I mean 1602399 as one example. Yet we saw several of these yesterday.

40 cores and 1.6M nodes per second? 40K nodes per second per core?

There is obfuscation and there is obfuscation...

How to communicate that if on average it'll be a microsecond or 5 to communicate hashtable remote?

You can do that maybe 200k times a second.
So 1 read from hashtable and 1 store that's 100k nps a second.

That's all communication from and to together.

Realize that's from all nodes together that communication speed; not only you want to get hashentries from remote nodes, they also want to get hashentries to you.

Not storing hashtable last 6 plies or so loses you a ply or 3 in search depth

Vincent

Based on yesterday, Rybka is simply splitting at the root. We were seeing multiple PVS from different processors for each depth. And they were not "synchronized" so that often when he ran out of time, he had 18, 19, 20, 21 and 22 ply search results with different "best moves and scores." In one case, a depth 19 search was +.3 and a depth 21 search was -.2... which one do you play in a real game since the depths are not the same? Looked silly to me and dates back to Monty's "Unsynchronized parallel search" paper of around 1978-1979 or so...

diep · Post by **diep** » Tue Nov 11, 2008 12:34 am

What Cozzie & co are doing in Rybka who cares?

Deceiving others is their job, maybe. Not new. To quote Frans Morsch:

"In 80s/90s the big secret from Fritz and Genius was that no one realized how many nodes per second we got"

Tiger also didn't print the nps correctly as we count it Bob, it got millions of nps long before others did.

I tend to recall that a few years ago Tiger reported 1 mln nps where it was in fact several millions of nps.

I feel Shay is doing things more clever there with Junior, from marketing viewpoint seen. It claims more nps than it gets

Vincent

diep · Post by **diep** » Tue Nov 11, 2008 12:48 am

Bob, the real bottom line is, shared hashtables with some dozens of cores is real important. Not only the transposition effect is there, the transposition effect also aborts really a lot of useless searches.

That makes having a shared hashtable of some sort real important.

Getting that to work on a cluster with just a few nodes with the total nps far bigger than the communication latency possible, is really complicated.

Your idea of delayed hash lookup has been tried already. It's speedup 2 out of 30.

With YBW another effect seems to be that having last few plies a hashtable makes the time shorter to put other cpu's to work. What i tried back in 2003 at supercomputer is put to split closer to leaves in PV nodes than in non-pv nodes. That didn't help much either. Real significant measurement was of course not possible. That's soon easier to do with 32 core Nehalem, 64 threads machines (4 sockets).

Seeing any chance to obtain such a box for your department?

Vincent

p.s. i would argue a cluster where the hash pressure is some 1000 times bigger than communication speed to remote hash; such a cluster is only interesting if you got really a lot of nodes in it

Zach Wegner · Post by **Zach Wegner** » Tue Nov 11, 2008 1:00 am

diep wrote:What Cozzie & co are doing in Rybka who cares?

Deceiving others is their job, maybe. Not new.

Cozzie in Rybka? You must be confused...

Uri Blass · Post by **Uri Blass** » Tue Nov 11, 2008 1:50 am

bob wrote:Those numbers sound pretty reasonable. I'm not so happy with "those" that report numbers that are simply fictional, and which anybody that has done any reading or research into parallel search could immediately recognize as bogus.

I still hope that one day someone will post some real numbers on Rybka's parallel speedup on an 8-way box, by running some "normal" positions to a fixed depth using 1, 2, 4 and 8 processors. He claims to scale better than any other program. Somehow I doubt it. Maybe "as good as" if he is lucky. But so far we just have urban legend to go on. Speedups for my program are quite easy to produce and anybody can do it.

I think that fixed depth may be misleading because rybka may play better at fixed depth with more cores thanks to doing less pruning at the same depth.

It is possible to test it simply by playing fixed depth match between
rybka single core and rybka 4 cores.

Uri

bob · Post by **bob** » Tue Nov 11, 2008 2:02 am

diep wrote:What Cozzie & co are doing in Rybka who cares?

Deceiving others is their job, maybe. Not new. To quote Frans Morsch:

"In 80s/90s the big secret from Fritz and Genius was that no one realized how many nodes per second we got"

Tiger also didn't print the nps correctly as we count it Bob, it got millions of nps long before others did.

I tend to recall that a few years ago Tiger reported 1 mln nps where it was in fact several millions of nps.

I feel Shay is doing things more clever there with Junior, from marketing viewpoint seen. It claims more nps than it gets

Vincent

Nothing surprises me. My numbers are always real and anyone can try to reproduce them on identical hardware for confirmation.

Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka

Re: Cluster Rybka