asynchronous search

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 06, 2010 11:45 pm

After a fairly disappointing week with implementing YBW to work on a cluster, I am beginning to question if alpha-beta is the right approach for distributed search. My scaling for 16 or 32 processors doesn't equal what i have for shared memory search on quad !! And that is only nps scaling ! Granted there are some bugs in the code and probably not optimized (uses MPI and some STL). But the scaling is so frustrating , I am tempted to use one of those asynchronus search methods (root split or APHID) ?? Because with that atleast my nps scaling would be 100% despite a huge search overhead. Do you think they are worth the trouble because I could save a lot of time that can be used elsewhere

Cluster is connected with a fast ethernet interface with an option to use the gigabit switches. I did the test only on the former but I am not hopeful of big improvements for the gigabit interface. I tested various cluster split depths and found d=8 to be an ok value. What are your split depths ? The distributed code is written on top of the old scorpio SMP code so only one process is started for each node. Under each node enough threads will be created and SMP search with shared tables will be used. This has an obvious advantage but gives some problems regarding load balancing and split depth. What are your thoughts about this ? I could go on and on with my adventure with cluster search but it is better to let others speak first...

Daniel

bob · Post by **bob** » Wed Apr 07, 2010 2:27 am

Daniel Shawul wrote:After a fairly disappointing week with implementing YBW to work on a cluster, I am beginning to question if alpha-beta is the right approach for distributed search. My scaling for 16 or 32 processors doesn't equal what i have for shared memory search on quad !! And that is only nps scaling ! Granted there are some bugs in the code and probably not optimized (uses MPI and some STL). But the scaling is so frustrating , I am tempted to use one of those asynchronus search methods (root split or APHID) ?? Because with that atleast my nps scaling would be 100% despite a huge search overhead. Do you think they are worth the trouble because I could save a lot of time that can be used elsewhere Cluster is connected with a fast ethernet interface with an option to use the gigabit switches. I did the test only on the former but I am not hopeful of big improvements for the gigabit interface. I tested various cluster split depths and found d=8 to be an ok value. What are your split depths ? The distributed code is written on top of the old scorpio SMP code so only one process is started for each node. Under each node enough threads will be created and SMP search with shared tables will be used. This has an obvious advantage but gives some problems regarding load balancing and split depth. What are your thoughts about this ? I could go on and on with my adventure with cluster search but it is better to let others speak first...

Daniel

This is a challenge. As far as split depth goes, it is a trade-off between work and balancing. You'd like to give a node a big hunk of work, as that minimizes the traffic to distribute the search. But that runs into serious load-balancing issues. You'd like to give a node a tiny hunk of work, as that simplifies load-balancing, which is a direct product of the difference between the largest chunk of work and the smallest. As everything gets smaller, the difference shrinks as well. But communication overhead goes up.

I suspect that a "static split depth" is not going to work well at all, any more than it does in a normal SMP search. You should be able to split anywhere from the root (and splitting at the root is a good thing, once you have searched the first move to establish alpha) to some point limited by communication overhead. Probably this limit needs to be "soft" and dynamically adjusted as the search progresses.

Another idea is that as you split, keep up with where you send the work (you can use the hash signature for this) and always try to send the same position to the same node as you deepen the search or encounter transpositions, since this will be more efficient because of hash contents.

First thing to do is to get something working _realiably_. Efficiency comes later once the thing works reliably.

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 07, 2010 4:45 am

I suspect that a "static split depth" is not going to work well at all, any more than it does in a normal SMP search. You should be able to split anywhere from the root (and splitting at the root is a good thing, once you have searched the first move to establish alpha) to some point limited by communication overhead. Probably this limit needs to be "soft" and dynamically adjusted as the search progresses.

Yes I plan to use variable split depth for different nodes. The d=8 that I mentioned was when
every processor commincated with message passing. Later when I tested by adding SMP search on
processors at the same node, I noticed I had to increase the split depth. Because now the search
power at a given node had doubled. Another issue is communication costs. One can use larger split
depths for compute nodes with larger latency. That is not an issue for me right now but i could
see it as an obstacle.

I have two problems with synchronization. One is that processors have to wait for the first
move to be searched before splitting. Another sync point is when all moves are searched
except the last one which is being searched by a different host. What to do there ?
Currently the owner waits until it is finished. This is only if there is another host helping
at the split. For the SMP search threads will ofcousre help out down the tree sometimes completely
yielding ownership of that node. But for the distributed search I don't know how to do it?
I thought of some ways to solve the issue but I expect that the code can get really messy.
I believe these synchronization issues and communication costs are what are killing the performance.

Splitting at the root the "YBW way" may help some but I do not expect it to improve my situation much.
If the second or third move were the best, the speculative root splitting would result in more effort
being spent on the non-relevant part of the tree. I understand that the chances of that happening are
slim most of the time. I would definately give it a try though.

Related to this is asynchronous iterative search on the root moves. What is your experience with this ?
I plan to do APHID or something similar for a server-client application to be used on a loosely connected
network like on a wifi network. Asynchronous search for distributed search really starts to make sense for
me. When you have a lot of processors at your disposal, it gets harder to keep them busy. With split depth
of 12, ussualy I don't see splits occuring before iteration 15.

Another idea is that as you split, keep up with where you send the work (you can use the hash signature for this) and always try to send the same position to the same node as you deepen the search or encounter transpositions, since this will be more efficient because of hash contents.

Currently each nodes sends a "HELP" message to a randomly selected node, and the node that recieved the help either cancels the request or accepts the node as a helper. That is what Fieldmann and co suggested but they use distributed transpostion tables and I don't, so trying to keep who searched what could be a good idea for me.

First thing to do is to get something working _realiably_. Efficiency comes later once the thing works reliably.

That is top priority now. MPI really helps there hiding the "gorry" details of process communication (sockets, managing system buffer etc..). I am just glad I don't have to do those from scratch.

bob · Post by **bob** » Wed Apr 07, 2010 3:41 pm

Daniel Shawul wrote:
I suspect that a "static split depth" is not going to work well at all, any more than it does in a normal SMP search. You should be able to split anywhere from the root (and splitting at the root is a good thing, once you have searched the first move to establish alpha) to some point limited by communication overhead. Probably this limit needs to be "soft" and dynamically adjusted as the search progresses.
Yes I plan to use variable split depth for different nodes. The d=8 that I mentioned was when
every processor commincated with message passing. Later when I tested by adding SMP search on
processors at the same node, I noticed I had to increase the split depth. Because now the search
power at a given node had doubled. Another issue is communication costs. One can use larger split
depths for compute nodes with larger latency. That is not an issue for me right now but i could
see it as an obstacle.

I have two problems with synchronization. One is that processors have to wait for the first
move to be searched before splitting. Another sync point is when all moves are searched
except the last one which is being searched by a different host. What to do there ?

Don't "sit". I never have a processor "sitting" in Crafty for any significant amount of time, that processor that would be "sitting" will help others that are not.

Currently the owner waits until it is finished. This is only if there is another host helping
at the split. For the SMP search threads will ofcousre help out down the tree sometimes completely
yielding ownership of that node. But for the distributed search I don't know how to do it?

smp vs distributed search should not have any significant differences in how splits and such are done, only the depth limits might be different because of the different cost for doing a split on shared memory vs over a distributed network.

I thought of some ways to solve the issue but I expect that the code can get really messy.
I believe these synchronization issues and communication costs are what are killing the performance.

Splitting at the root the "YBW way" may help some but I do not expect it to improve my situation much.

It is a big win, because the root node is the _only_ node where you are guaranteed to need to search every last branch, so you incur no extra overhead unless you end up changing your mind at the root, where this approach can be better or worse, depending on circumstances.

If the second or third move were the best, the speculative root splitting would result in more effort
being spent on the non-relevant part of the tree. I understand that the chances of that happening are
slim most of the time. I would definately give it a try though.

Related to this is asynchronous iterative search on the root moves. What is your experience with this ?

The idea is simply flawed. How can you compare a move X1, with score Y1 at depth Z1, with a move X2, with score Y2, at depth Z2? Which is more important, depth or score? The deeper score is worse, do you play the shallower move? Could it be it will fail low and be even worse when you get to the depth the other move was searched to? In short, it has been tried (Monty Newborn first did this back in the late 70's) and it is simply a bad idea when compared to doing a more normal search.

I plan to do APHID or something similar for a server-client application to be used on a loosely connected
network like on a wifi network. Asynchronous search for distributed search really starts to make sense for
me. When you have a lot of processors at your disposal, it gets harder to keep them busy. With split depth
of 12, ussualy I don't see splits occuring before iteration 15.

Another idea is that as you split, keep up with where you send the work (you can use the hash signature for this) and always try to send the same position to the same node as you deepen the search or encounter transpositions, since this will be more efficient because of hash contents.

Currently each nodes sends a "HELP" message to a randomly selected node, and the node that recieved the help either cancels the request or accepts the node as a helper. That is what Fieldmann and co suggested but they use distributed transpostion tables and I don't, so trying to keep who searched what could be a good idea for me.

First thing to do is to get something working _realiably_. Efficiency comes later once the thing works reliably.
That is top priority now. MPI really helps there hiding the "gorry" details of process communication (sockets, managing system buffer etc..). I am just glad I don't have to do those from scratch.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Apr 07, 2010 5:58 pm

bob wrote: The idea is simply flawed. How can you compare a move X1, with score Y1 at depth Z1, with a move X2, with score Y2, at depth Z2? Which is more important, depth or score?

For the love of God not this discussion again! We already went over this extensively.

I completely and utterly disagree with you. The idea is perfectly sound, used successfully in other programs and related algorithms and even inside Crafty itself even though you still fail to understand and admit it.

Highest score wins no matter the depth. If this did not hold, reductions/LMR wouldn't work, but they do.

If there is a problem with this approach, THIS is NOT it.

And that's the end of this part of the discussion for me

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Wed Apr 07, 2010 6:07 pm

Daniel Shawul wrote: I have two problems with synchronization. One is that processors have to wait for the first
move to be searched before splitting.

How can this suddenly be a problem? The exact same thing happens with SMP!

The wait time is the time to complete a single 'splitdepth' search on the PV, so even if it now takes a bit more (because splitdepth is like 8 instead of 6) it's still tiny in the grand scheme of things.

Ok, I am lying a bit: it's that time plus the time to send the splitblock over the network.

Another sync point is when all moves are searched
except the last one which is being searched by a different host. What to do there ?

Split in the remote so the rest of the cluster can help.

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 07, 2010 8:55 pm

How can this suddenly be a problem? The exact same thing happens with SMP!

The wait time is the time to complete a single 'splitdepth' search on the PV, so even if it now takes a bit more (because splitdepth is like 8 instead of 6) it's still tiny in the grand scheme of things.

Ok, I am lying a bit: it's that time plus the time to send the splitblock over the network.

It is a problem for me because I belive the normal YBW sync point and the other sync overahead
are major contibutors to performance loss (on top of communication overhead). I admit this all a
result of slow communcitaion which forces me to use large cluster split depths.
That is why I asked about asynchoronus search to begin with (no idle time gaurantees 100% nps scaling).

Split in the remote so the rest of the cluster can help.

It is not as easy as you might think. When master node finishes its share of
work at a split point, it could send the node that is still working to split as you suggested
_BUT_ that node could have finished working and already sent a "MERGE" message for the
owner node (eventhough for the master node it looks like it is still working).
FOR SMP it is easy because every split block is in one place. There you can even be speculative and
work one ply above the current split. For message passing it seems difficult to synchronize
this unless I am missing something ??
BTW worker threads can also become masters so that "still working" node at the split point could have
other workers by itself. But the problem is how to change the master node into a slave to the still working
node.

bob · Post by **bob** » Wed Apr 07, 2010 9:10 pm

Gian-Carlo Pascutto wrote:
bob wrote: The idea is simply flawed. How can you compare a move X1, with score Y1 at depth Z1, with a move X2, with score Y2, at depth Z2? Which is more important, depth or score?
For the love of God not this discussion again! We already went over this extensively.

I completely and utterly disagree with you. The idea is perfectly sound, used successfully in other programs and related algorithms and even inside Crafty itself even though you still fail to understand and admit it.

Highest score wins no matter the depth. If this did not hold, reductions/LMR wouldn't work, but they do.

If there is a problem with this approach, THIS is NOT it.

And that's the end of this part of the discussion for me

then we will just have to agree to disagree. You can't safely compare move(depth) where A(16) and B(17) have different scores. Which do you trust? If you think it works, use it. I experimented with it long ago and tossed it completely, being _cetain_ it is better to focus all the processors on advancing the tree uniformly.

I'd be happy to show you positions where A(16) has a higher score than B(17), yet when you stop and choose A and then ponder it fails to below the score for B. Happens every time you flip-flop on two successive iterations.

Do what you want, but don't try to convince me it is correct.

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 07, 2010 9:27 pm

Don't "sit". I never have a processor "sitting" in Crafty for any significant amount of time, that processor that would be "sitting" will help others that are not

I also don't "sit" for the SMP part. But with message passing it looks very difficult to synchronize master_to_slave converstion at the particular split point. More on my reply to GCP. Do you have a suggestion on that ?

The idea is simply flawed. How can you compare a move X1, with score Y1 at depth Z1, with a move X2, with score Y2, at depth Z2? Which is more important, depth or score? The deeper score is worse, do you play the shallower move? Could it be it will fail low and be even worse when you get to the depth the other move was searched to? In short, it has been tried (Monty Newborn first did this back in the late 70's) and it is simply a bad idea when compared to doing a more normal search.

How about APHID ? It looks like it has a way to correct such discrepancies coming from depth differences. I read that it was used in Beowulf cluster with comparable speedup as YBW.

Aaron Becker · Post by **Aaron Becker** » Wed Apr 07, 2010 9:33 pm

bob wrote: I'd be happy to show you positions where A(16) has a higher score than B(17), yet when you stop and choose A and then ponder it fails to below the score for B. Happens every time you flip-flop on two successive iterations.

Do what you want, but don't try to convince me it is correct.

Isn't this analogous to comparing the score of two moves searched with nominally equal depth when the extensions and reductions applied to each move are not the same? In any node which doesn't experience a cutoff, when we apply LMR we are implicitly committing ourselves to comparing searches with unequal depths just by looking at the resulting scores. If you're arguing that this is an unprincipled way of searching, then formally I agree, but in practice the results are compelling despite the lack of a rigorous theory to back the technique.

If there is a fundamental difference between GCP's suggested approach and score comparisons between reduced and non-reduced nodes, could you clarify what that difference is? I think that reductions are safer because we only make them when there is reason to believe that the move being reduced is not best, but in principle they seem very similar to me.

asynchronous search

asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search