asynchronous search

bob · Post by **bob** » Wed Apr 07, 2010 11:27 pm

Aaron Becker wrote:
bob wrote: I'd be happy to show you positions where A(16) has a higher score than B(17), yet when you stop and choose A and then ponder it fails to below the score for B. Happens every time you flip-flop on two successive iterations.

Do what you want, but don't try to convince me it is correct.
Isn't this analogous to comparing the score of two moves searched with nominally equal depth when the extensions and reductions applied to each move are not the same? In any node which doesn't experience a cutoff, when we apply LMR we are implicitly committing ourselves to comparing searches with unequal depths just by looking at the resulting scores. If you're arguing that this is an unprincipled way of searching, then formally I agree, but in practice the results are compelling despite the lack of a rigorous theory to back the technique.

If there is a fundamental difference between GCP's suggested approach and score comparisons between reduced and non-reduced nodes, could you clarify what that difference is? I think that reductions are safer because we only make them when there is reason to believe that the move being reduced is not best, but in principle they seem very similar to me.

It is not the same at all.

In normal search, you reduce that which you do not expect to do well. In this form of search, you are radomly searching some moves deeper and some shallower, and then comparing the scores. you will note that you do not play a PV where reductions were done. At least in Crafty's PV the moves shown will add up to the proper number of plies. That's a different issue. Some moves are simply easier to search, and you can go deeper if you choose to spend equal effort on each move. But why spend the same amount of effort on easy to search/dismiss moves as you spend on the actual move you are going to end up playing???

This unsynchronized stuff is bogus from many directions. Comparing the scores/depths is problematic. Searching two moves independently and unsynchronized is inefficient because it seriously degrades the alpha/beta process.

Dann Corbit · Post by **Dann Corbit** » Thu Apr 08, 2010 2:02 am

Daniel Shawul wrote:
Don't "sit". I never have a processor "sitting" in Crafty for any significant amount of time, that processor that would be "sitting" will help others that are not
I also don't "sit" for the SMP part. But with message passing it looks very difficult to synchronize master_to_slave converstion at the particular split point. More on my reply to GCP. Do you have a suggestion on that ?

The idea is simply flawed. How can you compare a move X1, with score Y1 at depth Z1, with a move X2, with score Y2, at depth Z2? Which is more important, depth or score? The deeper score is worse, do you play the shallower move? Could it be it will fail low and be even worse when you get to the depth the other move was searched to? In short, it has been tried (Monty Newborn first did this back in the late 70's) and it is simply a bad idea when compared to doing a more normal search.
How about APHID ? It looks like it has a way to correct such discrepancies coming from depth differences. I read that it was used in Beowulf cluster with comparable speedup as YBW.

Mark Brockington did an APHID version of Crafty. He has an APHID game kit he distributes as well:
http://games.cs.ualberta.ca/aphid/index.html

I still have the game kit (the link at the web site seems dead).
Caveat:
It uses PVM instead of mpich2 so there is no 64 bit version for Windows that I am aware of.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 08, 2010 3:23 am

Thanks Dann. Yes all links seem dead. If I can find that plug-in kit may be I would try it quickly. You know where to send it

Cheers

bob · Post by **bob** » Thu Apr 08, 2010 3:34 am

Daniel Shawul wrote:
Don't "sit". I never have a processor "sitting" in Crafty for any significant amount of time, that processor that would be "sitting" will help others that are not
I also don't "sit" for the SMP part. But with message passing it looks very difficult to synchronize master_to_slave converstion at the particular split point. More on my reply to GCP. Do you have a suggestion on that ?

In normal parallel search in Crafty, when a thread goes "idle" it increments a counter, which is seen by all other threads. They can then initiate a split when they reach a point where they have satisfied the YBW criterion at a node. Why not do something similar over the net???

The idea is simply flawed. How can you compare a move X1, with score Y1 at depth Z1, with a move X2, with score Y2, at depth Z2? Which is more important, depth or score? The deeper score is worse, do you play the shallower move? Could it be it will fail low and be even worse when you get to the depth the other move was searched to? In short, it has been tried (Monty Newborn first did this back in the late 70's) and it is simply a bad idea when compared to doing a more normal search.
How about APHID ? It looks like it has a way to correct such discrepancies coming from depth differences. I read that it was used in Beowulf cluster with comparable speedup as YBW.

You can't "correct' the discrepancy, you can choose to ignore it, or not cause it. GCP seems attracted to the former, while I prefer the latter. What is the benefit of searching move A to depth 20 with a score of +20, and move B to a depth of 25, with a score of -400? Why waste all that time on B when A needs more work.

The first Rybka cluster approach, of splitting only at the root, had _exactly_ this problem. In one position, Crafty captured Rybka's queen, and there was exactly one safe way to recapture. One processor was reporting that score as something like -0.20 at depth 21 while all the others were kibitzing moves that made no sense (obviously) with scores of -900 and up and depths of 25-28 plies. What good were those and why would one waste all that time searching crap moves with a real alpha/beta window rather than with a null-window?

I don't consider the idea workable. You will have to make up your own mind. But don't just take my advice, or _anyone_ else's without giving the topic some thought. I'm not exactly new to parallel search and have formed my conclusions from testing 'em all. Our very first version of parallel Cray Blitz did _exactly_ this (split at root, unsynchronized). and it was _not_ very effective. It was better than just using one cpu, but not a lot better. True splitting approaches were not just superior, but _far_ superior.

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 08, 2010 4:21 am

In normal parallel search in Crafty, when a thread goes "idle" it increments a counter, which is seen by all other threads. They can then initiate a split when they reach a point where they have satisfied the YBW criterion at a node. Why not do something similar over the net???

The distributed approach I am using is completely de-centralized so there is no single processor responsible
for allocating work, keeping info on other nodes etc . In short no place for counters whatsoever.
A processor doesnt even know if another processor is idle or not. All it does is send a "HELP" message and
wait for a "CANCEL" message (when that procesor is idle) or a "SPLIT" message ( when there is work to do ).

You are only thinking of SMP here so I will repeat here what I have mentioned before.
Imagine node A (with 2 threads) and node B are working at a split point. Node A and all its threads finished their
share of work and node B is still working. If node A just goes on working somewhere else who takes
care when node B returns with a "MERGE" message. It means A has to keep that old information and process it
accordingly while it is working somewhere else. This looks very inconvinient for me.You also can not tell
node B to take care of the split point when it is finished working. (I do exactly this in SMP btw)
Because when we send message to node B to do that it could have already finished its share of work
and already sent a "MERGE" message (only that we don't know about it yet because the message has not arrived here yet).

I don't consider the idea workable. You will have to make up your own mind. But don't just take my advice, or _anyone_ else's without giving the topic some thought. I'm not exactly new to parallel search and have formed my conclusions from testing 'em all. Our very first version of parallel Cray Blitz did _exactly_ this (split at root, unsynchronized). and it was _not_ very effective. It was better than just using one cpu, but not a lot better. True splitting approaches were not just superior, but _far_ superior.

Sure I can decide for myself what is good for me. I asked for other people's experience on it . You gave yours , GCP gave his ,so life goes on. Things do really get easily overheated in this forum though. Cheers

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu Apr 08, 2010 8:11 am

Daniel Shawul wrote: It is a problem for me because I belive the normal YBW sync point and the other sync overahead
are major contibutors to performance loss (on top of communication overhead). I admit this all a
result of slow communcitaion which forces me to use large cluster split depths.

I would try to figure out what is wrong here in the normal SMP case first. What you describe is not typically a YBW bottleneck AFAIK, so it surprises me that it is for you.

IIRC, neither Stockfish nor Crafty, which are pure YBW-ers, suffer greatly from large splitdepths. It should only become a problem when splitdepth gets close to searchdepth.

When I say suffer greatly, I mean that performance completely plummets. Of course there will be some minor loss.

That is why I asked about asynchoronus search to begin with (no idle time gaurantees 100% nps scaling).

Running separate, noncommunicating processes also guarantees 100% nps scaling, but that doesn't make it good

It is not as easy as you might think.

Oh, I know exactly how "easy" it is

I have to admit that I so far have failed in making my own cluster design simple and elegant.

When master node finishes its share of
work at a split point, it could send the node that is still working to split as you suggested
_BUT_ that node could have finished working and already sent a "MERGE" message for the
owner node (eventhough for the master node it looks like it is still working).

What's the problem here? You can protect against the spurious message and ignore it in the slave. This kind of asynchronicity (lets call it "in transit") is quite manageable and won't kill speedups.

If the master gets the result from the slave, it *will* know whether it already sent the HELP to the slave or not, and adjust accordingly.

bhlangonijr · Post by **bhlangonijr** » Thu Apr 08, 2010 3:57 pm

Gian-Carlo Pascutto wrote: Highest score wins no matter the depth. If this did not hold, reductions/LMR wouldn't work, but they do.

If there is a problem with this approach, THIS is NOT it.

And that's the end of this part of the discussion for me

I know you said this is the end of the discussion - for you. Just want to add some points to your statements:
* We only reduce what we think (heuristically) is not going to affect the outcome of the search, for instance "futile" moves, late moves, etc;
* In the other way around, we extend what we think is going to affect the outcome of the search, for instance pawn pushes, king threats, etc;

if the ply comparison doesn't matter why search deeper? Lets end up the search at the first depth and it's all nice.

I suggest a little test:
What if we stop searching in the middle of an iteration in the ID loop and then pick up the best move so far of, instead of picking up the best move of the prior completed iteration? In complex positions where the PV changes often it causes my engine to get very bad results in my tests. It proves - at least for me - it doesn't work compare results from different depths. Otherwise, getting the best move from an uncompleted iteration in the ID loop would always work...

Daniel Shawul · Post by **Daniel Shawul** » Thu Apr 08, 2010 4:54 pm

Well it is a bottleneck from the perspective of asynchoronous search which has none of that.
I am using large split depths for the cluster search (d = 8 - 12) due to huge communication
over head. For a d=12 I see only a few splits in a 5 min search and if i don't split enough
it won't scale well. May be I am barking on the wrong tree here as the cause of all this the slow network.
Synchronization is a price paid for low search overhead. The question is if this price is paid too early ?
Here is what the APHID authors say about this. Not everyone could agree about it but it was atleast
enlightening to have another view point besides the regular serial alpha-beta .

This class of algorithms can not achieve a linear speedup primarily due to synchronization overhead;
the search tree may have thousands of synchronization points and there are numberous occasions where
the processes are starved for work. The algorithms have low search overhead , but his is primarily due
to the implementation of a globally shared transposition table

Oh, I know exactly how "easy" it is Smile

I have to admit that I so far have failed in making my own cluster design simple and elegant.

What's the problem here? You can protect against the spurious message and ignore it in the slave. This kind of asynchronicity (lets call it "in transit") is quite manageable and won't kill speedups.

If the master gets the result from the slave, it *will* know whether it already sent the HELP to the slave or not, and adjust accordingly.

I agree it is doable. Infact it is done but I just can't figure out a simple way to do it.
The problem is the owner of that node has to keep old information while it is searching elsewhere.
Then it has to decide if it has to temprarily pause what is doing now and finish the old split point.
Or continue searching what it does now and go back and finish the split block afterwards.
This is the kind of thing that smells like Italian spaghetti

But anyway thank you all for sharing your experiences. Now I will have to go back and fix the thing
to get some scaling from it because at its current state it is just junk code.

cheers

bob · Post by **bob** » Thu Apr 08, 2010 6:43 pm

Daniel Shawul wrote:
In normal parallel search in Crafty, when a thread goes "idle" it increments a counter, which is seen by all other threads. They can then initiate a split when they reach a point where they have satisfied the YBW criterion at a node. Why not do something similar over the net???
The distributed approach I am using is completely de-centralized so there is no single processor responsible
for allocating work, keeping info on other nodes etc . In short no place for counters whatsoever.
A processor doesnt even know if another processor is idle or not. All it does is send a "HELP" message and
wait for a "CANCEL" message (when that procesor is idle) or a "SPLIT" message ( when there is work to do ).

You are talking about design _decisions_, not architectural shortcomings.

When a node goes idle, you could let it do a broadcast to tell everyone it is idle, and then someone can ask it to "join in" or not. You have to deal with multiple "please join me" requests and just do one of course.

You are only thinking of SMP here so I will repeat here what I have mentioned before.

Before you go on, I am _not_ "only thinking about SMP here". The topic is clearly about distributed computing. Something that is definitely not new, and certainly not an unknown science.

Imagine node A (with 2 threads) and node B are working at a split point. Node A and all its threads finished their
share of work and node B is still working. If node A just goes on working somewhere else who takes
care when node B returns with a "MERGE" message. It means A has to keep that old information and process it
accordingly while it is working somewhere else. This looks very inconvinient for me.You also can not tell
node B to take care of the split point when it is finished working. (I do exactly this in SMP btw)
Because when we send message to node B to do that it could have already finished its share of work
and already sent a "MERGE" message (only that we don't know about it yet because the message has not arrived here yet).

Why is that any different than what I already do in Crafty? Nobody has to "sit" on a node waiting on someone else to finish. If the "sitter" joins in to help someone _below_ that node in the tree, that work will be done before the old split point can be resolved.

These are not the kinds of issues I see as problems caused by the interconnect, they are design issues that need fixing before things get too set into stone.

The only significant issues are that splitting is slower, the search is less efficient due to the lack of a shared hash table, and that any sort of communication induces latency that should be minimized.

I don't consider the idea workable. You will have to make up your own mind. But don't just take my advice, or _anyone_ else's without giving the topic some thought. I'm not exactly new to parallel search and have formed my conclusions from testing 'em all. Our very first version of parallel Cray Blitz did _exactly_ this (split at root, unsynchronized). and it was _not_ very effective. It was better than just using one cpu, but not a lot better. True splitting approaches were not just superior, but _far_ superior.
Sure I can decide for myself what is good for me. I asked for other people's experience on it . You gave yours , GCP gave his ,so life goes on. Things do really get easily overheated in this forum though. Cheers

Happens all the time, in fact.

bob · Post by **bob** » Thu Apr 08, 2010 6:49 pm

Gian-Carlo Pascutto wrote:
Daniel Shawul wrote: It is a problem for me because I belive the normal YBW sync point and the other sync overahead
are major contibutors to performance loss (on top of communication overhead). I admit this all a
result of slow communcitaion which forces me to use large cluster split depths.
I would try to figure out what is wrong here in the normal SMP case first. What you describe is not typically a YBW bottleneck AFAIK, so it surprises me that it is for you.

IIRC, neither Stockfish nor Crafty, which are pure YBW-ers, suffer greatly from large splitdepths. It should only become a problem when splitdepth gets close to searchdepth.

This is a cost vs payoff issue. The payoff is doing more than one search in parallel, the cost is setting things up so that this can happen. If I start splitting near the tips, nps plummets because the cost of doing the split can become more than the payoff of searching a very tiny sub-tree. I am _still_ working on getting this to a point where it works well on every architecture I test on. I dislike tuning it for dual-core, for quad-core, for 8-core, up to the max I have used on recent hardware which is 32 cores (Eugene ran and tuned some on 64 cores a long while back, but that was an Itanium box which is a completely different animal anyway). And at present, the optimal "tuning" for crafty varies depending on the number of cores, and how things are architecturally put together (what is shared, what is local, cache specifically but also applies to memory, etc), how fast the processor and memory is, etc. I'd like to get rid of that completely.

When I say suffer greatly, I mean that performance completely plummets. Of course there will be some minor loss.

That is why I asked about asynchoronus search to begin with (no idle time gaurantees 100% nps scaling).
Running separate, noncommunicating processes also guarantees 100% nps scaling, but that doesn't make it good

It is not as easy as you might think.
Oh, I know exactly how "easy" it is

I have to admit that I so far have failed in making my own cluster design simple and elegant.

When master node finishes its share of
work at a split point, it could send the node that is still working to split as you suggested
_BUT_ that node could have finished working and already sent a "MERGE" message for the
owner node (eventhough for the master node it looks like it is still working).
What's the problem here? You can protect against the spurious message and ignore it in the slave. This kind of asynchronicity (lets call it "in transit") is quite manageable and won't kill speedups.

If the master gets the result from the slave, it *will* know whether it already sent the HELP to the slave or not, and adjust accordingly.

asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search

Re: asynchronous search