DTS Structure

Onno Garms · Post by **Onno Garms** » Mon May 31, 2010 7:00 pm

bob wrote:With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in."

That is just your (and maybe most other people's) YBW implementation. The thesis by Feldmann describes YBW the other way round: Idle threads ask for work. And with good implementations of YBWC it is well possible to split close to the root.

mcostalba · Post by **mcostalba** » Mon May 31, 2010 7:10 pm

bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

1) "everyone doing a real parallel search"

I realized today that we are not doing a real parallel search

2) No, at least in SF. Joona made that post because we have just finishing testing a modification in how SMP code works. Namely when a thread finished its work instead of sitting idle waiting for some else thread to pick it up, does a look up across currenlty active split points and picks the best (the highest depth one) , then joins the group without other threads do stop the search. So it is very efficient. But after a deep test on a QUAD machine we got no ELO gain from this patch that we discarded because it was simply adding complexity for no gain.

In the amazing case you would like to _read_ the patch and not just comment on that we can post it.

Nils Magnusson · Post by **Nils Magnusson** » Mon May 31, 2010 8:07 pm

Hi Marco

Perhaps it would be a good idea to post such discarted patches (which ended up testing equal or near equal) together with a small discription of the idea behind it somewhere (perhaps a git). That way anyone can retest or modify and retest to see if they can improve the idea. It could also serve as a place where people can post their patches and test results. Would be glad to set up a git for this purpose.

Nils

Greg Strong · Post by **Greg Strong** » Mon May 31, 2010 9:41 pm

Nils Magnusson wrote:Hi Marco

Perhaps it would be a good idea to post such discarted patches (which ended up testing equal or near equal) together with a small discription of the idea behind it somewhere (perhaps a git). That way anyone can retest or modify and retest to see if they can improve the idea. It could also serve as a place where people can post their patches and test results. Would be glad to set up a git for this purpose.

Nils

I think that's a great idea. Sometimes these engine ideas come back around. This one, for example, might generate an ELO gain with a much larger number of cores.

bob · Post by **bob** » Mon May 31, 2010 11:24 pm

Edmund wrote:
bob wrote:
Edmund wrote:
bob wrote:
Edmund wrote:
bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

However, DTS _can_ be pretty good because you don't just have to take the current node as a split point once YBW is satisfied, you can split _anywhere_ in the tree, with any active thread, without regard to the depth or tree order. Which means you have a much better chance of choosing a good split point (ALL node). And you can pick one as close to the root as possible as well, which can further help as the moves there are better ordered.

But it is definitely messy...
if you store all the data in a global struct why couldn't you let the slaves do the search for good split points in the YBW framework? What is different to DTS in that respect?

How are you going to handle the "helpful master" approach if you do that? The master has to do the splitting so that it has a back door to help others while retaining the call stack for this split point so that it can return back through it when all work is completed.

The only difference I really see is that in DTS master and slave can swap places and in YBW they can't. But the selection of the split points should work in a similar manner.
Think about how you are going to deal with the call stack that you can't modify, and how you are going to get back to this point if other processors just "jump in" at points in the tree that could be above or below this one. It's an extremely complex approach. Anything _can_ be done of course. But I would not try to undertake something 100x more complex than basic iterated search using DTS.

Here's a sample problem. Thread A recursively calls itself to get down to ply 15. When another thread goes idle, and wants to join this thread at ply=6, how can that thread modify my call stack so that as I return I don't just return thru the split point without any synchronization to verify that all work has been completed? This is a major complexity issue.
In my effort to DTS I used the following approach:
I kept my movelists global for anyone to access; each move entry consists of 2 parts: 2byte from/to/promotion, 2byte score. Whenever the score is < 32000 it means the move still has to be searched; 32767 the move has already been searched; 32766-n the move is currently searched with the task number "n". (Note I say task number and not thread number because that task was not statically fixed to a certain thread, anyway in the structure of the task there would be a variable pointing at the thread id of the one currently working on the root node of the task) So an idle thread could walk through the movelists of a working thread and change the score value of a certain move where it wants to split. The master thread which finds itself on a node still waiting on a slave will then immediately know which thread to help, by looking at the score of all moves with a score from 32001 to 32766.
How does that solve the problem of returning? With recursive search, the "master" (or owner in DTS-speak) has to return thru the split point as things are backed up, with DTS this is not the case and _anyone_ can back up through a split point, which is really the last thread that is working at the split point.

Using your approach, nobody else can back up still, which is an issue. This really is designed around an iterated search, and was, from the get-go.
These are two issues:
a) any thread can back up the node
b) split-point on any node possible

DTS: a) and b) are possible.
YBW: only b) is possible - a) isn't

In my last post I just wanted to provide a solution of how to do b) in YBW and not how to solve a)

Yes, but it doesn't solve the complete problem. A has to do the "backing up". Which means that A can't jump over to a different split point (which might be at the root even) because it may take too long to get back to the current split point to do the backing up. With DTS there is _never_ any issue. With YBW we only split _deeper_ than the current position, which completely solves this, since deeper means it is always a part of the current "sub-tree".

bob · Post by **bob** » Mon May 31, 2010 11:25 pm

Onno Garms wrote:
bob wrote:With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in."
That is just your (and maybe most other people's) YBW implementation. The thesis by Feldmann describes YBW the other way round: Idle threads ask for work. And with good implementations of YBWC it is well possible to split close to the root.

Feldman's search was distributed. Different issue.

I don't follow your last statement since I split at the root as well as everywhere else, except for getting too close to the tips.

bob · Post by **bob** » Mon May 31, 2010 11:28 pm

mcostalba wrote:
bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.
1) "everyone doing a real parallel search"

I realized today that we are not doing a real parallel search

2) No, at least in SF. Joona made that post because we have just finishing testing a modification in how SMP code works. Namely when a thread finished its work instead of sitting idle waiting for some else thread to pick it up, does a look up across currenlty active split points and picks the best (the highest depth one) , then joins the group without other threads do stop the search. So it is very efficient. But after a deep test on a QUAD machine we got no ELO gain from this patch that we discarded because it was simply adding complexity for no gain.

In the amazing case you would like to _read_ the patch and not just comment on that we can post it.

A) You take yourselves way too seriously. Not _every_ comment someone makes is directed toward you.

B) as for the "amazing case" just keep your snotty attitude to yourself, I know enough about how to do an efficient parallel search without looking at other code...

bob · Post by **bob** » Mon May 31, 2010 11:30 pm

Onno Garms wrote:
bob wrote: This is not a YBW issue, it is an implementation issue.
Ah, I see. I didn't know that there are implementations without this.

I use pure YBW in Crafty, And I don't have the issue you mentioned above at all. Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
What happens when all slaves of split S are ready (all children searched) but the master is still busy with something else? Will it continue its task before it returns the result of S to its parent?

Simple. The search can split anywhere _below_ the lowest active split point. When the master is busy, any idle threads will join in to help it complete whatever it is working on, always at a deeper ply.

mcostalba · Post by **mcostalba** » Tue Jun 01, 2010 6:57 am

bob wrote: B) as for the "amazing case" just keep your snotty attitude to yourself, I know enough about how to do an efficient parallel search without looking at other code...

Joona made a comment, you said he was wrong, I replied that we have tested some code that verify what Joona said, you are saying that you don't care because you already know everything.....who's the snotty one here ?

P.S: I already knew you had answered in that way (I don't know if this is snotty or not), you missed an occasion to surprise me. Please don't reply that you don't care to surprise me....otherwise you'd miss a second one

bob · Post by **bob** » Tue Jun 01, 2010 4:01 pm

mcostalba wrote:
bob wrote: B) as for the "amazing case" just keep your snotty attitude to yourself, I know enough about how to do an efficient parallel search without looking at other code...
Joona made a comment, you said he was wrong, I replied that we have tested some code that verify what Joona said, you are saying that you don't care because you already know everything.....who's the snotty one here ?

P.S: I already knew you had answered in that way (I don't know if this is snotty or not), you missed an occasion to surprise me. Please don't reply that you don't care to surprise me....otherwise you'd miss a second one

If you will simply reread what he wrote, and my follow up, I clearly said "this is not quite true" and explained _exactly_ why this is the case. As far as what you tested, I have no idea. I have tested my code, but it does require long games to test parallel search, as it gets better as the depth increases. How you tested this I don't know. It _does_ work. It might be 10%. It might be less. Depending on how effective your implementation is. 10% _does_ make a difference. Not a big one, and one that takes many games to measure. But it does help.

DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure