DTS Structure

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
Onno Garms
Posts: 224
Joined: Mon Mar 12, 2007 7:31 pm
Location: Bonn, Germany

Re: DTS Structure

Post by Onno Garms »

bob wrote:
Onno Garms wrote:
bob wrote: The problem is that recursive search means you can split "here" or not at all, because of the call stack.
ACK, but that is a problem for both DTS and YBWC.
No, that is what we are talking about when we mention an iterated search vs a recursive search.
Of course. I don't see any contraction between your comments (which I fully acknowledge) and my original comments. So I will go through them to post another reply.

Apart from this one (on "master only helps slaves, does not pick an arbitrary split point"):
First, how is this true?
Of course that is not true automatically, we have to implement that.
And second, why would you want to restrict where an idle processor can join in?
Sure, it is better not to restrict as in DTS. I never doubted that. But restricting is better then accepting the problem that Edmund saw with the YBWC: If we don't restrict, after all slaves at a split point have finished, would have to wait for the owner of the parent to become available again to continue the work.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: DTS Structure

Post by bob »

Onno Garms wrote:
bob wrote:
Onno Garms wrote:
bob wrote: The problem is that recursive search means you can split "here" or not at all, because of the call stack.
ACK, but that is a problem for both DTS and YBWC.
No, that is what we are talking about when we mention an iterated search vs a recursive search.
Of course. I don't see any contraction between your comments (which I fully acknowledge) and my original comments. So I will go through them to post another reply.

Apart from this one (on "master only helps slaves, does not pick an arbitrary split point"):
First, how is this true?
Of course that is not true automatically, we have to implement that.
And second, why would you want to restrict where an idle processor can join in?
Sure, it is better not to restrict as in DTS. I never doubted that. But restricting is better then accepting the problem that Edmund saw with the YBWC: If we don't restrict, after all slaves at a split point have finished, would have to wait for the owner of the parent to become available again to continue the work.
This is not a YBW issue, it is an implementation issue. The only specification for YBW is that you never split at any node until you have completed searching at least one branch out of that node to convince yourself that this is not likely going to be a CUT node. I use pure YBW in Crafty, And I don't have the issue you mentioned above at all. Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
zamar
Posts: 613
Joined: Sun Jan 18, 2009 7:03 am

Re: DTS Structure

Post by zamar »

bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems ;)
Joona Kiiski
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: DTS Structure

Post by bob »

zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems ;)
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

However, DTS _can_ be pretty good because you don't just have to take the current node as a split point once YBW is satisfied, you can split _anywhere_ in the tree, with any active thread, without regard to the depth or tree order. Which means you have a much better chance of choosing a good split point (ALL node). And you can pick one as close to the root as possible as well, which can further help as the moves there are better ordered.

But it is definitely messy...
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: DTS Structure

Post by Edmund »

bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems ;)
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

However, DTS _can_ be pretty good because you don't just have to take the current node as a split point once YBW is satisfied, you can split _anywhere_ in the tree, with any active thread, without regard to the depth or tree order. Which means you have a much better chance of choosing a good split point (ALL node). And you can pick one as close to the root as possible as well, which can further help as the moves there are better ordered.

But it is definitely messy...
if you store all the data in a global struct why couldn't you let the slaves do the search for good split points in the YBW framework? What is different to DTS in that respect?

The only difference I really see is that in DTS master and slave can swap places and in YBW they can't. But the selection of the split points should work in a similar manner.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: DTS Structure

Post by bob »

Edmund wrote:
bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems ;)
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

However, DTS _can_ be pretty good because you don't just have to take the current node as a split point once YBW is satisfied, you can split _anywhere_ in the tree, with any active thread, without regard to the depth or tree order. Which means you have a much better chance of choosing a good split point (ALL node). And you can pick one as close to the root as possible as well, which can further help as the moves there are better ordered.

But it is definitely messy...
if you store all the data in a global struct why couldn't you let the slaves do the search for good split points in the YBW framework? What is different to DTS in that respect?

How are you going to handle the "helpful master" approach if you do that? The master has to do the splitting so that it has a back door to help others while retaining the call stack for this split point so that it can return back through it when all work is completed.

The only difference I really see is that in DTS master and slave can swap places and in YBW they can't. But the selection of the split points should work in a similar manner.
Think about how you are going to deal with the call stack that you can't modify, and how you are going to get back to this point if other processors just "jump in" at points in the tree that could be above or below this one. It's an extremely complex approach. Anything _can_ be done of course. But I would not try to undertake something 100x more complex than basic iterated search using DTS.

Here's a sample problem. Thread A recursively calls itself to get down to ply 15. When another thread goes idle, and wants to join this thread at ply=6, how can that thread modify my call stack so that as I return I don't just return thru the split point without any synchronization to verify that all work has been completed? This is a major complexity issue.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: DTS Structure

Post by Edmund »

bob wrote:
Edmund wrote:
bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems ;)
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

However, DTS _can_ be pretty good because you don't just have to take the current node as a split point once YBW is satisfied, you can split _anywhere_ in the tree, with any active thread, without regard to the depth or tree order. Which means you have a much better chance of choosing a good split point (ALL node). And you can pick one as close to the root as possible as well, which can further help as the moves there are better ordered.

But it is definitely messy...
if you store all the data in a global struct why couldn't you let the slaves do the search for good split points in the YBW framework? What is different to DTS in that respect?

How are you going to handle the "helpful master" approach if you do that? The master has to do the splitting so that it has a back door to help others while retaining the call stack for this split point so that it can return back through it when all work is completed.

The only difference I really see is that in DTS master and slave can swap places and in YBW they can't. But the selection of the split points should work in a similar manner.
Think about how you are going to deal with the call stack that you can't modify, and how you are going to get back to this point if other processors just "jump in" at points in the tree that could be above or below this one. It's an extremely complex approach. Anything _can_ be done of course. But I would not try to undertake something 100x more complex than basic iterated search using DTS.

Here's a sample problem. Thread A recursively calls itself to get down to ply 15. When another thread goes idle, and wants to join this thread at ply=6, how can that thread modify my call stack so that as I return I don't just return thru the split point without any synchronization to verify that all work has been completed? This is a major complexity issue.
In my effort to DTS I used the following approach:
I kept my movelists global for anyone to access; each move entry consists of 2 parts: 2byte from/to/promotion, 2byte score. Whenever the score is < 32000 it means the move still has to be searched; 32767 the move has already been searched; 32766-n the move is currently searched with the task number "n". (Note I say task number and not thread number because that task was not statically fixed to a certain thread, anyway in the structure of the task there would be a variable pointing at the thread id of the one currently working on the root node of the task) So an idle thread could walk through the movelists of a working thread and change the score value of a certain move where it wants to split. The master thread which finds itself on a node still waiting on a slave will then immediately know which thread to help, by looking at the score of all moves with a score from 32001 to 32766.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: DTS Structure

Post by bob »

Edmund wrote:
bob wrote:
Edmund wrote:
bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems ;)
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

However, DTS _can_ be pretty good because you don't just have to take the current node as a split point once YBW is satisfied, you can split _anywhere_ in the tree, with any active thread, without regard to the depth or tree order. Which means you have a much better chance of choosing a good split point (ALL node). And you can pick one as close to the root as possible as well, which can further help as the moves there are better ordered.

But it is definitely messy...
if you store all the data in a global struct why couldn't you let the slaves do the search for good split points in the YBW framework? What is different to DTS in that respect?

How are you going to handle the "helpful master" approach if you do that? The master has to do the splitting so that it has a back door to help others while retaining the call stack for this split point so that it can return back through it when all work is completed.

The only difference I really see is that in DTS master and slave can swap places and in YBW they can't. But the selection of the split points should work in a similar manner.
Think about how you are going to deal with the call stack that you can't modify, and how you are going to get back to this point if other processors just "jump in" at points in the tree that could be above or below this one. It's an extremely complex approach. Anything _can_ be done of course. But I would not try to undertake something 100x more complex than basic iterated search using DTS.

Here's a sample problem. Thread A recursively calls itself to get down to ply 15. When another thread goes idle, and wants to join this thread at ply=6, how can that thread modify my call stack so that as I return I don't just return thru the split point without any synchronization to verify that all work has been completed? This is a major complexity issue.
In my effort to DTS I used the following approach:
I kept my movelists global for anyone to access; each move entry consists of 2 parts: 2byte from/to/promotion, 2byte score. Whenever the score is < 32000 it means the move still has to be searched; 32767 the move has already been searched; 32766-n the move is currently searched with the task number "n". (Note I say task number and not thread number because that task was not statically fixed to a certain thread, anyway in the structure of the task there would be a variable pointing at the thread id of the one currently working on the root node of the task) So an idle thread could walk through the movelists of a working thread and change the score value of a certain move where it wants to split. The master thread which finds itself on a node still waiting on a slave will then immediately know which thread to help, by looking at the score of all moves with a score from 32001 to 32766.
How does that solve the problem of returning? With recursive search, the "master" (or owner in DTS-speak) has to return thru the split point as things are backed up, with DTS this is not the case and _anyone_ can back up through a split point, which is really the last thread that is working at the split point.

Using your approach, nobody else can back up still, which is an issue. This really is designed around an iterated search, and was, from the get-go.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: DTS Structure

Post by Edmund »

bob wrote:
Edmund wrote:
bob wrote:
Edmund wrote:
bob wrote:
zamar wrote:
bob wrote: Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
And when helpful master concept is used, the time master thread ends up being idle is so small that it is almost irrelevant. The real problem is the fact that sometimes split happens at node which fails high and then a lot of work is simply wasted. This is the only area where DTS can _possibly_ outperform YBW. Anyway the expected gain is so small that I'm not going to try this in Stockfish until most people have at least 16 core systems ;)
This is not quite true. With current "crafty-like" YBW algorithms (and it seems everyone doing a real parallel search uses the same idea) a thread that notices idle threads stops and does some work to "invite them in." In DTS the opposite was done, at least in my implementation. The _idle_ threads simply collected state from everyone else and chose which to "join". And it could do the "join" without the "joinee" even realizing this had happened. That is an efficiency improvement since working threads don't do the splitting, the idle threads do.

However, DTS _can_ be pretty good because you don't just have to take the current node as a split point once YBW is satisfied, you can split _anywhere_ in the tree, with any active thread, without regard to the depth or tree order. Which means you have a much better chance of choosing a good split point (ALL node). And you can pick one as close to the root as possible as well, which can further help as the moves there are better ordered.

But it is definitely messy...
if you store all the data in a global struct why couldn't you let the slaves do the search for good split points in the YBW framework? What is different to DTS in that respect?

How are you going to handle the "helpful master" approach if you do that? The master has to do the splitting so that it has a back door to help others while retaining the call stack for this split point so that it can return back through it when all work is completed.

The only difference I really see is that in DTS master and slave can swap places and in YBW they can't. But the selection of the split points should work in a similar manner.
Think about how you are going to deal with the call stack that you can't modify, and how you are going to get back to this point if other processors just "jump in" at points in the tree that could be above or below this one. It's an extremely complex approach. Anything _can_ be done of course. But I would not try to undertake something 100x more complex than basic iterated search using DTS.

Here's a sample problem. Thread A recursively calls itself to get down to ply 15. When another thread goes idle, and wants to join this thread at ply=6, how can that thread modify my call stack so that as I return I don't just return thru the split point without any synchronization to verify that all work has been completed? This is a major complexity issue.
In my effort to DTS I used the following approach:
I kept my movelists global for anyone to access; each move entry consists of 2 parts: 2byte from/to/promotion, 2byte score. Whenever the score is < 32000 it means the move still has to be searched; 32767 the move has already been searched; 32766-n the move is currently searched with the task number "n". (Note I say task number and not thread number because that task was not statically fixed to a certain thread, anyway in the structure of the task there would be a variable pointing at the thread id of the one currently working on the root node of the task) So an idle thread could walk through the movelists of a working thread and change the score value of a certain move where it wants to split. The master thread which finds itself on a node still waiting on a slave will then immediately know which thread to help, by looking at the score of all moves with a score from 32001 to 32766.
How does that solve the problem of returning? With recursive search, the "master" (or owner in DTS-speak) has to return thru the split point as things are backed up, with DTS this is not the case and _anyone_ can back up through a split point, which is really the last thread that is working at the split point.

Using your approach, nobody else can back up still, which is an issue. This really is designed around an iterated search, and was, from the get-go.
These are two issues:
a) any thread can back up the node
b) split-point on any node possible

DTS: a) and b) are possible.
YBW: only b) is possible - a) isn't

In my last post I just wanted to provide a solution of how to do b) in YBW and not how to solve a)
User avatar
Onno Garms
Posts: 224
Joined: Mon Mar 12, 2007 7:31 pm
Location: Bonn, Germany

Re: DTS Structure

Post by Onno Garms »

bob wrote: This is not a YBW issue, it is an implementation issue.
Ah, I see. I didn't know that there are implementations without this.
I use pure YBW in Crafty, And I don't have the issue you mentioned above at all. Yes the original "owner" has to be the one to back up at that node once it has been completed by either itself or "slaves". But that master can go off and work at other points in the tree also, and not sit around holding things up...
What happens when all slaves of split S are ready (all children searched) but the master is still busy with something else? Will it continue its task before it returns the result of S to its parent?