YBWC: Active Reparenting

mcostalba · Post by **mcostalba** » Tue Apr 10, 2012 7:38 pm

In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.

Unfortunately I don't have access to any 8-core machine to test the patch, so the only thing I can do on my QUAD (and I have already started) is to test for regressions. So here we come to the reason of this post: in case someone has an 8-core (or better) computer and is willing to test this patch please drop me a pm. Just to give you an idea I was thinking to test along these lines:

Code: Select all

- Computer fully dedicated for this test, no background activity during the whole test
- At least 8 cores CPU
- At least 1 minute per game
- At least 1000 games
- No pondering

Thanks in advance for any help.
Marco

P.S: Patch is below. It is not difficult to understand for people knowing (read, with hands-on coding experience) the subject:
https://github.com/mcostalba/Stockfish/ ... 9b1d125d72

diep · Post by **diep** » Tue Apr 10, 2012 8:23 pm

CPU's that don't cause a cutoff at this node yet do unsubscribe themselves from this search and basically are idle, i called that cpu's calling AbortMe().

CPU's that cause either a research (PV) or fail high at this node, causing all other cpu's to get aborted, that's what i described as an AbortFailHigh().

CPU's that call AbortMe() yet then are idle because they own the specific 'recursionstack' and which activelly go search THEMSELVE higher in the search tree (so not lower which is what Crafty is doing) and/or at other CPU's stealing work there, that's basically a DTS feature as invented by Bob.
From what i understood - long time ago - in DTS cpu's actively search themselves for a job.

It does give a good speedup. So from algorithmic viewpoint it's a good thing.

Note i'm not a fan of this at modern hardware as it doesn't scale IMHO, yet YMMV there as Diep of course is in a bigger luxury position there as it doesn't have a recursive search at all, so it doesn't have this problem.

It's not clear to me whether you can claim a new name for this. Have to ask Bob careful for that - again with Diep i don't need a solution like this as it's in the first place already having a superior design there.

At the cluster though i need creative stuff to solve it all as it has more limitations of course with remote nodes. So i'm already looking there to memory migration, yet again in a manner that idle cpu's do not search themselves for a job

The guy who really has been busy with that is Bob.

mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.

Unfortunately I don't have access to any 8-core machine to test the patch, so the only thing I can do on my QUAD (and I have already started) is to test for regressions. So here we come to the reason of this post: in case someone has an 8-core (or better) computer and is willing to test this patch please drop me a pm. Just to give you an idea I was thinking to test along these lines:
Code: Select all
- Computer fully dedicated for this test, no background activity during the whole test
- At least 8 cores CPU
- At least 1 minute per game
- At least 1000 games
- No pondering
Thanks in advance for any help.
Marco

P.S: Patch is below. It is not difficult to understand for people knowing (read, with hands-on coding experience) the subject:
https://github.com/mcostalba/Stockfish/ ... 9b1d125d72

AlvaroBegue · Post by **AlvaroBegue** » Tue Apr 10, 2012 8:27 pm

Why would you need to run 1000 quick games to test a change to the parallelization? It would be much more informative to take a few (say, 500) test positions spanning a variety of situations and measure how long it takes to finish the search to some depth, probably using a couple of minutes per position.

diep · Post by **diep** » Tue Apr 10, 2012 8:44 pm

AlvaroBegue wrote:Why would you need to run 1000 quick games to test a change to the parallelization? It would be much more informative to take a few (say, 500) test positions spanning a variety of situations and measure how long it takes to finish the search to some depth, probably using a couple of minutes per position.

For Diep i use a 16 core barcelona box with ugly slow latencies from cpu to memory (320 ns+ for 8 bytes random reading). If your SMP works well there it makes a good chance to work at any shared memory hardware

I run then Diep at 16 processors with hashtable size Y (all cores together) and compare that with 1 cpu size Y hashtable.

That determines the speedup. Yet the 16 core run i use 213 positions for 7 minutes. That's exactly 24 hours. Then compare that to 1 core's run, which runs for quite some time a position (usually i give it 1 hour)

With a C program i wrote that parses the logfiles (and gets the biggest finished search depth out of the logfiles) i generate output data that i can throw in excel/open office. You then automatically calculate the scaling and the speedup.

Daniel Shawul · Post by **Daniel Shawul** » Tue Apr 10, 2012 9:11 pm

diep wrote:CPU's that don't cause a cutoff at this node yet do unsubscribe themselves from this search and basically are idle, i called that cpu's calling AbortMe().

CPU's that cause either a research (PV) or fail high at this node, causing all other cpu's to get aborted, that's what i described as an AbortFailHigh().

CPU's that call AbortMe() yet then are idle because they own the specific 'recursionstack' and which activelly go search THEMSELVE higher in the search tree (so not lower which is what Crafty is doing) and/or at other CPU's stealing work there, that's basically a DTS feature as invented by Bob.
From what i understood - long time ago - in DTS cpu's actively search themselves for a job.

It does give a good speedup. So from algorithmic viewpoint it's a good thing.

Note i'm not a fan of this at modern hardware as it doesn't scale IMHO, yet YMMV there as Diep of course is in a bigger luxury position there as it doesn't have a recursive search at all, so it doesn't have this problem.

It's not clear to me whether you can claim a new name for this. Have to ask Bob careful for that - again with Diep i don't need a solution like this as it's in the first place already having a superior design there.

At the cluster though i need creative stuff to solve it all as it has more limitations of course with remote nodes. So i'm already looking there to memory migration, yet again in a manner that idle cpu's do not search themselves for a job

The guy who really has been busy with that is Bob.
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.

Unfortunately I don't have access to any 8-core machine to test the patch, so the only thing I can do on my QUAD (and I have already started) is to test for regressions. So here we come to the reason of this post: in case someone has an 8-core (or better) computer and is willing to test this patch please drop me a pm. Just to give you an idea I was thinking to test along these lines:
Code: Select all
- Computer fully dedicated for this test, no background activity during the whole test
- At least 8 cores CPU
- At least 1 minute per game
- At least 1000 games
- No pondering
Thanks in advance for any help.
Marco

P.S: Patch is below. It is not difficult to understand for people knowing (read, with hands-on coding experience) the subject:
https://github.com/mcostalba/Stockfish/ ... 9b1d125d72

Hello Vincent

Yes DTS keeps a list of active split points from which slaves can assign themselves work. I did not like this approach because I wanted to a decentralized (cluster) search eventually. There I just send an "HELP" message to a randomly selected processor, then that processor can either "CANCEL" or "ACCEPT" it in which case they form master-slave relation. This is straightforward but most SMP tricks like keeping a centralized pool of work are complicated and probably not worth the effort as you pointed out. Also not scalable , it is way harder to keep track of all the work available in a cluster search.

Another feature of DTS you mentioned is that you can back up and do speculative parallel search before search completes at the current node. I didn't try this but maybe it is worth it. YBW generates many split points ( even though at shallower depth ) so the processor will most likely be busy anyway, whether they search for work themselves or get assigned IMO. But the fact that the tree is splitted higher up in the tree should result in lower number of splits.

Yet another feature of DTS that I actually tried is that ownership of node can change (last worker at a node returns the score). This is impossible to do with recursive search. I do this now but I am not sure of what I gain from it if at all.

Anywho most tricks of centralized approaches can't be used in clusters and there is the question of whether it is worth it.

-----
cheers

diep · Post by **diep** » Tue Apr 10, 2012 9:54 pm

Daniel Shawul wrote:
diep wrote:CPU's that don't cause a cutoff at this node yet do unsubscribe themselves from this search and basically are idle, i called that cpu's calling AbortMe().

CPU's that cause either a research (PV) or fail high at this node, causing all other cpu's to get aborted, that's what i described as an AbortFailHigh().

CPU's that call AbortMe() yet then are idle because they own the specific 'recursionstack' and which activelly go search THEMSELVE higher in the search tree (so not lower which is what Crafty is doing) and/or at other CPU's stealing work there, that's basically a DTS feature as invented by Bob.
From what i understood - long time ago - in DTS cpu's actively search themselves for a job.

It does give a good speedup. So from algorithmic viewpoint it's a good thing.

Note i'm not a fan of this at modern hardware as it doesn't scale IMHO, yet YMMV there as Diep of course is in a bigger luxury position there as it doesn't have a recursive search at all, so it doesn't have this problem.

It's not clear to me whether you can claim a new name for this. Have to ask Bob careful for that - again with Diep i don't need a solution like this as it's in the first place already having a superior design there.

At the cluster though i need creative stuff to solve it all as it has more limitations of course with remote nodes. So i'm already looking there to memory migration, yet again in a manner that idle cpu's do not search themselves for a job

The guy who really has been busy with that is Bob.
mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.

Unfortunately I don't have access to any 8-core machine to test the patch, so the only thing I can do on my QUAD (and I have already started) is to test for regressions. So here we come to the reason of this post: in case someone has an 8-core (or better) computer and is willing to test this patch please drop me a pm. Just to give you an idea I was thinking to test along these lines:
Code: Select all
- Computer fully dedicated for this test, no background activity during the whole test
- At least 8 cores CPU
- At least 1 minute per game
- At least 1000 games
- No pondering
Thanks in advance for any help.
Marco

P.S: Patch is below. It is not difficult to understand for people knowing (read, with hands-on coding experience) the subject:
https://github.com/mcostalba/Stockfish/ ... 9b1d125d72
Hello Vincent

Yes DTS keeps a list of active split points from which slaves can assign themselves work. I did not like this approach because I wanted to a decentralized (cluster) search eventually. There I just send an "HELP" message to a randomly selected processor, then that processor can either "CANCEL" or "ACCEPT" it in which case they form master-slave relation. This is straightforward but most SMP tricks like keeping a centralized pool of work are complicated and probably not worth the effort as you pointed out. Also not scalable , it is way harder to keep track of all the work available in a cluster search.

Another feature of DTS you mentioned is that you can back up and do speculative parallel search before search completes at the current node. I didn't try this but maybe it is worth it. YBW generates many split points ( even though at shallower depth ) so the processor will most likely be busy anyway, whether they search for work themselves or get assigned IMO. But the fact that the tree is splitted higher up in the tree should result in lower number of splits.

Yet another feature of DTS that I actually tried is that ownership of node can change (last worker at a node returns the score). This is impossible to do with recursive search. I do this now but I am not sure of what I gain from it if at all.

Anywho most tricks of centralized approaches can't be used in clusters and there is the question of whether it is worth it.

-----
cheers

There is no question about it that if your hardware would be having no penalty for your cpu's to bother other cpu's and/or can freely look in remote memory, that the DTS principle of a CPU finding ITSELF a spot where it can split will give a great speedup.

I don't think the most important result is smaller amount of splitpoints. In Diep for example i'm doing massive amounts of splits the first few seconds of the search. Simply to the limit it can handle.

I tried to implement DTS in 1998 and succeeded (with one bugfix in 2000) and up to a cpu or 4 it had a genius speedup, yet right from the start i combined it with YBW, later more on that as it's a crucial thing. Note DTS scaled bad for Diep, but i didn't have an ultra efficient implementation and things were done centralized. Losing about 0.5 from each 4 cpu's back then roughly.

What i do now splits when it is convenient for searching processors and cpu's not active aren't stealing work as that has problems scaling.

Note that in all cases i assumed following the YBW conditions. So if you would equip DTS with YBW basically you can prove that the method of
basically splitting from the searching processors when it is convenient for them, that this performs similar, besides saving out massive waste of internal bandwidth:

Let's count now from the root our plies. So root is 0, after 1 move we are at 1, then 2, 3 plies away from root.

We start with the simple and most important case. Namely start of the search. We start searching with 1 core.

Suppose we are at a depth of n+1 in the PV.
We backtrack to n with processor P0

We can prove now that all other cpu's must be idle, because only 1 cpu can basically return in a YBW concept from a move. If there is idle cpu's now we already split them in at depth n (if it would give a cutoff we would split them in at n-1).

Now we split in at depth n in this model. All moves get parallel distributed at this position P.

I want to drop next lemma now. When following YBW It's very difficult to find a position Q higher in the tree near the root than position P, which holds true that it follows YBW and no parallel search is currently getting carried out in Q.

Basically we can prove that if we already previously visited position Q in our current iteration, that we already had the opportunity to split there. And we had this opportunity while we can prove for the PV walkdown that all other cpu's were idle, and therefore could effectively get put to work.

In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.

That's the only situation where DTS would in theory have a benefit, assuming your hardware would allow it, over the 'split in as we happen to touch this position' type strategy.

The bottomline is that basically our parallel search is already going great at the moment that such event would occur.

So DTS doesn't solve some sort of 'worst case problem'. In fact it creates a new worstcase, as it's going to blow all other cpu's bandwidth when just 1 core for a fraction of a second is busy following the YBW property. That doesn't need to be during start of the search only. At this crucial moment that you want to split in other cpu's as quick as possible, which is possible after your single core gets back in a position where it has searched the first move; at that crucial moment with DTS all your machines core are busy hammering and jamming your single core.

So it CAN have an effect that exponentially slows down your YBW search, meanwhile it contributes positively in efficiency (avoiding a few splits) at a moment in time that your parallel search already is going great at a bunch of cores and basically is about to finish an iteration or subbranch.

The benefit versus worstcase is not positive for DTS at modern hardware, when following the YBW property.

p.s. DTS doesn't need to be implemented in a centralized manner, neither does YBW. We just speak about a fraction of a second that the search is busy at 1 core. The trick is to be able to quickly split in of course - i never made a secret out of that.

Diep is not recursive. So basically to split in another cpu is just setting a few pointers in theory. I didn't even ultra optimize it - that cheap it already is.

mcostalba · Post by **mcostalba** » Tue Apr 10, 2012 10:48 pm

diep wrote: In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.

Yes, this is the condition I am looking for in my patch. And I also limit the search to the 'oldest' split points that it means, for each thread, master of a set of split points, I check only the split point highest in the tree so to possibly reparent near the root: this should help getting a longer average life time for the slave.

bob · Post by **bob** » Tue Apr 10, 2012 11:39 pm

mcostalba wrote:In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.

I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.

Unfortunately I don't have access to any 8-core machine to test the patch, so the only thing I can do on my QUAD (and I have already started) is to test for regressions. So here we come to the reason of this post: in case someone has an 8-core (or better) computer and is willing to test this patch please drop me a pm. Just to give you an idea I was thinking to test along these lines:
Code: Select all
- Computer fully dedicated for this test, no background activity during the whole test
- At least 8 cores CPU
- At least 1 minute per game
- At least 1000 games
- No pondering
Thanks in advance for any help.
Marco

P.S: Patch is below. It is not difficult to understand for people knowing (read, with hands-on coding experience) the subject:
https://github.com/mcostalba/Stockfish/ ... 9b1d125d72

There were a few versions of Crafty that did this. I did not find it helpful in terms of strength. With larger numbers of processors it is actually harmful as on a 64 core box, you do not want every processor to go jump in and help somewhere else. You end up with too many at one split point which is not efficient either... If you honor my "max threads per split point" variable, it has a better chance but it still did not search as fast as just simply splitting when possible. And I did the obvious things such as not jumping into an active split point where there are no moves remaining... or just a few moves remaining... etc...

There are a few ugly race conditions to handle as well. One runs out of work to do, finds an active split point, verifies there is something to do, and then does the usual "copy/work" stuff. But it then can split at a point that is no longer active, because between the time when you check to see if the split point is active and there is work to do, and you copy the necessary data to start the parallel search, one of the other processors already searching could have completed everything and cleaned up...

Cray Blitz (DTS) did this far better, because when a processor ran out of work to do, it could create a new split point by itself, and never had to wait for anything... but I did not feel like going back to an interative search and throw out the cleanliness of the recursive negamax.

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 11, 2012 1:11 am

Thank you for your thorough reply.

There is no question about it that if your hardware would be having no penalty for your cpu's to bother other cpu's and/or can freely look in remote memory, that the DTS principle of a CPU finding ITSELF a spot where it can split will give a great speedup.

I don't think the most important result is smaller amount of splitpoints. In Diep for example i'm doing massive amounts of splits the first few seconds of the search. Simply to the limit it can handle.

I tried to implement DTS in 1998 and succeeded (with one bugfix in 2000) and up to a cpu or 4 it had a genius speedup, yet right from the start i combined it with YBW, later more on that as it's a crucial thing. Note DTS scaled bad for Diep, but i didn't have an ultra efficient implementation and things were done centralized. Losing about 0.5 from each 4 cpu's back then roughly.

What i do now splits when it is convenient for searching processors and cpu's not active aren't stealing work as that has problems scaling.

Note that in all cases i assumed following the YBW conditions. So if you would equip DTS with YBW basically you can prove that the method of
basically splitting from the searching processors when it is convenient for them, that this performs similar, besides saving out massive waste of internal bandwidth:

I agree. Bandwidth is expensive especially when you start using more than 4 threads. The paper also mentions about using a "thrashing counter" before sending a help request.
Splitting at larger depths (fewer splits) help to avoid time wasted by splitting un-necessarily. I know you are talking on a larger scale (256 processor f.i) in which keeping processors busy is more important than selecting the best split point.

Let's count now from the root our plies. So root is 0, after 1 move we are at 1, then 2, 3 plies away from root.

We start with the simple and most important case. Namely start of the search. We start searching with 1 core.

Suppose we are at a depth of n+1 in the PV.
We backtrack to n with processor P0

We can prove now that all other cpu's must be idle, because only 1 cpu can basically return in a YBW concept from a move. If there is idle cpu's now we already split them in at depth n (if it would give a cutoff we would split them in at n-1).

I think some limit the number of processors used at a split point (say max of 8 ). It would be unwise to split with all of them when a few would finish the job quickly. Also with 256 or more processors available some will undoubtedly be idle. The same thing also happens in endgames where wer can have few moves at each ply. But I understand your point.

Now we split in at depth n in this model. All moves get parallel distributed at this position P.

I want to drop next lemma now. When following YBW It's very difficult to find a position Q higher in the tree near the root than position P, which holds true that it follows YBW and no parallel search is currently getting carried out in Q.

Basically we can prove that if we already previously visited position Q in our current iteration, that we already had the opportunity to split there. And we had this opportunity while we can prove for the PV walkdown that all other cpu's were idle, and therefore could effectively get put to work.

I agree. For cases where we had the opportunity to split but not enough processors to search with, slave processors can check for idle processors at the split point
each time they come back to get a move.

In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.

That's the only situation where DTS would in theory have a benefit, assuming your hardware would allow it, over the 'split in as we happen to touch this position' type strategy.

The bottomline is that basically our parallel search is already going great at the moment that such event would occur.

So DTS doesn't solve some sort of 'worst case problem'. In fact it creates a new worstcase, as it's going to blow all other cpu's bandwidth when just 1 core for a fraction of a second is busy following the YBW property. That doesn't need to be during start of the search only. At this crucial moment that you want to split in other cpu's as quick as possible, which is possible after your single core gets back in a position where it has searched the first move; at that crucial moment with DTS all your machines core are busy hammering and jamming your single core.

One worst case scenario mentioned in the paper is re-splitting at deeper depths again and again may lead to overall slow down in some hypothetical situations.
Then to avoid this problem, it is suggested to start a speculative parallel search at N-1 once enough moves are searched at ply N that gaurantee the alpha bound.
It could be a waste of work if the bound gets improved once the rest of the moves are searched at ply N. Well this speculation is one thing that I would hope DTS could give
improvement... But one can also do speculative YBW by starting parallel search at ALL nodes without searching any moves. It gave a little improvement fro Feldmann and co.

So it CAN have an effect that exponentially slows down your YBW search, meanwhile it contributes positively in efficiency (avoiding a few splits) at a moment in time that your parallel search already is going great at a bunch of cores and basically is about to finish an iteration or subbranch.

The benefit versus worstcase is not positive for DTS at modern hardware, when following the YBW property.

It is mentioned that DTS was developed on cray (high bandwidth) and that this affected decisions of the algorithm. I am not sure which parts of the algorithm are affected most, but it is probably the work stealing as you suggested.

p.s. DTS doesn't need to be implemented in a centralized manner, neither does YBW. We just speak about a fraction of a second that the search is busy at 1 core. The trick is to be able to quickly split in of course - i never made a secret out of that.

Diep is not recursive. So basically to split in another cpu is just setting a few pointers in theory. I didn't even ultra optimize it - that cheap it already is.

Thanks it really is good to hear of your experience.

Daniel Shawul · Post by **Daniel Shawul** » Wed Apr 11, 2012 1:33 am

I just checked the paper. It says :

To process the HELP command, a simple test was added to the sequential search code. After any node is processed, HELP(i) is checked, and if set, the tree state is copied to shared memory, the flag is cleared, and the search continues normally. This reduces the idle time for a processor to the time required for a processor to search one node and return to the top of Search() and then store the "tree state" for the idle processor to examine. (About 30 microseconds on a Cray C90.) After this time, an idle processor will have at least one tree state to examine for split points, making this reasonably efficient.

A processor which gets a HELP request always does a copy of its state to shared memory. So the use of the Cray is indeed important. It is a bit different than what the OP is doing since that is simply keeping a list of available split points...

YBWC: Active Reparenting

YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting

Re: YBWC: Active Reparenting