Daniel Shawul wrote: diep wrote:
CPU's that don't cause a cutoff at this node yet do unsubscribe themselves from this search and basically are idle, i called that cpu's calling AbortMe().
CPU's that cause either a research (PV) or fail high at this node, causing all other cpu's to get aborted, that's what i described as an AbortFailHigh().
CPU's that call AbortMe() yet then are idle because they own the specific 'recursionstack' and which activelly go search THEMSELVE higher in the search tree (so not lower which is what Crafty is doing) and/or at other CPU's stealing work there, that's basically a DTS feature as invented by Bob.
From what i understood - long time ago - in DTS cpu's actively search themselves for a job.
It does give a good speedup. So from algorithmic viewpoint it's a good thing.
Note i'm not a fan of this at modern hardware as it doesn't scale IMHO, yet YMMV there as Diep of course is in a bigger luxury position there as it doesn't have a recursive search at all, so it doesn't have this problem.
It's not clear to me whether you can claim a new name for this. Have to ask Bob careful for that - again with Diep i don't need a solution like this as it's in the first place already having a superior design there.
At the cluster though i need creative stuff to solve it all as it has more limitations of course with remote nodes. So i'm already looking there to memory migration, yet again in a manner that idle cpu's do not search themselves for a job
The guy who really has been busy with that is Bob.
In Young Brothers Wait Concept (YBWC) available slaves are booked by the split point master, then start to search below the assigned split point and, once finished, return in idle state waiting to be booked by another master.
I have pushed a patch that introduces what I have called "Active Reparenting" so that when a slave finishes its job on the assigned split point, instead of passively waiting to be booked, searches a suitable active split point and reparents itself to that split point. Then immediately starts to search below the split point in exactly the same way of the others split point's slaves. This reduces to zero the time waiting in idle loop and _could_ increase scalability especially with many (8 or more) cores, at least this is the idea.
Unfortunately I don't have access to any 8-core machine to test the patch, so the only thing I can do on my QUAD (and I have already started) is to test for regressions. So here we come to the reason of this post: in case someone has an 8-core (or better) computer and is willing to test this patch please drop me a pm. Just to give you an idea I was thinking to test along these lines:
Code: Select all
- Computer fully dedicated for this test, no background activity during the whole test
- At least 8 cores CPU
- At least 1 minute per game
- At least 1000 games
- No pondering
Thanks in advance for any help.
P.S: Patch is below. It is not difficult to understand for people knowing (read, with hands-on coding experience) the subject:
https://github.com/mcostalba/Stockfish/ ... 9b1d125d72
Yes DTS keeps a list of active split points from which slaves can assign themselves work. I did not like this approach because I wanted to a decentralized (cluster) search eventually. There I just send an "HELP" message to a randomly selected processor, then that processor can either "CANCEL" or "ACCEPT" it in which case they form master-slave relation. This is straightforward but most SMP tricks like keeping a centralized pool of work are complicated and probably not worth the effort as you pointed out. Also not scalable , it is way harder to keep track of all the work available in a cluster search.
Another feature of DTS you mentioned is that you can back up and do speculative parallel search before search completes at the current node. I didn't try this but maybe it is worth it. YBW generates many split points ( even though at shallower depth ) so the processor will most likely be busy anyway, whether they search for work themselves or get assigned IMO. But the fact that the tree is splitted higher up in the tree should result in lower number of splits.
Yet another feature of DTS that I actually tried is that ownership of node can change (last worker at a node returns the score). This is impossible to do with recursive search. I do this now but I am not sure of what I gain from it if at all.
Anywho most tricks of centralized approaches can't be used in clusters and there is the question of whether it is worth it.
There is no question about it that if your hardware would be having no penalty for your cpu's to bother other cpu's and/or can freely look in remote memory, that the DTS principle of a CPU finding ITSELF a spot where it can split will give a great speedup.
I don't think the most important result is smaller amount of splitpoints. In Diep for example i'm doing massive amounts of splits the first few seconds of the search. Simply to the limit it can handle.
I tried to implement DTS in 1998 and succeeded (with one bugfix in 2000) and up to a cpu or 4 it had a genius speedup, yet right from the start i combined it with YBW, later more on that as it's a crucial thing. Note DTS scaled bad for Diep, but i didn't have an ultra efficient implementation and things were done centralized. Losing about 0.5 from each 4 cpu's back then roughly.
What i do now splits when it is convenient for searching processors and cpu's not active aren't stealing work as that has problems scaling.
Note that in all cases i assumed following the YBW conditions. So if you would equip DTS with YBW basically you can prove that the method of
basically splitting from the searching processors when it is convenient for them, that this performs similar, besides saving out massive waste of internal bandwidth:
Let's count now from the root our plies. So root is 0, after 1 move we are at 1, then 2, 3 plies away from root.
We start with the simple and most important case. Namely start of the search. We start searching with 1 core.
Suppose we are at a depth of n+1 in the PV.
We backtrack to n with processor P0
We can prove now that all other cpu's must be idle, because only 1 cpu can basically return in a YBW concept from a move. If there is idle cpu's now we already split them in at depth n (if it would give a cutoff we would split them in at n-1).
Now we split in at depth n in this model. All moves get parallel distributed at this position P.
I want to drop next lemma now. When following YBW It's very difficult to find a position Q higher in the tree near the root than position P, which holds true that it follows YBW and no parallel search is currently getting carried out in Q.
Basically we can prove that if we already previously visited position Q in our current iteration, that we already had the opportunity to split there. And we had this opportunity while we can prove for the PV walkdown that all other cpu's were idle, and therefore could effectively get put to work.
In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.
That's the only situation where DTS would in theory have a benefit, assuming your hardware would allow it, over the 'split in as we happen to touch this position' type strategy.
The bottomline is that basically our parallel search is already going great at the moment that such event would occur.
So DTS doesn't solve some sort of 'worst case problem'. In fact it creates a new worstcase, as it's going to blow all other cpu's bandwidth when just 1 core for a fraction of a second is busy following the YBW property. That doesn't need to be during start of the search only. At this crucial moment that you want to split in other cpu's as quick as possible, which is possible after your single core gets back in a position where it has searched the first move; at that crucial moment with DTS all your machines core are busy hammering and jamming your single core.
So it CAN have an effect that exponentially slows down your YBW search, meanwhile it contributes positively in efficiency (avoiding a few splits) at a moment in time that your parallel search already is going great at a bunch of cores and basically is about to finish an iteration or subbranch.
The benefit versus worstcase is not positive for DTS at modern hardware, when following the YBW property.
p.s. DTS doesn't need to be implemented in a centralized manner, neither does YBW. We just speak about a fraction of a second that the search is busy at 1 core. The trick is to be able to quickly split in of course - i never made a secret out of that.
Diep is not recursive. So basically to split in another cpu is just setting a few pointers in theory. I didn't even ultra optimize it - that cheap it already is.