ChessUSA.com TalkChess.com
Hosted by Your Move Chess & Games
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

YBWC: Active Reparenting
Post new topic    TalkChess.com Forum Index -> Computer Chess Club: Programming and Technical Discussions Flat
View previous topic :: View next topic  
Author Message
Robert Hyatt



Joined: 27 Feb 2006
Posts: 15816
Location: Birmingham, AL

PostPost subject: Re: YBWC: Active Reparenting    Posted: Tue Apr 17, 2012 2:54 pm Reply to topic Reply with quote

diep wrote:
bob wrote:
Daniel Shawul wrote:
Thank you for your thorough reply.
Quote:

There is no question about it that if your hardware would be having no penalty for your cpu's to bother other cpu's and/or can freely look in remote memory, that the DTS principle of a CPU finding ITSELF a spot where it can split will give a great speedup.

I don't think the most important result is smaller amount of splitpoints. In Diep for example i'm doing massive amounts of splits the first few seconds of the search. Simply to the limit it can handle.

I tried to implement DTS in 1998 and succeeded (with one bugfix in 2000) and up to a cpu or 4 it had a genius speedup, yet right from the start i combined it with YBW, later more on that as it's a crucial thing. Note DTS scaled bad for Diep, but i didn't have an ultra efficient implementation and things were done centralized. Losing about 0.5 from each 4 cpu's back then roughly.

What i do now splits when it is convenient for searching processors and cpu's not active aren't stealing work as that has problems scaling.

Note that in all cases i assumed following the YBW conditions. So if you would equip DTS with YBW basically you can prove that the method of
basically splitting from the searching processors when it is convenient for them, that this performs similar, besides saving out massive waste of internal bandwidth:

I agree. Bandwidth is expensive especially when you start using more than 4 threads. The paper also mentions about using a "thrashing counter" before sending a help request.
Splitting at larger depths (fewer splits) help to avoid time wasted by splitting un-necessarily. I know you are talking on a larger scale (256 processor f.i) in which keeping processors busy is more important than selecting the best split point.


I would NEVER agree with that. If you select a bad split point, you might as well not split as you get zero gain for searching nodes that don't need to be searched. The more nodes you have, the more important selecting a good split point becomes, or your speedup flattens out very quickly.

Quote:

Quote:

Let's count now from the root our plies. So root is 0, after 1 move we are at 1, then 2, 3 plies away from root.

We start with the simple and most important case. Namely start of the search. We start searching with 1 core.

Suppose we are at a depth of n+1 in the PV.
We backtrack to n with processor P0

We can prove now that all other cpu's must be idle, because only 1 cpu can basically return in a YBW concept from a move. If there is idle cpu's now we already split them in at depth n (if it would give a cutoff we would split them in at n-1).

I think some limit the number of processors used at a split point (say max of 8 ). It would be unwise to split with all of them when a few would finish the job quickly. Also with 256 or more processors available some will undoubtedly be idle. The same thing also happens in endgames where wer can have few moves at each ply. But I understand your point.
Quote:

Now we split in at depth n in this model. All moves get parallel distributed at this position P.

I want to drop next lemma now. When following YBW It's very difficult to find a position Q higher in the tree near the root than position P, which holds true that it follows YBW and no parallel search is currently getting carried out in Q.


It is actually more common than you think. Suppose you split at the root, and now you have two threads searching there. You can still split in either of the two simultaneous sub-trees you are searching if other threads become idle. And you can re-split, and re-split. I posted some data the other day where on an 8 core box, in normal games, I saw over 100 active split points max. That is at one instant in time. Since I don't split in the q-search at all, and since I limit myself to 4 threads per split point, max, there is a LOT of re-splitting going on. And for over 100 split blocks to be active, there must be a large number of different split points scattered around the tree...


Quote:


Basically we can prove that if we already previously visited position Q in our current iteration, that we already had the opportunity to split there.


You might have had the opportunity, but not the ability. No idle threads. Where after searching a couple of other moves there, a thread working somewhere else finally finishes and now you can YBW split there at any time.


[quote\ And we had this opportunity while we can prove for the PV walkdown that all other cpu's were idle, and therefore could effectively get put to work.

I agree. For cases where we had the opportunity to split but not enough processors to search with, slave processors can check for idle processors at the split point
each time they come back to get a move.

Quote:

In other words the situation that there is a position where we might be able to add a cpu occurs basically if in a position we already have a bunch of cores search in parallel AND there is more moves to search in that position Q left.


Correct as it was implemented in Cray Blitz and Crafty.

Quote:


That's the only situation where DTS would in theory have a benefit, assuming your hardware would allow it, over the 'split in as we happen to touch this position' type strategy.



Your statement is way too general. The benefit of DTS is that the instant a thread goes idle, you do not have to split at the current node being searched in any thread, assuming the YBW criteria has been satisfied. With DTS you can split anywhere, no matter where any thread is currently searching. First thing that solves is zero idle time, because you do not have to wait for an active thread to reach a node where YBW criteria has been satisfied, you can split at other nodes, any other node in fact, where either the YBW criteria has been satisfied, or if you find none, you can split at one where you believe it will be satisfied, speculatively.

Quote:


The bottomline is that basically our parallel search is already going great at the moment that such event would occur.

So DTS doesn't solve some sort of 'worst case problem'. In fact it creates a new worstcase, as it's going to blow all other cpu's bandwidth when just 1 core for a fraction of a second is busy following the YBW property. That doesn't need to be during start of the search only. At this crucial moment that you want to split in other cpu's as quick as possible, which is possible after your single core gets back in a position where it has searched the first move; at that crucial moment with DTS all your machines core are busy hammering and jamming your single core.


How? This is an implementation issue, not a necessary property of DTS...


Quote:


One worst case scenario mentioned in the paper is re-splitting at deeper depths again and again may lead to overall slow down in some hypothetical situations.
Then to avoid this problem, it is suggested to start a speculative parallel search at N-1 once enough moves are searched at ply N that gaurantee the alpha bound.
It could be a waste of work if the bound gets improved once the rest of the moves are searched at ply N. Well this speculation is one thing that I would hope DTS could give
improvement... But one can also do speculative YBW by starting parallel search at ALL nodes without searching any moves. It gave a little improvement fro Feldmann and co.

Quote:

So it CAN have an effect that exponentially slows down your YBW search, meanwhile it contributes positively in efficiency (avoiding a few splits) at a moment in time that your parallel search already is going great at a bunch of cores and basically is about to finish an iteration or subbranch.

The benefit versus worstcase is not positive for DTS at modern hardware, when following the YBW property.


I think you have developed some sort of basic misunderstanding about DTS. The basic principle IS "YBW". That is / was the primary criteria for choosing a split point. But DTS is better than pure YBW because in pure YBW, you can only split at a node where some instance of the search is currently about to choose the next move to search and it notices that (a) a thread is idle and (b) the YBW criteria has been satisfied. With DTS you can split at ANY node in ANY active tree that is being searched. Most likely you ALWAYS find a YBW-satisfying node to split. If not, THEN, and only then, does DTS suggest "speculation" rather than sitting idle.

Quote:


It is mentioned that DTS was developed on cray (high bandwidth) and that this affected decisions of the algorithm. I am not sure which parts of the algorithm are affected most, but it is probably the work stealing as you suggested.

Quote:

p.s. DTS doesn't need to be implemented in a centralized manner, neither does YBW. We just speak about a fraction of a second that the search is busy at 1 core. The trick is to be able to quickly split in of course - i never made a secret out of that.

Diep is not recursive. So basically to split in another cpu is just setting a few pointers in theory. I didn't even ultra optimize it - that cheap it already is.

Thanks it really is good to hear of your experience.


Bob discussing DTS is not interesting here. Basically we are on 2 tracks here.

Daniel was thinking of GPUs when writing it - other architecture. You should see it in that light. AMD 7970 gpu's start at 2048 cores. Nvidia 680 starts at 1536 cores.

Soon both 'big brothers' will release. That's double the number of cores. For Nvidia it will be easiest to release it.

It's not easy to have gpu cores pick their own splitpoints. Everything goes massive parallel there.

As for DTS. When asked exactly 10 years ago how you splitted, as you said you didn't follow YBW with DTS back then. You answerred that you used some chessknowledge to guess whether to split.

Furthermore you copied 64KB for each splitpoint (the stack).

Now that might not have been a problem at Cray Smile

It sure was a worst case of Crafty some years ago which copied a kilobyte or 3. Nalimov then hacked in your code until it got a lot cheaper. I'd suggest you ship Nalimov a 'thank you' for that, as it took care crafty could scale to a lot more cores Smile

He really got it more efficient that crafty SMP code.

Now i don't want to get into a math discussion too much. But if i say the word 'prove' that means a mathematical proof. If i say "Proof by empirical finding", that isn't a 100% proof. If i say : "it will", that's not any of the 2.

Assuming the YBW property we can prove a few things.

We can PROVE that if you have generated splitpoints closer to the root in DTS, that when following the YBW property we already visited that position before and had our CHANCE to split there.

we can PROVE that if this position happens to be in a PV node that we should have splitted already in that position (of course an implementation can spoil it).

We can prove that if this situation occured from a state where just 1 core was left searching that when returning to a position P, with an average there of 40 semi legal moves, that we could already split there. Usually that means we had room for a CPU or 40. Note in Diep i limit it to 32 cpu's a splitpoint max.

So we can PROVE that in 99.9% of the cases we already did do genius splitting.

now i don't know how your datastructure works, but in Diep it is the case that if a bunch of processors split in at position P, that these cores will first use up all moves of P prior to calling AbortMe().

So we can prove that we already have searched quite efficient the vaste majority of the search tree with that 'random splitting'. Giving an exact % is difficult there, but it'll be around a 95%+.

Now it happens to be the case he has a bug in his search algorithm in StockFish, namely for some programming reason they selfintroduced SF can't call AbortMe().

I know a zillion ways how to fix that, but then designing a new datastructure to again split in cpu's that own this position P is not a very clever thing to do from SMP viewpoint, IMHO.

Better spend your effort that you can really abort yourself there and have the others search further.

We can SHOW that not doing this, can create a worst case in which you nonstop keep splitting in at small search depths - which is what you refer to earlier on. Yet that happens because of an implementation shortcoming in StockFish - not because it is a problem in a proper YBW implementation.

Now i'm fully aware you know all this, as in 1998 that's exactly what you told me - so i fixed that for Diep in 1998 Smile

The radical solution for Diep back then was to make the program non-recursive, but that doesn't need to be the solution. Also in a recursive program you can easily build this.

Now all this said a few words on DTS. As i posted before at shared memory systems with 1 or 2 sockets, it's not a problem to still implement a DTS in the manner how i solved it back in 1999 (note i fixed a few nasty bugs around october 2000 in implementation).

This should scale ok at 1 or 2 sockets.

Latencies of 4 socket machines is a lot tougher to deal with and also the amount of cores suddenly then is far above 16 making it a lot tougher to get it working well - yet maybe you'll get away there a tad as well.

But back to proving.

We can prove that what some call 'random splitting' in fact optimizes itself very quickly towards the root. It's moving there very quickly.

We can make it plausible that the 'random splitting' needs to do less work to get say a cpu or 32 to work in YBW.

If we have our alfa = -inf, beta = +inf PVS window and get in position P,
then YBW will search its first move with 1 core p0.

If p0 returns in P, then with the direct splitting i get from the idle list in 1 dang 31 cpu's. It directly unlocks the idle list. It ships then the cpu's their 'start' command in P and a move to search and that's basically it.

Now DTS at the same big box.
Let's do simple about the first thing. p0 describes its position P and puts it in the splitpoint list as splittable.

The hardware will very cheap then update to each individual cpu that's idle an invalidate and the cpu's load that. Still we are busy dirt cheap.

some dozens of cpu's overwhelm each other then at this splitpoint as they all at nearly the same atomic time find this splitpoint. They all try to lock it. Just 1 of them gets the spinlock.

Again - this is a big machine not a tiny one.

Spinlock usually works a LITTLE DIFFERENT there.

It will do a few attempts and then the KERNEL will kick all those cpu's from the 'runqueue' and force them to idle. So a forced idle by kernel.

About 10 miliseconds later another one is allowed to unidle and get the split.

In meantime of course p0 already has searched the entire tree below its move and also wants to lock into this splitpoint, but it also gets overwhelmed by all the idle cpu's that try to get in.

The other cpu's are all fighting still for this splitpoint and will suffer another 10 milliseconds penalty. It will take real long until p0 can lock in.

So some splitpoint that should've taken maybe a 100 microseconds, it effectively takes some dozens of milliseconds to search.

Slowdown garantueed of factor 100.

Soon that exponential explodes and the program searches at a 1000 nodes a second or so a core effectively.

The real problem of this 'slowdown of a garantueed factor 100' is that it is already happening when basically 1 or 2 cores search a position P. So at the very start of the iteration it has the biggest impact, and at the very start of an iteration you 'd prefer quickly putting to work all cores...

Now i gave the extreme example of what factual happens at bigger machines. Also at tinier machines you don't want all that fuzz around 1 splitpoint as p0 also has to update it. It just won't be able to quickly lock back in. Idle cpu's are so eager to split in, and they all detected that splitpoint, so they all try to lock in and they all manage to lock in and make your life a mess.

It's what i call a worst case at a criticial spot of your search, namely start of the iteration and each time that just 1 core is searching, this behaviour is there that exponential will slow down your search.

You don't notice this problem much at a machine with fast latencies to the RAM, but it will be a nightmare at bigger machines.[/quote]

I wasn't thinking GPU at all because the original post was not about a massively-parallel GPU search but was, instead, about the stockfish search.

GPUs have enough bottlenecks that I have simply not been that interested, other than in casually keeping up with what is being done. But they have a very defined target, and are being optimized toward that target, as opposed to true general-purpose computing.
Back to top
View user's profile Send private message
Display posts from previous:   
Subject Author Date/Time
YBWC: Active Reparenting Marco Costalba Tue Apr 10, 2012 5:38 pm
      Re: YBWC: Active Reparenting Vincent Diepeveen Tue Apr 10, 2012 6:23 pm
            Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 10, 2012 7:11 pm
                  Re: YBWC: Active Reparenting Vincent Diepeveen Tue Apr 10, 2012 7:54 pm
                        Re: YBWC: Active Reparenting Marco Costalba Tue Apr 10, 2012 8:48 pm
                              Re: YBWC: Active Reparenting Vincent Diepeveen Wed Apr 11, 2012 10:09 am
                                    Re: YBWC: Active Reparenting Robert Hyatt Mon Apr 16, 2012 6:53 pm
                        Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 10, 2012 11:11 pm
                              Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 10, 2012 11:33 pm
                              Re: YBWC: Active Reparenting Robert Hyatt Mon Apr 16, 2012 7:07 pm
                                    Re: YBWC: Active Reparenting Vincent Diepeveen Mon Apr 16, 2012 8:55 pm
                                          Re: YBWC: Active Reparenting Robert Hyatt Tue Apr 17, 2012 2:54 pm
                                                Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 17, 2012 5:42 pm
                                                      Re: YBWC: Active Reparenting Robert Hyatt Tue Apr 17, 2012 8:01 pm
                                                            Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 17, 2012 9:34 pm
                                                                  Re: YBWC: Active Reparenting Robert Hyatt Tue Apr 17, 2012 9:46 pm
                                                                        Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 17, 2012 10:30 pm
                                                                              Re: YBWC: Active Reparenting Vincent Diepeveen Tue Apr 17, 2012 11:43 pm
                                                                                    Re: YBWC: Active Reparenting Daniel Shawul Wed Apr 18, 2012 1:05 am
                                                                                          Re: YBWC: Active Reparenting Vincent Diepeveen Wed Apr 18, 2012 11:24 am
                                                                                          Re: YBWC: Active Reparenting Daniel Shawul Wed Apr 18, 2012 1:01 pm
                                                                              Re: YBWC: Active Reparenting Robert Hyatt Wed Apr 18, 2012 7:34 pm
                                                                                    Re: YBWC: Active Reparenting Daniel Shawul Wed Apr 18, 2012 11:18 pm
                                                                                          Re: YBWC: Active Reparenting Vincent Diepeveen Thu Apr 19, 2012 7:01 am
                                                                                          Re: YBWC: Active Reparenting Daniel Shawul Thu Apr 19, 2012 12:46 pm
                                                                                          Re: YBWC: Active Reparenting Vincent Diepeveen Thu Apr 19, 2012 2:40 pm
                                                                                          Re: YBWC: Active Reparenting Daniel Shawul Thu Apr 19, 2012 9:32 pm
                                                                                          Re: YBWC: Active Reparenting Robert Hyatt Thu Apr 19, 2012 7:08 pm
                                                                                          Re: YBWC: Active Reparenting Daniel Shawul Thu Apr 19, 2012 9:37 pm
                                                                                          Re: YBWC: Active Reparenting Robert Hyatt Thu Apr 19, 2012 9:48 pm
                                                                                          Re: YBWC: Active Reparenting Robert Hyatt Thu Apr 19, 2012 6:59 pm
                                                                                          Re: YBWC: Active Reparenting Robert Hyatt Thu Apr 19, 2012 5:30 pm
                                                                        Re: YBWC: Active Reparenting Vincent Diepeveen Tue Apr 17, 2012 11:29 pm
                                                                              Re: YBWC: Active Reparenting Daniel Shawul Wed Apr 18, 2012 12:58 am
                                                                  Re: YBWC: Active Reparenting Vincent Diepeveen Tue Apr 17, 2012 11:18 pm
                                    Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 17, 2012 5:37 pm
      Re: YBWC: Active Reparenting Álvaro Begué Tue Apr 10, 2012 6:27 pm
            Re: YBWC: Active Reparenting Vincent Diepeveen Tue Apr 10, 2012 6:44 pm
      Re: YBWC: Active Reparenting Robert Hyatt Tue Apr 10, 2012 9:39 pm
            Re: YBWC: Active Reparenting Marco Costalba Wed Apr 11, 2012 6:06 am
                  Re: YBWC: Active Reparenting Robert Hyatt Thu Apr 12, 2012 1:20 am
                        Re: YBWC: Active Reparenting Vincent Diepeveen Sun Apr 15, 2012 10:04 am
                              Re: YBWC: Active Reparenting Marco Costalba Sun Apr 15, 2012 10:22 am
                                    Re: YBWC: Active Reparenting Vincent Diepeveen Sun Apr 15, 2012 2:39 pm
                                          Re: YBWC: Active Reparenting Marco Costalba Mon Apr 16, 2012 5:29 am
                                                Re: YBWC: Active Reparenting Vincent Diepeveen Mon Apr 16, 2012 5:54 pm
                                                      Re: YBWC: Active Reparenting Marco Costalba Tue Apr 17, 2012 11:30 am
                                                Re: YBWC: Active Reparenting Marco Costalba Wed Apr 18, 2012 6:31 am
      Re: YBWC: Active Reparenting Rein Halbersma Wed Apr 11, 2012 7:23 am
            Re: YBWC: Active Reparenting Vincent Diepeveen Wed Apr 11, 2012 9:07 am
            Re: YBWC: Active Reparenting Vincent Diepeveen Wed Apr 11, 2012 9:31 am
            Re: YBWC: Active Reparenting Daniel Shawul Wed Apr 11, 2012 12:14 pm
                  Re: YBWC: Active Reparenting Vincent Diepeveen Wed Apr 11, 2012 1:54 pm
                  Re: YBWC: Active Reparenting Robert Hyatt Thu Apr 12, 2012 1:28 am
                        Re: YBWC: Active Reparenting Daniel Shawul Tue Apr 17, 2012 5:49 pm
                              Re: YBWC: Active Reparenting Vincent Diepeveen Tue Apr 17, 2012 11:10 pm
                                    Re: YBWC: Active Reparenting Daniel Shawul Wed Apr 18, 2012 12:50 am
                              Re: YBWC: Active Reparenting Robert Hyatt Wed Apr 18, 2012 7:38 pm
                                    Re: YBWC: Active Reparenting Daniel Shawul Wed Apr 18, 2012 11:30 pm
                                          Re: YBWC: Active Reparenting Vincent Diepeveen Thu Apr 19, 2012 7:08 am
                                                Re: YBWC: Active Reparenting Daniel Shawul Thu Apr 19, 2012 12:15 pm
                                                      Re: YBWC: Active Reparenting Vincent Diepeveen Thu Apr 19, 2012 1:53 pm
                                          Re: YBWC: Active Reparenting Robert Hyatt Thu Apr 19, 2012 7:11 pm
      Re: GPUs Srdja Matovic Mon Jun 04, 2012 5:08 pm
Post new topic    TalkChess.com Forum Index -> Computer Chess Club: Programming and Technical Discussions

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




Powered by phpBB © 2001, 2005 phpBB Group
Enhanced with Moby Threads