Explanation for non-expert?

lucasart · Post by **lucasart** » Sat Feb 21, 2015 1:14 am

bob wrote:
lucasart wrote:
bob wrote:
lucasart wrote:
bob wrote: I didn't understand it because spLevel was not in the Stockfish6 code I looked at. That change is not going to be worth 50 Elo or so however. But I will let YOU figure out why. And yes, if you look at my old DTS paper this kind of thing was factored in. There's very little new stuff in normal PVS/YBW type parallel searching.
If you've understood all this since.the 80's, how come Crafty scales so badly from 8 to 16 cores ?
It doesn't on my machine. It depends on the box. The PC is not exactly the perfect platform. 16 cores typically has two chips although it could be a single now. 16 cores on a single chip, with a single path to memory has a problem.

Additionally, I chose to write Crafty in a simpler way that the DTS algorithm used in Cray Blitz. I may well go back and rewrite that one day, because it is significantly better. It just hasn't been a high priority. But the issues for SMP scaling are not new. SMP search has been around for almost 40 years now... Alpha/Beta hasn't changed.

On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores. You do have to know what you are doing when testing, and make certain turbo is disabled, or the answers are useless. Most miss that.
Nobody cares about NPS scaling. It's the wrong measure. Searching more nodes is not a goal in itself. Winning more games is. It's ELO we care about. Here's the evidence that shows that there's a big problem with Crafty's scaling from 8 to 16 cores:
http://www.talkchess.com/forum/viewtopi ... ght=crafty
Stick to topics you understand.

If your NPS scaling is bad, EVERYTHING parallel is bad. If your NPS scaling is good, then you have a chance to do decently on the usual SMP speedup measurements. But if your NPS is 1/2 of what it should be, your speedup is guaranteed to be 1/2 of what it could be as well.

Here's one 16 core run I made about a month ago that shows that for this specific 2x8core box, scaling was within reason and can be improved a bit:
Code: Select all
        1-core  =  nps= 6.1M     ---
        4-cores =  nps=23.7M    4.0x
        8-cores =  nps=45.0M    7.4x
       16-cores =  nps=86.2M    14.1x
Those are not that bad. Different machines behave differently and require some tuning. If that represents a "big problem" I will keep it. Every time I move to something new it takes some tweaking... ho hum.

Did you look at the link I send ? I will repost it here for people who are too lazy to click on it:

Going from 8 to 16 cores, Crafty loses 50 elo. But it almost doubles NPS, so I guess it's fine by your standard

And you keep trying to lecture everyone about Cray Blitz or Crafty way of doing SMP, as if it were the reference of correctness ? A few years ago, you lectured Marco when he came up with the "Active Reparenting" patch. Later, Joona retried it (thanks to fishtest we could test with many cores), and it was a huge success. Now Joona improved it dramatically, gaining almost 50 elo on 16 cores, and you try to lecture him that his reasoning is flawed and his idea cannot work. Nobody in SF team cares about your opinion (or anyone's opinion). We only care about reality, which means ELO. And when reality disgrees with the great Profesor Hyatt, it means that the great Professor Hyatt is wrong. PERIOD.

Even Cheng's lazy SMP scales better than Crafty on 16 cores (using the correct measure which is ELO). Doesn't that tell you that your DTS algorithm (or the way you implemented it) has some issues ? Or do you prefer to tuck your head in the sand and keep believing that the world is flat ?

Do you ever question yourself ? Or do you prefer to question reality (=ELO) when it doesn't agree with you ?

bob · Post by **bob** » Sat Feb 21, 2015 2:15 am

All well and good. I can't answer what is going on on THAT box. I can answer that MY 16 core tests look perfectly normal. I can't/won't say anything else about things well beyond my control.

petero2 · Post by **petero2** » Tue Mar 10, 2015 8:29 pm

bob wrote:
petero2 wrote:
zamar wrote:
#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.
Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.
Texel already does that. It collects cut off statistics during search and uses the information to estimate the probability that a move at a potential split point would be searched by a corresponding single threaded search. It also collects "overhead" statistics during search, which estimates how much time a thread needs to spend on starting/stopping the search, as a function of the depth of the subtree to search. For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I looked at what you were doing. One thing you might not have noticed was that in DTS there is no "master" thread. Since this was an iterative rather than recursive search, ANY thread could complete the search at a node and back up to the previous ply. When any thread hits a fail high, it returns to the previous ply while stopping any siblings working there since that work is unnecessary.

I did notice that and in my old algorithm description I wrote this:

petero2 wrote:If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the helper thread fails high, the master thread still finishes searching move one before it finds the fail high result for move two in the transposition table and propagates the fail high result up in the search tree. In this case there is an opportunity for the parallel search to search fewer nodes than the serial search would search, by immediately aborting the master thread. I don't know if the DTS algorithm handles this case but I would guess that it does.

I did not mean to imply that DTS has a master thread. I meant that I guessed DTS handled this case because it doesn't have a master thread. I also had an idea how to implement something similar in the context of my algorithm:

petero2 wrote:It should be possible to fix the first disadvantage above by having the master thread periodically check its SplitPoint objects to see if there has been a beta cutoff (or alpha increase). If this is the case, an exception could be thrown and caught at the correct level in the recursion. There the search result from the helper thread could be picked up and returned.

I have made some measurements and in my engine a helper thread fails high in around 0.3% - 1.5% of the searches. A value of around 0.4% - 0.5% is the most common. Even though this seems like a small value, the sub-tree that the master thread can avoid searching when the fail high happens could be quite big, so I decided to implement this idea.

A helper thread sets a flag when it fails high. There is one flag per thread and the master threads check their flag at the top of the negaScout function. This way the master threads don't have to scan their SplitPoint objects until a fail high has actually happened.

In the current implementation I don't handle the "alpha increase" case at PV nodes. This could potentially be handled by interrupting the master thread and restarting it with the updated alpha value. It is unclear if that would be a win though. It could be faster to continue searching with the old alpha value if the master thread is almost finished searching its sub-tree.

Thanks to C++ exceptions and RAII the implementation is quite straightforward. Thanks to "the table approach" (see TR18015.pdf section 5.4.1.2) to implement exception handling, there is no code slowdown when exceptions are not thrown.

I played 4000 games against the previous texel development version using 4 cores and time control 120s + 1s/move. The result was +5.8 elo, LOS=92.4%. Not quite as statistically significant as I would like, but I have no reason to believe this change does not work, so I plan to keep it.

bob · Post by **bob** » Tue Mar 10, 2015 11:30 pm

petero2 wrote:
bob wrote:
petero2 wrote:
zamar wrote:
#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.
Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.
Texel already does that. It collects cut off statistics during search and uses the information to estimate the probability that a move at a potential split point would be searched by a corresponding single threaded search. It also collects "overhead" statistics during search, which estimates how much time a thread needs to spend on starting/stopping the search, as a function of the depth of the subtree to search. For a given move at a potential split point it then computes the expected "efficiency", defined as p1 * p2 * ... * pN * (1 - overhead), where p1, ..., pN are the estimated probabilities that the moves for the current and all "parent" split points need to be searched. Finally it picks the move with the largest estimated efficiency.

An early version of the algorithm is described here.
I looked at what you were doing. One thing you might not have noticed was that in DTS there is no "master" thread. Since this was an iterative rather than recursive search, ANY thread could complete the search at a node and back up to the previous ply. When any thread hits a fail high, it returns to the previous ply while stopping any siblings working there since that work is unnecessary.
I did notice that and in my old algorithm description I wrote this:
petero2 wrote:If the master thread is searching move one at a SplitPoint, and a helper thread is searching move two, and the helper thread fails high, the master thread still finishes searching move one before it finds the fail high result for move two in the transposition table and propagates the fail high result up in the search tree. In this case there is an opportunity for the parallel search to search fewer nodes than the serial search would search, by immediately aborting the master thread. I don't know if the DTS algorithm handles this case but I would guess that it does.
I did not mean to imply that DTS has a master thread. I meant that I guessed DTS handled this case because it doesn't have a master thread. I also had an idea how to implement something similar in the context of my algorithm:
petero2 wrote:It should be possible to fix the first disadvantage above by having the master thread periodically check its SplitPoint objects to see if there has been a beta cutoff (or alpha increase). If this is the case, an exception could be thrown and caught at the correct level in the recursion. There the search result from the helper thread could be picked up and returned.
I have made some measurements and in my engine a helper thread fails high in around 0.3% - 1.5% of the searches. A value of around 0.4% - 0.5% is the most common. Even though this seems like a small value, the sub-tree that the master thread can avoid searching when the fail high happens could be quite big, so I decided to implement this idea.

A helper thread sets a flag when it fails high. There is one flag per thread and the master threads check their flag at the top of the negaScout function. This way the master threads don't have to scan their SplitPoint objects until a fail high has actually happened.

In the current implementation I don't handle the "alpha increase" case at PV nodes. This could potentially be handled by interrupting the master thread and restarting it with the updated alpha value. It is unclear if that would be a win though. It could be faster to continue searching with the old alpha value if the master thread is almost finished searching its sub-tree.

Thanks to C++ exceptions and RAII the implementation is quite straightforward. Thanks to "the table approach" (see TR18015.pdf section 5.4.1.2) to implement exception handling, there is no code slowdown when exceptions are not thrown.

I played 4000 games against the previous texel development version using 4 cores and time control 120s + 1s/move. The result was +5.8 elo, LOS=92.4%. Not quite as statistically significant as I would like, but I have no reason to believe this change does not work, so I plan to keep it.

An alternative is to set such a flag at the shared split block for that thread. Then any child (or the original parent) can set that flag when it fails high and the failing high thread can instantly return without waiting, knowing that the other threads will not try to return once that fail high flag is set. No waiting at all.

Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?