Explanation for non-expert?

Dann Corbit · Post by **Dann Corbit** » Tue Feb 17, 2015 8:56 pm

bob wrote:
mcostalba wrote:
gladius wrote:
zullil wrote:Can someone explain in general terms the following code from the recent smp patch for SF:

<snip>

It seems that the idea is to order split-points lexicographically based on spLevel, slavesCount and depth, in order to allocate threads "intelligently".

Correction/elaboration welcome. Thanks.
I'll repost my comment from the fishcooking forum:

Really nice idea too. It's a great refinement of the late-join idea. Instead of just joining the first split point we find, score all the potential ones, and pick the one with the best score. The scoring metric checks how many recursive splits have been done first, then the number of slaves already on the split point, and finally the depth. So, it will distribute the slaves out more across the tree, and bias towards ones near the root of the tree.
It is not clear why it succeeds because the idea of late join was already tested without success, also the idea to join to near the root was already tested without success.

So here the new thing is this score formula that is made so to bias toward ones with small number of split points and only when the number of split point is equal, toward near the root of the tree. But near the root is not the key of the success, because when tested alone it failed.

IMO the key is to bias toward small number of split points (the parameter that Bob did not understand). Of course once we explain to him he will promptly say that he already did it in the '80, so, to avoid this pity, I won't tell him what spLevel means
I didn't understand it because spLevel was not in the Stockfish6 code I looked at. That change is not going to be worth 50 Elo or so however. But I will let YOU figure out why. And yes, if you look at my old DTS paper this kind of thing was factored in. There's very little new stuff in normal PVS/YBW type parallel searching.

Considering the direction of compute power (massively parallel computation appears to be the wave of the future) any improvement in efficiency (even 5%) would seem to be terribly important to me.

And with stockfish being open source, everyone can benefit from the results of their experiments. It's a win/win/win situation if there ever was one.

syzygy · Post by **syzygy** » Tue Feb 17, 2015 9:19 pm

zullil wrote:The first result is for Stockfish from just before Joona's patch. The second result includes the patch and nothing more. The unpatched binary used 45% more time to complete the bench test.

But nps goes down. The big problem that SF had when running on many cores is low nps. If that has not improved, then I am very sceptical that there is any Elo gain at normal time controls.

I don't know how the tests were run that showed a big improvement, but if they were run at like 100ms per move, it might well be that the measured gains are due entirely to side-effects that go away at normal time controls (at which real games are played).

For example, I do not exclude that SF running on 16 cores performs worse than SF on 8 cores during the first 100ms of a search. If that is the case, then somehow making sure than only 8 out of 16 cores have work to do (and somehow preventing the other 8 from causing synchronisation overhead) will show an improvement.

I haven't studied the patch but again, I am extremely suspicious if it does not very clearly improve nps. It is absolutely clear that low nps on many cores is the problem that should be addressed.

zamar · Post by **zamar** » Tue Feb 17, 2015 11:47 pm

There is no magic here. This stuff has been known since '80s. Two key factors to consider when/how to split:

1) Each time you split there is a price to pay. 2 threads working on the same split point, one of them produces cut-off, the work of the other thread is wasted. We could say that you are getting only 90% of your money back. If there are many, let's say, four split points above the working thread, the risk is four times higher, so you would be getting only 90%^4 = 64% of your money back.

Moral of the story: Don't create splitpoints under splitpoints unless you really have to.

2) The more threads you have working on the same split point (let's say 8), the higher the price to pay if one of them produces cut-off. If one of the 8 threads produces cut-off, the work of 7 others is wasted.

Moral of the story: Don't put too many threads under the same split point. Balance them as equally between split points as you can.

In the end it's just all about finding the balance between creating a minimum number of split points (recursively) and at the same time putting a minimum number of threads under the same split point.

I'm quite certain that SF is still not in the optimum...

bob · Post by **bob** » Wed Feb 18, 2015 3:09 am

zamar wrote:There is no magic here. This stuff has been known since '80s. Two key factors to consider when/how to split:

1) Each time you split there is a price to pay. 2 threads working on the same split point, one of them produces cut-off, the work of the other thread is wasted. We could say that you are getting only 90% of your money back. If there are many, let's say, four split points above the working thread, the risk is four times higher, so you would be getting only 90%^4 = 64% of your money back.

Moral of the story: Don't create splitpoints under splitpoints unless you really have to.

2) The more threads you have working on the same split point (let's say 8), the higher the price to pay if one of them produces cut-off. If one of the 8 threads produces cut-off, the work of 7 others is wasted.

Moral of the story: Don't put too many threads under the same split point. Balance them as equally between split points as you can.

In the end it's just all about finding the balance between creating a minimum number of split points (recursively) and at the same time putting a minimum number of threads under the same split point.

I'm quite certain that SF is still not in the optimum...

#1 is not quite so simple. If you split below a split point, you still incur about the same amount of overhead. This is a sort of similarity one might find with speculative execution in a modern CPU. Remember that the only overhead you incur is the overhead searched by this new thread, if the split turns out to be wrong. It really doesn't matter where it happens. An existing split point exists for a reason. Choosing a different place to split incurs its own risk.

#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.

(a) a "late joiner" joins but does nothing because all work has already been completed or scheduled by other threads and is in progress;

(b) N threads split but only N-m have work to do due to a small branching factor at that node (this is particularly bad in endgames and is why I still use a minimum thread group as I did in Cray Blitz to avoid having too many threads at one point.)

Either of the above is a pain, I have been working on reducing both the last few months.

The best thing about the Cray Blitz DTS algorithm is that what you or I now call a "late join" is MUCH more flexible. In Cray Blitz, a thread can join ANY thread, either at an existing split point or by creating a new split point, with equal (almost nil) overhead. With this you can examine each tree being searched, look at all the potential split points, and select the one with the greatest probability of being a true ALL node with significant work remaining.

All in all this will remain an interesting problem for a long time, and as cores go up, it becomes more interesting... And there is also the problem that not all hardware is created equal, which makes it harder to tune this stuff. By hand is hard enough. Automatically is painful.

And then there is multiple-socket NUMA...

lucasart · Post by **lucasart** » Wed Feb 18, 2015 4:08 am

bob wrote: I didn't understand it because spLevel was not in the Stockfish6 code I looked at. That change is not going to be worth 50 Elo or so however. But I will let YOU figure out why. And yes, if you look at my old DTS paper this kind of thing was factored in. There's very little new stuff in normal PVS/YBW type parallel searching.

If you've understood all this since.the 80's, how come Crafty scales so badly from 8 to 16 cores ?

zamar · Post by **zamar** » Wed Feb 18, 2015 9:10 am

Hi Bob,

bob wrote: #1 is not quite so simple. If you split below a split point, you still incur about the same amount of overhead. This is a sort of similarity one might find with speculative execution in a modern CPU. Remember that the only overhead you incur is the overhead searched by this new thread, if the split turns out to be wrong. It really doesn't matter where it happens.

You missed the point. If the thread is searching under a splitpoint which is under existing splitpoint which is under existing split point, and so on (called spLevel in SF), you are speculating over speculations over speculations. If any of those speculations in the chain turn out to be wrong, the work of the thread is lost. Just don't do that.

From a real life:
- If an idiot helps a wise man, something good can still result from it.
- If an idiot helps an idiot who helps an idiot who helps an idiot who helps a wice man, nothin good comes out of it.

#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.

Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.

(a) a "late joiner" joins but does nothing because all work has already been completed or scheduled by other threads and is in progress;

SF sets a flag when there is no more work to be done.

(b) N threads split but only N-m have work to do due to a small branching factor at that node (this is particularly bad in endgames and is why I still use a minimum thread group as I did in Cray Blitz to avoid having too many threads at one point.)

In SF this is controlled by minimum split depth and max slaves per split point parameters.

zullil · Post by **zullil** » Wed Feb 18, 2015 12:51 pm

Question from non-expert:

Would it be possible to include a UCI option "Write SMP log" (false by default) that would dump possibly useful information about activity/inactivity of threads (or whatever), so users with high-core machines might be able to test/contribute feedback?

I understand that testing multi-threaded searching is difficult, because the searches themselves are highly indeterminate. Testing with very short time control matches seems problematic to me.

Thanks.

bob · Post by **bob** » Wed Feb 18, 2015 2:59 pm

lucasart wrote:
bob wrote: I didn't understand it because spLevel was not in the Stockfish6 code I looked at. That change is not going to be worth 50 Elo or so however. But I will let YOU figure out why. And yes, if you look at my old DTS paper this kind of thing was factored in. There's very little new stuff in normal PVS/YBW type parallel searching.
If you've understood all this since.the 80's, how come Crafty scales so badly from 8 to 16 cores ?

It doesn't on my machine. It depends on the box. The PC is not exactly the perfect platform. 16 cores typically has two chips although it could be a single now. 16 cores on a single chip, with a single path to memory has a problem.

Additionally, I chose to write Crafty in a simpler way that the DTS algorithm used in Cray Blitz. I may well go back and rewrite that one day, because it is significantly better. It just hasn't been a high priority. But the issues for SMP scaling are not new. SMP search has been around for almost 40 years now... Alpha/Beta hasn't changed.

On the 12-core box I use occasionally (dual 6 core chips) the scaling is almost perfect. 5.5M NPS on one core, 60M NPS on 12 cores. You do have to know what you are doing when testing, and make certain turbo is disabled, or the answers are useless. Most miss that.

bob · Post by **bob** » Wed Feb 18, 2015 4:29 pm

zamar wrote:Hi Bob,

bob wrote: #1 is not quite so simple. If you split below a split point, you still incur about the same amount of overhead. This is a sort of similarity one might find with speculative execution in a modern CPU. Remember that the only overhead you incur is the overhead searched by this new thread, if the split turns out to be wrong. It really doesn't matter where it happens.
You missed the point. If the thread is searching under a splitpoint which is under existing splitpoint which is under existing split point, and so on (called spLevel in SF), you are speculating over speculations over speculations. If any of those speculations in the chain turn out to be wrong, the work of the thread is lost. Just don't do that.

From a real life:
- If an idiot helps a wise man, something good can still result from it.
- If an idiot helps an idiot who helps an idiot who helps an idiot who helps a wice man, nothin good comes out of it.

#2 is also not quite so simple. Better to have 8 threads at a sure ALL node than having 4 at that node and the other 4 somewhere else. One condition I didn't see being checked was one of the conditions I used in the DTS algorithm for Cray Blitz, which was "work left to be done at a split point." Even in Crafty I have been measuring "futile splits" which can happen in multiple ways.
Well, there is no such a thing as a sure ALL node, but in in principle I agree. An ideal splitting algorithm would take the node type and the count of already searched moves into account as this affects the probability of cut-off.

(a) a "late joiner" joins but does nothing because all work has already been completed or scheduled by other threads and is in progress;
SF sets a flag when there is no more work to be done.

(b) N threads split but only N-m have work to do due to a small branching factor at that node (this is particularly bad in endgames and is why I still use a minimum thread group as I did in Cray Blitz to avoid having too many threads at one point.)
In SF this is controlled by minimum split depth and max slaves per split point parameters.

I understood the spLevel idea. You missed what I wrote. We are talking about a single thread. Wherever you stick it, if you are wrong its effort will be wasted. If you stick it below one or more split points, yes if one of those is wrong then this will waste effort. But the probability of one of those split points being bad is no worse than the probability of a brand new split point being bad. So no matter where you stick this thread, all you risk is making its effort wasted.

When I worked on my dissertation, using up to 30 CPUs at the time, I didn't have much luck trying to take advantage of unpredictable behavior. The obvious idea of trying to split near the root is problematic, because any mis-ordering can turn that ALL node into a CUT node in unpredictable ways. The only really good split point is the root since that is a guaranteed ALL node (PV nodes are also ALL).

But there are so many issues to deal with and so many solutions, that doing things automatically is hard. You have to be very careful with cache management and how you share things to avoid all sorts of cache invalidation/forwarding traffic, memory bandwidth on a single chip is limited, you don't want to have a thread bounce to a different CPU so processor affinity is important, and then there are other things starting with NUMA/memory allocation.

I've tried repeatedly to do something automatic in regard to setting things up optimally. No luck so far, because there is too much variability in the hardware. I tested on a dual 8 core a few weeks back and it really ran well. I tested on a dual 12 core a little later and the plain NPS scaling was bad and took a lot of tweaking to get right. I see this almost every time I run on something newer...

bob · Post by **bob** » Wed Feb 18, 2015 4:37 pm

zullil wrote:Question from non-expert:

Would it be possible to include a UCI option "Write SMP log" (false by default) that would dump possibly useful information about activity/inactivity of threads (or whatever), so users with high-core machines might be able to test/contribute feedback?

I understand that testing multi-threaded searching is difficult, because the searches themselves are highly indeterminate. Testing with very short time control matches seems problematic to me.

Thanks.

This is not easy. First, such a log could be quite big. As a humorous note, Cray once asked me to do this when I was testing on the T90 with 32 cpus. I asked "are you sure you want to see that?" I ran the test, told them where to find the log, Dennis called me back and said "7+ terabytes is a bit large, eh?"

There are several issues to deal with. IF NPS scaling is poor, that is almost always a result of either (a) threads sitting idle waiting for work too long or (b) cache/memory bottlenecks. Discovering which is not so easy. If you watch Crafty run, you will see it display time as "time=3:14(97%)". What that means is that 97% of the time ALL threads were busy, or the inverse, 3% of the total performance power of the machine (N cores) was lost due to waiting for a place to split. If NPS is low but that percentage is high, then we are looking at cache/memory issues which are difficult for a user to deal with. Intel changes their cache mechanism regularly, particularly with automatic pre-fetching. This makes figuring out where the NPS bottleneck is quite difficult at times. Intel / AMD have some internal tools I have used in the past to help me discover unknown cache hot-spots and the like. But every time they make an improvement, it bites some program until the programmer fixes that.

Easy-to-tune is a goal. But a VERY difficult one. I'm interested in how the dual 18-core boxes might run and what quirks they are going to have...

Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?

Re: Explanation for non-expert?