Lazy SMP Better than YBWC?

bob · Post by **bob** » Sun Oct 25, 2015 5:24 pm

Joost Buijs wrote:
mcostalba wrote: Regarding std::set performance, you are aiming at the wrong target, here is reducing lock contention the real target: in case lock is already held, the thread is suspended and it will be resumed at a random, possibly long time, depending on the thread scheduler, this is much worse than the time spent to look up a std::set of just few moves.
Yes, I know what lock contention is and that is why you want to keep the duration of the lock as short as possible.

In my engine I do the following:

Try the hash move (if available) before any move generation at all.
Then I generate all captures and sort them with an insertion sort, this happens before I call the SMP search.
Now I can simply pick the captures from a list which goes very quick.
The only time I really get a problem with lock contention is when I run out of captures and have to generate the non captures and sort them.
That is why I want to try to generate all moves at once, it also makes the code a lot cleaner and probably faster.

This is simply a "pseudo-lock" when you think about it. You generate captures before you do the SMP search. So idle threads get to sit and wait while you do this. As opposed to sitting and waiting at a lock while you do this. No difference in terms of speed or performance.

I don't do a "sort". I select a move when one is needed. I've tried this many times but it has always slowed me down too much. Particularly when you don't have a hash move or a good capture, and you generate all moves and sort. And every other ply only needs one move to fail high wasting the sort effort... I also try killers, counter-moves, etc, before generating the non-captures, which has always been faster for me.

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 6:01 pm

bob wrote: This is simply a "pseudo-lock" when you think about it. You generate captures before you do the SMP search. So idle threads get to sit and wait while you do this. As opposed to sitting and waiting at a lock while you do this. No difference in terms of speed or performance.

It is very difficult to tell exactly what happens with idle threads when the SMP search is fully up and running.
When a thread is the stage of generating moves it is very likely that idle threads will be assigned to a split-point in a different thread.

bob wrote: I don't do a "sort". I select a move when one is needed. I've tried this many times but it has always slowed me down too much. Particularly when you don't have a hash move or a good capture, and you generate all moves and sort. And every other ply only needs one move to fail high wasting the sort effort... I also try killers, counter-moves, etc, before generating the non-captures, which has always been faster for me.

Probably you are right that selecting a move when you need it is faster than sorting in the case you have an early cutoff.
It is just a matter of laziness that I sort the moves before using them, it makes a difference of a few percent at most, which means a few Elo points.
It seams wiser to spend the effort in the evaluation function which can make a difference of hundreds of points.

mcostalba · Post by **mcostalba** » Sun Oct 25, 2015 6:15 pm

Joost Buijs wrote: Yes, I know what lock contention is and that is why you want to keep the duration of the lock as short as possible.

This is the typical naive approach at optimization. Optimization is an engineering discipline and should be approached with an engineering method, i.e. measuring. The only way to optimize is with a profiler, many times what you think is sensible, it is instead absolutely harmless. In this case is far form clear that looking up a move in a set is a performance bottleneck(!).

BTW the hot path (the first lookup) is outside lock contention, and the second lookup is already fully cached and the branches well predicted because it happens just few instructions after, so any assumption about the performance impact is really fully speculative.

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 6:23 pm

mcostalba wrote:
Joost Buijs wrote: Yes, I know what lock contention is and that is why you want to keep the duration of the lock as short as possible.
This is the typical naive approach at optimization. Optimization is an engineering discipline and should be approached with an engineering method, i.e. measuring. The only way to optimize is with a profiler, many times what you think is sensible, it is instead absolutely harmless. In this case is far form clear that looking up a move in a set is a performance bottleneck(!).

BTW the hot path (the first lookup) is outside lock contention, and the second lookup is already fully cached and the branches well predicted because it happens just few instructions after, so any assumption about the performance impact is really fully speculative.

Keeping the duration of the locks as short as possible is in my view not naive.
I agree that when you really want to know where the bottlenecks are you have to use a profiler or something like Intel VTune which is more than a profiler.
Here I use Intel Parallel Studio, so I'm fully aware about what it has to offer.

mcostalba · Post by **mcostalba** » Sun Oct 25, 2015 6:25 pm

mcostalba wrote:many times what you think is sensible, it is instead absolutely harmless.

For instance, without profiling one, worried by the time it takes to look up, could write something like this:

Code: Select all

  while (move = NextMove(), move != MOVE_NONE)
  {

    size_type size = sp->searchedMoves.size();

    if (sp->searchedMoves.count(move))
        continue; // No contention on fast path

    sp->spinlock.acquire();

    if (   sp->searchedMoves.size() != size  // False in the common case
        && sp->searchedMoves.count(move))
    {
        sp->spinlock.release();
        continue; // Race
    }

    sp->searchedMoves.insert(move);

    sp->spinlock.release();

    ...

This _could_ seem a nice trick to avoid the second lookup in most of the cases, but from an engineering point of view is a mistake: without profiling this just adds complexity for no reason.

bob · Post by **bob** » Sun Oct 25, 2015 6:28 pm

Joost Buijs wrote:
bob wrote: This is simply a "pseudo-lock" when you think about it. You generate captures before you do the SMP search. So idle threads get to sit and wait while you do this. As opposed to sitting and waiting at a lock while you do this. No difference in terms of speed or performance.
It is very difficult to tell exactly what happens with idle threads when the SMP search is fully up and running.
When a thread is the stage of generating moves it is very likely that idle threads will be assigned to a split-point in a different thread.

Correct, but this misses my point. There are two cases. Either a thread joins an existing split point after the moves have been generated, in which case locks are irrelevant since the moves are done, or it joins or tries to join before the moves are generated. With a lock it has to wait until they are generated, without a lock it has to wait for the "invitation" that comes only after the moves have been generated. You wait the same amount of time either way, if you wait...

There is one point that is relevant, however. If you do a separate move selection procedure for parallel search, you can make it faster and ignore efficiency since it isn't done very often. For example, I very carefully do NOT search killers, hash move, counter-moves, etc a second time. That is a list of "already searched prior to quiet move generation" moves that has to be kept, and that is a serial bottleneck. But with just one thread using that code it works fine and the small cost more than offsets the overhead of searching a move a second time, even it if ends up in a hash hit at the next ply. But for a parallel move selection procedure, anything done under a lock has to be looked at carefully. As it feeds Amdahl's law. My original approach was to acquire a lock, then choose the next move for this thread, then release the lock. I am working on a rewrite of this to see if it can be improved significantly, although with my current NPS scaling, the improvement will necessarily be fairly small since there is not much room left. But then again, 20 or 24 cores is not the end of the "core curve" either.

I still think DTS is THE way to do this, but I am not yet ready to give up my recursive negamax search code.

bob wrote: I don't do a "sort". I select a move when one is needed. I've tried this many times but it has always slowed me down too much. Particularly when you don't have a hash move or a good capture, and you generate all moves and sort. And every other ply only needs one move to fail high wasting the sort effort... I also try killers, counter-moves, etc, before generating the non-captures, which has always been faster for me.
Probably you are right that selecting a move when you need it is faster than sorting in the case you have an early cutoff.
It is just a matter of laziness that I sort the moves before using them, it makes a difference of a few percent at most, which means a few Elo points.
It seams wiser to spend the effort in the evaluation function which can make a difference of hundreds of points.

bob · Post by **bob** » Sun Oct 25, 2015 6:33 pm

mcostalba wrote:
Joost Buijs wrote: Yes, I know what lock contention is and that is why you want to keep the duration of the lock as short as possible.
This is the typical naive approach at optimization. Optimization is an engineering discipline and should be approached with an engineering method, i.e. measuring. The only way to optimize is with a profiler, many times what you think is sensible, it is instead absolutely harmless. In this case is far form clear that looking up a move in a set is a performance bottleneck(!).

BTW the hot path (the first lookup) is outside lock contention, and the second lookup is already fully cached and the branches well predicted because it happens just few instructions after, so any assumption about the performance impact is really fully speculative.

As a note, branch prediction might not quite be what you expect. I noticed with vtune that there were lots of "flagged" branches with a higher than expected misprediction rate. I ran SF6 just to compare and the same is true there. The best prediction examples deal with loops, which we don't exactly use in the traditional sense.

I'll try to post some crafty and SF6 prediction data later this week. Can't get to it from home as I have not studied the command-line vtune interface and have only used the x11 based GUI. X11 over even my 21mbit download bandwidth is unusable for vtune.

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 7:37 pm

bob wrote: Correct, but this misses my point. There are two cases. Either a thread joins an existing split point after the moves have been generated, in which case locks are irrelevant since the moves are done, or it joins or tries to join before the moves are generated. With a lock it has to wait until they are generated, without a lock it has to wait for the "invitation" that comes only after the moves have been generated. You wait the same amount of time either way, if you wait...

I have to think about this.

One way or the other, you will always have to wait for the move generation to finish, I don't see how you can circumvent this.

bob wrote: There is one point that is relevant, however. If you do a separate move selection procedure for parallel search, you can make it faster and ignore efficiency since it isn't done very often. For example, I very carefully do NOT search killers, hash move, counter-moves, etc a second time. That is a list of "already searched prior to quiet move generation" moves that has to be kept, and that is a serial bottleneck. But with just one thread using that code it works fine and the small cost more than offsets the overhead of searching a move a second time, even it if ends up in a hash hit at the next ply. But for a parallel move selection procedure, anything done under a lock has to be looked at carefully. As it feeds Amdahl's law. My original approach was to acquire a lock, then choose the next move for this thread, then release the lock. I am working on a rewrite of this to see if it can be improved significantly, although with my current NPS scaling, the improvement will necessarily be fairly small since there is not much room left. But then again, 20 or 24 cores is not the end of the "core curve" either.

I still think DTS is THE way to do this, but I am not yet ready to give up my recursive negamax search code.

On my 8 core machine I see a typical NPS speedup of 6.5 and a time to depth speedup of 4 to 5.

There are several bottlenecks in my parallel code, for instance I use condition variables (which probably use the scheduler) to kick off my threads.
I see on many occasions that when a thread kicks off another one already had a beta cutoff.
I can solve this by not using the scheduler, but than my CPU will run at 100% load indefinitely which I don't like.

Another problem is when a beta cutoff occurs I set an abort flag in the split-point and it takes very long for the other slave threads to respond.
There is no way to solve this, I check the abort flag as much as possible but it always costs time.

In a new version of the engine I'm working on I jump right back with a long-jump to the point where I called the search when the abort flag is set, maybe this will help a little.

I guess you will always have some death time whatever you do.

bob · Post by **bob** » Sun Oct 25, 2015 8:03 pm

Joost Buijs wrote:
bob wrote: Correct, but this misses my point. There are two cases. Either a thread joins an existing split point after the moves have been generated, in which case locks are irrelevant since the moves are done, or it joins or tries to join before the moves are generated. With a lock it has to wait until they are generated, without a lock it has to wait for the "invitation" that comes only after the moves have been generated. You wait the same amount of time either way, if you wait...
I have to think about this.

One way or the other, you will always have to wait for the move generation to finish, I don't see how you can circumvent this.

bob wrote: There is one point that is relevant, however. If you do a separate move selection procedure for parallel search, you can make it faster and ignore efficiency since it isn't done very often. For example, I very carefully do NOT search killers, hash move, counter-moves, etc a second time. That is a list of "already searched prior to quiet move generation" moves that has to be kept, and that is a serial bottleneck. But with just one thread using that code it works fine and the small cost more than offsets the overhead of searching a move a second time, even it if ends up in a hash hit at the next ply. But for a parallel move selection procedure, anything done under a lock has to be looked at carefully. As it feeds Amdahl's law. My original approach was to acquire a lock, then choose the next move for this thread, then release the lock. I am working on a rewrite of this to see if it can be improved significantly, although with my current NPS scaling, the improvement will necessarily be fairly small since there is not much room left. But then again, 20 or 24 cores is not the end of the "core curve" either.

I still think DTS is THE way to do this, but I am not yet ready to give up my recursive negamax search code.

On my 8 core machine I see a typical NPS speedup of 6.5 and a time to depth speedup of 4 to 5.

There are several bottlenecks in my parallel code, for instance I use condition variables (which probably use the scheduler) to kick off my threads.
I see on many occasions that when a thread kicks off another one already had a beta cutoff.
I can solve this by not using the scheduler, but than my CPU will run at 100% load indefinitely which I don't like.

Another problem is when a beta cutoff occurs I set an abort flag in the split-point and it takes very long for the other slave threads to respond.
There is no way to solve this, I check the abort flag as much as possible but it always costs time.

Why? This should be a simple cache hit, until a thread sets this flag which would invalidate all other caches, they refresh the line (from the cache where it was modified) and then abort. I check this flag after any call to search or quiesce. I can't even find this line of code mentioned in any vtune analysis output... And I check it at LEAST once per node, more if there are things like LMR re-searches, or such.

The 6.5 NPS speedup sounds like trouble on 20-24 and beyond machines. I am currently hitting 19.5x on a 20 core box and I am trying to get that last .5x presently.

In a new version of the engine I'm working on I jump right back with a long-jump to the point where I called the search when the abort flag is set, maybe this will help a little.

I assume you are then using copy/make so that you get everything back to the initial setting by just returning to the root "stack" information? Copy/Make was too slow for me. I've converted back to it a couple of times to test but it is always worse speed-wise. But with the almost zero cost of the abort flag, a series of quick returns gets me out of things almost instantly.

I guess you will always have some death time whatever you do.

BTW my goal is ALWAYS max speed. I don't touch the thread scheduler, I don't use mutex locks since they block, etc. When I start things up, they run at 100% cpu duty cycle. Sometimes annoying but if I want to use the machine for other things, I just use less than max cores for chess. Letting a thread wait due to the process scheduler is not a good idea anyway, so I simply don't do this.. And since I use processor affinity I don't worry about threads bouncing and screwing up cache either.

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 8:43 pm

bob wrote: Why? This should be a simple cache hit, until a thread sets this flag which would invalidate all other caches, they refresh the line (from the cache where it was modified) and then abort. I check this flag after any call to search or quiesce. I can't even find this line of code mentioned in any vtune analysis output... And I check it at LEAST once per node, more if there are things like LMR re-searches, or such.

The 6.5 NPS speedup sounds like trouble on 20-24 and beyond machines. I am currently hitting 19.5x on a 20 core box and I am trying to get that last .5x presently.

I don't know why it takes so long to respond to the abort flag, I check the flag after every do-move and after a return from search at each node.
I'm talking about microseconds here, but when it happens very often it is a measurable loss.

Unfortunately I don't have a 20 or 24 core machine available since I retired from the univ.
I tested a few times on a 20 core machine and the NPS speedup that I saw was in the region of 13.5 to 15, so a lot worse than your 19.5

bob wrote: I assume you are then using copy/make so that you get everything back to the initial setting by just returning to the root "stack" information? Copy/Make was too slow for me. I've converted back to it a couple of times to test but it is always worse speed-wise. But with the almost zero cost of the abort flag, a series of quick returns gets me out of things almost instantly.

No, I don't use copy/make, the only thing I have to do is to make an extra copy of the position for the master thread, just like I have to do for the slaves, this hardly costs extra time.
Copy/make I've never tried, I backup some things that are time consuming to restore like hash keys and such, all the other stuff is calculated.

Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?