Lazy SMP Better than YBWC?

mcostalba · Post by **mcostalba** » Sun Oct 25, 2015 8:11 am

Joost Buijs wrote: This is not so easy as it seams when your move order depends upon information e.g. history which might have been changed in between by another thread.

In case move ordering depends on dynamic information (that indeed is the usual case), you can use a simple set lookup, in this example a split point std::set called searchedMoves

Code: Select all

  while &#40;move = NextMove&#40;), move != MOVE_NONE&#41;
  &#123;

    if &#40;sp->searchedMoves.count&#40;move&#41;)
        continue; // No contention on fast path

    sp->spinlock.acquire&#40;);

    if &#40;sp->searchedMoves.count&#40;move&#41;)
    &#123;
        sp->spinlock.release&#40;);
        continue; // Race
    &#125;

    sp->searchedMoves.insert&#40;move&#41;;

    sp->spinlock.release&#40;);

    ...

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 11:42 am

mcostalba wrote:
Joost Buijs wrote: This is not so easy as it seams when your move order depends upon information e.g. history which might have been changed in between by another thread.
In case move ordering depends on dynamic information (that indeed is the usual case), you can use a simple set lookup, in this example a split point std::set called searchedMoves
Code: Select all
  while &#40;move = NextMove&#40;), move != MOVE_NONE&#41;
  &#123;

    if &#40;sp->searchedMoves.count&#40;move&#41;)
        continue; // No contention on fast path

    sp->spinlock.acquire&#40;);

    if &#40;sp->searchedMoves.count&#40;move&#41;)
    &#123;
        sp->spinlock.release&#40;);
        continue; // Race
    &#125;

    sp->searchedMoves.insert&#40;move&#41;;

    sp->spinlock.release&#40;);

    ...

I never used that std:: stuff, probably these sets are represented as a binary tree internally, searching that binary tree for each move will cost you performance as well.

Staged move generation is something with questionable value.
When you generate captures and non captures separately and you have an early beta cutoff on a capture move it helps, when you don't have an early cutoff it actually hurts.
Generating and sorting all moves at once before entering the SMP search is something I want to try, after this it is simply picking a move from a list.

Joerg Oster · Post by **Joerg Oster** » Sun Oct 25, 2015 12:23 pm

bob wrote:
mcostalba wrote:
bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.
NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.

Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?

Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.
This is a simple point. NPS scaling is a critical measure. Because it provides an absolute upper bound for parallel search speedup. If you run on 20 cores and your NPS is only 15x. Your smp speedup will never exceed 15x, even if it is perfect. So you can't ignore it.

But it is not the ultimate measure. Speedup is the critical part.

If you have a 10x speedup, what do you do for improvement? If your NPS speedup is only 13x, you start there, not on the splitting and such. Find out why it is so slow. If your speedup is 10x but nps is 19x, then there is an issue in choosing split points that is causing excessive search overhead.

My main criticism of lazy SMP is purely on theoretical grounds. If you do fixed depth searches and you don't see much speedup, that represents a problem. While lazy smp might work pretty well, a traditional parallel search should always have a significant edge, particularly as cores climb.

Me thinks this is only a valid/fair comparison if all helper threads are searching the same depth as the main thread.
Alas, nobody is doing lazy smp in this stupid way ...

mar · Post by **mar** » Sun Oct 25, 2015 12:44 pm

Joost Buijs wrote:I never used that std:: stuff, probably these sets are represented as a binary tree internally, searching that binary tree for each move will cost you performance as well.

Yes, usually std::set and std::map is implemented as red-black trees, requiring 3 pointers per node.
Lookup best should be close to log n but you'll need to rebalance now and then.
Also when using default heap allocator, stl usually does 1 new/delete per insert/erase which means additional overhead.
A better solution might be to use priority queue (std::priority_queue) (assuming you only want to insert/pop best), certainly in terms of space.
Another question is real impact and this would of course have to be measured.
EDIT: for small n, incremental insertion sort might be better, but again this would have to be measured

mcostalba · Post by **mcostalba** » Sun Oct 25, 2015 2:55 pm

Joost Buijs wrote: Generating and sorting all moves at once before entering the SMP search is something I want to try, after this it is simply picking a move from a list.

Yes this is another possible approach, but in this case you don't use the dynamically updated move order info retrieved after searching the captures, this can have an effect when search depth is high, but of course only testing has the last word.

Regarding std::set performance, you are aiming at the wrong target, here is reducing lock contention the real target: in case lock is already held, the thread is suspended and it will be resumed at a random, possibly long time, depending on the thread scheduler, this is much worse than the time spent to look up a std::set of just few moves.

bob · Post by **bob** » Sun Oct 25, 2015 4:58 pm

Joerg Oster wrote:
bob wrote:
mcostalba wrote:
bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.
NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.

Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?

Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.
This is a simple point. NPS scaling is a critical measure. Because it provides an absolute upper bound for parallel search speedup. If you run on 20 cores and your NPS is only 15x. Your smp speedup will never exceed 15x, even if it is perfect. So you can't ignore it.

But it is not the ultimate measure. Speedup is the critical part.

If you have a 10x speedup, what do you do for improvement? If your NPS speedup is only 13x, you start there, not on the splitting and such. Find out why it is so slow. If your speedup is 10x but nps is 19x, then there is an issue in choosing split points that is causing excessive search overhead.

My main criticism of lazy SMP is purely on theoretical grounds. If you do fixed depth searches and you don't see much speedup, that represents a problem. While lazy smp might work pretty well, a traditional parallel search should always have a significant edge, particularly as cores climb.
Me thinks this is only a valid/fair comparison if all helper threads are searching the same depth as the main thread.
Alas, nobody is doing lazy smp in this stupid way ...

Actually you are, you have a ton of search consistency issues. How do you compare two different moves searched to different depths? And do you control which moves are searched to which depth or is this controlled by serendipity?

IE there is a reason for the "L" in LMR, so you have control over what gets reduced and by how much. As opposed to using RMR (Random Move Reductions).

If you like the results, fine. If you think this is better than a directed search that tries to do what a single-thread search does, only faster, that's a bit of a stretch.

This is not a new idea. Monty Newborn did a flavor of this in the 70's, I don't remember who first proposed ABDADA, but that was also over 20 years ago. The intent was "easy" as opposed to "optimal". That's exactly what this has produced so far, at least for all the published data over the years.

If you want to claim lazy SMP is better, you have to provide data, not hand-waving. And it has to be data run at reasonable time controls on a reasonable number of cores...

when lazy SF stops changing, I'll see if I can give some data that is real by running some longer games on a cluster of 24 core machines. Having actual data would be nice.

bob · Post by **bob** » Sun Oct 25, 2015 5:04 pm

mcostalba wrote:
bob wrote: I am now looking at reducing the NextMove() lock as it ought to be possible to select a move quicker... and reduce that. And that is not a very big contention point, but it is THE contention point that prevents perfect NPS scaling...
You can let each thread to have a local move generator and use move count to sync them:
Code: Select all
  while &#40;move = NextMove&#40;), move != MOVE_NONE&#41;
  &#123;
    if (++moveCount <= sp->moveCount&#41;
        continue; // No contention on fast path

    sp->spinlock.acquire&#40;);  // sp is a pointer to our split point

    if &#40;moveCount <= sp->moveCount&#41;
    &#123;
        sp->spinlock.release&#40;);
        continue; // Race
    &#125;

    assert&#40;moveCount == sp->moveCount + 1&#41;;

    sp->moveCount = moveCount;

    sp->spinlock.release&#40;);

    ...
In this way the contention is reduced to the bare minimum: a very fast spinlock around a single moveCount check.

Started doing this maybe a year ago or so, already. That's one of several changes that improved the NPS scaling to very close to optimal. I am only losing about 1/2 of one core in raw NPS, which as I said, means the lock over head is VERY low. That was when I spent some effort to guarantee that moves at a parallel node are always searched in exactly the same order as nodes in the serial search, at least so far as things like LMR decisions go since once a thread selects a move, anything can happen beyond that point time-wise. Based on vtune, much of the "missing" nps is TLB-related since I am running my tests with 64gb of trans/ref table memory, and the TLB has no hope of speeding that up.

bob · Post by **bob** » Sun Oct 25, 2015 5:05 pm

Joost Buijs wrote:
mcostalba wrote:
bob wrote: I am now looking at reducing the NextMove() lock as it ought to be possible to select a move quicker... and reduce that. And that is not a very big contention point, but it is THE contention point that prevents perfect NPS scaling...
You can let each thread to have a local move generator and use move count to sync them:
Code: Select all
  while &#40;move = NextMove&#40;), move != MOVE_NONE&#41;
  &#123;
    if (++moveCount <= sp->moveCount&#41;
        continue; // No contention on fast path

    sp->spinlock.acquire&#40;);  // sp is a pointer to our split point

    if &#40;moveCount <= sp->moveCount&#41;
    &#123;
        sp->spinlock.release&#40;);
        continue; // Race
    &#125;

    assert&#40;moveCount == sp->moveCount + 1&#41;;

    sp->moveCount = moveCount;

    sp->spinlock.release&#40;);

    ...
In this way the contention is reduced to the bare minimum: a very fast spinlock around a single moveCount check.
This is not so easy as it seams when your move order depends upon information e.g. history which might have been changed in between by another thread.

That's pretty much irrelevant anyway if you use shared history. I've tried both and shared has been best for me. Now with multiple threads dinging around with the counters, they will never be consistent from run to run anyway...

bob · Post by **bob** » Sun Oct 25, 2015 5:08 pm

mcostalba wrote:
Joost Buijs wrote: Generating and sorting all moves at once before entering the SMP search is something I want to try, after this it is simply picking a move from a list.
Yes this is another possible approach, but in this case you don't use the dynamically updated move order info retrieved after searching the captures, this can have an effect when search depth is high, but of course only testing has the last word.

Regarding std::set performance, you are aiming at the wrong target, here is reducing lock contention the real target: in case lock is already held, the thread is suspended and it will be resumed at a random, possibly long time, depending on the thread scheduler, this is much worse than the time spent to look up a std::set of just few moves.

I've never written a parallel search that has that issue. I've always used spin locks so there is no thread scheduler to deal with. I use affinity to set one thread per physical core, and they run without blocking from that point forward.

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 5:13 pm

mcostalba wrote: Regarding std::set performance, you are aiming at the wrong target, here is reducing lock contention the real target: in case lock is already held, the thread is suspended and it will be resumed at a random, possibly long time, depending on the thread scheduler, this is much worse than the time spent to look up a std::set of just few moves.

Yes, I know what lock contention is and that is why you want to keep the duration of the lock as short as possible.

In my engine I do the following:

Try the hash move (if available) before any move generation at all.
Then I generate all captures and sort them with an insertion sort, this happens before I call the SMP search.
Now I can simply pick the captures from a list which goes very quick.
The only time I really get a problem with lock contention is when I run out of captures and have to generate the non captures and sort them.
That is why I want to try to generate all moves at once, it also makes the code a lot cleaner and probably faster.

Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?