Lazy SMP Better than YBWC?

mcostalba · Post by **mcostalba** » Sat Oct 24, 2015 10:30 pm

bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.

NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.

Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?

Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.

mcostalba · Post by **mcostalba** » Sat Oct 24, 2015 10:45 pm

bob wrote: An interesting performance metric for current code. There is ZERO lock contention anywhere except at split points where multiple threads are trying to acquire the next move.

Well, this is exactly where the biggest part of lock contention is, especially with a staged move generator (there are tools that can help you verifying this. Intel's "vtune" is one...but there are also many others).

bob · Post by **bob** » Sun Oct 25, 2015 12:04 am

mcostalba wrote:
bob wrote: An interesting performance metric for current code. There is ZERO lock contention anywhere except at split points where multiple threads are trying to acquire the next move.
Well, this is exactly where the biggest part of lock contention is, especially with a staged move generator (there are tools that can help you verifying this. Intel's "vtune" is one...but there are also many others).

Late to the party. I already use vtune. It is _very_ good. Not seen anything else that can give that much data about everything. And it even intelligently figures out what to use the hardware performance monitors to measure as it observes... And note I don't have MUCH contention there as shown by NPS scaling...

bob · Post by **bob** » Sun Oct 25, 2015 12:09 am

mcostalba wrote:
bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.
NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.

Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?

Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.

This is a simple point. NPS scaling is a critical measure. Because it provides an absolute upper bound for parallel search speedup. If you run on 20 cores and your NPS is only 15x. Your smp speedup will never exceed 15x, even if it is perfect. So you can't ignore it.

But it is not the ultimate measure. Speedup is the critical part.

If you have a 10x speedup, what do you do for improvement? If your NPS speedup is only 13x, you start there, not on the splitting and such. Find out why it is so slow. If your speedup is 10x but nps is 19x, then there is an issue in choosing split points that is causing excessive search overhead.

My main criticism of lazy SMP is purely on theoretical grounds. If you do fixed depth searches and you don't see much speedup, that represents a problem. While lazy smp might work pretty well, a traditional parallel search should always have a significant edge, particularly as cores climb.

mcostalba · Post by **mcostalba** » Sun Oct 25, 2015 7:12 am

bob wrote: I am now looking at reducing the NextMove() lock as it ought to be possible to select a move quicker... and reduce that. And that is not a very big contention point, but it is THE contention point that prevents perfect NPS scaling...

You can let each thread to have a local move generator and use move count to sync them:

Code: Select all

  while (move = NextMove(), move != MOVE_NONE)
  {
    if (++moveCount <= sp->moveCount)
        continue; // No contention on fast path

    sp->spinlock.acquire();  // sp is a pointer to our split point

    if (moveCount <= sp->moveCount)
    {
        sp->spinlock.release();
        continue; // Race
    }

    assert(moveCount == sp->moveCount + 1);

    sp->moveCount = moveCount;

    sp->spinlock.release();

    ...

In this way the contention is reduced to the bare minimum: a very fast spinlock around a single moveCount check.

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 7:59 am

mcostalba wrote:
bob wrote: I am now looking at reducing the NextMove() lock as it ought to be possible to select a move quicker... and reduce that. And that is not a very big contention point, but it is THE contention point that prevents perfect NPS scaling...
You can let each thread to have a local move generator and use move count to sync them:
Code: Select all
  while (move = NextMove(), move != MOVE_NONE)
  {
    if (++moveCount <= sp->moveCount)
        continue; // No contention on fast path

    sp->spinlock.acquire();  // sp is a pointer to our split point

    if (moveCount <= sp->moveCount)
    {
        sp->spinlock.release();
        continue; // Race
    }

    assert(moveCount == sp->moveCount + 1);

    sp->moveCount = moveCount;

    sp->spinlock.release();

    ...
In this way the contention is reduced to the bare minimum: a very fast spinlock around a single moveCount check.

This is not so easy as it seams when your move order depends upon information e.g. history which might have been changed in between by another thread.

mcostalba · Post by **mcostalba** » Sun Oct 25, 2015 8:11 am

Joost Buijs wrote: This is not so easy as it seams when your move order depends upon information e.g. history which might have been changed in between by another thread.

In case move ordering depends on dynamic information (that indeed is the usual case), you can use a simple set lookup, in this example a split point std::set called searchedMoves

Code: Select all

  while (move = NextMove(), move != MOVE_NONE)
  {

    if (sp->searchedMoves.count(move))
        continue; // No contention on fast path

    sp->spinlock.acquire();

    if (sp->searchedMoves.count(move))
    {
        sp->spinlock.release();
        continue; // Race
    }

    sp->searchedMoves.insert(move);

    sp->spinlock.release();

    ...

Joost Buijs · Post by **Joost Buijs** » Sun Oct 25, 2015 11:42 am

mcostalba wrote:
Joost Buijs wrote: This is not so easy as it seams when your move order depends upon information e.g. history which might have been changed in between by another thread.
In case move ordering depends on dynamic information (that indeed is the usual case), you can use a simple set lookup, in this example a split point std::set called searchedMoves
Code: Select all
  while (move = NextMove(), move != MOVE_NONE)
  {

    if (sp->searchedMoves.count(move))
        continue; // No contention on fast path

    sp->spinlock.acquire();

    if (sp->searchedMoves.count(move))
    {
        sp->spinlock.release();
        continue; // Race
    }

    sp->searchedMoves.insert(move);

    sp->spinlock.release();

    ...

I never used that std:: stuff, probably these sets are represented as a binary tree internally, searching that binary tree for each move will cost you performance as well.

Staged move generation is something with questionable value.
When you generate captures and non captures separately and you have an early beta cutoff on a capture move it helps, when you don't have an early cutoff it actually hurts.
Generating and sorting all moves at once before entering the SMP search is something I want to try, after this it is simply picking a move from a list.

Joerg Oster · Post by **Joerg Oster** » Sun Oct 25, 2015 12:23 pm

bob wrote:
mcostalba wrote:
bob wrote: Your YBWC is NOT "state of the art". Talk to some of your guys that are doing actual measurements. Louis Z. comes to mind For example, my current code is showing a 19.5x NPS speedup on 20 cores. How are you doing? When my NPS speedup was stuck at 16x or so a couple of months ago the SMP speedup was hanging at 13x on 20 cores. How are you doing? Current code is faster. I am running the tests now, but it takes a week or so to run them.
NPS is really a poor metric to measure quality of YBWC implementation: you just raise minimum split depth and you get all the nps that you want.

Indeed a way to measure the efficency and the low overhead of the YBWC implementation is to see at what minimum split depth you have the optimum. In SF our sweet spot is 4 but recently with some scheme we even went down to minimum split depth of just 2 (!) plies without losing ELO (please note that I have written ELO, not NPS, TTD or other indirect stuff). IMO a near optimal ELO at minimum split depth 2 is a testament to the very efficient copying and locking and minimal overhead of the YBWC implementation. How are you doing?

Regarding lazy SMP I only add that is far from trivial to get something that works well: many SF developers (me included) have tried and failed before one of us, known by alias mbootsector, come up with the good one. Then many people improved above it, as is common in our development model. So the fact that you didn't get success with lazy SMP does not yield a lot of info per se.
This is a simple point. NPS scaling is a critical measure. Because it provides an absolute upper bound for parallel search speedup. If you run on 20 cores and your NPS is only 15x. Your smp speedup will never exceed 15x, even if it is perfect. So you can't ignore it.

But it is not the ultimate measure. Speedup is the critical part.

If you have a 10x speedup, what do you do for improvement? If your NPS speedup is only 13x, you start there, not on the splitting and such. Find out why it is so slow. If your speedup is 10x but nps is 19x, then there is an issue in choosing split points that is causing excessive search overhead.

My main criticism of lazy SMP is purely on theoretical grounds. If you do fixed depth searches and you don't see much speedup, that represents a problem. While lazy smp might work pretty well, a traditional parallel search should always have a significant edge, particularly as cores climb.

Me thinks this is only a valid/fair comparison if all helper threads are searching the same depth as the main thread.
Alas, nobody is doing lazy smp in this stupid way ...

mar · Post by **mar** » Sun Oct 25, 2015 12:44 pm

Joost Buijs wrote:I never used that std:: stuff, probably these sets are represented as a binary tree internally, searching that binary tree for each move will cost you performance as well.

Yes, usually std::set and std::map is implemented as red-black trees, requiring 3 pointers per node.
Lookup best should be close to log n but you'll need to rebalance now and then.
Also when using default heap allocator, stl usually does 1 new/delete per insert/erase which means additional overhead.
A better solution might be to use priority queue (std::priority_queue) (assuming you only want to insert/pop best), certainly in terms of space.
Another question is real impact and this would of course have to be measured.
EDIT: for small n, incremental insertion sort might be better, but again this would have to be measured

Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?