Lazy SMP Better than YBWC?

sje · Post by **sje** » Sun Oct 25, 2015 11:58 pm

Joost Buijs wrote:In a new version of the engine I'm working on I jump right back with a long-jump to the point where I called the search when the abort flag is set, maybe this will help a little.

I suggest not using setjmp()/longjmp() in any context where it can be avoided. Its usage is a magnet for memory leakage, and this alone is enough condemnation. For a traditional chess search there are two good alternatives:

1) Use a flag to indicate a search stop, and test this flag upon returning from a node to toss the results. The most recently generated PV/score is set as the ultimate search result with everything calculated after the flag is set is thrown away. This is what I used to do in my search code.

2) Re-write the recursive search as a non-recursive state machine with the formerly local variables explicitly stored on a ply-indexed stack. At each state transition, the stop/abort flag can be checked and the next state will be set to the search termination state. Again, the most recently generated PV/score is set as the ultimate search result. This is what all my recent code does.

The second approach is better, although it takes some forethought to specify the state/transition map. There could be up to two or three dozen states needed with at least a pair for each implementation of what would be a single recursive call in a non-state routine. While there is some compute time needed to run the big fat switch statement on each transition, there is a gain from not having routine call/return overhead for each node. If the compiler has a decent optimizer, then there will be a lot of savings from not having a lot of register save/restore code. Further, it's much easier to debug a state machine, particularly when so much of the search information is available from an explicit stack and is not hidden away in activation records.

bob · Post by **bob** » Mon Oct 26, 2015 12:02 am

Joost Buijs wrote:
bob wrote: Why? This should be a simple cache hit, until a thread sets this flag which would invalidate all other caches, they refresh the line (from the cache where it was modified) and then abort. I check this flag after any call to search or quiesce. I can't even find this line of code mentioned in any vtune analysis output... And I check it at LEAST once per node, more if there are things like LMR re-searches, or such.

The 6.5 NPS speedup sounds like trouble on 20-24 and beyond machines. I am currently hitting 19.5x on a 20 core box and I am trying to get that last .5x presently.
I don't know why it takes so long to respond to the abort flag, I check the flag after every do-move and after a return from search at each node.
I'm talking about microseconds here, but when it happens very often it is a measurable loss.

Unfortunately I don't have a 20 or 24 core machine available since I retired from the univ.
I tested a few times on a 20 core machine and the NPS speedup that I saw was in the region of 13.5 to 15, so a lot worse than your 19.5

First thing, you should NOT be stopping lots of times. That suggests poor split point choices or poor move ordering which makes one move down without a fail high not very convincing it is an ALL node.

But in any case, I was talking about the overhead to check the flag. You should not be able to even measure that...

bob wrote: I assume you are then using copy/make so that you get everything back to the initial setting by just returning to the root "stack" information? Copy/Make was too slow for me. I've converted back to it a couple of times to test but it is always worse speed-wise. But with the almost zero cost of the abort flag, a series of quick returns gets me out of things almost instantly.
No, I don't use copy/make, the only thing I have to do is to make an extra copy of the position for the master thread, just like I have to do for the slaves, this hardly costs extra time.
Copy/make I've never tried, I backup some things that are time consuming to restore like hash keys and such, all the other stuff is calculated.

One key point is to throw intuition out the window here. measurement is the key. IE my choice of Intel's vtune software which is not exactly cheap (we bought a two floating license vtune from intel for about $8,000. Several wanted a copy and everybody said "may as well buy one for Bob since he will be playing with this all the time, leaving another license to float between the rest of us."

MikeB · Post by **MikeB** » Mon Oct 26, 2015 4:34 am

bob wrote: One key point is to throw intuition out the window here. measurement is the key. IE my choice of Intel's vtune software which is not exactly cheap (we bought a two floating license vtune from intel for about $8,000. Several wanted a copy and everybody said "may as well buy one for Bob since he will be playing with this all the time, leaving another license to float between the rest of us."

LOL +1

bob · Post by **bob** » Mon Oct 26, 2015 4:48 am

sje wrote:
Joost Buijs wrote:In a new version of the engine I'm working on I jump right back with a long-jump to the point where I called the search when the abort flag is set, maybe this will help a little.
I suggest not using setjmp()/longjmp() in any context where it can be avoided. Its usage is a magnet for memory leakage, and this alone is enough condemnation. For a traditional chess search there are two good alternatives:

1) Use a flag to indicate a search stop, and test this flag upon returning from a node to toss the results. The most recently generated PV/score is set as the ultimate search result with everything calculated after the flag is set is thrown away. This is what I used to do in my search code.

2) Re-write the recursive search as a non-recursive state machine with the formerly local variables explicitly stored on a ply-indexed stack. At each state transition, the stop/abort flag can be checked and the next state will be set to the search termination state. Again, the most recently generated PV/score is set as the ultimate search result. This is what all my recent code does.

The second approach is better, although it takes some forethought to specify the state/transition map. There could be up to two or three dozen states needed with at least a pair for each implementation of what would be a single recursive call in a non-state routine. While there is some compute time needed to run the big fat switch statement on each transition, there is a gain from not having routine call/return overhead for each node. If the compiler has a decent optimizer, then there will be a lot of savings from not having a lot of register save/restore code. Further, it's much easier to debug a state machine, particularly when so much of the search information is available from an explicit stack and is not hidden away in activation records.

Having done this both ways, both have good and bad. Recursive has a nice clean implementation that is easy to understand. But it causes problems for a parallel search. IE you are at ply=N and have not satisfied YBWC, while at ply-1, ply-3, etc you have. So you can't split.

non-recursive is nice in that you can see the entire search state space no matter which ply you are at, so you can split wherever you want (IE DTS) and never have to wait to do a split/join. But the code is definitely messier and is likely going to use the dreaded goto...

I've been debating returning to an iterative type (loop) search as done in CB, because it offers significantly better options for parallel search. But the negamax search is REALLY nice and clean...

I despise setjmp()/longjmp()...

Joost Buijs · Post by **Joost Buijs** » Mon Oct 26, 2015 6:41 am

sje wrote: I suggest not using setjmp()/longjmp() in any context where it can be avoided. Its usage is a magnet for memory leakage, and this alone is enough condemnation.

It is not one of my favorites either, it is just something I am trying.
Unwinding the stack each time especially when you're searching very deeply takes unnecessary time.
Memory leakage can always be solved except when it resides somewhere in the libraries.

sje wrote: 1) Use a flag to indicate a search stop, and test this flag upon returning from a node to toss the results. The most recently generated PV/score is set as the ultimate search result with everything calculated after the flag is set is thrown away. This is what I used to do in my search code.

This is exactly what I'm doing in my current engine.

sje · Post by **sje** » Mon Oct 26, 2015 6:44 am

bob wrote:non-recursive is nice in that you can see the entire search state space no matter which ply you are at, so you can split wherever you want (IE DTS) and never have to wait to do a split/join. But the code is definitely messier and is likely going to use the dreaded goto...

No need for a goto; just a loooong switch statement with one case body per state.

A look at Stockfish shows a routine with a big switch state machine is used for supplying moves, one per call. But the companion search node processor routine is written using old fashioned recursion. I'd bet that the search in Stockfish would be much easier to understand and to modify if it were converted into a state machine.

Symbolic also uses a state machine for its move supplier routine. This routine is a method of the A/B search class, just like the node processor state routine. The two routines work together in part by sharing two instance variables: the ply, and a pointer to the current ply-indexed record element. (Dereferencing is faster than indexing as no extra multiplication mapping is required.)

It all works. And it's easy to code new states or move around existing states.

Joost Buijs · Post by **Joost Buijs** » Mon Oct 26, 2015 6:54 am

bob wrote: First thing, you should NOT be stopping lots of times. That suggests poor split point choices or poor move ordering which makes one move down without a fail high not very convincing it is an ALL node.

But in any case, I was talking about the overhead to check the flag. You should not be able to even measure that...

My move ordering is reasonable but could be better.
I split everywhere, I have no special rules for splitting except a minimum depth, this can also be done better, I know.
It's not the overhead of checking the flag, but unwinding the stack when a cutoff occurs that takes time.

The last time I measured it, I found an average gain of 176 Elo going from 1 to 8 cores, lets say that I am more or less satisfied with it.

bob wrote: One key point is to throw intuition out the window here. measurement is the key. IE my choice of Intel's vtune software which is not exactly cheap (we bought a two floating license vtune from intel for about $8,000. Several wanted a copy and everybody said "may as well buy one for Bob since he will be playing with this all the time, leaving another license to float between the rest of us."

I know what you mean, most people do it this way.

bob · Post by **bob** » Mon Oct 26, 2015 5:48 pm

sje wrote:
bob wrote:non-recursive is nice in that you can see the entire search state space no matter which ply you are at, so you can split wherever you want (IE DTS) and never have to wait to do a split/join. But the code is definitely messier and is likely going to use the dreaded goto...
No need for a goto; just a loooong switch statement with one case body per state.

A look at Stockfish shows a routine with a big switch state machine is used for supplying moves, one per call. But the companion search node processor routine is written using old fashioned recursion. I'd bet that the search in Stockfish would be much easier to understand and to modify if it were converted into a state machine.

Symbolic also uses a state machine for its move supplier routine. This routine is a method of the A/B search class, just like the node processor state routine. The two routines work together in part by sharing two instance variables: the ply, and a pointer to the current ply-indexed record element. (Dereferencing is faster than indexing as no extra multiplication mapping is required.)

It all works. And it's easy to code new states or move around existing states.

My NextMove() function has always been a big switch. Idea came from Cray Blitz in fact. But there, there is no duplicated code. For the search, you either use goto's here and there or you duplicate code. I consider goto's to be the lesser of two evils there, as duplicating code causes so many "oops, forgot to update this in all 10 different places..."

sje · Post by **sje** » Mon Oct 26, 2015 7:55 pm

bob wrote:My NextMove() function has always been a big switch. Idea came from Cray Blitz in fact. But there, there is no duplicated code. For the search, you either use goto's here and there or you duplicate code. I consider goto's to be the lesser of two evils there, as duplicating code causes so many "oops, forgot to update this in all 10 different places..."

I'm not sure I understand about any need for duplicated code in absence of goto statements. In Symbolic's current TABS (Traditional A/B Search), there are eigteen states, none of which duplicate code.

Code: Select all

// Tabs Search State

typedef enum
{
  TssNil = -1,
  TssExit,
  TssIterBegin,
  TssIterEnd,
  TssMoveExecute,
  TssMoveRetract,
  TssNodeBegin,
  TssNodeEnd,
  TssPick,
  TssPickGainTest,
  TssPruning0,
  TssPruning1,
  TssScanBegin,  // Move scan start
  TssScanEnd,    // Move scan finish
  TssSeqBegin,   // Sequence of iterations start
  TssSeqEnd,     // Sequence of iterations finish
  TssTestWindow,
  TssTrackSelect,
  TssTranProbe
} Tss;

Of course, there are no goto statements or continue statements in the entire program. Nor are there any break statements other than case body terminators.

sje · Post by **sje** » Mon Oct 26, 2015 7:58 pm

And the state list for the move picker:

Code: Select all

// Move Picker State

typedef enum
{
  MpsNil = -1,
  MpsBaseInit,
  MpsBaseNext,
  MpsFullInit,
  MpsFullPV,
  MpsFullTran,
  MpsFullSS,
  MpsFullBS,
  MpsFullGainer0,
  MpsFullK0,
  MpsFullK1,
  MpsFullKA,
  MpsFullKC,
  MpsFullHolder,
  MpsFullGainer1,
  MpsGainInit,
  MpsGainPV,
  MpsGainSS,
  MpsGainBS,
  MpsExit
} Mps;

Lazy SMP Better than YBWC?

setjmp()/longjmp() = EVIL

Re: Lazy SMP Better than YBWC?

Re: Lazy SMP Better than YBWC?

Re: setjmp()/longjmp() = EVIL

Re: setjmp()/longjmp() = EVIL

Re: setjmp()/longjmp() = EVIL

Re: Lazy SMP Better than YBWC?

Re: setjmp()/longjmp() = EVIL

State machines

Re: State machines