DTS Structure

Onno Garms · Post by **Onno Garms** » Sat May 29, 2010 12:27 am

Edmund wrote:A main difference moving to the Dynamic Tree Splitting Algorithm is that the whole search has to be laid out iterative instead of recursive. Now my question is, how can this be achieved best?

Iterative search is also useful for YBW. Also, it helps profiling because you don't have cycles in the call tree.

I personally don't like the macro based solutions for iterative search from other programs.

EDIT: because I dislike macros in general and try to minimize their use. I also dislike a huge case switch. The latter was fine for qsearch, but as a case switch cannot jump into the middle of loops, it requires ugly adaptions of the code.

I thought quite a long time, how to write a readable iterative search. I found a macro free solution. The price are - gotos. I hadn't used goto since my youth with C64 BASIC (not counting jumping out of nested loops) but for an iterative search I found gotos really useful.

Search looks like this:

Code: Select all


namespace Label
&#123;
  enum _ &#123;after_scout, after_research, ...&#125;;
&#125;

class Node
&#123;
  int alpha, beta;
  Move move;
  MoveList moves;
  int value;
  Label&#58;&#58;_ ret_addr;
&#125;;  

search ()
&#123;
  Node* node = &d_nodes&#91;0&#93;;
  Node* child = node+1;

recurse&#58;
  node->moves.init&#40;);
  while (&#40;node->move = node->moves.next&#40;)))
  &#123;
    child->alpha = ...
    child->beta = ...
    child->ret_addr = Label&#58;&#58;after_scout;
    ++node;
    ++child;
    goto recurse;
    // when we return here, node and child have their original values
    // and child is evaluated    
after_scout&#58;
    if &#40;child->value < -node->alpha&#41;
    &#123;
      child->alpha = ...
      child->beta = ...
      child->ret_addr = Label&#58;&#58;after_research;
      ++node
      ++child;
      goto recurse;
    after_research&#58;
      ...
    &#125;
    if &#40;cutoff&#41;
    &#123;
      node->value = ...
      goto node_done;
    &#125;
  &#125;
node_done&#58;
  --node;
  --child;
  switch &#40;node->ret_addr&#41;
  &#123;
    case Label after_research&#58;
      goto after_research;
    ...
  &#125;
&#125;

Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.

Onno Garms · Post by **Onno Garms** » Sat May 29, 2010 12:49 am

Code: Select all

switch &#40;child->ret_addr&#41;

Edmund · Post by **Edmund** » Sat May 29, 2010 1:05 am

Onno Garms wrote:...
Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.

Thats about the algorithm I was thinking of too. I just wonder how the inline assembly should be slower than the additional indirection.

I was thinking of something like:

first a jumptable is initialized holding the addresses of all labels

Code: Select all

	// init Jumptable

	int jmptable&#91;64&#93;;

	#define ADD_JUMPER&#40;jmp&#41; 		 \
		mov edx, DWORD PTR jmp		\
		mov &#91;esi&#93;, edx				  \
		add esi, 4
	
	__asm
	&#123;
		lea esi, jmptable			;Load address of the jmptable into esi
		
		ADD_JUMPER&#40;PV&#41;
		ADD_JUMPER&#40;ZW&#41;
		ADD_JUMPER&#40;QS&#41;
&#91;...&#93;
	&#125;
	
	#undef ADD_JUMPER

&#91;...&#93;

then to jump to a specific label I would have used a macro with an inlined assembly statement.

Dann Corbit · Post by **Dann Corbit** » Sat May 29, 2010 2:17 am

Edmund wrote:
Onno Garms wrote:...
Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.
Thats about the algorithm I was thinking of too. I just wonder how the inline assembly should be slower than the additional indirection.

I was thinking of something like:

first a jumptable is initialized holding the addresses of all labels
Code: Select all
	// init Jumptable

	int jmptable&#91;64&#93;;

	#define ADD_JUMPER&#40;jmp&#41; 		 \
		mov edx, DWORD PTR jmp		\
		mov &#91;esi&#93;, edx				  \
		add esi, 4
	
	__asm
	&#123;
		lea esi, jmptable			;Load address of the jmptable into esi
		
		ADD_JUMPER&#40;PV&#41;
		ADD_JUMPER&#40;ZW&#41;
		ADD_JUMPER&#40;QS&#41;
&#91;...&#93;
	&#125;
	
	#undef ADD_JUMPER

&#91;...&#93;
then to jump to a specific label I would have used a macro with an inlined assembly statement.

I sometimes use array(s) of function pointers.

E.g. something like this:
genmoves[piecetype][color](rank, file);

where genmoves is a 6x2 array of function pointers.

It removes all the goto branches (if you don't count calls and returns).
That approach used to be a *lot* faster than a switch if it were in a critical path, but lately the difference has pretty well vanished.

Here is a benchmark program to test your compiler for pointer, array, and switch invocation (a profiler will tell you which approach is fastest):

Code: Select all

#include <math.h>
#include <stdio.h>

typedef double  (*f_t&#41; &#40;double&#41;;
static f_t      f&#91;&#93; = &#123;log, log10, sqrt, cos, cosh, exp, sin, sinh, tan, tanh, 0&#125;;

static double   accum0 = 0;
static double   accum1 = 0;
static double   accum2 = 0;


void            arr&#40;void&#41;
&#123;
    int             i;
    double          d = 0;
    for &#40;i = 0; f&#91;i&#93;; i++) &#123;
        d += f&#91;i&#93; &#40;0.5&#41;;
    &#125;
    accum0 += d;
&#125;

void            poi&#40;void&#41;
&#123;
    f_t            *flist = f;
    double          d = 0;
    while (*flist&#41; &#123;
        f_t             ff = *flist;
        d += ff&#40;0.5&#41;;
        flist++;
    &#125;
    accum1 += d;
&#125;

void            swi&#40;void&#41;
&#123;
    int             i;
    double          d = 0;
    for &#40;i = 0; f&#91;i&#93;; i++) &#123;
        switch &#40;i&#41; &#123;
        case 0&#58;
            d += f&#91;0&#93; &#40;0.5&#41;;
            break;
        case 1&#58;
            d += f&#91;1&#93; &#40;0.5&#41;;
            break;
        case 2&#58;
            d += f&#91;2&#93; &#40;0.5&#41;;
            break;
        case 3&#58;
            d += f&#91;3&#93; &#40;0.5&#41;;
            break;
        case 4&#58;
            d += f&#91;4&#93; &#40;0.5&#41;;
            break;
        case 5&#58;
            d += f&#91;5&#93; &#40;0.5&#41;;
            break;
        case 6&#58;
            d += f&#91;6&#93; &#40;0.5&#41;;
            break;
        case 7&#58;
            d += f&#91;7&#93; &#40;0.5&#41;;
            break;
        case 8&#58;
            d += f&#91;8&#93; &#40;0.5&#41;;
            break;
        case 9&#58;
            d += f&#91;9&#93; &#40;0.5&#41;;
            break;
        default&#58;
            break;
        &#125;
    &#125;
    accum2 += d;
&#125;

int             main&#40;void&#41;
&#123;
    long            i;
    for &#40;i = 0; i < 1000000; i++) &#123;
        arr&#40;);
        poi&#40;);
        swi&#40;);
    &#125;
    printf&#40;"%.20g, %.20g, %.20g\n", accum0, accum1, accum2&#41;;
    return 0;
&#125;

Zach Wegner · Post by **Zach Wegner** » Sat May 29, 2010 3:49 am

Onno Garms wrote:
Edmund wrote:A main difference moving to the Dynamic Tree Splitting Algorithm is that the whole search has to be laid out iterative instead of recursive. Now my question is, how can this be achieved best?
Iterative search is also useful for YBW. Also, it helps profiling because you don't have cycles in the call tree.

I personally don't like the macro based solutions for iterative search from other programs.

EDIT: because I dislike macros in general and try to minimize their use. I also dislike a huge case switch. The latter was fine for qsearch, but as a case switch cannot jump into the middle of loops, it requires ugly adaptions of the code.

I thought quite a long time, how to write a readable iterative search. I found a macro free solution. The price are - gotos. I hadn't used goto since my youth with C64 BASIC (not counting jumping out of nested loops) but for an iterative search I found gotos really useful.

Search looks like this:
Code: Select all
namespace Label
&#123;
  enum _ &#123;after_scout, after_research, ...&#125;;
&#125;

class Node
&#123;
  int alpha, beta;
  Move move;
  MoveList moves;
  int value;
  Label&#58;&#58;_ ret_addr;
&#125;;  

search ()
&#123;
  Node* node = &d_nodes&#91;0&#93;;
  Node* child = node+1;

recurse&#58;
  node->moves.init&#40;);
  while (&#40;node->move = node->moves.next&#40;)))
  &#123;
    child->alpha = ...
    child->beta = ...
    child->ret_addr = Label&#58;&#58;after_scout;
    ++node;
    ++child;
    goto recurse;
    // when we return here, node and child have their original values
    // and child is evaluated    
after_scout&#58;
    if &#40;child->value < -node->alpha&#41;
    &#123;
      child->alpha = ...
      child->beta = ...
      child->ret_addr = Label&#58;&#58;after_research;
      ++node
      ++child;
      goto recurse;
    after_research&#58;
      ...
    &#125;
    if &#40;cutoff&#41;
    &#123;
      node->value = ...
      goto node_done;
    &#125;
  &#125;
node_done&#58;
  --node;
  --child;
  switch &#40;node->ret_addr&#41;
  &#123;
    case Label after_research&#58;
      goto after_research;
    ...
  &#125;
&#125;
Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.

That's quite close to the way it used to be in ZCT (apparently before the first public release). I had a big switch at the top, each statement just a goto (or a return in one case). Having it at the top is also marginally more flexible, you can call qsearch directly, and for when children join a node at a split point, they can go straight into the move loop.

Frankly going that far to avoid macros just makes the code ugly. At the very least you can wrap all the child-> stuff into a function, but you still have a goto and a label sitting around.

And of course switches can jump into the middle of a loop. Why couldn't they?

CThinker · Post by **CThinker** » Sat May 29, 2010 6:26 am

My understanding is that DTS is really not that different from YBW, except that instead of "waiting", a thread is made available to search other candidate sub-trees. So, the common recursive search can be easily made to do DTS. Crafty, I believe, is a good example of this.

bob · Post by **bob** » Sat May 29, 2010 7:11 am

CThinker wrote:My understanding is that DTS is really not that different from YBW, except that instead of "waiting", a thread is made available to search other candidate sub-trees. So, the common recursive search can be easily made to do DTS. Crafty, I believe, is a good example of this.

Not so easy. The problem is that the DTS algorithm is based on choosing a split point from all available nodes on _all_ other threads. That is, you choose the most beneficial split point. As an example, in Crafty (as in any YBW algorithm) you will split at a node after one node has been searched without causing a fail-high. In Crafty, about 92% of nodes fail high on the first move, if they fail high. This means that if this is a CUT node, 92% of the time we won't split here since we fail high on the first branch. Unfortunately, 8% of the time we will fail high on the second (or later) move, and doing a parallel split there does unnecessary work.

With DTS, in Cray Blitz, I circumvented this to an extent because rather than having to split at the current node of the current thread that has satisfied YBW, I could look at all nodes on all threads, find the ones that had satisfied YBW (or even better, ones that had searched multiple moves which really reduces the odds of doing a split at a CUT node).

The problem is that recursive search means you can split "here" or not at all, because of the call stack. With DTS there is no call stack, so you can split at a node that is 10 plies back in the tree just as easily as you can split "here". That's the main difference. The stuff about processors not waiting is pretty comparable between Crafty and DTS. As are the data structures and such. But the ability to see all nodes for all threads when a CPU goes idle, rather than having non-idle CPUs ask idle ones to "join in" is a significant efficiency enhancement...

Onno Garms · Post by **Onno Garms** » Sat May 29, 2010 9:33 am

bob wrote: The problem is that recursive search means you can split "here" or not at all, because of the call stack.

ACK, but that is a problem for both DTS and YBWC.

With DTS there is no call stack, so you can split at a node that is 10 plies back in the tree just as easily as you can split "here".

You mean "iterative search" rather than "DTS"?

That's the main difference.

Isnt' the main difference that in DTS the child that finishes last at a split point takes over the parent? This means, we move nodes to another thread. For this reason, an iterative search is essential for DTS (because we cannot move callstacks of recursive search). For YBW an iterative search is just a benefit on top (allowing to split anywhere in the tree).

As are the data structures and such. But the ability to see all nodes for all threads when a CPU goes idle, rather than having non-idle CPUs ask idle ones to "join in" is a significant efficiency enhancement...

ACK

Edmund · Post by **Edmund** » Sat May 29, 2010 9:43 am

CThinker wrote:My understanding is that DTS is really not that different from YBW, except that instead of "waiting", a thread is made available to search other candidate sub-trees. So, the common recursive search can be easily made to do DTS. Crafty, I believe, is a good example of this.

Additionally to Bob's statement that DTS is more flexible in the choice of split points I want to add that DTS also has more flexibility regarding node ownership:

YBW:
Slave takes work on splitpoint (SP)
Master returns to SP, only needs information from Slave
Master takes on other task
Slave finishes search
Master finishes other task
Master propagates the values

DTS:
Slave takes work on splitpoint (SP)
Master returns to SP, only needs information from Slave
Master takes on other task
Slave finishes search
Slave becomes the new owner of the node and propagates the values

This difference is critical if the information gained from propagating the values as soon as they were available would have saved some other threads work.

Onno Garms · Post by **Onno Garms** » Sat May 29, 2010 9:48 am

Zach Wegner wrote: That's quite close to the way it used to be in ZCT (apparently before the first public release).

I think it is more similar to what Edmund wants to implement. The only difference to Eduard's approach is that I do not store the addresses of the jump labels but have a case switch at the the end to translate enums into jumps.

DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure