DTS Structure

Edmund · Post by **Edmund** » Fri May 28, 2010 10:29 am

A main difference moving to the Dynamic Tree Splitting Algorithm is that the whole search has to be laid out iterative instead of recursive. Now my question is, how can this be achieved best?

My first approach was similar to how it is done in ZCT: a switch branch inside of an infinite loop. In the switch there are several stages defined and when jumping to a different part (eg. from pv-search to quies-search) the stage where it should return afterwards is stored in the stack, the target stage (beginning of quies) is stored and then with "continue;" the instruction pointer goes to the beginning of the switch branch where it then jumps to the target stage.

After reading an article yesterday I think there is a much easier maybe nicer solution to this. Its more about emulating the behavior of the usual function call algorithm. Instead of the stages we use jump labels. At the beginning of the function a table is filled with the addresses of these labels (I found a solution where this is done in assembly - or in gcc there is even the option to use the &&label operator) Now when a new stage is called, it directly jumps to it, storing the address of the return label on the stack for when it has to return. To conclude, this approach doesn't do the extra work jumping to the beginning of the switch statement every time it wants to enter another section of the program. Additionally all the statements like continue; break; etc. become obsolete.

Are there any other approaches?

liuzy · Post by **liuzy** » Fri May 28, 2010 1:30 pm

Just wait for your result.
DTS is much complicated than YBW, whose performance is also better than YBW, not much. If we only care about the ELO gain, it's hard to measure the difference between DTS and YBW.

Zach Wegner · Post by **Zach Wegner** » Fri May 28, 2010 4:44 pm

Edmund wrote:A main difference moving to the Dynamic Tree Splitting Algorithm is that the whole search has to be laid out iterative instead of recursive. Now my question is, how can this be achieved best?

My first approach was similar to how it is done in ZCT: a switch branch inside of an infinite loop. In the switch there are several stages defined and when jumping to a different part (eg. from pv-search to quies-search) the stage where it should return afterwards is stored in the stack, the target stage (beginning of quies) is stored and then with "continue;" the instruction pointer goes to the beginning of the switch branch where it then jumps to the target stage.

After reading an article yesterday I think there is a much easier maybe nicer solution to this. Its more about emulating the behavior of the usual function call algorithm. Instead of the stages we use jump labels. At the beginning of the function a table is filled with the addresses of these labels (I found a solution where this is done in assembly - or in gcc there is even the option to use the &&label operator) Now when a new stage is called, it directly jumps to it, storing the address of the return label on the stack for when it has to return. To conclude, this approach doesn't do the extra work jumping to the beginning of the switch statement every time it wants to enter another section of the program. Additionally all the statements like continue; break; etc. become obsolete.

Are there any other approaches?

Hello Edmund,

Your solution might work a bit better, but personally I doubt there would be much speed gain. If you look at the assembler output from search.c in ZCT, you'll see it isn't really that bad--at each RETURN, there's an unconditional jump to the top of the function, where it calls search_maintenance(), and from there, uses the search state to index a jump table. So the net win is quite small, you win one unconditional jump, a bounds check on the search state (which should always be not taken of course) and one memory access to a very small table.

Here's what gcc -O3 gives me:

Code: Select all

.L171&#58;
    movq    %r14, %rsi
    movq    %r13, %rdi
    call    search_maintenance
    testl   %eax, %eax
    jne .L174
    movq    32&#40;%rsp&#41;, %rbx
    movl    120&#40;%rbx&#41;, %esi
    cmpl    $12, %esi
    ja  .L7
    mov %esi, %eax
    jmp *.L21&#40;,%rax,8&#41;
    .section    .rodata
    .align 8
    .align 4
.L21&#58;
    .quad   .L8
    .quad   .L9
    .quad   .L10
    .quad   .L157
    .quad   .L12
    .quad   .L15
    .quad   .L14
    .quad   .L15
    .quad   .L16
    .quad   .L17
    .quad   .L18
    .quad   .L19
    .quad   .L20

OTOH it would probably make the code a little easier to read. It would not be as portable, but if you don't care about that it might be worth it.

And there are actually other ways to formulate it. Teemu Pudas was able to do it with real function calls but without being recursive. It's pretty clean, but it would be even slower than ZCT. His method is here: http://www.open-aurec.com/wbforum/viewt ... 76#p187476

Edmund · Post by **Edmund** » Fri May 28, 2010 9:21 pm

liuzy wrote:Just wait for your result.
DTS is much complicated than YBW, whose performance is also better than YBW, not much. If we only care about the ELO gain, it's hard to measure the difference between DTS and YBW.

Where would be the fun? And its not so much more complicated if you think about it.

Edmund · Post by **Edmund** » Fri May 28, 2010 9:27 pm

Zach Wegner wrote:
Hello Edmund,

Your solution might work a bit better, but personally I doubt there would be much speed gain. If you look at the assembler output from search.c in ZCT, you'll see it isn't really that bad--at each RETURN, there's an unconditional jump to the top of the function, where it calls search_maintenance(), and from there, uses the search state to index a jump table. So the net win is quite small, you win one unconditional jump, a bounds check on the search state (which should always be not taken of course) and one memory access to a very small table.
...

first of all thanks for the link Zach. That was a very interesting read. Teemus approach with overloading the usual function statements is quite neat. Its also interesting that already hgm and vincent were working into similar directions with iterative search via gotos.

regarding the comparison with your code, I wouldn't replace the switch statement for the sake of one additional jump. Its rather for the improved readability, which is very important in such structures.

Onno Garms · Post by **Onno Garms** » Sat May 29, 2010 12:27 am

Edmund wrote:A main difference moving to the Dynamic Tree Splitting Algorithm is that the whole search has to be laid out iterative instead of recursive. Now my question is, how can this be achieved best?

Iterative search is also useful for YBW. Also, it helps profiling because you don't have cycles in the call tree.

I personally don't like the macro based solutions for iterative search from other programs.

EDIT: because I dislike macros in general and try to minimize their use. I also dislike a huge case switch. The latter was fine for qsearch, but as a case switch cannot jump into the middle of loops, it requires ugly adaptions of the code.

I thought quite a long time, how to write a readable iterative search. I found a macro free solution. The price are - gotos. I hadn't used goto since my youth with C64 BASIC (not counting jumping out of nested loops) but for an iterative search I found gotos really useful.

Search looks like this:

Code: Select all


namespace Label
&#123;
  enum _ &#123;after_scout, after_research, ...&#125;;
&#125;

class Node
&#123;
  int alpha, beta;
  Move move;
  MoveList moves;
  int value;
  Label&#58;&#58;_ ret_addr;
&#125;;  

search ()
&#123;
  Node* node = &d_nodes&#91;0&#93;;
  Node* child = node+1;

recurse&#58;
  node->moves.init&#40;);
  while (&#40;node->move = node->moves.next&#40;)))
  &#123;
    child->alpha = ...
    child->beta = ...
    child->ret_addr = Label&#58;&#58;after_scout;
    ++node;
    ++child;
    goto recurse;
    // when we return here, node and child have their original values
    // and child is evaluated    
after_scout&#58;
    if &#40;child->value < -node->alpha&#41;
    &#123;
      child->alpha = ...
      child->beta = ...
      child->ret_addr = Label&#58;&#58;after_research;
      ++node
      ++child;
      goto recurse;
    after_research&#58;
      ...
    &#125;
    if &#40;cutoff&#41;
    &#123;
      node->value = ...
      goto node_done;
    &#125;
  &#125;
node_done&#58;
  --node;
  --child;
  switch &#40;node->ret_addr&#41;
  &#123;
    case Label after_research&#58;
      goto after_research;
    ...
  &#125;
&#125;

Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.

Onno Garms · Post by **Onno Garms** » Sat May 29, 2010 12:49 am

Code: Select all

switch &#40;child->ret_addr&#41;

Edmund · Post by **Edmund** » Sat May 29, 2010 1:05 am

Onno Garms wrote:...
Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.

Thats about the algorithm I was thinking of too. I just wonder how the inline assembly should be slower than the additional indirection.

I was thinking of something like:

first a jumptable is initialized holding the addresses of all labels

Code: Select all

	// init Jumptable

	int jmptable&#91;64&#93;;

	#define ADD_JUMPER&#40;jmp&#41; 		 \
		mov edx, DWORD PTR jmp		\
		mov &#91;esi&#93;, edx				  \
		add esi, 4
	
	__asm
	&#123;
		lea esi, jmptable			;Load address of the jmptable into esi
		
		ADD_JUMPER&#40;PV&#41;
		ADD_JUMPER&#40;ZW&#41;
		ADD_JUMPER&#40;QS&#41;
&#91;...&#93;
	&#125;
	
	#undef ADD_JUMPER

&#91;...&#93;

then to jump to a specific label I would have used a macro with an inlined assembly statement.

Dann Corbit · Post by **Dann Corbit** » Sat May 29, 2010 2:17 am

Edmund wrote:
Onno Garms wrote:...
Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.
Thats about the algorithm I was thinking of too. I just wonder how the inline assembly should be slower than the additional indirection.

I was thinking of something like:

first a jumptable is initialized holding the addresses of all labels
Code: Select all
	// init Jumptable

	int jmptable&#91;64&#93;;

	#define ADD_JUMPER&#40;jmp&#41; 		 \
		mov edx, DWORD PTR jmp		\
		mov &#91;esi&#93;, edx				  \
		add esi, 4
	
	__asm
	&#123;
		lea esi, jmptable			;Load address of the jmptable into esi
		
		ADD_JUMPER&#40;PV&#41;
		ADD_JUMPER&#40;ZW&#41;
		ADD_JUMPER&#40;QS&#41;
&#91;...&#93;
	&#125;
	
	#undef ADD_JUMPER

&#91;...&#93;
then to jump to a specific label I would have used a macro with an inlined assembly statement.

I sometimes use array(s) of function pointers.

E.g. something like this:
genmoves[piecetype][color](rank, file);

where genmoves is a 6x2 array of function pointers.

It removes all the goto branches (if you don't count calls and returns).
That approach used to be a *lot* faster than a switch if it were in a critical path, but lately the difference has pretty well vanished.

Here is a benchmark program to test your compiler for pointer, array, and switch invocation (a profiler will tell you which approach is fastest):

Code: Select all

#include <math.h>
#include <stdio.h>

typedef double  (*f_t&#41; &#40;double&#41;;
static f_t      f&#91;&#93; = &#123;log, log10, sqrt, cos, cosh, exp, sin, sinh, tan, tanh, 0&#125;;

static double   accum0 = 0;
static double   accum1 = 0;
static double   accum2 = 0;


void            arr&#40;void&#41;
&#123;
    int             i;
    double          d = 0;
    for &#40;i = 0; f&#91;i&#93;; i++) &#123;
        d += f&#91;i&#93; &#40;0.5&#41;;
    &#125;
    accum0 += d;
&#125;

void            poi&#40;void&#41;
&#123;
    f_t            *flist = f;
    double          d = 0;
    while (*flist&#41; &#123;
        f_t             ff = *flist;
        d += ff&#40;0.5&#41;;
        flist++;
    &#125;
    accum1 += d;
&#125;

void            swi&#40;void&#41;
&#123;
    int             i;
    double          d = 0;
    for &#40;i = 0; f&#91;i&#93;; i++) &#123;
        switch &#40;i&#41; &#123;
        case 0&#58;
            d += f&#91;0&#93; &#40;0.5&#41;;
            break;
        case 1&#58;
            d += f&#91;1&#93; &#40;0.5&#41;;
            break;
        case 2&#58;
            d += f&#91;2&#93; &#40;0.5&#41;;
            break;
        case 3&#58;
            d += f&#91;3&#93; &#40;0.5&#41;;
            break;
        case 4&#58;
            d += f&#91;4&#93; &#40;0.5&#41;;
            break;
        case 5&#58;
            d += f&#91;5&#93; &#40;0.5&#41;;
            break;
        case 6&#58;
            d += f&#91;6&#93; &#40;0.5&#41;;
            break;
        case 7&#58;
            d += f&#91;7&#93; &#40;0.5&#41;;
            break;
        case 8&#58;
            d += f&#91;8&#93; &#40;0.5&#41;;
            break;
        case 9&#58;
            d += f&#91;9&#93; &#40;0.5&#41;;
            break;
        default&#58;
            break;
        &#125;
    &#125;
    accum2 += d;
&#125;

int             main&#40;void&#41;
&#123;
    long            i;
    for &#40;i = 0; i < 1000000; i++) &#123;
        arr&#40;);
        poi&#40;);
        swi&#40;);
    &#125;
    printf&#40;"%.20g, %.20g, %.20g\n", accum0, accum1, accum2&#41;;
    return 0;
&#125;

Zach Wegner · Post by **Zach Wegner** » Sat May 29, 2010 3:49 am

Onno Garms wrote:
Edmund wrote:A main difference moving to the Dynamic Tree Splitting Algorithm is that the whole search has to be laid out iterative instead of recursive. Now my question is, how can this be achieved best?
Iterative search is also useful for YBW. Also, it helps profiling because you don't have cycles in the call tree.

I personally don't like the macro based solutions for iterative search from other programs.

EDIT: because I dislike macros in general and try to minimize their use. I also dislike a huge case switch. The latter was fine for qsearch, but as a case switch cannot jump into the middle of loops, it requires ugly adaptions of the code.

I thought quite a long time, how to write a readable iterative search. I found a macro free solution. The price are - gotos. I hadn't used goto since my youth with C64 BASIC (not counting jumping out of nested loops) but for an iterative search I found gotos really useful.

Search looks like this:
Code: Select all
namespace Label
&#123;
  enum _ &#123;after_scout, after_research, ...&#125;;
&#125;

class Node
&#123;
  int alpha, beta;
  Move move;
  MoveList moves;
  int value;
  Label&#58;&#58;_ ret_addr;
&#125;;  

search ()
&#123;
  Node* node = &d_nodes&#91;0&#93;;
  Node* child = node+1;

recurse&#58;
  node->moves.init&#40;);
  while (&#40;node->move = node->moves.next&#40;)))
  &#123;
    child->alpha = ...
    child->beta = ...
    child->ret_addr = Label&#58;&#58;after_scout;
    ++node;
    ++child;
    goto recurse;
    // when we return here, node and child have their original values
    // and child is evaluated    
after_scout&#58;
    if &#40;child->value < -node->alpha&#41;
    &#123;
      child->alpha = ...
      child->beta = ...
      child->ret_addr = Label&#58;&#58;after_research;
      ++node
      ++child;
      goto recurse;
    after_research&#58;
      ...
    &#125;
    if &#40;cutoff&#41;
    &#123;
      node->value = ...
      goto node_done;
    &#125;
  &#125;
node_done&#58;
  --node;
  --child;
  switch &#40;node->ret_addr&#41;
  &#123;
    case Label after_research&#58;
      goto after_research;
    ...
  &#125;
&#125;
Using gcc enhancements of C++ or using inline assembly in msvc, it is possible to store the address of the lables rather then the enum. This saves the switch at the end, one can directly jump to the label. However I found the inline assembly based solution in msvc slower then with the case switch. Also I could not find a way to port to 64 bit in msvc. So I decided to stay in the C++ standard and have the case switch.

That's quite close to the way it used to be in ZCT (apparently before the first public release). I had a big switch at the top, each statement just a goto (or a return in one case). Having it at the top is also marginally more flexible, you can call qsearch directly, and for when children join a node at a split point, they can go straight into the move loop.

Frankly going that far to avoid macros just makes the code ugly. At the very least you can wrap all the child-> stuff into a function, but you still have a goto and a label sitting around.

And of course switches can jump into the middle of a loop. Why couldn't they?

DTS Structure

DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure

Re: DTS Structure