Branch-poor looping

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Branch-poor looping

Post by hgm »

Something that occurred to me: my mailbox engines often have loops with difficult-to-predict branches in them. Like

Code: Select all

for(direction=0; (step = steps[piece][direction]); direction++) {
  int to = from + step;
  if(board[to] == EMPTY) { // sort of 50-50
    GenerateNonCapture(from, to);
  }
}
One of the advantages of bitboards is that the bit extraction automatically skips the iterations where you don't want to do something. But that principle could be used on arbitrary loops:

Code: Select all

int todo = 0;

for(direction=0; (step = steps[piece][direction]); direction++) {
  todo |= &#40;board&#91;from+step&#93; == EMPTY&#41; << direction;
&#125;

while&#40;todo&#41; &#123;
  unsigned int next = todo & -todo;
  todo -= next;
  direction = bit2nr&#91;next*DEBRUIJN >> 27&#93;;
  GenerateNonCapture&#40;from, from + steps&#91;piece&#93;&#91;direction&#93;);
&#125;
The first loop, instead of using the condition to branch on, just accumulates it in a 'todo' word that then marks the iterations where the condition was satisfied, and the second loop selectively picks out these iterations, and performs the required action.

The lookup in bit2nr[] could be optimized away by supplying a copy of steps[piece][direction] where the elements are ordered in 'DEBRUIJN' order, so that next*DEBRUIJN>>27 could be used to index it directly.

Code: Select all

int todo = 0;

for&#40;direction=0; &#40;step = steps&#91;piece&#93;&#91;direction&#93;); direction++) &#123;
  todo |= &#40;board&#91;from+step&#93; == EMPTY&#41; << direction;
&#125;

while&#40;todo&#41; &#123;
  unsigned int next = todo & -todo;
  todo -= next;
  GenerateNonCapture&#40;from, from + shuffledSteps&#91;piece&#93;&#91;next*DEBRUIJN >> 27&#93;);
&#125;
jwes
Posts: 778
Joined: Sat Jul 01, 2006 7:11 am

Re: Branch-poor looping

Post by jwes »

A related idea:
replace

Code: Select all

if &#40;this && that && the_other&#41;
with

Code: Select all

bool test = this && that && the_other;
if &#40;test&#41;
If this, that, and the_other are quick to compute but hard to predict, it may help.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Branch-poor looping

Post by stegemma »

hgm wrote:[...]
One of the advantages of bitboards is that the bit extraction automatically skips the iterations where you don't want to do something. But that principle could be used on arbitrary loops:

Code: Select all

int todo = 0;

for&#40;direction=0; &#40;step = steps&#91;piece&#93;&#91;direction&#93;); direction++) &#123;
  todo |= &#40;board&#91;from+step&#93; == EMPTY&#41; << direction;
&#125;

while&#40;todo&#41; &#123;
  unsigned int next = todo & -todo;
  todo -= next;
  direction = bit2nr&#91;next*DEBRUIJN >> 27&#93;;
  GenerateNonCapture&#40;from, from + steps&#91;piece&#93;&#91;direction&#93;);
&#125;
The first loop, instead of using the condition to branch on, just accumulates it in a 'todo' word that then marks the iterations where the condition was satisfied, and the second loop selectively picks out these iterations, and performs the required action.[...]
But the == is in fact a branch, so you just have moved the branch in another loop.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Branch-poor looping

Post by hgm »

stegemma wrote:But the == is in fact a branch, so you just have moved the branch in another loop.
Not in Intel architecture, if the compiler is any good. The i386 instruction set has a SETcc instruction which fills AL with the value of a selected condition code (0 or 1) set by a previous ALU operation. On other occasions (like >= compares) it can exploit setting of the carry CC by the comparison, and then use ADC #0 on a zeroed register to create 0 or 1 depending on the outcome.

There also is a more general conditional move, MOVcc, since Pentium II.

A gcc example:

Code: Select all

main&#40;)
&#123;
extern int a;
int b = &#40;a == 10&#41;;
return b;
&#125;
gives

Code: Select all

        .file   "setcc.c"
        .def    ___main;        .scl    2;      .type   32;     .endef
        .text
        .p2align 4,,15
.globl _main
        .def    _main;  .scl    2;      .type   32;     .endef
_main&#58;
        pushl   %ebp
        movl    $16, %eax
        movl    %esp, %ebp
        subl    $8, %esp
        andl    $-16, %esp
        call    __alloca
        call    ___main
        xorl    %eax, %eax
        cmpl    $10, _a
        leave
        sete    %al
        ret
You see it uses the SETE instruction.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Branch-poor looping

Post by stegemma »

hgm wrote:[...]

You see it uses the SETE instruction.
Yes, you're right. As for the log2 stuff, it should only be tested and compared but it seems that it can help, as you said.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Branch-poor looping

Post by bob »

jwes wrote:A related idea:
replace

Code: Select all

if &#40;this && that && the_other&#41;
with

Code: Select all

bool test = this && that && the_other;
if &#40;test&#41;
If this, that, and the_other are quick to compute but hard to predict, it may help.
That still has branches. Each "&&" requires either a branch or possibly a cmov instruction...
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Branch-poor looping

Post by hgm »

For this reason I often use | and & instead of || and &&. If the first condition would not be able to terminate the expression most of the time, and the second condition is not enormously more complex, using | and & is often faster. A test like

if(timeoutFlag | inputFlag)

would not require more microOps than

if(timeoutFlag || inputFlag)

because in i368 LOAD doea not automatically set condition codes,and a separate test is needed for that. So the first case compiles to something like

Code: Select all

LOAD
LOAD
OR
BEQ L
...
L&#58;
and the second to

Code: Select all

LOAD
TEST
BNE M
LOAD
TEST
BEQ L
M&#58;
...
L&#58;
So the first case uses 4 instructions, and the second 3 or 6, which even in a 50-50 case would be 4.5 on average.

If the BNE M would be taken most of the time, so that the average would be closer to 3 than to 6, any time it would not be taken would incur a mispredict penalty,which can easily be 12 clock cycles, which would be good for 48 instructions at 4 instructions/clock, which does not really help to keep the average low. Branches that are not there cannot be mispredicted.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Branch-poor looping

Post by stegemma »

hgm wrote:[...]

Code: Select all

LOAD
LOAD
OR
BEQ L
...
L&#58;
[...]
Or maybe just:

Code: Select all

LOAD register, a
OR register, b
BEQ L
...
L&#58;
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
User avatar
hgm
Posts: 27811
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Branch-poor looping

Post by hgm »

stegemma wrote:Or maybe just:

Code: Select all

LOAD register, a
OR register, b
BEQ L
...
L&#58;
That would be the i386 machine code, but I was talking about micro-Ops. ALU instructions like OR register, b are converted by the instruction decoder into a separate LOAD and OR microOp. (And instructions like INC mem even to a triplet, LOAD, INC, STORE.) Using the memory-alu i386 instructions can reduce the load on the instruction fetcher and instruction cache, but it can also backfire, because only the first of the 4 instruction decoders can decode i386 instructions that result in more than a singleuOp. So if it encounters such an instruction 'out of register' it would have to wait for the next clock cycle to be decoded, some of the auxiliary decoders reaming idle in the current cycle. This could create a decoder bottleneck if it happens too often.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Branch-poor looping

Post by bob »

hgm wrote:For this reason I often use | and & instead of || and &&. If the first condition would not be able to terminate the expression most of the time, and the second condition is not enormously more complex, using | and & is often faster. A test like

if(timeoutFlag | inputFlag)

would not require more microOps than

if(timeoutFlag || inputFlag)

because in i368 LOAD doea not automatically set condition codes,and a separate test is needed for that. So the first case compiles to something like

Code: Select all

LOAD
LOAD
OR
BEQ L
...
L&#58;
and the second to

Code: Select all

LOAD
TEST
BNE M
LOAD
TEST
BEQ L
M&#58;
...
L&#58;
So the first case uses 4 instructions, and the second 3 or 6, which even in a 50-50 case would be 4.5 on average.

If the BNE M would be taken most of the time, so that the average would be closer to 3 than to 6, any time it would not be taken would incur a mispredict penalty,which can easily be 12 clock cycles, which would be good for 48 instructions at 4 instructions/clock, which does not really help to keep the average low. Branches that are not there cannot be mispredicted.
I use | a lot. & is more dangerous

a && b means both not zero

a & b fails if a=1 and b=2. Burned myself there more than once.