More uses for magics?

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Aug 08, 2007 11:53 am

Jacob wrote: Another idea I had was a way to generate pawn moves without separate cases for black and white. The basic principle is simple enough; for white pawns, the pawn bitboard is shifted and compared to the occupancy. For black pawns, the occupancy is shifted by the same amount and compared to the pawn bitboard. I used a 17 element array to store my piece bitboards, which allows me to use some simple algebra to generate the pawn moves.

Hi Jacob,

nice idea to handle the white/black pawn cases. You safe separate biscan traversal loops for black pawns - which is of course nice. But there is a price though. Your loop bodies became much more expensive.

Code: Select all

   while (moves) {
      sqr[side] = PopLSB(&moves);
      sqr[xside] = sqr[side] - 8;
      AddMove(sqr[0], sqr[1], pos[sqr[0]]);
   }

versus loop hoisting with much cheaper loop bodies

Code: Select all

   for (wpmoves = ...; wpmoves; wpmoves &= wpmoves-1) {
      to = bitScan((wpmoves);
      fr = to - 8;
      AddMove(fr, to, whitePawn);
   }
   for (bpmoves = ...; bpmoves; bpmoves &= bpmoves-1) {
      to = bitScan(bpmoves);
      fr = to + 8;
      AddMove(fr, to, blackpawn);
   }

Inside the loop body you index sq[2] by side and xside to use sqr[0], sqr[1] in AddMove.
That requires, despite the xside == side ^ 1 variable which takes one register, 2*2 "expensive" memory read/writes with four agu-calculations, much longer code inside the "most critical" inner loop - while disjoint white/black pawn handling has all inside scratch registers.

Code: Select all

   test rdx, rdx        ; set to traverse
   jz   over
loopNext:
   bsf  rax, rdx        ; to
   lea  ecx, [eax-8]    ; from
   ; eg AddMove 16-bit moves
   shl eax, 6             
   lea eax, ...           ; add fr,to,whitepawn together
   mov WORD PTR [r10], ax ; move 2 list
   add r10d, 2            ; next 16-bit index

   lea  rax, [rdx-1]
   and  rdx, rax
   jnz  loopNext
over:

You may post the generated assembly of your loop body for comparison. Another minor point is that disjoint loops for black/white are likely better to predict conditional branches, as long you have enough entries in your branch target buffer. I appreciate central common code for better maintabilty as well - one should be aware of the costs. Actually i am attired by the approach to flip sides at each move - and to therefor to have only white moves with only black capture targets

Cheers,
Gerd

Aleks Peshkov · Post by **Aleks Peshkov** » Wed Aug 08, 2007 3:21 pm

Gerd Isenberg wrote:Actually i am attired by the approach to flip sides at each move - and to therefor to have only white moves with only black capture targets

Oops, rotated bitboards reincarnation.

C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;
2) not in check/quiesearch/check evasion move generators;
3) root/principal variation/zero-window variants of alpha-beta search.

Just a point in C++ vs C war.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Aug 08, 2007 4:04 pm

Aleks Peshkov wrote:
Gerd Isenberg wrote:Actually i am attired by the approach to flip sides at each move - and to therefor to have only white moves with only black capture targets
Oops, rotated bitboards reincarnation.

Hehe - yes I flip my quad-bitboard, hashkey etc. by some bswaps during make/unmake.
I always have white to move now - which makes some stuff simpler - or at least avoids multiple template incarnations. I use

Code: Select all

zobrist(blackPiece, sq) == bswap (zobrist(whitePiece, sq^56))

with the drawback that hashkeys of symmetrical (sub)positions are effectiv 32 bit instead of 64, since
byte == byte[i^7] after xoring them

Code: Select all

   {SS,   TT,   UU,   VV,   WW,   XX,   YY,   ZZ   }
 ^ {   ZZ,   YY,   XX,   WW,   VV,   UU,   TT,   SS} 

== {SS^ZZ,TT^YY,UU^XX,VV^WW,VV^WW,UU^XX,TT^YY,SS^ZZ}

Aleks Peshkov wrote: C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;
2) not in check/quiesearch/check evasion move generators;
3) root/principal variation/zero-window variants of alpha-beta search.

Just a point in C++ vs C war.

Yes, even consts in inlines are likely to get optimized away...

Jacob · Post by **Jacob** » Fri Aug 10, 2007 1:33 am

Thanks for the responses : ).

Here's the assembly (I think >_<, I'm not used to eclipse, or assembly for that matter)

Code: Select all

0x0000000000403d1b <GenNonCaptures+2690>: jmp    0x403d7e <GenNonCaptures+2789>
0x0000000000403d1d <GenNonCaptures+2692>: mov    2980101(%rip),%ebx        # 0x6db628 <side>
0x0000000000403d23 <GenNonCaptures+2698>: lea    0xffffffffffffffd8(%rbp),%rdi
0x0000000000403d27 <GenNonCaptures+2702>: callq  0x4010e0 <PopLSB>
0x0000000000403d2c <GenNonCaptures+2707>: mov    %eax,%edx
0x0000000000403d2e <GenNonCaptures+2709>: movslq %ebx,%rax
0x0000000000403d31 <GenNonCaptures+2712>: mov    %edx,0xffffffffffffffd0(%rbp,%rax,4)
0x0000000000403d35 <GenNonCaptures+2716>: mov    2980081(%rip),%ecx        # 0x6db62c <xside>
0x0000000000403d3b <GenNonCaptures+2722>: mov    2980071(%rip),%eax        # 0x6db628 <side>
0x0000000000403d41 <GenNonCaptures+2728>: cltq   
0x0000000000403d43 <GenNonCaptures+2730>: mov    0xffffffffffffffd0(%rbp,%rax,4),%eax
0x0000000000403d47 <GenNonCaptures+2734>: lea    0xfffffffffffffff8(%rax),%edx
0x0000000000403d4a <GenNonCaptures+2737>: movslq %ecx,%rax
0x0000000000403d4d <GenNonCaptures+2740>: mov    %edx,0xffffffffffffffd0(%rbp,%rax,4)
0x0000000000403d51 <GenNonCaptures+2744>: mov    0xffffffffffffffd0(%rbp),%edx
0x0000000000403d54 <GenNonCaptures+2747>: mov    0xffffffffffffffd4(%rbp),%eax
0x0000000000403d57 <GenNonCaptures+2750>: shl    $0x6,%eax
0x0000000000403d5a <GenNonCaptures+2753>: mov    %edx,%ecx
0x0000000000403d5c <GenNonCaptures+2755>: or     %eax,%ecx
0x0000000000403d5e <GenNonCaptures+2757>: mov    14456(%rip),%edx        # 0x4075dc <PAWN>
0x0000000000403d64 <GenNonCaptures+2763>: mov    2980030(%rip),%eax        # 0x6db628 <side>
0x0000000000403d6a <GenNonCaptures+2769>: or     %edx,%eax
0x0000000000403d6c <GenNonCaptures+2771>: shl    $0xc,%eax
0x0000000000403d6f <GenNonCaptures+2774>: or     %ecx,%eax
0x0000000000403d71 <GenNonCaptures+2776>: mov    %eax,%edx
0x0000000000403d73 <GenNonCaptures+2778>: mov    0xffffffffffffffc8(%rbp),%rax
0x0000000000403d77 <GenNonCaptures+2782>: mov    %edx,(%rax)
0x0000000000403d79 <GenNonCaptures+2784>: addq   $0x4,0xffffffffffffffc8(%rbp)
0x0000000000403d7e <GenNonCaptures+2789>: mov    0xffffffffffffffd8(%rbp),%rax

Code: Select all

while (moves) {
		sqr[side] = PopLSB(&moves);
		sqr[xside] = sqr[side] - 8;
		AddMove(sqr[0], sqr[1], PAWN|side);
	}

Code: Select all

#define AddMove(from, to, pc) \
	*list++ = ((from) | ((to) << 6) | ((pc) << 12));

I might just keep my old move generator that has disjoint loops,... At the very least, I won't have to rewrite my engine to use the different data structures :-/. It was fun playing with the idea, though.

C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;

I'm kind of new to C/C++, and my engine is written in C++, Do you have an example of how you might use a template for this?

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Aug 10, 2007 1:41 pm

Jacob wrote: I might just keep my old move generator that has disjoint loops,... At the very least, I won't have to rewrite my engine to use the different data structures :-/. It was fun playing with the idea, though.

C++ templates can help to avoid source code duplication in cases:
1) white/black pieces moves;
I'm kind of new to C/C++, and my engine is written in C++, Do you have an example of how you might use a template for this?

With integer-templates someting like this is possible for "side" as compile time parameter.

Code: Select all

template <int side> 
void genPawnPushs(u64 &moves)
{
   int pc = PAWN|side;
   while (moves) 
   {
      u32 sqr[2];
      sqr[side]   = PopLSB(moves);
      sqr[side^1] = sqr[side] - 8;
      AddMove(sqr[0], sqr[1], pc);    
   }
}

Since "side" is compile time invariant - either 0 or 1 for each incarnation - the optimizer can make it disappear.
The drawback is with a side-variable, you can only use it with a conditional branch ...

Code: Select all

if ( side == white )
   genPawnPushs<white>(...);
else
   genPawnPushs<black>(...);

... Which might be nice, if good predictable (likely) - because the templates pay off in smaller code of the loop-bodies.
There are other, probably better solutions for your purpose - to hoist the side-variant outside the loop body without introducing too much register pressure and memory accesses.

Your approach is to traverse different set-semantics, to get rid of the side-condition. You traverse either a set of from-squares or to-squares - and then you do a conditional swap of the from/to-squares via memory inside the loop body - which seems quite expensive according to your posted assembly. Ok, expensive is relative here - the memory is L1 and processor (at least AMD K8, K10) will apply store/load forwarding. Anyway, there are more and longer instructions.

You may try the conditional swap via registers instead of memory ...

Code: Select all

void genPawnPushs(u64 moves, int side /* {0,1} */)
{
   int pc = PAWN|side;
   side = -sise; // {0,-1}
   while (moves) 
   {
      from = PopLSB(&moves);
      to   =  from - 8;
      swpm = (from ^ to) & side;   // 0 if side == 0, otherwise from ^ to
      from =  from ^ swpm;         // becomes "to" if side == 1
      to   =  to   ^ swpm;         // becomes "from" if side == 1
      AddMove(from, to, pc);    
   }
}

... but that also looks not that cheap either - four register ops inside the loop body. I suggest the "usual" and imho more consistant semantic of traversing target-sets only - and to calculate from-to delta (eg. +-8) by side as a loop invariant.

Code: Select all

void genPawnPushs(u64 moves, int side /* {0,1} */)
{
   int delta = side ? 8 : -8; // hopefully a branchless cmove
   int pc = PAWN|side;
   while (moves) 
   {
      from = PopLSB(&moves);
      to   =  from + delta;
      AddMove(from, to, pc);    
   }
}

or if the compiler doesn't like cmove

Code: Select all

void genPawnPushs(u64 moves, int side /* {0,1} */)
{
   int delta = side * 16 - 8; // (side << 4) - 8 or even delta = pawnPushDelta[side]
   int pc = PAWN|side;
   while (moves) 
   {
      from = PopLSB(&moves);
      to   =  from + delta;
      AddMove(from, to, pc);    
   }
}

Some further remarks to your assembly - Gnu ATA-Syntax is a bit hard to read for me

You should inline PopLSB - and I also suggest to use a pure bitscanforward and to reset the found bit explicitly inside the loop by
moves &= moves - 1. This usually produces better and faster code, more parallel computation, and don't takes redundant tests for the continue-condition at the end of the loop. It also helps some compilers to keep the traversed set inside a register - but it is a matter of taste, and also the optimization-ability of the compiler. Of course it only works with bitscanforward but not with bitscanreverse that way.

Code: Select all

#ifdef ...
inline int bitscanforward(u64 moves)
{
     long idx;
     _BitScanForward64(&idx, moves);
     return (int) idx;
}
#else
...
#endif

void genPawnPushs(u64 moves, int side /* {0,1} */)
{
   int delta = side * 16 - 8; // (side << 4) - 8
   int pc = PAWN|side;
   while (moves) 
   {
      from = bitscanforward(moves);
      AddMove(from, from + delta, pc);
      moves &= moves - 1;    
   }
}

Cheers,
Gerd

More uses for magics?

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves

Re: Generalized bitboard pawn moves