Reflection of a bitboard

hgm · Post by **hgm** » Fri Apr 22, 2016 10:06 pm

What is the fastest way to flip a bitboard along a diagonal? The best I could think of is masking with 0x8080808080808080ull>>N eight times (for N=0-7), and then multiplying with 0x0002040810204081ull<<N. And then mask out the upper byte, shift it right by 8*N, and OR everything together.

Is there a faster way?

Karlo Bala · Post by **Karlo Bala** » Fri Apr 22, 2016 10:38 pm

hgm wrote:What is the fastest way to flip a bitboard along a diagonal? The best I could think of is masking with 0x8080808080808080ull>>N eight times (for N=0-7), and then multiplying with 0x0002040810204081ull<<N. And then mask out the upper byte, shift it right by 8*N, and OR everything together.

Is there a faster way?

Version from CPW!?

Code: Select all

/**
 * Flip a bitboard about the diagonal a1-h8.
 * Square h1 is mapped to a8 and vice versa.
 * @param x any bitboard
 * @return bitboard x flipped about diagonal a1-h8
 */
U64 flipDiagA1H8&#40;U64 x&#41; &#123;
   U64 t;
   const U64 k1 = C64&#40;0x5500550055005500&#41;;
   const U64 k2 = C64&#40;0x3333000033330000&#41;;
   const U64 k4 = C64&#40;0x0f0f0f0f00000000&#41;;
   t  = k4 & &#40;x ^ &#40;x << 28&#41;);
   x ^=       t ^ &#40;t >> 28&#41; ;
   t  = k2 & &#40;x ^ &#40;x << 14&#41;);
   x ^=       t ^ &#40;t >> 14&#41; ;
   t  = k1 & &#40;x ^ &#40;x <<  7&#41;);
   x ^=       t ^ &#40;t >>  7&#41; ;
   return x;
&#125;

hgm · Post by **hgm** » Fri Apr 22, 2016 11:06 pm

Wow, that looks clever! Thanks!

stegemma · Post by **stegemma** » Tue Apr 26, 2016 12:18 am

Another possible solution (maybe less efficient) could be to use 8 arrays of 256 reverted bytes (stored as 64 bit values) for any row:

Code: Select all


U64 k = row1&#91;x & 0xff&#93;
         | row2&#91;&#40;x>>8&#41; & 0xff&#93;
         | row3&#91;&#40;x>>16&#41; & 0xff&#93;
         | row4&#91;&#40;x>>24&#41; & 0xff&#93;
         | row5&#91;&#40;x>>32&#41; & 0xff&#93;
         | row6&#91;&#40;x>>40&#41; & 0xff&#93;
         | row7&#91;&#40;x>>48&#41; & 0xff&#93;
         | row8&#91;&#40;x>>56&#41; & 0xff&#93;;

The same idea can be used to reflect vertically a bitboard.

hgm · Post by **hgm** » Tue Apr 26, 2016 11:14 am

That is also an interesting method. Actually the row1- row8 tables would just be pre-shifted versions of each other. So you could reduce cache pressure by using the same table for all rows, and shift it afterwards. This sounds like a bad deal, but x64 architecture has a single instruction for doing a = b + 2*c (LEA with scaled offset addressing mode).

So

Code: Select all

t = row&#91;x & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>16&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>24&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>32&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>40&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>48&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>56&#41; & 0xff&#93;;

Would just be 7 shifts, 8 ANDs, 7 LEAs and 8 memory LOADs.

The other method has 6 shifts, 9 XORs, 3 ANDs, which is probably faster, even though memory loads can be done in prallel with ALU stuff.

stegemma · Post by **stegemma** » Tue Apr 26, 2016 11:26 am

hgm wrote:That is also an interesting method. Actually the row1- row8 tables would just be pre-shifted versions of each other. So you could reduce cache pressure by using the same table for all rows, and shift it afterwards. This sounds like a bad deal, but x64 architecture has a single instruction for doing a = b + 2*c (LEA with scaled offset addressing mode).

So
Code: Select all
t = row&#91;x & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>16&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>24&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>32&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>40&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>48&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>56&#41; & 0xff&#93;;
Would just be 7 shifts, 8 ANDs, 7 LEAs and 8 memory LOADs.

The other method has 6 shifts, 9 XORs, 3 ANDs, which is probably faster, even though memory loads can be done in prallel with ALU stuff.

Of course the fastest is better but with a table you can easily use the same algorithm to different operation on the bitboard.

If the compiler is not smart-enough, maybe it is better this little change:

Code: Select all

t = row&#91;x & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;

PS: I've not verified the "2*t+row" idea but I suppose that it works, in your case

PS2: I've used a lot of LEA in my assembly engines

hgm · Post by **hgm** » Tue Apr 26, 2016 12:06 pm

Actually one could hope that the compiler is smart enough to translate (x>>8*N) & 0xff by simply loading a byte into a register with zero extend (MOVZX). That eliminates both the shifts and the ANDs. It puts a bit of a heavy load on the CPU's load unit, though.

stegemma · Post by **stegemma** » Tue Apr 26, 2016 12:15 pm

On 64 bit Intel CPU there is no direct access to the highest bytes, so almost one shift is still required. You just have to try.

hgm · Post by **hgm** » Tue Apr 26, 2016 12:18 pm

There is if the source operand is in memory. Any memory byte can be accessed.

stegemma · Post by **stegemma** » Tue Apr 26, 2016 12:25 pm

Maybe I'm wrong, but I've wrote this little schema on the latest Pentium CPUs:

Code: Select all

;
; 8  &#58; al  bl  cl  dl                  r8b r9b r10b r11b r12b r13b r14b r15b
; 8  &#58;  ah  bh  ch  dh
; 16 &#58; ax  bx  cx  dx  si  di  bp  sp  r8w r9w r10w r11w r12w r13w r14w r15w
; 32 &#58; eax ebx ecx edx esi edi ebp esp r8d r9d r10d r11d r12d r13d r14d r15d
; 64 &#58; rax rbx rcx rdx rsi rdi rbp rsp r8  r9  r10  r11  r12  r13  r14  r15
;
; 64 &#58; mm0  mm1  mm2  mm3  mm4  mm5  mm6  mm7
; FPU&#58; st0  st1  st2  st3  st4  st5  st6  st7 
;
; 128&#58; xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
; 256&#58; ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 ymm8 ymm9 ymm10 ymm11 ymm12 ymm13 ymm14 ymm15

It seems to me that there is no something like al/ah for the whole 64 bit registers.

Reflection of a bitboard

Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard

Re: Reflection of a bitboard