Reflection of a bitboard

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Reflection of a bitboard

Post by hgm »

What is the fastest way to flip a bitboard along a diagonal? The best I could think of is masking with 0x8080808080808080ull>>N eight times (for N=0-7), and then multiplying with 0x0002040810204081ull<<N. And then mask out the upper byte, shift it right by 8*N, and OR everything together.

Is there a faster way?
Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: Reflection of a bitboard

Post by Karlo Bala »

hgm wrote:What is the fastest way to flip a bitboard along a diagonal? The best I could think of is masking with 0x8080808080808080ull>>N eight times (for N=0-7), and then multiplying with 0x0002040810204081ull<<N. And then mask out the upper byte, shift it right by 8*N, and OR everything together.

Is there a faster way?
Version from CPW!?

Code: Select all

/**
 * Flip a bitboard about the diagonal a1-h8.
 * Square h1 is mapped to a8 and vice versa.
 * @param x any bitboard
 * @return bitboard x flipped about diagonal a1-h8
 */
U64 flipDiagA1H8&#40;U64 x&#41; &#123;
   U64 t;
   const U64 k1 = C64&#40;0x5500550055005500&#41;;
   const U64 k2 = C64&#40;0x3333000033330000&#41;;
   const U64 k4 = C64&#40;0x0f0f0f0f00000000&#41;;
   t  = k4 & &#40;x ^ &#40;x << 28&#41;);
   x ^=       t ^ &#40;t >> 28&#41; ;
   t  = k2 & &#40;x ^ &#40;x << 14&#41;);
   x ^=       t ^ &#40;t >> 14&#41; ;
   t  = k1 & &#40;x ^ &#40;x <<  7&#41;);
   x ^=       t ^ &#40;t >>  7&#41; ;
   return x;
&#125;
Best Regards,
Karlo Balla Jr.
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reflection of a bitboard

Post by hgm »

Wow, that looks clever! Thanks!
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Reflection of a bitboard

Post by stegemma »

Another possible solution (maybe less efficient) could be to use 8 arrays of 256 reverted bytes (stored as 64 bit values) for any row:

Code: Select all


U64 k = row1&#91;x & 0xff&#93;
         | row2&#91;&#40;x>>8&#41; & 0xff&#93;
         | row3&#91;&#40;x>>16&#41; & 0xff&#93;
         | row4&#91;&#40;x>>24&#41; & 0xff&#93;
         | row5&#91;&#40;x>>32&#41; & 0xff&#93;
         | row6&#91;&#40;x>>40&#41; & 0xff&#93;
         | row7&#91;&#40;x>>48&#41; & 0xff&#93;
         | row8&#91;&#40;x>>56&#41; & 0xff&#93;;
The same idea can be used to reflect vertically a bitboard.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reflection of a bitboard

Post by hgm »

That is also an interesting method. Actually the row1- row8 tables would just be pre-shifted versions of each other. So you could reduce cache pressure by using the same table for all rows, and shift it afterwards. This sounds like a bad deal, but x64 architecture has a single instruction for doing a = b + 2*c (LEA with scaled offset addressing mode).

So

Code: Select all

t = row&#91;x & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>16&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>24&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>32&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>40&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>48&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>56&#41; & 0xff&#93;;
Would just be 7 shifts, 8 ANDs, 7 LEAs and 8 memory LOADs.

The other method has 6 shifts, 9 XORs, 3 ANDs, which is probably faster, even though memory loads can be done in prallel with ALU stuff.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Reflection of a bitboard

Post by stegemma »

hgm wrote:That is also an interesting method. Actually the row1- row8 tables would just be pre-shifted versions of each other. So you could reduce cache pressure by using the same table for all rows, and shift it afterwards. This sounds like a bad deal, but x64 architecture has a single instruction for doing a = b + 2*c (LEA with scaled offset addressing mode).

So

Code: Select all

t = row&#91;x & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>16&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>24&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>32&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>40&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>48&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>56&#41; & 0xff&#93;;
Would just be 7 shifts, 8 ANDs, 7 LEAs and 8 memory LOADs.

The other method has 6 shifts, 9 XORs, 3 ANDs, which is probably faster, even though memory loads can be done in prallel with ALU stuff.
Of course the fastest is better but with a table you can easily use the same algorithm to different operation on the bitboard.

If the compiler is not smart-enough, maybe it is better this little change:

Code: Select all

t = row&#91;x & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
t = 2*t + row&#91;&#40;x>>=8&#41; & 0xff&#93;;
PS: I've not verified the "2*t+row" idea but I suppose that it works, in your case

PS2: I've used a lot of LEA in my assembly engines ;)
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reflection of a bitboard

Post by hgm »

Actually one could hope that the compiler is smart enough to translate (x>>8*N) & 0xff by simply loading a byte into a register with zero extend (MOVZX). That eliminates both the shifts and the ANDs. It puts a bit of a heavy load on the CPU's load unit, though.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Reflection of a bitboard

Post by stegemma »

On 64 bit Intel CPU there is no direct access to the highest bytes, so almost one shift is still required. You just have to try.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Reflection of a bitboard

Post by hgm »

There is if the source operand is in memory. Any memory byte can be accessed.
User avatar
stegemma
Posts: 859
Joined: Mon Aug 10, 2009 10:05 pm
Location: Italy
Full name: Stefano Gemma

Re: Reflection of a bitboard

Post by stegemma »

Maybe I'm wrong, but I've wrote this little schema on the latest Pentium CPUs:

Code: Select all

;
; 8  &#58; al  bl  cl  dl                  r8b r9b r10b r11b r12b r13b r14b r15b
; 8  &#58;  ah  bh  ch  dh
; 16 &#58; ax  bx  cx  dx  si  di  bp  sp  r8w r9w r10w r11w r12w r13w r14w r15w
; 32 &#58; eax ebx ecx edx esi edi ebp esp r8d r9d r10d r11d r12d r13d r14d r15d
; 64 &#58; rax rbx rcx rdx rsi rdi rbp rsp r8  r9  r10  r11  r12  r13  r14  r15
;
; 64 &#58; mm0  mm1  mm2  mm3  mm4  mm5  mm6  mm7
; FPU&#58; st0  st1  st2  st3  st4  st5  st6  st7 
;
; 128&#58; xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
; 256&#58; ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 ymm8 ymm9 ymm10 ymm11 ymm12 ymm13 ymm14 ymm15
It seems to me that there is no something like al/ah for the whole 64 bit registers.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com