What is the fastest way to flip a bitboard along a diagonal? The best I could think of is masking with 0x8080808080808080ull>>N eight times (for N=0-7), and then multiplying with 0x0002040810204081ull<<N. And then mask out the upper byte, shift it right by 8*N, and OR everything together.
Is there a faster way?
Reflection of a bitboard
Moderators: hgm, Rebel, chrisw
-
- Posts: 27795
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
-
- Posts: 373
- Joined: Wed Mar 22, 2006 10:17 am
- Location: Novi Sad, Serbia
- Full name: Karlo Balla
Re: Reflection of a bitboard
Version from CPW!?hgm wrote:What is the fastest way to flip a bitboard along a diagonal? The best I could think of is masking with 0x8080808080808080ull>>N eight times (for N=0-7), and then multiplying with 0x0002040810204081ull<<N. And then mask out the upper byte, shift it right by 8*N, and OR everything together.
Is there a faster way?
Code: Select all
/**
* Flip a bitboard about the diagonal a1-h8.
* Square h1 is mapped to a8 and vice versa.
* @param x any bitboard
* @return bitboard x flipped about diagonal a1-h8
*/
U64 flipDiagA1H8(U64 x) {
U64 t;
const U64 k1 = C64(0x5500550055005500);
const U64 k2 = C64(0x3333000033330000);
const U64 k4 = C64(0x0f0f0f0f00000000);
t = k4 & (x ^ (x << 28));
x ^= t ^ (t >> 28) ;
t = k2 & (x ^ (x << 14));
x ^= t ^ (t >> 14) ;
t = k1 & (x ^ (x << 7));
x ^= t ^ (t >> 7) ;
return x;
}
Best Regards,
Karlo Balla Jr.
Karlo Balla Jr.
-
- Posts: 27795
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Reflection of a bitboard
Wow, that looks clever! Thanks!
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: Reflection of a bitboard
Another possible solution (maybe less efficient) could be to use 8 arrays of 256 reverted bytes (stored as 64 bit values) for any row:
The same idea can be used to reflect vertically a bitboard.
Code: Select all
U64 k = row1[x & 0xff]
| row2[(x>>8) & 0xff]
| row3[(x>>16) & 0xff]
| row4[(x>>24) & 0xff]
| row5[(x>>32) & 0xff]
| row6[(x>>40) & 0xff]
| row7[(x>>48) & 0xff]
| row8[(x>>56) & 0xff];
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
http://www.linformatica.com
-
- Posts: 27795
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Reflection of a bitboard
That is also an interesting method. Actually the row1- row8 tables would just be pre-shifted versions of each other. So you could reduce cache pressure by using the same table for all rows, and shift it afterwards. This sounds like a bad deal, but x64 architecture has a single instruction for doing a = b + 2*c (LEA with scaled offset addressing mode).
So
Would just be 7 shifts, 8 ANDs, 7 LEAs and 8 memory LOADs.
The other method has 6 shifts, 9 XORs, 3 ANDs, which is probably faster, even though memory loads can be done in prallel with ALU stuff.
So
Code: Select all
t = row[x & 0xff];
t = 2*t + row[(x>>8) & 0xff];
t = 2*t + row[(x>>16) & 0xff];
t = 2*t + row[(x>>24) & 0xff];
t = 2*t + row[(x>>32) & 0xff];
t = 2*t + row[(x>>40) & 0xff];
t = 2*t + row[(x>>48) & 0xff];
t = 2*t + row[(x>>56) & 0xff];
The other method has 6 shifts, 9 XORs, 3 ANDs, which is probably faster, even though memory loads can be done in prallel with ALU stuff.
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: Reflection of a bitboard
Of course the fastest is better but with a table you can easily use the same algorithm to different operation on the bitboard.hgm wrote:That is also an interesting method. Actually the row1- row8 tables would just be pre-shifted versions of each other. So you could reduce cache pressure by using the same table for all rows, and shift it afterwards. This sounds like a bad deal, but x64 architecture has a single instruction for doing a = b + 2*c (LEA with scaled offset addressing mode).
So
Would just be 7 shifts, 8 ANDs, 7 LEAs and 8 memory LOADs.Code: Select all
t = row[x & 0xff]; t = 2*t + row[(x>>8) & 0xff]; t = 2*t + row[(x>>16) & 0xff]; t = 2*t + row[(x>>24) & 0xff]; t = 2*t + row[(x>>32) & 0xff]; t = 2*t + row[(x>>40) & 0xff]; t = 2*t + row[(x>>48) & 0xff]; t = 2*t + row[(x>>56) & 0xff];
The other method has 6 shifts, 9 XORs, 3 ANDs, which is probably faster, even though memory loads can be done in prallel with ALU stuff.
If the compiler is not smart-enough, maybe it is better this little change:
Code: Select all
t = row[x & 0xff];
t = 2*t + row[(x>>=8) & 0xff];
t = 2*t + row[(x>>=8) & 0xff];
t = 2*t + row[(x>>=8) & 0xff];
t = 2*t + row[(x>>=8) & 0xff];
t = 2*t + row[(x>>=8) & 0xff];
t = 2*t + row[(x>>=8) & 0xff];
t = 2*t + row[(x>>=8) & 0xff];
PS2: I've used a lot of LEA in my assembly engines
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
http://www.linformatica.com
-
- Posts: 27795
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Reflection of a bitboard
Actually one could hope that the compiler is smart enough to translate (x>>8*N) & 0xff by simply loading a byte into a register with zero extend (MOVZX). That eliminates both the shifts and the ANDs. It puts a bit of a heavy load on the CPU's load unit, though.
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: Reflection of a bitboard
On 64 bit Intel CPU there is no direct access to the highest bytes, so almost one shift is still required. You just have to try.
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
http://www.linformatica.com
-
- Posts: 27795
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Reflection of a bitboard
There is if the source operand is in memory. Any memory byte can be accessed.
-
- Posts: 859
- Joined: Mon Aug 10, 2009 10:05 pm
- Location: Italy
- Full name: Stefano Gemma
Re: Reflection of a bitboard
Maybe I'm wrong, but I've wrote this little schema on the latest Pentium CPUs:
It seems to me that there is no something like al/ah for the whole 64 bit registers.
Code: Select all
;
; 8 : al bl cl dl r8b r9b r10b r11b r12b r13b r14b r15b
; 8 : ah bh ch dh
; 16 : ax bx cx dx si di bp sp r8w r9w r10w r11w r12w r13w r14w r15w
; 32 : eax ebx ecx edx esi edi ebp esp r8d r9d r10d r11d r12d r13d r14d r15d
; 64 : rax rbx rcx rdx rsi rdi rbp rsp r8 r9 r10 r11 r12 r13 r14 r15
;
; 64 : mm0 mm1 mm2 mm3 mm4 mm5 mm6 mm7
; FPU: st0 st1 st2 st3 st4 st5 st6 st7
;
; 128: xmm0 xmm1 xmm2 xmm3 xmm4 xmm5 xmm6 xmm7 xmm8 xmm9 xmm10 xmm11 xmm12 xmm13 xmm14 xmm15
; 256: ymm0 ymm1 ymm2 ymm3 ymm4 ymm5 ymm6 ymm7 ymm8 ymm9 ymm10 ymm11 ymm12 ymm13 ymm14 ymm15
Author of Drago, Raffaela, Freccia, Satana, Sabrina.
http://www.linformatica.com
http://www.linformatica.com