Reversing bits in a 32bit word

Cardoso · Post by **Cardoso** » Sat Mar 22, 2008 3:08 pm

Hi,
what is the fastest way to reverse the bits in a 32bit int?
For example bit 0 goes to bit 31, bit 1 goes to 30, bit 12 goes to 19, etc.
Gerd?

Thanks in advance,
Alvaro

Hart · Post by **Hart** » Sat Mar 22, 2008 4:38 pm

http://www.cs.utk.edu/~vose/c-stuff/bithacks.html

unsigned int v; // reverse the bits in this
unsigned int t = v << 1; // t will have the reversed bits of v
int i;

v >>= 1;
for (i = sizeof(v) * 8 - 2; i; i--)
{
t |= v & 1;
t <<= 1;
v >>= 1;
}
t |= v;

Reverse an N-bit quantity in parallel in 5 * lg(N) operations:

unsigned int v; // 32 bit word to reverse bit order

int const S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers
int const B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

v = ((v >> S[0]) & B[0]) | ((v << S[0]) & ~B[0]); // swap odd and even bits
v = ((v >> S[1]) & B[1]) | ((v << S[1]) & ~B[1]); // swap consecutive pairs
v = ((v >> S[2]) & B[2]) | ((v << S[2]) & ~B[2]); // swap nibbles ...
v = ((v >> S[3]) & B[3]) | ((v << S[3]) & ~B[3]);
v = ((v >> S[4]) & B[4]) | ((v << S[4]) & ~B[4]);

-Sean Eron Anderson

Helpful?

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Mar 22, 2008 4:43 pm

Cardoso wrote:Hi,
what is the fastest way to reverse the bits in a 32bit int?
For example bit 0 goes to bit 31, bit 1 goes to 30, bit 12 goes to 19, etc.
Gerd?

Thanks in advance,
Alvaro

Hi Alvaro,

Ok, 32 bit is not exactly mentioned in
http://chessprogramming.wikispaces.com/ ... d+Rotating,
but the principle is the same. Five delta swaps, swapping bits, duos, nibbles, bytes and words:

Code: Select all

U32 reverse (U32 x) {
   x = ((x >> 1) & 0x55555555) | ((x & 0x55555555) << 1);
   x = ((x >> 2) & 0x33333333) | ((x & 0x33333333) << 2);
   x = ((x >> 4) & 0x0f0f0f0f) | ((x & 0x0f0f0f0f) << 4);
   x = ((x >> 8) & 0x00ff00ff) | ((x & 0x00ff00ff) << 8);
   return (x >> 16) | (x << 16);
}

With x86 32-bit bswap/rotate intrinsics this one is probably fastest with 14 dependent ops:

Code: Select all

U32 reverse (U32 x) {
   x ^= 0x0f0f0f0f & (x ^ _rotl(x, 8));
   x ^= 0x33333333 & (x ^ _rotl(x, 4));
   x ^= 0x55555555 & (x ^ _rotl(x, 2));
   return _byteswap_ulong(_rotr(x, 7));
}

Depending on your memory or L1-footprint you may also try 4 byte- or two word (128KByte) lookups:

Code: Select all

x = reverseWord[x>>16] | (reverseWord[x & 0xffff] << 16);

Cheers,
Gerd

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Mar 22, 2008 5:15 pm

Gerd Isenberg wrote: With x86 32-bit bswap/rotate intrinsics this one is probably fastest with 14 dependent ops:
Code: Select all
U32 reverse (U32 x) {
   x ^= 0x0f0f0f0f & (x ^ _rotl(x, 8));
   x ^= 0x33333333 & (x ^ _rotl(x, 4));
   x ^= 0x55555555 & (x ^ _rotl(x, 2));
   return _byteswap_ulong(_rotr(x, 7));
}

Probably this is even faster, due to more parallelism and replacing 'or' by 'add', taking advantage of x86 lea for mul 2,4:

Code: Select all

U32 reverse (U32 x) {
   x = ((x >> 1) & 0x55555555) +  2*(x & 0x55555555);
   x = ((x >> 2) & 0x33333333) +  4*(x & 0x33333333);
   x = ((x >> 4) & 0x0f0f0f0f) + 16*(x & 0x0f0f0f0f);
   return _byteswap_ulong(x);
}

17 instructions, 3 moves, 2 leas:

Code: Select all

   mov   eax, ecx
   shr   eax, 1
   and   ecx, 0x55555555
   and   eax, 0x55555555
   lea   eax, [eax + 2*ecx]
   mov   ecx, eax
   shr   eax, 2
   and   ecx, 0x33333333
   and   eax, 0x33333333
   lea   eax, [eax + 4*ecx]
   mov   ecx, eax
   shr   eax, 4
   and   ecx, 0x0f0f0f0f
   and   eax, 0x0f0f0f0f
   shl   ecx, 4
   add   eax, ecx
   bswap eax

Cardoso · Post by **Cardoso** » Sun Mar 23, 2008 9:55 am

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Mar 25, 2008 10:12 pm

Hart wrote: Reverse an N-bit quantity in parallel in 5 * lg(N) operations:

unsigned int v; // 32 bit word to reverse bit order

int const S[] = {1, 2, 4, 8, 16}; // Magic Binary Numbers
int const B[] = {0x55555555, 0x33333333, 0x0F0F0F0F, 0x00FF00FF, 0x0000FFFF};

v = ((v >> S[0]) & B[0]) | ((v << S[0]) & ~B[0]); // swap odd and even bits
v = ((v >> S[1]) & B[1]) | ((v << S[1]) & ~B[1]); // swap consecutive pairs
v = ((v >> S[2]) & B[2]) | ((v << S[2]) & ~B[2]); // swap nibbles ...
v = ((v >> S[3]) & B[3]) | ((v << S[3]) & ~B[3]);
v = ((v >> S[4]) & B[4]) | ((v << S[4]) & ~B[4]);

The const-array declaration is fine. If declared local, as shift, factor and const mask by two-adic fractions of -1 (see macro M(a)).
The loop is unrolled and all constants are resolved by the (vc2005) compiler. The generated 64-bit assembly looks optimal. Of course "reverse" can be done faster by mirrorHorizontal and flipvertical aka 64-bit bswap.

Code: Select all

#define S(a) ( C64( 1) << (a)    )
#define M(a) ( C64(-1) / ((a)+1) )

U64 mirrorHorizontal(U64 b) {
   const int s[]= {S(  0 ), S(  1 ), S(  2 )};
   const U64 f[]= {S(s[0]), S(s[1]), S(s[2])};
   const U64 k[]= {M(f[0]), M(f[1]), M(f[2])};
   for (U32 i = 0; i < sizeof(s)/sizeof(s[0]); i++)
      b = ((b >> s[i]) & k[i]) + f[i]*(b & k[i]);
   return b;
}

U64 reverse(U64 b) {
   const int s[]= {S(  0 ), S(  1 ), S(  2 ), S(  3 ), S(  4 ), S(  5 )};
   const U64 f[]= {S(s[0]), S(s[1]), S(s[2]), S(s[3]), S(s[4]), S(s[5])};
   const U64 k[]= {M(f[0]), M(f[1]), M(f[2]), M(f[3]), M(f[4]), M(f[5])};
   for (U32 i = 0; i < sizeof(s)/sizeof(s[0]); i++)
      b = ((b >> s[i]) & k[i]) + f[i]*(b & k[i]);
   return b;
}

How can we define some clever text-concatination macros to make the above array-initializations dynamic on the passed number of elements?

Code: Select all

#define InitS(3)
=> {S(  0 ), S(  1 ), S(  2 )}

#define InitS(6)
=> {S(  0 ), S(  1 ), S(  2 ), S(  3 ), S(  4 ), S(  5 )}

The vc2005 generated assembly of both routines:

Code: Select all

b$ = 8
?mirrorHorizontal@@ PROC
  00000	48 8b c1	 mov	 rax, rcx
  00003	48 ba 55 55 55
	55 55 55 55 55	 mov	 rdx, 5555555555555555H
  0000d	48 d1 e8	 shr	 rax, 1
  00010	48 23 ca	 and	 rcx, rdx
  00013	48 23 c2	 and	 rax, rdx
  00016	48 ba 33 33 33
	33 33 33 33 33	 mov	 rdx, 3333333333333333H
  00020	48 8d 0c 48	 lea	 rcx, QWORD PTR [rax+rcx*2]
  00024	48 8b c1	 mov	 rax, rcx
  00027	48 23 ca	 and	 rcx, rdx
  0002a	48 c1 e8 02	 shr	 rax, 2
  0002e	48 23 c2	 and	 rax, rdx
  00031	48 8d 14 88	 lea	 rdx, QWORD PTR [rax+rcx*4]
  00035	48 b9 0f 0f 0f
	0f 0f 0f 0f 0f	 mov	 rcx, 0f0f0f0f0f0f0f0fH
  0003f	48 8b c2	 mov	 rax, rdx
  00042	48 c1 ea 04	 shr	 rdx, 4
  00046	48 23 c1	 and	 rax, rcx
  00049	48 23 d1	 and	 rdx, rcx
  0004c	48 c1 e0 04	 shl	 rax, 4
  00050	48 03 c2	 add	 rax, rdx
  00053	c3		 ret	 0
?mirrorHorizontal@@ ENDP


b$ = 8
?reverse@@ PROC
  00000	48 8b c1	 mov	 rax, rcx
  00003	48 ba 55 55 55
	55 55 55 55 55	 mov	 rdx, 5555555555555555H
  0000d	48 23 ca	 and	 rcx, rdx
  00010	48 d1 e8	 shr	 rax, 1
  00013	48 23 c2	 and	 rax, rdx
  00016	48 ba 33 33 33
	33 33 33 33 33	 mov	 rdx, 3333333333333333H
  00020	48 8d 0c 48	 lea	 rcx, QWORD PTR [rax+rcx*2]
  00024	48 8b c1	 mov	 rax, rcx
  00027	48 23 ca	 and	 rcx, rdx
  0002a	48 c1 e8 02	 shr	 rax, 2
  0002e	48 23 c2	 and	 rax, rdx
  00031	48 8d 14 88	 lea	 rdx, QWORD PTR [rax+rcx*4]
  00035	48 b9 0f 0f 0f
	0f 0f 0f 0f 0f	 mov	 rcx, 0f0f0f0f0f0f0f0fH
  0003f	48 8b c2	 mov	 rax, rdx
  00042	48 c1 ea 04	 shr	 rdx, 4
  00046	48 23 c1	 and	 rax, rcx
  00049	48 23 d1	 and	 rdx, rcx
  0004c	48 b9 ff 00 ff
	00 ff 00 ff 00	 mov	 rcx, 00ff00ff00ff00ffH
  00056	48 c1 e0 04	 shl	 rax, 4
  0005a	48 03 c2	 add	 rax, rdx
  0005d	48 8b d0	 mov	 rdx, rax
  00060	48 c1 e8 08	 shr	 rax, 8
  00064	48 23 c1	 and	 rax, rcx
  00067	48 23 d1	 and	 rdx, rcx
  0006a	48 c1 e2 08	 shl	 rdx, 8
  0006e	48 03 d0	 add	 rdx, rax
  00071	48 b8 ff ff 00
	00 ff ff 00 00	 mov	 rax, 0000ffff0000ffffH
  0007b	48 8b ca	 mov	 rcx, rdx
  0007e	48 23 d0	 and	 rdx, rax
  00081	48 c1 e9 10	 shr	 rcx, 16
  00085	48 c1 e2 10	 shl	 rdx, 16
  00089	48 23 c8	 and	 rcx, rax
  0008c	48 03 ca	 add	 rcx, rdx
  0008f	8b c1		 mov	 eax, ecx
  00091	48 c1 e9 20	 shr	 rcx, 32
  00095	48 c1 e0 20	 shl	 rax, 32
  00099	48 03 c1	 add	 rax, rcx
  0009c	c3		 ret	 0
?reverse@@ ENDP

Gerd

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 26, 2008 12:37 am

Gerd Isenberg wrote: How can we define some clever text-concatination macros to make the above array-initializations dynamic on the passed number of elements?
Code: Select all
#define InitS(3)
=> {S(  0 ), S(  1 ), S(  2 )}

#define InitS(6)
=> {S(  0 ), S(  1 ), S(  2 ), S(  3 ), S(  4 ), S(  5 )}

Nonsense, no need for ugly macros. All constants are calculated at compile time while unrolling the loops - same assembly as mentioned. One may even pass integer template parameter.

Code: Select all

U64 mirrorHorizontal(U64 b) {
   for (U32 i = 0; i < 3; i++) {
      const int s(      1  << i);
      const U64 f( C64( 1) << s);
      const U64 k( C64(-1) / (f+1) );
      b = ((b >> s) & k) + f*(b & k);
   }
   return b;
}

U64 flipVertical(U64 b) {
   for (U32 i = 3; i < 6; i++) {
      const int s(      1  << i);
      const U64 f( C64( 1) << s);
      const U64 k( C64(-1) / (f+1) );
      b = ((b >> s) & k) + f*(b & k);
   }
   return b;
}

U64 reverse(U64 b) {
   for (U32 i = 0; i < 6; i++) {
      const int s(      1  << i);
      const U64 f( C64( 1) << s);
      const U64 k( C64(-1) / (f+1) );
      b = ((b >> s) & k) + f*(b & k);
   }
   return b;
}

bob · Post by **bob** » Wed Mar 26, 2008 2:57 am

Gerd Isenberg wrote:
Gerd Isenberg wrote: How can we define some clever text-concatination macros to make the above array-initializations dynamic on the passed number of elements?
Code: Select all
#define InitS(3)
=> {S(  0 ), S(  1 ), S(  2 )}

#define InitS(6)
=> {S(  0 ), S(  1 ), S(  2 ), S(  3 ), S(  4 ), S(  5 )}
Nonsense, no need for ugly macros. All constants are calculated at compile time while unrolling the loops - same assembly as mentioned. One may even pass integer template parameter.
Code: Select all
U64 mirrorHorizontal(U64 b) {
   for (U32 i = 0; i < 3; i++) {
      const int s(      1  << i);
      const U64 f( C64( 1) << s);
      const U64 k( C64(-1) / (f+1) );
      b = ((b >> s) & k) + f*(b & k);
   }
   return b;
}

U64 flipVertical(U64 b) {
   for (U32 i = 3; i < 6; i++) {
      const int s(      1  << i);
      const U64 f( C64( 1) << s);
      const U64 k( C64(-1) / (f+1) );
      b = ((b >> s) & k) + f*(b & k);
   }
   return b;
}

U64 reverse(U64 b) {
   for (U32 i = 0; i < 6; i++) {
      const int s(      1  << i);
      const U64 f( C64( 1) << s);
      const U64 k( C64(-1) / (f+1) );
      b = ((b >> s) & k) + f*(b & k);
   }
   return b;
}
 

Do you ever worry that your head is going to explode or something???

sje · Post by **sje** » Wed Mar 26, 2008 4:48 am

If you can spare 128 KB of memory, try a lookup table:

Code: Select all

unsigned short int RevBitsVec[65536];  // Needs initialization

unsigned int Rev32(const unsigned int UnRev32)
{
  return
    RevBitsVec[UnRev32 & 0xffff] << 16 |
    RevBitsVec[(UnRev32 >> 16) & 0xffff];
}

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 26, 2008 8:09 am

bob wrote:Do you ever worry that your head is going to explode or something???

No need to worry

Just happy to understand some simple math. The compile time division of -1 by power of two plus one, to get all the masks for a general delta-swap formula, to swap bits, pairs of bits, nibbles, bytes, words and dwords. I was quite amazed by the compiler abilities to unroll the loop to get that optimal assembly to mirror, flip or reverse bitboards (or any integer).

Code: Select all

s[i] := 1<<i    
f[i] := 1<<s[i]
k[i] := (unsigned)-1 / (f[i] + 1)

i    = 0
s[i] = 1
f[i] = 2
k[i] = 0xffffffffffffffff / 3
k[i] = 0x5555555555555555

i    = 1
s[i] = 2
f[i] = 4
k[i] = 0xffffffffffffffff / 5
k[i] = 0x3333333333333333

i    = 2
s[i] = 4
f[i] = 16
k[i] = 0xffffffffffffffff / 0x11
k[i] = 0x0f0f0f0f0f0f0f0f

i    = 3
s[i] = 8
f[i] = 256
k[i] = 0xffffffffffffffff / 0x101
k[i] = 0x00ff00ff00ff00ff

i    = 4
s[i] = 16
f[i] = 65536
k[i] = 0xffffffffffffffff / 0x10001
k[i] = 0x0000ffff0000ffff

i    = 5
s[i] = 32
f[i] = 4294967296
k[i] = 0xffffffffffffffff / 0x100000001
k[i] = 0x00000000ffffffff

Thus ....

Code: Select all

U64 mirrorHorizontal(U64 b) {
   for (U32 i = 0; i < 3; i++) {
      const int s(      1  << i);
      const U64 f( C64( 1) << s);
      const U64 k( C64(-1) / (f+1) );
      b = ((b >> s) & k) + f*(b & k);
   }
   return b;
}

... is equivalent to

Code: Select all

U64 mirrorHorizontal(U64 b) {
   b = ((b >> 1) & 0x5555555555555555) +  2*(b & 0x5555555555555555);
   b = ((b >> 2) & 0x3333333333333333) +  4*(b & 0x3333333333333333);
   b = ((b >> 4) & 0x0f0f0f0f0f0f0f0f) + 16*(b & 0x0f0f0f0f0f0f0f0f);
   return b;
}

For practical relevance: I actually flip my board vertically after each move made. That is I bswap each of my four board bitboards and also the zobrist based hashkey and castle rights etc. with

Code: Select all

zobrist[whitepiece][sq] := bswap( zobrist[blackpiece][sq^56].

With proper initialization the engine has always white to move, and I only need to generate white moves and to eval or recognize egtbs with white to move positions, which simplifies code. Obviously I generate moves in same order with vertically flipped (and color changed) positions and get same search results.

As an intended side effect, I'll gain more hash-hits in white/black symmetrical positions

I was also thinking to conditionally mirror the board horizontally, to keep the white king always on e-h files. But that seems actually to expensive until future x86 will have a PPERM instruction (introduced by AMDs SSE5 instruction set) to reverse vectors of two or four bitboards in one instruction.

Reversing bits in a 32bit word

Reversing bits in a 32bit word

Re: Reversing bits in a 32bit word

Re: Reversing bits in a 32bit word

Re: Reversing bits in a 32bit word

Ok, very helpfull, manky thanks to you all (NT)

Re: Reversing bits in a 32bit word

Re: Reversing bits in a 32bit word

Re: Reversing bits in a 32bit word

Re: Reversing bits in a 32bit word

Re: Reversing bits in a 32bit word