bitcount

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Guetti

bitcount

Post by Guetti »

Just stumbled over this while googling, if you need some variations of bitcount to make your engine unique:

http://infolab.stanford.edu/~manku/bitc ... count.html
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: bitcount

Post by Gerd Isenberg »

In nifty one may safe one 'and' instruction each in counting duo-bits and bytes. Also, imul kf >> 56 should produce no slower code than modulo 255 - infact both do the same to add the final bytecounts of a 64-bit word:

Code: Select all

const U64 k1 = C64(0x5555555555555555); // -1/3
const U64 k2 = C64(0x3333333333333333); // -1/5
const U64 k4 = C64(0x0f0f0f0f0f0f0f0f); // -1/17
const U64 kf = C64(0x0101010101010101); // -1/255

int niftyBitcount (U64 x)
{
   x = (x & k1) + ((x >> 1) & k1); 
   x = (x & k2) + ((x >> 2) & k2); 
   x = (x & k4) + ((x >> 4) & k4); 
   return (int)    (x % 255);
}
versus the one from
http://chessprogramming.wikispaces.com/Population+Count

Code: Select all

int popCount (U64 x)
{
    x =  x       - ((x >> 1)  & k1); 
    x = (x & k2) + ((x >> 2)  & k2); 
    x = (x       +  (x >> 4)) & k4 ; 
    return (int)   ((x * kf) >> 56); 
}
Well, if I don't use L1 elsewhere I may waste 64KByte for word lookups - still we need four 16-bit lookups for 64 bits. And shortly we will use popcnt64 intrinsics anyway:
http://msdn2.microsoft.com/en-us/library/bb385231.aspx

Gerd
User avatar
sje
Posts: 4675
Joined: Mon Mar 13, 2006 7:43 pm

Re: bitcount

Post by sje »

Symbolic's C++ toolkit has two bit counter methods for its CTBB (bitboard) class. Both are inline methods; neither uses assembly language.

The first is Card() (from "cardinality") which sums four probes of a 16 bit wide lookup table and this is used mostly in the positional evaluation code where many one bits may be present.

The second is CardFew() and it uses the compliment/and loop that counts one bit at a time. It's used where the expectation is that only a few at most one bits are present. Example: the check response code determines single vs double check status by running CardFew() on the bitboard containing the squares having enemy men attacking the square of the king of interest.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: bitcount

Post by Gerd Isenberg »

Another alternative to gain more parallelism - last recent intel procs and K10 have 128 bit alus and perform up to three simd instructions per cycle, but probably the popcnt instruction is not yet available by the cpu, operating system or compiler intrinsic.
Expanding or sign extending bits to bytes into four 128-bit xmm-registers, adding all bytes together with one final sum of absolute byte difference. The routine might likely be modified for a weighted popcount aka dot-product bit[64]*byte_rot90[64].

Code: Select all

U32 popCount(U64 bb) {
   static const U64 CACHE_ALIGN masks[8] = {
      0x0101010101010101, 0x0202020202020202,
      0x0404040404040404, 0x0808080808080808,
      0x1010101010101010, 0x2020202020202020,
      0x4040404040404040, 0x8080808080808080,
   };
   __m128i x0, x1, x2, x3, zr; U32 cnt;
   __m128i * pM = (__m128i*) masks;
   x0  =  _mm_cvtsi64x_si128 ( bb );
   x0  =  _mm_unpacklo_epi64 ( x0, x0 );
   zr  =  _mm_setzero_si128();
   x3  =  _mm_andnot_si128 ( x0, pM[3] );
   x2  =  _mm_andnot_si128 ( x0, pM[2] );
   x1  =  _mm_andnot_si128 ( x0, pM[1] );
   x0  =  _mm_andnot_si128 ( x0, pM[0] );
   x3  =  _mm_cmpeq_epi8   ( x3, zr );
   x2  =  _mm_cmpeq_epi8   ( x2, zr );
   x1  =  _mm_cmpeq_epi8   ( x1, zr );
   x0  =  _mm_cmpeq_epi8   ( x0, zr );
   // perform 4 times 'and' with weight vector here
   x2  =  _mm_add_epi8     ( x2, x3 );
   x0  =  _mm_add_epi8     ( x0, x1 );
   x0  =  _mm_add_epi8     ( x0, x2 );
   x0  =  _mm_sad_epu8     ( x0, zr );
   cnt = -_mm_cvtsi128_si32( x0 )
         -_mm_extract_epi16( x0, 4 );
   return cnt & 255;
}

bb$ = 8
?popCount@@YAI_K@Z PROC
  00000	66 0f ef db	 pxor	 xmm3, xmm3
  00004	66 48 0f 6e d1	 movd	 xmm2, rcx
  00009	66 0f 6c d2	 punpcklqdq xmm2, xmm2
  0000d	66 0f 6f e2	 movdqa	 xmm4, xmm2
  00011	66 0f 6f c2	 movdqa	 xmm0, xmm2
  00015	66 0f 6f ca	 movdqa	 xmm1, xmm2
  00019	66 0f df 15 30 00 00 00	 pandn	 xmm2, XMMWORD PTR ?masks+48
  00021	66 0f df 25 00 00 00 00	 pandn	 xmm4, XMMWORD PTR ?masks
  00029	66 0f df 0d 20 00 00 00	 pandn	 xmm1, XMMWORD PTR ?masks+32
  00031	66 0f df 05 10 00 00 00	 pandn	 xmm0, XMMWORD PTR ?masks+16
  00039	66 0f 74 e3	 pcmpeqb xmm4, xmm3
  0003d	66 0f 74 c3	 pcmpeqb xmm0, xmm3
  00041	66 0f 74 cb	 pcmpeqb xmm1, xmm3
  00045	66 0f fc e0	 paddb	 xmm4, xmm0
  00049	66 0f 74 d3	 pcmpeqb xmm2, xmm3
  0004d	66 0f fc ca	 paddb	 xmm1, xmm2
  00051	66 0f fc e1	 paddb	 xmm4, xmm1
  00055	66 0f f6 e3	 psadbw	 xmm4, xmm3
  00059	66 0f 7e e0	 movd	 eax, xmm4
  0005d	66 0f c5 cc 04	 pextrw	 ecx, xmm4, 4
  00062	03 c8		 add	 ecx, eax
  00064	f7 d9		 neg	 ecx
  00066	0f b6 c1	 movzx	 eax, cl
  00069	c3		 ret	 0
?popCount@@YAI_K@Z ENDP