Just stumbled over this while googling, if you need some variations of bitcount to make your engine unique:
http://infolab.stanford.edu/~manku/bitc ... count.html
bitcount
Moderators: hgm, Rebel, chrisw
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: bitcount
In nifty one may safe one 'and' instruction each in counting duo-bits and bytes. Also, imul kf >> 56 should produce no slower code than modulo 255 - infact both do the same to add the final bytecounts of a 64-bit word:
versus the one from
http://chessprogramming.wikispaces.com/Population+Count
Well, if I don't use L1 elsewhere I may waste 64KByte for word lookups - still we need four 16-bit lookups for 64 bits. And shortly we will use popcnt64 intrinsics anyway:
http://msdn2.microsoft.com/en-us/library/bb385231.aspx
Gerd
Code: Select all
const U64 k1 = C64(0x5555555555555555); // -1/3
const U64 k2 = C64(0x3333333333333333); // -1/5
const U64 k4 = C64(0x0f0f0f0f0f0f0f0f); // -1/17
const U64 kf = C64(0x0101010101010101); // -1/255
int niftyBitcount (U64 x)
{
x = (x & k1) + ((x >> 1) & k1);
x = (x & k2) + ((x >> 2) & k2);
x = (x & k4) + ((x >> 4) & k4);
return (int) (x % 255);
}
http://chessprogramming.wikispaces.com/Population+Count
Code: Select all
int popCount (U64 x)
{
x = x - ((x >> 1) & k1);
x = (x & k2) + ((x >> 2) & k2);
x = (x + (x >> 4)) & k4 ;
return (int) ((x * kf) >> 56);
}
http://msdn2.microsoft.com/en-us/library/bb385231.aspx
Gerd
-
- Posts: 4675
- Joined: Mon Mar 13, 2006 7:43 pm
Re: bitcount
Symbolic's C++ toolkit has two bit counter methods for its CTBB (bitboard) class. Both are inline methods; neither uses assembly language.
The first is Card() (from "cardinality") which sums four probes of a 16 bit wide lookup table and this is used mostly in the positional evaluation code where many one bits may be present.
The second is CardFew() and it uses the compliment/and loop that counts one bit at a time. It's used where the expectation is that only a few at most one bits are present. Example: the check response code determines single vs double check status by running CardFew() on the bitboard containing the squares having enemy men attacking the square of the king of interest.
The first is Card() (from "cardinality") which sums four probes of a 16 bit wide lookup table and this is used mostly in the positional evaluation code where many one bits may be present.
The second is CardFew() and it uses the compliment/and loop that counts one bit at a time. It's used where the expectation is that only a few at most one bits are present. Example: the check response code determines single vs double check status by running CardFew() on the bitboard containing the squares having enemy men attacking the square of the king of interest.
-
- Posts: 2250
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: bitcount
Another alternative to gain more parallelism - last recent intel procs and K10 have 128 bit alus and perform up to three simd instructions per cycle, but probably the popcnt instruction is not yet available by the cpu, operating system or compiler intrinsic.
Expanding or sign extending bits to bytes into four 128-bit xmm-registers, adding all bytes together with one final sum of absolute byte difference. The routine might likely be modified for a weighted popcount aka dot-product bit[64]*byte_rot90[64].
Expanding or sign extending bits to bytes into four 128-bit xmm-registers, adding all bytes together with one final sum of absolute byte difference. The routine might likely be modified for a weighted popcount aka dot-product bit[64]*byte_rot90[64].
Code: Select all
U32 popCount(U64 bb) {
static const U64 CACHE_ALIGN masks[8] = {
0x0101010101010101, 0x0202020202020202,
0x0404040404040404, 0x0808080808080808,
0x1010101010101010, 0x2020202020202020,
0x4040404040404040, 0x8080808080808080,
};
__m128i x0, x1, x2, x3, zr; U32 cnt;
__m128i * pM = (__m128i*) masks;
x0 = _mm_cvtsi64x_si128 ( bb );
x0 = _mm_unpacklo_epi64 ( x0, x0 );
zr = _mm_setzero_si128();
x3 = _mm_andnot_si128 ( x0, pM[3] );
x2 = _mm_andnot_si128 ( x0, pM[2] );
x1 = _mm_andnot_si128 ( x0, pM[1] );
x0 = _mm_andnot_si128 ( x0, pM[0] );
x3 = _mm_cmpeq_epi8 ( x3, zr );
x2 = _mm_cmpeq_epi8 ( x2, zr );
x1 = _mm_cmpeq_epi8 ( x1, zr );
x0 = _mm_cmpeq_epi8 ( x0, zr );
// perform 4 times 'and' with weight vector here
x2 = _mm_add_epi8 ( x2, x3 );
x0 = _mm_add_epi8 ( x0, x1 );
x0 = _mm_add_epi8 ( x0, x2 );
x0 = _mm_sad_epu8 ( x0, zr );
cnt = -_mm_cvtsi128_si32( x0 )
-_mm_extract_epi16( x0, 4 );
return cnt & 255;
}
bb$ = 8
?popCount@@YAI_K@Z PROC
00000 66 0f ef db pxor xmm3, xmm3
00004 66 48 0f 6e d1 movd xmm2, rcx
00009 66 0f 6c d2 punpcklqdq xmm2, xmm2
0000d 66 0f 6f e2 movdqa xmm4, xmm2
00011 66 0f 6f c2 movdqa xmm0, xmm2
00015 66 0f 6f ca movdqa xmm1, xmm2
00019 66 0f df 15 30 00 00 00 pandn xmm2, XMMWORD PTR ?masks+48
00021 66 0f df 25 00 00 00 00 pandn xmm4, XMMWORD PTR ?masks
00029 66 0f df 0d 20 00 00 00 pandn xmm1, XMMWORD PTR ?masks+32
00031 66 0f df 05 10 00 00 00 pandn xmm0, XMMWORD PTR ?masks+16
00039 66 0f 74 e3 pcmpeqb xmm4, xmm3
0003d 66 0f 74 c3 pcmpeqb xmm0, xmm3
00041 66 0f 74 cb pcmpeqb xmm1, xmm3
00045 66 0f fc e0 paddb xmm4, xmm0
00049 66 0f 74 d3 pcmpeqb xmm2, xmm3
0004d 66 0f fc ca paddb xmm1, xmm2
00051 66 0f fc e1 paddb xmm4, xmm1
00055 66 0f f6 e3 psadbw xmm4, xmm3
00059 66 0f 7e e0 movd eax, xmm4
0005d 66 0f c5 cc 04 pextrw ecx, xmm4, 4
00062 03 c8 add ecx, eax
00064 f7 d9 neg ecx
00066 0f b6 c1 movzx eax, cl
00069 c3 ret 0
?popCount@@YAI_K@Z ENDP