Fastest bitboard compress routine when you can't use ASM

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

yoshiharu
Posts: 56
Joined: Sat Nov 11, 2006 11:14 pm

Re: Fastest bitboard compress routine when you can't use ASM

Post by yoshiharu »

Pradu wrote:
mambofish wrote:

Code: Select all

tzc64[(0x3f & ((lo ^ high) * 0x78291ACF) >> 26)];
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex(X) toIndexDB[(X)*C64(0x07EDD5E59A4E28C2)>>58]
So, it seems I didn't find anything new...:?
:-D
Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ;-) ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Hi Gerd

Thanks for the details. I don't know where you find this stuff, but it is very interesting!

Here's the Robert Harley reference I used for my code.

http://www.df.lth.se/~john_e/gems/gem0039.html

I found a reference to the table-based technique in "Hacker's Delight" which put me on to the idea. A Google search did the rest :-)

I am a little surprised at some of the timings presented. For example, I wouldn't use numberOfTrailingZeros. It relies on this binary chop code to isolate the LSB.

Code: Select all

    public static int numberOfTrailingZeros(long i) {
        // HD, Figure 5-14
	int x, y;
	if (i == 0) return 64;

	int n = 63;
	y = (int)i; if (y != 0) { n = n -32; x = y; } else x = (int)(i>>>32);
	y = x <<16; if &#40;y != 0&#41; &#123; n = n -16; x = y; &#125;
	y = x << 8; if &#40;y != 0&#41; &#123; n = n - 8; x = y; &#125;
	y = x << 4; if &#40;y != 0&#41; &#123; n = n - 4; x = y; &#125;


	y = x << 2; if &#40;y != 0&#41; &#123; n = n - 2; x = y; &#125;
	return n - (&#40;x << 1&#41; >>> 31&#41;;
    &#125;
Each 1-bit therefore requires 5 tests to identify. I did try this technique and found it very much slower than the magic number method, so I am a little surprised at the timings you report. Nervertheless I'll set up some tests again and confirm.

Regards,
Vince
jesper_nielsen

Re: Fastest bitboard compress routine when you can't use ASM

Post by jesper_nielsen »

yoshiharu wrote:
Pradu wrote:
mambofish wrote:

Code: Select all

tzc64&#91;&#40;0x3f & (&#40;lo ^ high&#41; * 0x78291ACF&#41; >> 26&#41;&#93;;
I believe this is called Matt Taylor's folded bitscan or something. I have never used this because I only cared about 64-bits. Here's the 64-bit version of this bitscan that was first come up with by some guys in MIT ... forgot their names already...:

Code: Select all

X - a bitboard with a single bit active
#define toIndex&#40;X&#41; toIndexDB&#91;&#40;X&#41;*C64&#40;0x07EDD5E59A4E28C2&#41;>>58&#93;
So, it seems I didn't find anything new...:?
:-D
Anyways, an "isomorphic" method showed fast enough for me, but at some point, any improvement in these kind of routines didn't translate into a faster search, using gprof I actually found out I was working to save a few percents of the time...maybe it's because I am currently evaluating at every node (ohh, yes, veeery slow ;-) ).
What is for you guys (in a "real-life" search) the gain in NPS when you switched to your current favourite bit-twiddling routines?

Cheers, Mauro
A story from the world of C#

When i first started out programming my engine "Pupsi", i quickly made a simple straightforward algorithm, basically shifting the bits off to the left, and updating indexes.

I thought "This is going to fly, once i get this code optimized with some really smart stuff!"

I have tried every trick i could find on the net, but nothing gave any improvement. :-(

I implemented the algorithm suggested by Vince, and i got roughly a 2,5-3 % improvement in pure pertf. Perft including evaluation gives about 0.5 % speedup. (This is after very preliminary tests. More is definitely needed!)

Here is the algorithm i wrote:
Please note: I index the bits from the opposite side as the example given by Vince. So the High order bits is in mask_low, meaning the lowest indexes.

Code: Select all

    public static void getSquaresOld&#40;ulong mask, int&#91;&#93; squares, ref int index&#41; &#123;
      index = 0;
      int rank = 0;
      int file;
      uint mask_low = &#40;uint&#41;(&#40;mask >> 32&#41; & 0xFFFFFFFF&#41;;
      uint mask_high = &#40;uint&#41;( mask & 0xFFFFFFFF&#41;;
      uint rank_mask;
      while &#40;mask_low != 0&#41; &#123;
        if (&#40;rank_mask = &#40;mask_low & 0xFF000000&#41;) != 0&#41; &#123;
          file = 0;
          while &#40;rank_mask != 0&#41; &#123;
            if (&#40;rank_mask & 0x80000000&#41; != 0&#41; &#123;
              squares&#91;index++&#93; = rank + file;
            &#125;
            rank_mask <<= 1;
            file += 1;
          &#125;
        &#125;
        mask_low <<= 8;
        rank += 8;
      &#125;

      rank = 32;
      while &#40;mask_high != 0&#41; &#123;
        if (&#40;rank_mask = &#40;mask_high & 0xFF000000&#41;) != 0&#41; &#123;
          file = 0;
          while &#40;rank_mask != 0&#41; &#123;
            if (&#40;rank_mask & 0x80000000&#41; != 0&#41; &#123;
              squares&#91;index++&#93; = rank + file;
            &#125;
            rank_mask <<= 1;
            file += 1;
          &#125;
        &#125;
        mask_high <<= 8;
        rank += 8;
      &#125;
    &#125;
So even though the improvement is very, very small, perhaps I can perhaps finally put the old algorithm to rest! :-)

Thanks!

- Jesper
mambofish

Re: Fastest bitboard compress routine when you can't use ASM

Post by mambofish »

Hi Jesper,

Having now seen the posts elsewhere by Gerd and Pradu among others, it seems this algorithm is much more widely known than I had appreciated. :oops:

Gerd's 64-bit magic number code is neater, but slower than the code I posted. I imagine this is because I'm running on 32-bit hardware and a 64-bit multiplication takes appreciably longer than a 32 bit one. On 64-bit hardware, Gerd's code might be faster.

However, in your case I don't think a 2.5-3% speedup will make that much difference, because your program probably isn't spending that more than 10-15% of its time in move generation anyway. (As you noticed, when you include the evaluation effort, its effect was reduced to 0.5% in your engine)

At these tiny differences, I think code size and elegance become more important considerations. In fact I am tempted to try Gerd's algorithm anyway simply on these criteria, even though it is around 25% slower:

Code: Select all

test value&#58; 805318656 &#40;4 bits on&#41;
bsf64_58&#58; elapsed= 875 ms
bsf64_26&#58; elapsed= 672 ms
test value&#58; 805318656
(I tested over a range of values of course, but this is an indicative result. In fact I suspect its quite important to base any decision on test values that are roughly representative of the average number of target squares per move a piece might have during the game, rather than on an outlying value, such as 1 or 19)

Kind regards
Vince
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: Fastest bitboard compress routine when you can't use ASM

Post by Gerd Isenberg »

mambofish wrote:Hi Gerd

Thanks for the details. I don't know where you find this stuff, but it is very interesting!

Here's the Robert Harley reference I used for my code.

http://www.df.lth.se/~john_e/gems/gem0039.html

I found a reference to the table-based technique in "Hacker's Delight" which put me on to the idea. A Google search did the rest :-)
Hi Vince,

thanks for the reference. But this is only a 32-bsf replacement similar to a 32-bit DeBruijn multiplication, but with no perfect hashing - that is a 64-byte table using only 32.

Bitscanning was an old hobby for me and there are zillions of approaches, the archives, also in the winboard programming forum are saturated with bitscans.

As a "historical" reference two C-versions from my bitscan repository, Walter Faxon's folding trick with none perfect hashing and a direct calculation by test,setnz,lea chains, I came up with. This versions also clear the found bit:

Code: Select all

extern const BYTE LSB_64_table&#91;154&#93;;  // bit number table
#define LSB_64_adj  -51               // offset to table base
#define LSB_64_magic  ( &#40;UINT32&#41;0x01C5FC81 )  // magic constant

// Author&#58; Walter Faxon &#40;in the 80ies, constant found by brute force&#41;
int bitScanAndReset&#40;BitBoard &bb&#41;
&#123;
    BitBoard t64;
    UINT32 t32;
    t64 = bb - 1;
    bb &= t64;         // omit this line to retain current LSB
    t64 ^= bb;
    t32 = &#40;UINT32&#41;t64 ^ &#40;UINT32&#41;&#40;t64 >> 32&#41;;
    t32 ^= LSB_64_magic;
    t32 += t32 >> 16;
    t32 -= t32 >> 8;
    return LSB_64_table &#91;LSB_64_adj + &#40;BYTE&#41;t32&#93;;
&#125;

const BYTE LSB_64_table&#91;154&#93; =
&#123;
#define __  0
    23,__,__,__,31,__,__,39,19,__, 17,16,18,__,47,10,20, 9, 8,11,
     1, 0, 2,57,56,58, 3,12,__,59, __,__,21,__, 4,__,__,60,__,__,
    __,__,__,13,__,__,__,__,__,__,  5,__,__,61,__,__,__,__,__,__,
    __,__,__,__,22,__,__,__,30,__, __,38,__,__,__,14,__,__,46,__,
    __,__, 6,__,__,62,__,__,__,54, __,__,__,__,__,__,__,__,__,__,
    29,__,__,37,__,__,__,__,__,__, 45,__,__,__,__,__,28,__,__,36,
    __,53,__,__,27,__,44,35,26,24, 25,34,32,33,43,40,41,52,42,15,
    __,50,48,49,__,51, 7,__,__,63, __,__,__,55
#undef __
&#125;;

Code: Select all

// Author&#58; Gerd Isenberg &#40;around 1995&#41;
int bitScanAndReset&#40;BitBoard &bb&#41;
&#123;
    BitBoard lsbb = bb & (-(__int64&#41;bb&#41;;
    bb ^= lsbb;
    UINT32 lsb = LOWBOARD&#40;lsbb&#41; | HIGHBOARD&#40;lsbb&#41;;
    return (((((((((&#40;HIGHBOARD&#40;lsbb&#41;!=0&#41; *2&#41;
                ^(&#40;lsb & 0xffff0000&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xff00ff00&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xf0f0f0f0&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xcccccccc&#41;!=0&#41;)*2&#41;
                ^(&#40;lsb & 0xaaaaaaaa&#41;!=0&#41;;
&#125;
Intel P4 and AMD K8 even in 64 bit mode have very slow implementations of the former (PIII) fast bsf/bsr-instructions. The portable De Bruijn multiplication plus shift and lookup were (and are still for K8 and others) competitive with fast multiplication of K8. For 32-bit systems Matt's routine was the favourite approach.

In the meantime on the c2d bsf/bsr have become fast again - two cycles latency, and one cycle reciprocal throughput. K10 (and intel as well) will have fast lzcnt instruction. x86-64 C-compiler will have intrinsic functions to use it without assembly.
mambofish wrote: I am a little surprised at some of the timings presented. For example, I wouldn't use numberOfTrailingZeros. It relies on this binary chop code to isolate the LSB.

Code: Select all

    public static int numberOfTrailingZeros&#40;long i&#41; &#123;
        // HD, Figure 5-14

	int x, y;
	if &#40;i == 0&#41; return 64;

	int n = 63;
	y = &#40;int&#41;i; if &#40;y != 0&#41; &#123; n = n -32; x = y; &#125; else x = &#40;int&#41;&#40;i>>>32&#41;;
	y = x <<16; if &#40;y != 0&#41; &#123; n = n -16; x = y; &#125;
	y = x << 8; if &#40;y != 0&#41; &#123; n = n - 8; x = y; &#125;
	y = x << 4; if &#40;y != 0&#41; &#123; n = n - 4; x = y; &#125;


	y = x << 2; if &#40;y != 0&#41; &#123; n = n - 2; x = y; &#125;
	return n - (&#40;x << 1&#41; >>> 31&#41;;
    &#125;
Each 1-bit therefore requires 5 tests to identify. I did try this technique and found it very much slower than the magic number method, so I am a little surprised at the timings you report. Nervertheless I'll set up some tests again and confirm.

Regards,
Vince
Of course in Java Long.numberOfTrailingZeros is slow on 32-bit machines - there are faster implementations for 64-bit systems - and it will become the fastest with an approprite 64-bit just in time compiler on a core 2 duo or K10, using the new instructions.

Cheers,
Gerd