32 bit versions for bitscan64

Desperado · Post by **Desperado** » Fri Aug 21, 2009 12:54 pm

Hi everyone,

because i know http://chessprogramming.wikispaces.com/BitScan where all sorts of bitscans are descriped, i missed the following
approaches.(using _32bit intrinsics_) (i also couldnt find it elsewhere)

so i thought i will post this.

Code: Select all


//****************************************************************************
//* DESCRIPTION	&#58;	32Bit Version for _BitScanForward64						 *
//****************************************************************************

ULONG bsf64&#40;BTB_T bb&#41;
	&#123;
	 UI_32 *const ptr = &#40;UI_32*)&bb;
	 ULONG id=64;

	 _BitScanForward&#40;&id,*&#40;ptr+1&#41;);
	 id+=32;
	 _BitScanForward&#40;&id,*ptr&#41;;

	 return&#40;id&#41;;
	&#125;

//****************************************************************************
//* DESCRIPTION	&#58;	32Bit Version for _BitScanReverse64						 *
//****************************************************************************

ULONG bsr64&#40;BTB_T bb&#41;
	&#123;
	 UI_32 *const ptr = &#40;UI_32*)&bb;
	 ULONG id=64;

	 _BitScanReverse&#40;&id,*&#40;ptr+1&#41;) ? id+=32 &#58; _BitScanReverse&#40;&id,*ptr&#41;;

	 return&#40;id&#41;;
	&#125;

These functions doing (much) better than debruijn or double conversion (and some others) _even_ on my amd K8 (where bitscans seem to be horror

)

Cheers

Desperado · Post by **Desperado** » Fri Aug 21, 2009 1:52 pm

or even faster forward scan...

Code: Select all


//****************************************************************************
//* DESCRIPTION	&#58;	32Bit Version for _BitScanForward64						 *
//****************************************************************************

ULONG bsf64&#40;BTB_T bb&#41;
	&#123;
	 UI_32 *const ptr = &#40;UI_32*)&bb;
	 ULONG id=64;

	 _BitScanForward&#40;&id,*ptr&#41; ? id &#58; id += 31 + _BitScanForward&#40;&id,*&#40;ptr+1&#41;);

	 return&#40;id&#41;;
	&#125;

This can avoid the second scan...

with ASSERT(bb!=0) in all cases.

mcostalba · Post by **mcostalba** » Fri Aug 21, 2009 2:46 pm

For 32 bit we use the following:

Code: Select all

Square first_1&#40;Bitboard b&#41; &#123;
  b ^= &#40;b - 1&#41;;
  uint32_t fold = int&#40;b&#41; ^ int&#40;b >> 32&#41;;
  return Square&#40;BitTable&#91;&#40;fold * 0x783a9b23&#41; >> 26&#93;);
&#125;

Instead when you need to pop first bit, i.e. scan and reset then:

Code: Select all

// Use type-punning
union b_union &#123;

    Bitboard b;
    struct &#123;
        uint32_t l;
        uint32_t h;
    &#125; dw;
&#125;;

// WARNING&#58; Needs -fno-strict-aliasing compiler option
Square pop_1st_bit&#40;Bitboard* bb&#41; &#123;

  b_union u;
  uint32_t b;

  u.b = *bb;

  if &#40;u.dw.l&#41;
  &#123;
      b = u.dw.l;
      *(&#40;uint32_t*&#41;bb&#41; = b & &#40;b - 1&#41;;
      b ^= &#40;b - 1&#41;;
  &#125;
  else
  &#123;
      b = u.dw.h;
      *(&#40;uint32_t*&#41;bb+1&#41; = b & &#40;b - 1&#41;; // Little endian only?
      b = ~&#40;b ^ &#40;b - 1&#41;);
  &#125;
  return Square&#40;BitTable&#91;&#40;b * 0x783a9b23&#41; >> 26&#93;);
&#125;

Do you think yours are faster ?

Desperado · Post by **Desperado** » Fri Aug 21, 2009 3:24 pm

Hi Marco,

nice to meet you again.

Code: Select all

//****************************************************************************
//* PROCEDURE	&#58; bsf64														 *
//* DESCRIPTION	&#58; 32 bit version &#40;Matt Taylor's Folding trick&#41;				 *
//****************************************************************************

ULONG bsf64_MattTaylor&#40;BTB_T bb&#41;
	&#123;
	 const int lsb_64_table&#91;64&#93; =
		&#123;
		 63, 30,  3, 32, 59, 14, 11, 33,
		 60, 24, 50,  9, 55, 19, 21, 34,
		 61, 29,  2, 53, 51, 23, 41, 18,
		 56, 28,  1, 43, 46, 27,  0, 35,
		 62, 31, 58,  4,  5, 49, 54,  6,
		 15, 52, 12, 40,  7, 42, 45, 16,
		 25, 57, 48, 13, 10, 39,  8, 44,
		 20, 47, 38, 22, 17, 37, 36, 26
		&#125;;

	 bb ^= &#40;bb- 1&#41;;
     unsigned int folded = &#40;int&#41; bb ^ &#40;bb >> 32&#41;;
   
	 return lsb_64_table&#91;folded * 0x78291ACF >> 26&#93;;
	&#125;

Code: Select all


//****************************************************************************
//* PROCEDURE	&#58; bsf64														 *
//* DESCRIPTION	&#58; 32 bit version											 *
//****************************************************************************

ULONG bsf64&#40;BTB_T bb&#41;
	&#123;
	 UI_32 *const ptr = &#40;UI_32*)&bb;
	 ULONG id=64;

	 _BitScanForward&#40;&id,*ptr&#41; ? id &#58; id += 31 + &#40;bool&#41;_BitScanForward&#40;&id,*&#40;ptr+1&#41;);

	 return&#40;id&#41;;
	&#125;

At least for factor 3x faster than Matt, i would say almost factor 4.
And that on my _anti_ bitscan machine! (i hope i didnt do something totally wrong!, but i dont think so).

As you can see, i put a bool typecast before the bitscan, because the ms-reference tells _BitScan is returning 0 or _nonzero_ value, and it has to be _1_ of course.

...

mcostalba · Post by **mcostalba** » Fri Aug 21, 2009 3:32 pm

Hi Michael,

actually the bitscan alone is not so important, it is called seldom.

Much more important is the pop_1st_bit() that is the _real_ routine used at 99.9% of cases.

How would you write that in this case ? I can speed test mine and the yours if you post it.

Thanks
Marco

Desperado · Post by **Desperado** » Fri Aug 21, 2009 3:41 pm

Hi Marco,

first i have to correct my last post !!!

Putting the table to global scope, everything changes!

The result seems to be equal(about)....(for _bitscan alone_)
BUT on _anti_ bitscan machine.

i will have a closer look on the pop_1st_bit routine, and post my results.

cheers

Aleks Peshkov · Post by **Aleks Peshkov** » Fri Aug 21, 2009 4:09 pm

Code: Select all

    inline U32 lo&#40;U64 b&#41; &#123; return small_cast<U32>&#40;b&#41;; &#125;
    inline U32 hi&#40;U64 b&#41; &#123; return static_cast<U32>&#40;b >> 32&#41;; &#125;

    inline index_t bsf&#40;U64 value&#41; &#123;
        U32 lo = &#58;&#58;lo&#40;value&#41;;
        return bsf&#40;lo ? lo &#58; hi&#40;value&#41;) + &#40;lo ? 0 &#58; 32&#41;;
    &#125;

Sorry, for ++-ish, pure C-version is trivial.
Smart compiler can generate branchless x86-code.

mcostalba · Post by **mcostalba** » Fri Aug 21, 2009 4:18 pm

Aleks Peshkov wrote:

Code: Select all

    inline U32 lo&#40;U64 b&#41; &#123; return small_cast<U32>&#40;b&#41;; &#125;
    inline U32 hi&#40;U64 b&#41; &#123; return static_cast<U32>&#40;b >> 32&#41;; &#125;

    inline index_t bsf&#40;U64 value&#41; &#123;
        U32 lo = &#58;&#58;lo&#40;value&#41;;
        return bsf&#40;lo ? lo &#58; hi&#40;value&#41;) + &#40;lo ? 0 &#58; 32&#41;;
    &#125;

Sorry, for ++-ish, pure C-version is trivial.
Smart compiler can generate branchless x86-code.

Thanks for your version.

I am very interested in pop_first_1() function competitive with the above I posted. Have you some suggestion in this regard ?

Thanks in advance
Marco

BTW, why you didn't write directly ?

Code: Select all

    inline U32 lo&#40;U64 b&#41; &#123; return small_cast<U32>&#40;b&#41;; &#125;
    inline U32 hi&#40;U64 b&#41; &#123; return static_cast<U32>&#40;b >> 32&#41;; &#125;

    inline index_t bsf&#40;U64 value&#41; &#123;
        U32 lo = &#58;&#58;lo&#40;value&#41;;
        return lo ? bsf&#40;lo&#41; &#58; 32 + bsf&#40;hi&#40;value&#41;);
    &#125;

Aleks Peshkov · Post by **Aleks Peshkov** » Fri Aug 21, 2009 4:32 pm

I am using C++ streams-like operator overloading

Code: Select all

    Self& operator += &#40;SelfArg a&#41; &#123; assert ((*this & a&#41;.none&#40;)); return *this ^= a; &#125;
    Self& operator -= &#40;SelfArg a&#41; &#123; assert ((*this & a&#41; == a&#41;;   return *this ^= a; &#125;
    Self& operator /= &#40;SelfArg a&#41; &#123; *this |= a; return *this -= a; &#125; //b = b & ~a

    friend bool operator >> &#40;Self& self, Index& i&#41; &#123;
        if &#40;self.none&#40;)) &#123; return false; &#125;
        i = self.first&#40;);
        self -= i;
        return true;
    &#125;

Generated code is far from optimal, but I like to manage compact STL-like code:

Code: Select all

    PieceSet pieces = op.nonPawns&#40;);
    while &#40;pieces >> moved_piece&#41; &#123;
        move_from = op&#91;moved_piece&#93;;
        BitBoard nonCaptures = op.movesOf&#40;moved_piece&#41; / occupied; //my and-not operator
        while &#40;nonCaptures >> move_to&#41; &#123;
            CUT&#40;makeMove&#40;));
        &#125;
    &#125;

Desperado · Post by **Desperado** » Fri Aug 21, 2009 4:49 pm

Aleks Peshkov wrote:

Code: Select all

    inline U32 lo&#40;U64 b&#41; &#123; return small_cast<U32>&#40;b&#41;; &#125;
    inline U32 hi&#40;U64 b&#41; &#123; return static_cast<U32>&#40;b >> 32&#41;; &#125;

    inline index_t bsf&#40;U64 value&#41; &#123;
        U32 lo = &#58;&#58;lo&#40;value&#41;;
        return bsf&#40;lo ? lo &#58; hi&#40;value&#41;) + &#40;lo ? 0 &#58; 32&#41;;
    &#125;

Sorry, for ++-ish, pure C-version is trivial.
Smart compiler can generate branchless x86-code.

Hi Alex,

maybe my mind is totally blocked, but i have some questions to
the above code.

1. why does bsf call itself ?
2. i cannot find a small_cast operator (although i know what it should do)
3. and what is the issue between the _trivial_ c version and the c++ version ? In other words, what makes the c++ version more complex than
the pure c version?
4.i thought explicit putting the (reduced)type in front of more complex type is enough for type conversion downwards.

Hope this text is understandable

Thx

EDIT: saw at the moment you last post, so dont need any answer for 3.

32 bit versions for bitscan64

32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64

Re: 32 bit versions for bitscan64