PEXT Bitboards

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

PEXT Bitboards

Post by Gerd Isenberg »

Haswell BMI2 PEXT replacement for Fancy Magic Bitboards with variable shift comes real now. PEXT r64, r64, r64 has latency of 3 and reciprocal throughput of 1 cycle (like imul r64,r64):

Code: Select all

U64 arrAttacks  [...]; // ~840 KByte all rook and bishop attacks
U64 arrRookBase  [64]; // arrAttacks base offset per rook square
U64 arrRookMask  [64]; // 10..12 relevant occupancy bits per rook square
U64 arrBishopBase[64]; // arrAttacks base offset per bishop square
U64 arrBishopMask[64]; //  5.. 9 relevant occupancy bits per bishop square

U64 rookAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrRookBase[sq] + _pext_u64(occ, arrRookMask[sq])];
}

U64 bishopAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrBishopBase[sq] + _pext_u64(occ, arrBishopMask[sq])];
}
http://users.atw.hu/instlatx64/GenuineI ... LatX64.txt
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: PEXT Bitboards

Post by syzygy »

Gerd Isenberg wrote:Haswell BMI2 PEXT replacement for Fancy Magic Bitboards with variable shift comes real now. PEXT r64, r64, r64 has latency of 3 and reciprocal throughput of 1 cycle (like imul r64,r64):

Code: Select all

U64 arrAttacks  [...]; // ~840 KByte all rook and bishop attacks
U64 arrRookBase  [64]; // arrAttacks base offset per rook square
U64 arrRookMask  [64]; // 10..12 relevant occupancy bits per rook square
U64 arrBishopBase[64]; // arrAttacks base offset per bishop square
U64 arrBishopMask[64]; //  5.. 9 relevant occupancy bits per bishop square

U64 rookAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrRookBase[sq] + _pext_u64(occ, arrRookMask[sq])];
}

U64 bishopAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrBishopBase[sq] + _pext_u64(occ, arrBishopMask[sq])];
}
http://users.atw.hu/instlatx64/GenuineI ... LatX64.txt
My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).

I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:

Code: Select all

uint64 bit&#91;64&#93;; // bit&#91;i&#93; = 1ULL << i;

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  val &= mask;
  while &#40;val&#41; &#123;
    uint64 p = val & -val;
    uint64 q = mask & -mask;
    while &#40;q != p&#41; &#123;
      i++;
      mask &= mask - 1;
      q = mask & -mask;
    &#125;
    mask &= mask - 1;
    res |= bit&#91;i++&#93;;
    val &= val - 1;
  &#125;

  return res;
&#125;

uint64 _pdep_u64&#40;long64 val, long64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  while &#40;mask&#41; &#123;
    if &#40;val & bit&#91;i++&#93;)
      res |= mask & -mask;
    mask &= mask - 1;
  &#125;

  return res;
&#125;
I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: PEXT Bitboards

Post by Gerd Isenberg »

syzygy wrote: My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).

I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:

Code: Select all

uint64 bit&#91;64&#93;; // bit&#91;i&#93; = 1ULL << i;

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  val &= mask;
  while &#40;val&#41; &#123;
    uint64 p = val & -val;
    uint64 q = mask & -mask;
    while &#40;q != p&#41; &#123;
      i++;
      mask &= mask - 1;
      q = mask & -mask;
    &#125;
    mask &= mask - 1;
    res |= bit&#91;i++&#93;;
    val &= val - 1;
  &#125;

  return res;
&#125;

uint64 _pdep_u64&#40;long64 val, long64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  while &#40;mask&#41; &#123;
    if &#40;val & bit&#91;i++&#93;)
      res |= mask & -mask;
    mask &= mask - 1;
  &#125;

  return res;
&#125;
I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.

Yes, funny that i++ and lookup is faster than bb+=bb (lea). May be number of code cachelines and alignment.

Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;
syzygy
Posts: 5557
Joined: Tue Feb 28, 2012 11:56 pm

Re: PEXT Bitboards

Post by syzygy »

Gerd Isenberg wrote:Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;
Oops, this is much faster :-)

I think I started with a simple but naive approach using bsf etc., then found some ways to speed it up, then realised I did not need bsf, but forgot to go back to the simple approach.
Jhoravi
Posts: 291
Joined: Wed May 08, 2013 6:49 am

Re: PEXT Bitboards

Post by Jhoravi »

Hi all,

Is BMI2 part of AVX2 extensions?
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: PEXT Bitboards

Post by Gerd Isenberg »

Jhoravi wrote:Hi all,

Is BMI2 part of AVX2 extensions?
It comes along with AVX2. It is related due to three-operand VEX prefix, but is more disjoint instruction set working on GP-registers rather than SIMD ymm registers.

https://chessprogramming.wikispaces.com/BMI2
Jhoravi
Posts: 291
Joined: Wed May 08, 2013 6:49 am

Re: PEXT Bitboards

Post by Jhoravi »

Hi,

It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?

thanks
phenri
Posts: 284
Joined: Tue Aug 13, 2013 9:44 am

Re: PEXT Bitboards

Post by phenri »

Jhoravi wrote: It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014?
This is may be true for the celerons, Pentium and the i3 i5 i7 mobile serie

But not for Desktop i3, i5, i7 serie, and Xeon, because these references claim the contrary.

http://en.wikipedia.org/wiki/Bit_Manipu ... rting_CPUs
http://www.cpu-world.com/info/Intel/Intel_Core_i3.html
http://www.cpu-world.com/info/Intel/Intel_Core_i5.html
http://www.cpu-world.com/info/Intel/Intel_Core_i7.html
http://www.cpu-world.com/info/Intel/Intel_Xeon.html
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: PEXT Bitboards

Post by Gerd Isenberg »

Jhoravi wrote:Hi,

It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?

thanks
Not current Haswell, but an early 2013 version as referred in the Stiller article of the German c't magazine 14/2013. From Intel's dev zone:
http://software.intel.com/en-us/article ... sor-family
In order to correctly use the new instructions and avoid runtime crashes, applications must properly detect hardware support for the new instructions using CPUID checks. It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.

Code: Select all

CPUID.&#40;EAX=07H, ECX=0H&#41;&#58;EBX.BMI2&#91;bit 8&#93;==1
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: PEXT Bitboards

Post by Rein Halbersma »

Gerd Isenberg wrote:

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;
For machines without BMI2 but with SSE 4.2, it should be faster to only loop over the bits to be extracted, and for each such bit count the number of mask bits behind it (performance is untested, but should be correct at least, hopefully)

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;uint64 src = val & mask; src; src &= src - 1&#41;
  &#123;
    res |= bit&#91;popcnt&#40;(&#40;src & -src&#41; - 1&#41; & mask&#41;&#93;;
  &#125;
  return res; 
&#125;
The corresponding PDEP emulation using popcnt is left as an exercise for the reader :D (I am not near a machine to test it for the moment, and it's slightly more involved than the PEXT emulation, will post something next week)