PEXT Bitboards

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

syzygy
Posts: 4221
Joined: Tue Feb 28, 2012 10:56 pm

Re: PEXT Bitboards

Post by syzygy » Fri Jun 07, 2013 9:57 pm

Gerd Isenberg wrote:Haswell BMI2 PEXT replacement for Fancy Magic Bitboards with variable shift comes real now. PEXT r64, r64, r64 has latency of 3 and reciprocal throughput of 1 cycle (like imul r64,r64):

Code: Select all

U64 arrAttacks  [...]; // ~840 KByte all rook and bishop attacks
U64 arrRookBase  [64]; // arrAttacks base offset per rook square
U64 arrRookMask  [64]; // 10..12 relevant occupancy bits per rook square
U64 arrBishopBase[64]; // arrAttacks base offset per bishop square
U64 arrBishopMask[64]; //  5.. 9 relevant occupancy bits per bishop square

U64 rookAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrRookBase[sq] + _pext_u64(occ, arrRookMask[sq])];
}

U64 bishopAttack(U64 occ, enumSquare sq) {
  return arrAttacks[arrBishopBase[sq] + _pext_u64(occ, arrBishopMask[sq])];
}
http://users.atw.hu/instlatx64/GenuineI ... LatX64.txt
My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).

I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:

Code: Select all

uint64 bit&#91;64&#93;; // bit&#91;i&#93; = 1ULL << i;

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  val &= mask;
  while &#40;val&#41; &#123;
    uint64 p = val & -val;
    uint64 q = mask & -mask;
    while &#40;q != p&#41; &#123;
      i++;
      mask &= mask - 1;
      q = mask & -mask;
    &#125;
    mask &= mask - 1;
    res |= bit&#91;i++&#93;;
    val &= val - 1;
  &#125;

  return res;
&#125;

uint64 _pdep_u64&#40;long64 val, long64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  while &#40;mask&#41; &#123;
    if &#40;val & bit&#91;i++&#93;)
      res |= mask & -mask;
    mask &= mask - 1;
  &#125;

  return res;
&#125;
I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.

Gerd Isenberg
Posts: 2104
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: PEXT Bitboards

Post by Gerd Isenberg » Fri Jun 07, 2013 10:52 pm

syzygy wrote: My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).

I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:

Code: Select all

uint64 bit&#91;64&#93;; // bit&#91;i&#93; = 1ULL << i;

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  val &= mask;
  while &#40;val&#41; &#123;
    uint64 p = val & -val;
    uint64 q = mask & -mask;
    while &#40;q != p&#41; &#123;
      i++;
      mask &= mask - 1;
      q = mask & -mask;
    &#125;
    mask &= mask - 1;
    res |= bit&#91;i++&#93;;
    val &= val - 1;
  &#125;

  return res;
&#125;

uint64 _pdep_u64&#40;long64 val, long64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  while &#40;mask&#41; &#123;
    if &#40;val & bit&#91;i++&#93;)
      res |= mask & -mask;
    mask &= mask - 1;
  &#125;

  return res;
&#125;
I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.

Yes, funny that i++ and lookup is faster than bb+=bb (lea). May be number of code cachelines and alignment.

Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;

syzygy
Posts: 4221
Joined: Tue Feb 28, 2012 10:56 pm

Re: PEXT Bitboards

Post by syzygy » Fri Jun 07, 2013 11:49 pm

Gerd Isenberg wrote:Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;
Oops, this is much faster :-)

I think I started with a simple but naive approach using bsf etc., then found some ways to speed it up, then realised I did not need bsf, but forgot to go back to the simple approach.

Jhoravi
Posts: 238
Joined: Wed May 08, 2013 4:49 am

Re: PEXT Bitboards

Post by Jhoravi » Thu Jun 13, 2013 12:12 pm

Hi all,

Is BMI2 part of AVX2 extensions?

Gerd Isenberg
Posts: 2104
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: PEXT Bitboards

Post by Gerd Isenberg » Thu Jun 13, 2013 1:53 pm

Jhoravi wrote:Hi all,

Is BMI2 part of AVX2 extensions?
It comes along with AVX2. It is related due to three-operand VEX prefix, but is more disjoint instruction set working on GP-registers rather than SIMD ymm registers.

https://chessprogramming.wikispaces.com/BMI2

Jhoravi
Posts: 238
Joined: Wed May 08, 2013 4:49 am

Re: PEXT Bitboards

Post by Jhoravi » Mon Mar 24, 2014 5:37 am

Hi,

It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?

thanks

phenri
Posts: 284
Joined: Tue Aug 13, 2013 7:44 am

Re: PEXT Bitboards

Post by phenri » Mon Mar 24, 2014 6:10 am

Jhoravi wrote: It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014?
This is may be true for the celerons, Pentium and the i3 i5 i7 mobile serie

But not for Desktop i3, i5, i7 serie, and Xeon, because these references claim the contrary.

http://en.wikipedia.org/wiki/Bit_Manipu ... rting_CPUs
http://www.cpu-world.com/info/Intel/Intel_Core_i3.html
http://www.cpu-world.com/info/Intel/Intel_Core_i5.html
http://www.cpu-world.com/info/Intel/Intel_Core_i7.html
http://www.cpu-world.com/info/Intel/Intel_Xeon.html

Gerd Isenberg
Posts: 2104
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: PEXT Bitboards

Post by Gerd Isenberg » Mon Mar 24, 2014 4:07 pm

Jhoravi wrote:Hi,

It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?

thanks
Not current Haswell, but an early 2013 version as referred in the Stiller article of the German c't magazine 14/2013. From Intel's dev zone:
http://software.intel.com/en-us/article ... sor-family
In order to correctly use the new instructions and avoid runtime crashes, applications must properly detect hardware support for the new instructions using CPUID checks. It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.

Code: Select all

CPUID.&#40;EAX=07H, ECX=0H&#41;&#58;EBX.BMI2&#91;bit 8&#93;==1

Rein Halbersma
Posts: 672
Joined: Tue May 22, 2007 9:13 am

Re: PEXT Bitboards

Post by Rein Halbersma » Sun Jul 16, 2017 8:00 pm

Gerd Isenberg wrote:

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;
For machines without BMI2 but with SSE 4.2, it should be faster to only loop over the bits to be extracted, and for each such bit count the number of mask bits behind it (performance is untested, but should be correct at least, hopefully)

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;uint64 src = val & mask; src; src &= src - 1&#41;
  &#123;
    res |= bit&#91;popcnt&#40;(&#40;src & -src&#41; - 1&#41; & mask&#41;&#93;;
  &#125;
  return res; 
&#125;
The corresponding PDEP emulation using popcnt is left as an exercise for the reader :D (I am not near a machine to test it for the moment, and it's slightly more involved than the PEXT emulation, will post something next week)

User avatar
lucasart
Posts: 2957
Joined: Mon May 31, 2010 11:29 am
Contact:

Re: PEXT Bitboards

Post by lucasart » Mon Jul 17, 2017 5:02 am

To make software PEXT useful for chess, it needs to be on par with magic multiplication. And I really don't think the above will be.

So, the choice is still between hardware PEXT and Magic.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.

Post Reply