PEXT Bitboards

syzygy · Post by **syzygy** » Fri Jun 07, 2013 11:57 pm

Gerd Isenberg wrote:Haswell BMI2 PEXT replacement for Fancy Magic Bitboards with variable shift comes real now. PEXT r64, r64, r64 has latency of 3 and reciprocal throughput of 1 cycle (like imul r64,r64):

Code: Select all

U64 arrAttacks  &#91;...&#93;; // ~840 KByte all rook and bishop attacks
U64 arrRookBase  &#91;64&#93;; // arrAttacks base offset per rook square
U64 arrRookMask  &#91;64&#93;; // 10..12 relevant occupancy bits per rook square
U64 arrBishopBase&#91;64&#93;; // arrAttacks base offset per bishop square
U64 arrBishopMask&#91;64&#93;; //  5.. 9 relevant occupancy bits per bishop square

U64 rookAttack&#40;U64 occ, enumSquare sq&#41; &#123;
  return arrAttacks&#91;arrRookBase&#91;sq&#93; + _pext_u64&#40;occ, arrRookMask&#91;sq&#93;)&#93;;
&#125;

U64 bishopAttack&#40;U64 occ, enumSquare sq&#41; &#123;
  return arrAttacks&#91;arrBishopBase&#91;sq&#93; + _pext_u64&#40;occ, arrBishopMask&#91;sq&#93;)&#93;;
&#125;

http://users.atw.hu/instlatx64/GenuineI ... LatX64.txt

My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).

I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:

Code: Select all

uint64 bit&#91;64&#93;; // bit&#91;i&#93; = 1ULL << i;

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  val &= mask;
  while &#40;val&#41; &#123;
    uint64 p = val & -val;
    uint64 q = mask & -mask;
    while &#40;q != p&#41; &#123;
      i++;
      mask &= mask - 1;
      q = mask & -mask;
    &#125;
    mask &= mask - 1;
    res |= bit&#91;i++&#93;;
    val &= val - 1;
  &#125;

  return res;
&#125;

uint64 _pdep_u64&#40;long64 val, long64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  while &#40;mask&#41; &#123;
    if &#40;val & bit&#91;i++&#93;)
      res |= mask & -mask;
    mask &= mask - 1;
  &#125;

  return res;
&#125;

I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Jun 08, 2013 12:52 am

syzygy wrote: My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).

I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:
Code: Select all
uint64 bit&#91;64&#93;; // bit&#91;i&#93; = 1ULL << i;

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  val &= mask;
  while &#40;val&#41; &#123;
    uint64 p = val & -val;
    uint64 q = mask & -mask;
    while &#40;q != p&#41; &#123;
      i++;
      mask &= mask - 1;
      q = mask & -mask;
    &#125;
    mask &= mask - 1;
    res |= bit&#91;i++&#93;;
    val &= val - 1;
  &#125;

  return res;
&#125;

uint64 _pdep_u64&#40;long64 val, long64 mask&#41;
&#123;
  uint64 res = 0;
  int i = 0;

  while &#40;mask&#41; &#123;
    if &#40;val & bit&#91;i++&#93;)
      res |= mask & -mask;
    mask &= mask - 1;
  &#125;

  return res;
&#125;
I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.

Yes, funny that i++ and lookup is faster than bb+=bb (lea). May be number of code cachelines and alignment.

Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;

syzygy · Post by **syzygy** » Sat Jun 08, 2013 1:49 am

Gerd Isenberg wrote:Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.
Code: Select all
uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;

Oops, this is much faster

I think I started with a simple but naive approach using bsf etc., then found some ways to speed it up, then realised I did not need bsf, but forgot to go back to the simple approach.

Jhoravi · Post by **Jhoravi** » Thu Jun 13, 2013 2:12 pm

Hi all,

Is BMI2 part of AVX2 extensions?

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Jun 13, 2013 3:53 pm

Jhoravi wrote:Hi all,

Is BMI2 part of AVX2 extensions?

It comes along with AVX2. It is related due to three-operand VEX prefix, but is more disjoint instruction set working on GP-registers rather than SIMD ymm registers.

https://chessprogramming.wikispaces.com/BMI2

Jhoravi · Post by **Jhoravi** » Mon Mar 24, 2014 6:37 am

Hi,

It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?

thanks

phenri · Post by **phenri** » Mon Mar 24, 2014 7:10 am

Jhoravi wrote: It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014?

This is may be true for the celerons, Pentium and the i3 i5 i7 mobile serie

But not for Desktop i3, i5, i7 serie, and Xeon, because these references claim the contrary.

http://en.wikipedia.org/wiki/Bit_Manipu ... rting_CPUs
http://www.cpu-world.com/info/Intel/Intel_Core_i3.html
http://www.cpu-world.com/info/Intel/Intel_Core_i5.html
http://www.cpu-world.com/info/Intel/Intel_Core_i7.html
http://www.cpu-world.com/info/Intel/Intel_Xeon.html

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Mar 24, 2014 5:07 pm

Jhoravi wrote:Hi,

It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?

thanks

Not current Haswell, but an early 2013 version as referred in the Stiller article of the German c't magazine 14/2013. From Intel's dev zone:
http://software.intel.com/en-us/article ... sor-family

In order to correctly use the new instructions and avoid runtime crashes, applications must properly detect hardware support for the new instructions using CPUID checks. It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.

Code: Select all

CPUID.&#40;EAX=07H, ECX=0H&#41;&#58;EBX.BMI2&#91;bit 8&#93;==1

Rein Halbersma · Post by **Rein Halbersma** » Sun Jul 16, 2017 10:00 pm

Gerd Isenberg wrote:

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;int i=0; mask; ++i&#41;
  &#123;
    if ( val & mask & -mask )
      res |= bit&#91;i&#93;;
    mask &= mask - 1; 
  &#125;
  return res; 
&#125;

For machines without BMI2 but with SSE 4.2, it should be faster to only loop over the bits to be extracted, and for each such bit count the number of mask bits behind it (performance is untested, but should be correct at least, hopefully)

Code: Select all

uint64 _pext_u64&#40;uint64 val, uint64 mask&#41; 
&#123;
  uint64 res = 0; 
  for &#40;uint64 src = val & mask; src; src &= src - 1&#41;
  &#123;
    res |= bit&#91;popcnt&#40;(&#40;src & -src&#41; - 1&#41; & mask&#41;&#93;;
  &#125;
  return res; 
&#125;

The corresponding PDEP emulation using popcnt is left as an exercise for the reader

(I am not near a machine to test it for the moment, and it's slightly more involved than the PEXT emulation, will post something next week)

lucasart · Post by **lucasart** » Mon Jul 17, 2017 7:02 am

To make software PEXT useful for chess, it needs to be on par with magic multiplication. And I really don't think the above will be.

So, the choice is still between hardware PEXT and Magic.

PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards

Re: PEXT Bitboards