Gerd Isenberg wrote:Haswell BMI2 PEXT replacement for Fancy Magic Bitboards with variable shift comes real now. PEXT r64, r64, r64 has latency of 3 and reciprocal throughput of 1 cycle (like imul r64,r64):
My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).
I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:
uint64 bit[64]; // bit[i] = 1ULL << i;
uint64 _pext_u64(uint64 val, uint64 mask)
{
uint64 res = 0;
int i = 0;
val &= mask;
while (val) {
uint64 p = val & -val;
uint64 q = mask & -mask;
while (q != p) {
i++;
mask &= mask - 1;
q = mask & -mask;
}
mask &= mask - 1;
res |= bit[i++];
val &= val - 1;
}
return res;
}
uint64 _pdep_u64(long64 val, long64 mask)
{
uint64 res = 0;
int i = 0;
while (mask) {
if (val & bit[i++])
res |= mask & -mask;
mask &= mask - 1;
}
return res;
}
I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.
syzygy wrote:
My tablebase generator (when compiled with -DBMI2) in addition uses pdep to reduce the table to 210.25 KB (a bit smaller is possible by reusing entries, but I haven't yet looked at that very well).
I haven't tested on a Haswell processor, but it works correctly with my own implementation of pext and pdep:
uint64 bit[64]; // bit[i] = 1ULL << i;
uint64 _pext_u64(uint64 val, uint64 mask)
{
uint64 res = 0;
int i = 0;
val &= mask;
while (val) {
uint64 p = val & -val;
uint64 q = mask & -mask;
while (q != p) {
i++;
mask &= mask - 1;
q = mask & -mask;
}
mask &= mask - 1;
res |= bit[i++];
val &= val - 1;
}
return res;
}
uint64 _pdep_u64(long64 val, long64 mask)
{
uint64 res = 0;
int i = 0;
while (mask) {
if (val & bit[i++])
res |= mask & -mask;
mask &= mask - 1;
}
return res;
}
I couldn't easily get it faster (on my i7 in combination with my generator), but maybe someone has a better implementation.
Taking bit and incrementing i surprisingly was faster than uint64 bb = 1 and bb <<= 1 at each iteration.
Yes, funny that i++ and lookup is faster than bb+=bb (lea). May be number of code cachelines and alignment.
Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.
Gerd Isenberg wrote:Your pext is effective with most zeros to extract. Otherwise, the loop body of the obvious approach is smaller and takes less registers.
uint64 _pext_u64(uint64 val, uint64 mask)
{
uint64 res = 0;
for (int i=0; mask; ++i)
{
if ( val & mask & -mask )
res |= bit[i];
mask &= mask - 1;
}
return res;
}
Oops, this is much faster
I think I started with a simple but naive approach using bsf etc., then found some ways to speed it up, then realised I did not need bsf, but forgot to go back to the simple approach.
It comes along with AVX2. It is related due to three-operand VEX prefix, but is more disjoint instruction set working on GP-registers rather than SIMD ymm registers.
It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?
It's stated at chessprogrammingwiki that the current Intel Haswell has no BMI2 yet. Does it still hold true today as of March 2014? Please recommend me a simple utilitiy to determine if the CPU has BMI2 to aid my planned purchase. And i just read that Gull 2.8 supports BMI2. whats your thoughts about it?
In order to correctly use the new instructions and avoid runtime crashes, applications must properly detect hardware support for the new instructions using CPUID checks. It is important to understand that a new instruction is supported on a particular processor only if the corresponding CPUID feature flag is set. Applications must not assume support of any instruction set extension simply based on, for example, checking a CPU model or family and must instead always check for _all_ the feature CPUID bits of the instructions being used.
uint64 _pext_u64(uint64 val, uint64 mask)
{
uint64 res = 0;
for (int i=0; mask; ++i)
{
if ( val & mask & -mask )
res |= bit[i];
mask &= mask - 1;
}
return res;
}
For machines without BMI2 but with SSE 4.2, it should be faster to only loop over the bits to be extracted, and for each such bit count the number of mask bits behind it (performance is untested, but should be correct at least, hopefully)
uint64 _pext_u64(uint64 val, uint64 mask)
{
uint64 res = 0;
for (uint64 src = val & mask; src; src &= src - 1)
{
res |= bit[popcnt(((src & -src) - 1) & mask)];
}
return res;
}
The corresponding PDEP emulation using popcnt is left as an exercise for the reader (I am not near a machine to test it for the moment, and it's slightly more involved than the PEXT emulation, will post something next week)