pdep/pext for 128-bit integers and bit-arrays

Rein Halbersma · Post by **Rein Halbersma** » Fri Dec 03, 2021 8:52 am

How can one write efficient versions of pdep_u128 and pext_u128? I'm aware of the serial versions for pdep_u64 and pext_u64:
https://www.chessprogramming.org/BMI2#S ... ementation
https://www.chessprogramming.org/BMI2#S ... entation_2

On platforms where there are 128-bit integers, all the arithmetic operations are available so the serial implementations should extend to 128-bit integers as well. But this isn't as straightforward for u64[2] or the more general version for bit-arrays u64[N].

So I wonder if it's possible to combine the 64-bit intrinsics into combined 128-bit versions (and suitable generalizations for bit-arrays). For things like popcount/countl_zero/countr_zero it's relatively easy to write bit-array versions as simple loops over the bit-arrays. But pdep/pext seem to have "carry" properties much like multiplication.

Did anyone ever experiment with this?

Sopel · Post by **Sopel** » Fri Dec 03, 2021 12:32 pm

I'm not certain, but these should work. Untested.

Code: Select all

void pext128(int128_t a, int128_t b)
{
    int128_t out0 = pext(a.low64, b.low64);
    int128_t out1 = pext(a.high64, b.high64);
    return out0 | (out1 << popcount(b.low64));
}

Code: Select all

void pdep128(int128_t a, int128_t b)
{
    int128_t out0 = pdep(a.low64, b.low64);
    int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64);
    return out0 | (out1 << 64);
}

Rein Halbersma · Post by **Rein Halbersma** » Fri Dec 03, 2021 1:55 pm

Sopel wrote: ↑Fri Dec 03, 2021 12:32 pm I'm not certain, but these should work. Untested.

Code: Select all

void pext128(int128_t a, int128_t b)
{
    int128_t out0 = pext(a.low64, b.low64);
    int128_t out1 = pext(a.high64, b.high64);
    return out0 | (out1 << popcount(b.low64));
}

Code: Select all

void pdep128(int128_t a, int128_t b)
{
    int128_t out0 = pdep(a.low64, b.low64);
    int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64);
    return out0 | (out1 << 64);
}

Interesting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?

Sopel · Post by **Sopel** » Fri Dec 03, 2021 2:09 pm

Rein Halbersma wrote: ↑Fri Dec 03, 2021 1:55 pm
Sopel wrote: ↑Fri Dec 03, 2021 12:32 pm I'm not certain, but these should work. Untested.
Code: Select all
void pext128(int128_t a, int128_t b)
{
    int128_t out0 = pext(a.low64, b.low64);
    int128_t out1 = pext(a.high64, b.high64);
    return out0 | (out1 << popcount(b.low64));
}
Code: Select all
void pdep128(int128_t a, int128_t b)
{
    int128_t out0 = pdep(a.low64, b.low64);
    int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64);
    return out0 | (out1 << 64);
}
Interesting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?

I assume you mean in pdep? That's I shift the a by the amount of bits taken according to b.low64, such that the next bits to take start from the least significant bit. To clarify, b is the mask in this code.

dangi12012 · Post by **dangi12012** » Fri Dec 03, 2021 2:13 pm

Rein Halbersma wrote: ↑Fri Dec 03, 2021 1:55 pm
Sopel wrote: ↑Fri Dec 03, 2021 12:32 pm I'm not certain, but these should work. Untested.
Code: Select all
void pext128(int128_t a, int128_t b)
{
    int128_t out0 = pext(a.low64, b.low64);
    int128_t out1 = pext(a.high64, b.high64);
    return out0 | (out1 << popcount(b.low64));
}
Code: Select all
void pdep128(int128_t a, int128_t b)
{
    int128_t out0 = pdep(a.low64, b.low64);
    int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64);
    return out0 | (out1 << 64);
}
Interesting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?

Sopels solution is sound. You have 2x 64 bit masks and the solution is to pext/pedp the lower part - and the higher part needs to be shifted by the bits you already extracted/deposited. It would be clearer if inedexes are used and not the names.
Its the perfect solution.

I personally would use __m128 registers unioned with uint64_t generally as you also get access to other cool instructions there - but i am biased towards desktop x86-x64 environments.

Code: Select all

typedef union __declspec(intrin_type) __declspec(align(16)) __m128 {
     float               m128_f32[4];
     unsigned __int64    m128_u64[2];
     __int8              m128_i8[16];
     __int16             m128_i16[8];
     __int32             m128_i32[4];
     __int64             m128_i64[2];
     unsigned __int8     m128_u8[16];
     unsigned __int16    m128_u16[8];
     unsigned __int32    m128_u32[4];
 } __m128;

Rein Halbersma · Post by **Rein Halbersma** » Fri Dec 03, 2021 3:08 pm

Sopel wrote: ↑Fri Dec 03, 2021 2:09 pm
Rein Halbersma wrote: ↑Fri Dec 03, 2021 1:55 pm
Interesting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?
I assume you mean in pdep? That's I shift the a by the amount of bits taken according to b.low64, such that the next bits to take start from the least significant bit. To clarify, b is the mask in this code.

I think you're right. I'll have time to test it later this weekend hopefully.

pdep/pext for 128-bit integers and bit-arrays

pdep/pext for 128-bit integers and bit-arrays

Re: pdep/pext for 128-bit integers and bit-arrays

Re: pdep/pext for 128-bit integers and bit-arrays

Re: pdep/pext for 128-bit integers and bit-arrays

Re: pdep/pext for 128-bit integers and bit-arrays

Re: pdep/pext for 128-bit integers and bit-arrays