How can one write efficient versions of pdep_u128 and pext_u128? I'm aware of the serial versions for pdep_u64 and pext_u64:
https://www.chessprogramming.org/BMI2#S ... ementation
https://www.chessprogramming.org/BMI2#S ... entation_2
On platforms where there are 128-bit integers, all the arithmetic operations are available so the serial implementations should extend to 128-bit integers as well. But this isn't as straightforward for u64[2] or the more general version for bit-arrays u64[N].
So I wonder if it's possible to combine the 64-bit intrinsics into combined 128-bit versions (and suitable generalizations for bit-arrays). For things like popcount/countl_zero/countr_zero it's relatively easy to write bit-array versions as simple loops over the bit-arrays. But pdep/pext seem to have "carry" properties much like multiplication.
Did anyone ever experiment with this?
pdep/pext for 128-bit integers and bit-arrays
Moderators: hgm, Rebel, chrisw
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: pdep/pext for 128-bit integers and bit-arrays
I'm not certain, but these should work. Untested.
Code: Select all
void pext128(int128_t a, int128_t b)
{
int128_t out0 = pext(a.low64, b.low64);
int128_t out1 = pext(a.high64, b.high64);
return out0 | (out1 << popcount(b.low64));
}
Code: Select all
void pdep128(int128_t a, int128_t b)
{
int128_t out0 = pdep(a.low64, b.low64);
int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64);
return out0 | (out1 << 64);
}
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
Re: pdep/pext for 128-bit integers and bit-arrays
Interesting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?Sopel wrote: ↑Fri Dec 03, 2021 12:32 pm I'm not certain, but these should work. Untested.
Code: Select all
void pext128(int128_t a, int128_t b) { int128_t out0 = pext(a.low64, b.low64); int128_t out1 = pext(a.high64, b.high64); return out0 | (out1 << popcount(b.low64)); }
Code: Select all
void pdep128(int128_t a, int128_t b) { int128_t out0 = pdep(a.low64, b.low64); int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64); return out0 | (out1 << 64); }
-
- Posts: 389
- Joined: Tue Oct 08, 2019 11:39 pm
- Full name: Tomasz Sobczyk
Re: pdep/pext for 128-bit integers and bit-arrays
I assume you mean in pdep? That's I shift the a by the amount of bits taken according to b.low64, such that the next bits to take start from the least significant bit. To clarify, b is the mask in this code.Rein Halbersma wrote: ↑Fri Dec 03, 2021 1:55 pmInteresting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?Sopel wrote: ↑Fri Dec 03, 2021 12:32 pm I'm not certain, but these should work. Untested.
Code: Select all
void pext128(int128_t a, int128_t b) { int128_t out0 = pext(a.low64, b.low64); int128_t out1 = pext(a.high64, b.high64); return out0 | (out1 << popcount(b.low64)); }
Code: Select all
void pdep128(int128_t a, int128_t b) { int128_t out0 = pdep(a.low64, b.low64); int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64); return out0 | (out1 << 64); }
dangi12012 wrote:No one wants to touch anything you have posted. That proves you now have negative reputations since everyone knows already you are a forum troll.
Maybe you copied your stockfish commits from someone else too?
I will look into that.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: pdep/pext for 128-bit integers and bit-arrays
Sopels solution is sound. You have 2x 64 bit masks and the solution is to pext/pedp the lower part - and the higher part needs to be shifted by the bits you already extracted/deposited. It would be clearer if inedexes are used and not the names.Rein Halbersma wrote: ↑Fri Dec 03, 2021 1:55 pmInteresting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?Sopel wrote: ↑Fri Dec 03, 2021 12:32 pm I'm not certain, but these should work. Untested.
Code: Select all
void pext128(int128_t a, int128_t b) { int128_t out0 = pext(a.low64, b.low64); int128_t out1 = pext(a.high64, b.high64); return out0 | (out1 << popcount(b.low64)); }
Code: Select all
void pdep128(int128_t a, int128_t b) { int128_t out0 = pdep(a.low64, b.low64); int128_t out1 = pdep((a >> popcount(b.low64)).low64, b.high64); return out0 | (out1 << 64); }
Its the perfect solution.
I personally would use __m128 registers unioned with uint64_t generally as you also get access to other cool instructions there - but i am biased towards desktop x86-x64 environments.
Code: Select all
typedef union __declspec(intrin_type) __declspec(align(16)) __m128 {
float m128_f32[4];
unsigned __int64 m128_u64[2];
__int8 m128_i8[16];
__int16 m128_i16[8];
__int32 m128_i32[4];
__int64 m128_i64[2];
unsigned __int8 m128_u8[16];
unsigned __int16 m128_u16[8];
unsigned __int32 m128_u32[4];
} __m128;
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 741
- Joined: Tue May 22, 2007 11:13 am
Re: pdep/pext for 128-bit integers and bit-arrays
I think you're right. I'll have time to test it later this weekend hopefully.Sopel wrote: ↑Fri Dec 03, 2021 2:09 pmI assume you mean in pdep? That's I shift the a by the amount of bits taken according to b.low64, such that the next bits to take start from the least significant bit. To clarify, b is the mask in this code.Rein Halbersma wrote: ↑Fri Dec 03, 2021 1:55 pm
Interesting ideas! However, suppose a.low64 has more bits than b.low64 can accomodate, then the remaining bits in a.low64 should have priority to fill b.high64 over the bits in a.high64, no?