Why not use PEXT? Since PEXT in one go needs huge lookup tables it might be slowed down due to cache pressure. However, doing PEXT per line would only need two 5/6 bit indices per square. I can't test this idea because I'm stuck with zen2. Will someone please test this idea.
It would look like this in pseudo code I suppose.
horizontal[PEXT()] | vertical{PEXT()]; // for rooks
diagonal[PEXT()] | antidiag[PEXT()]; // for bishops
I proposed both split index and even PEXT (before there was a PEXT) back in 2006.
It is overdue that both ideas are combined.
I repeat--will someone please do a test.
New idea for my split index approach
Moderator: Ras
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: New idea for my split index approach
Hey I think we already have this.
Essentially we use PEXT to hash for 4 rays instead of 2 and use split index
I have ZEN3 - we only need 16640 lookup slots:
Thomas Jahn had the same idea here:
https://www.talkchess.com/forum3/viewto ... &start=340
C++ version (disabled in main repo because not correct)
https://github.com/Gigantua/Chess_Moveg ... Sparse.hpp
Essentially we use PEXT to hash for 4 rays instead of 2 and use split index
I have ZEN3 - we only need 16640 lookup slots:
Code: Select all
Pext constexpr 788.439376 107904 [843kb] pext_u64 yes Zach Wegner https://www.chessprogramming.org/BMI2#PEXTBitboards
Sparse Pext 510.107352 16640 [130kb] pext_u64 no Thomas Jahn https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79049&start=340
https://www.talkchess.com/forum3/viewto ... &start=340
C++ version (disabled in main repo because not correct)
https://github.com/Gigantua/Chess_Moveg ... Sparse.hpp
Code: Select all
static inline uint64_t BishopAttacks(int square, uint64_t occupation)
{
int offset = square * 128;
uint64_t dSub = Subset[offset + (int)_pext_u64(occupation, DiagonalMask[square])];
uint64_t aSub = Subset[offset + 64 + (int)_pext_u64(occupation, AntiDiagonalMask[square])];
return dSub | aSub;
}
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: New idea for my split index approach
Okay thanks I remember now. However, using one array with an offset for the other line I have shown to be slower than using two arrays. So the C# version that was faster could be faster still. And the C++ version never got a test because of a bug. Therefore my OP is still valid. So let me rephrase my request. Could someone please do a proper test of this.
Thanks!
Thanks!
-
- Posts: 20
- Joined: Fri Mar 10, 2023 9:33 am
- Full name: Martin Novák
Re: New idea for my split index approach
Hello Mike,Mike Sherwin wrote: ↑Mon Jun 26, 2023 1:08 am Okay thanks I remember now. However, using one array with an offset for the other line I have shown to be slower than using two arrays. So the C# version that was faster could be faster still. And the C++ version never got a test because of a bug. Therefore my OP is still valid. So let me rephrase my request. Could someone please do a proper test of this.
Thanks!
I have already experimented with PEXT for rooks. You can still use there your hSubset and vSubset. But vSubset has to be changed. In your implementation you multiply occ & vmask[sq] with c2h7 diagonal but that then gives you 6 bits of blockers on rows in order: 7, 6, 5, 4, 3, 2 but pext gives it in order 2, 3, 4, 5, 6, 7 so I changed the vSubset to expect that in "pext" order. You can still use this 2d array in classic KGSSB just change the c2h7 diag for c7h2 antidiag. (I know that my explanation is terrible here I started writing something that is hopefully more clear https://github.com/martinnovaak/motor/b ... al-attacks). So for classic rook KGSSB and for Pext KGSSB you can use totally same 2d array.
I think that you had to change vmask and hmask to include all 6 bits. Here is my simple implementation of pext kgssb rook attacks: https://gist.github.com/martinnovaak/d5 ... 2a29f86e36. Actually when I tested it I didn't get any speedup from pext rook horizontal (pext was even slower on my CPU). I think that is because KGSSB horizontal rook is really fast, it is just bit shift and AND with first 6 bits. But again the implementation I sent is using 6 bits and as you said it could be reduced to 5.
Diagonal and antidiagonal pext wouldn't be so simple. I once tried to also implement it but my attempt failed and i didn't try to fix it again.
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: New idea for my split index approach
Thanks Martin, That is solid information for anyone willing to make a test for C++ on a PEXT efficient machine in a R.W.C.E.martinn wrote: ↑Mon Jun 26, 2023 2:55 pmHello Mike,Mike Sherwin wrote: ↑Mon Jun 26, 2023 1:08 am Okay thanks I remember now. However, using one array with an offset for the other line I have shown to be slower than using two arrays. So the C# version that was faster could be faster still. And the C++ version never got a test because of a bug. Therefore my OP is still valid. So let me rephrase my request. Could someone please do a proper test of this.
Thanks!
I have already experimented with PEXT for rooks. You can still use there your hSubset and vSubset. But vSubset has to be changed. In your implementation you multiply occ & vmask[sq] with c2h7 diagonal but that then gives you 6 bits of blockers on rows in order: 7, 6, 5, 4, 3, 2 but pext gives it in order 2, 3, 4, 5, 6, 7 so I changed the vSubset to expect that in "pext" order. You can still use this 2d array in classic KGSSB just change the c2h7 diag for c7h2 antidiag. (I know that my explanation is terrible here I started writing something that is hopefully more clear https://github.com/martinnovaak/motor/b ... al-attacks). So for classic rook KGSSB and for Pext KGSSB you can use totally same 2d array.
I think that you had to change vmask and hmask to include all 6 bits. Here is my simple implementation of pext kgssb rook attacks: https://gist.github.com/martinnovaak/d5 ... 2a29f86e36. Actually when I tested it I didn't get any speedup from pext rook horizontal (pext was even slower on my CPU). I think that is because KGSSB horizontal rook is really fast, it is just bit shift and AND with first 6 bits. But again the implementation I sent is using 6 bits and as you said it could be reduced to 5.
Diagonal and antidiagonal pext wouldn't be so simple. I once tried to also implement it but my attempt failed and i didn't try to fix it again.
Use KGSSB original horizontal line as it is the fastest.
Use PEXT on vertical, diagonal and antidiagonal lines.
I think someone can do this test in a few hours. I did months of work over many years on this not just for myself but for everyone. Can't someone spare a few hours work for me and everyone else including themselves to determine the truth. Or send me 2500 US dollars and I'll buy a 7950x system and run the test myself.

-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: New idea for my split index approach
I do not have an efficient PEXT cpu. I can't afford to buy one. If I did or could I would test this method in a real world chess engine myself using C++. Are there no good guys/gals left? 

-
- Posts: 2695
- Joined: Tue Aug 30, 2016 8:19 pm
- Full name: Rasmus Althoff
Re: New idea for my split index approach
If you can provide C++ code that builds under Linux and has some way to do an A/B comparison test, I can run that on a Zen3 CPU which has proper PEXT.Mike Sherwin wrote: ↑Thu Jun 29, 2023 2:21 amI do not have an efficient PEXT cpu. I can't afford to buy one. If I did or could I would test this method in a real world chess engine myself using C++. Are there no good guys/gals left?
Rasmus Althoff
https://www.ct800.net
https://www.ct800.net
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: New idea for my split index approach
I can do this over the weekend. Martin Novák's implementation looks reasonable.
Also cool that somebody else is using inline constexpr arrays.
To repeat why this is cool:
Many situations like loop unrolling will make it obvious to the compiler that we are talking about for example Square = 5 at a certain point in the code.
So when we just have static arrays the contents are only known at runtime, but with constexpr the compiler can already have masks and precalculate some results for some invocations of movegen.
Code size is unaffected since we often times can do the old constexpr lambda instant invocation trick to define constexpr arrays too.
The real killer feature in computer chess will be Zen4 tho imo. Lets hope AVX512 gains some traction again since Intel removed it since 12th gen in consumer chips.
Over the weekend Split Index Pext promised. (calling it SIP)
Also cool that somebody else is using inline constexpr arrays.
To repeat why this is cool:
Many situations like loop unrolling will make it obvious to the compiler that we are talking about for example Square = 5 at a certain point in the code.
So when we just have static arrays the contents are only known at runtime, but with constexpr the compiler can already have masks and precalculate some results for some invocations of movegen.
Code size is unaffected since we often times can do the old constexpr lambda instant invocation trick to define constexpr arrays too.
The real killer feature in computer chess will be Zen4 tho imo. Lets hope AVX512 gains some traction again since Intel removed it since 12th gen in consumer chips.
Over the weekend Split Index Pext promised. (calling it SIP)
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: New idea for my split index approach
There is complete C# code in the links above and C++ code for the rook as well. Martin's C++ rook code is 27% faster than black magic on Daniel's test platform. But that is not a RWCE and does not (yet) simulate a tt. The rook uses less memory than black magic so in a RWCE I expect it to outperform BM rook by more than 27 percent. In Daniel's test engine there is complete source code for my Kindergarten Math version. It just needs translated to use PEXT. The problem with translation if I understand correctly is that Kindergarten reverses the bit order from bcdefg to gfedcb so that needs to be changed in the initialization to use PEXT as PEXT does not reverse the bit order.Ras wrote: ↑Thu Jun 29, 2023 10:24 amIf you can provide C++ code that builds under Linux and has some way to do an A/B comparison test, I can run that on a Zen3 CPU which has proper PEXT.Mike Sherwin wrote: ↑Thu Jun 29, 2023 2:21 amI do not have an efficient PEXT cpu. I can't afford to buy one. If I did or could I would test this method in a real world chess engine myself using C++. Are there no good guys/gals left?
Here is the complete non PEXT code.
Below your post Daniel has promised the C++ PEXT code by this weekend!
Code: Select all
// Datum : 21.01.2023; updated code as of 16.03.2023 (martinn)
// Author : Michael J Sherwin 2023
// Content: The name is Kindergarten Super SISSY Bitboards - https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
// The code can be further "minimized" according to the author
// C++20 constexpr port by Daniel Infuehr
#include <stdint.h>
namespace Chess_Lookup::KGSSB
{
static uint64_t vMask[64];
static uint64_t hMask[64];
static uint64_t dMask[64];
static uint64_t aMask[64];
static uint64_t vSubset[64][64];
static uint64_t hSubset[64][64];
static uint64_t dSubset[64][64];
static uint64_t aSubset[64][64];
// new lookup table for shifts in calculation of hIndex - martinn
static unsigned int shift_horizontal_table[64]; // new lookup table for shifts in calculation of hIndex
static constexpr uint64_t Size = (64 * 64 * 4 + 64 * 4) * sizeof(uint64_t);
enum { FILEa, FILEb, FILEc, FILEd, FILEe, FILEf, FILEg, FILEh };
enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };
static void InitSquare(int sq)
{
int ts, dx, dy;
uint64_t occ, index;
int x = sq % 8;
int y = sq / 8;
// Initialize Kindergarten Super SISSY Bitboards
// diagonals
for (ts = sq + 9, dx = x + 1, dy = y + 1; dx < FILEh && dy < RANK8; dMask[sq] |= 1ull << ts, ts += 9, dx++, dy++);
for (ts = sq - 9, dx = x - 1, dy = y - 1; dx > FILEa && dy > RANK1; dMask[sq] |= 1ull << ts, ts -= 9, dx--, dy--);
// anti diagonals
for (ts = sq + 7, dx = x - 1, dy = y + 1; dx > FILEa && dy < RANK8; aMask[sq] |= 1ull << ts, ts += 7, dx--, dy++);
for (ts = sq - 7, dx = x + 1, dy = y - 1; dx < FILEh && dy > RANK1; aMask[sq] |= 1ull << ts, ts -= 7, dx++, dy--);
// diagonal indexes
for (index = 0; index < 64; index++) {
dSubset[sq][index] = 0;
occ = index << 1;
if ((sq & 7) != FILEh && (sq >> 3) != RANK8) {
for (ts = sq + 9; ; ts += 9) {
dSubset[sq][index] |= (1ull << ts);
if (occ & (1ull << (ts & 7))) break;
if ((ts & 7) == FILEh || (ts >> 3) == RANK8) break;
}
}
if ((sq & 7) != FILEa && (sq >> 3) != RANK1) {
for (ts = sq - 9; ; ts -= 9) {
dSubset[sq][index] |= (1ull << ts);
if (occ & (1ull << (ts & 7))) break;
if ((ts & 7) == FILEa || (ts >> 3) == RANK1) break;
}
}
}
// anti diagonal indexes
for (index = 0; index < 64; index++) {
aSubset[sq][index] = 0;
occ = index << 1;
if ((sq & 7) != FILEa && (sq >> 3) != RANK8) {
for (ts = sq + 7; ; ts += 7) {
aSubset[sq][index] |= (1ull << ts);
if (occ & (1ull << (ts & 7))) break;
if ((ts & 7) == FILEa || (ts >> 3) == RANK8) break;
}
}
if ((sq & 7) != FILEh && (sq >> 3) != RANK1) {
for (ts = sq - 7; ; ts -= 7) {
aSubset[sq][index] |= (1ull << ts);
if (occ & (1ull << (ts & 7))) break;
if ((ts & 7) == FILEh || (ts >> 3) == RANK1) break;
}
}
}
// horizontal lines
for (ts = sq + 1, dx = x + 1; dx < FILEh; hMask[sq] |= 1ull << ts, ts += 1, dx++);
for (ts = sq - 1, dx = x - 1; dx > FILEa; hMask[sq] |= 1ull << ts, ts -= 1, dx--);
// vertical indexes
for (index = 0; index < 64; index++) {
vSubset[sq][index] = 0;
uint64_t blockers = 0;
for (int i = 0; i <= 5; i++) {
if (index & (1ull << i)) {
blockers |= (1ull << (((5 - i) << 3) + 8));
}
}
if ((sq >> 3) != RANK8) {
for (ts = sq + 8; ; ts += 8) {
vSubset[sq][index] |= (1ull << ts);
if (blockers & (1ull << (ts - (ts & 7)))) break;
if ((ts >> 3) == RANK8) break;
}
}
if ((sq >> 3) != RANK1) {
for (ts = sq - 8; ; ts -= 8) {
vSubset[sq][index] |= (1ull << ts);
if (blockers & (1ull << (ts - (ts & 7)))) break;
if ((ts >> 3) == RANK1) break;
}
}
}
// horizontal indexes
for (index = 0; index < 64; index++) {
hSubset[sq][index] = 0;
occ = index << 1;
if ((sq & 7) != FILEh) {
for (ts = sq + 1; ; ts += 1) {
hSubset[sq][index] |= (1ull << ts);
if (occ & (1ull << (ts & 7))) break;
if ((ts & 7) == FILEh) break;
}
}
if ((sq & 7) != FILEa) {
for (ts = sq - 1; ; ts -= 1) {
hSubset[sq][index] |= (1ull << ts);
if (occ & (1ull << (ts & 7))) break;
if ((ts & 7) == FILEa) break;
}
}
}
}
static void Init()
{
for (int i = 0; i < 64; i++) {
InitSquare(i);
}
for (int i = 0; i < 64; i++) {
shift_horizontal_table[i] = (i & 56) + 1;
}
}
static constexpr uint64_t file_b2_b7 = 0x0002020202020200ull;
static constexpr uint64_t file_a2_a7 = 0x0001010101010100ull;
static constexpr uint64_t diag_c2h7 = 0x0080402010080400ull;
static uint64_t Bishop(int sq, uint64_t occ)
{
return
dSubset[sq][(((occ & dMask[sq]) * file_b2_b7) >> 58)] |
aSubset[sq][(((occ & aMask[sq]) * file_b2_b7) >> 58)];
}
static uint64_t Rook(int sq, uint64_t occ)
{
return
hSubset[sq][(occ >> shift_horizontal_table[sq]) & 63] |
vSubset[sq][((((occ >> (sq & 7)) & file_a2_a7) * diag_c2h7) >> 58)];
}
static uint64_t Queen(int sq, uint64_t occ)
{
return Rook(sq, occ) | Bishop(sq, occ);
}
}
-
- Posts: 965
- Joined: Fri Aug 21, 2020 1:25 am
- Location: Planet Earth, Sol system
- Full name: Michael J Sherwin
Re: New idea for my split index approach
Thanks Daniel, that has got me excited! Not as excited as I used to get wooing a pretty girl but the next best thing.dangi12012 wrote: ↑Thu Jun 29, 2023 5:34 pm I can do this over the weekend. Martin Novák's implementation looks reasonable.
Also cool that somebody else is using inline constexpr arrays.
To repeat why this is cool:
Many situations like loop unrolling will make it obvious to the compiler that we are talking about for example Square = 5 at a certain point in the code.
So when we just have static arrays the contents are only known at runtime, but with constexpr the compiler can already have masks and precalculate some results for some invocations of movegen.
Code size is unaffected since we often times can do the old constexpr lambda instant invocation trick to define constexpr arrays too.
The real killer feature in computer chess will be Zen4 tho imo. Lets hope AVX512 gains some traction again since Intel removed it since 12th gen in consumer chips.
Over the weekend Split Index Pext promised. (calling it SIP)
