New idea for my split index approach

Mike Sherwin · Post by **Mike Sherwin** » Sun Jun 25, 2023 10:27 pm

Why not use PEXT? Since PEXT in one go needs huge lookup tables it might be slowed down due to cache pressure. However, doing PEXT per line would only need two 5/6 bit indices per square. I can't test this idea because I'm stuck with zen2. Will someone please test this idea.

It would look like this in pseudo code I suppose.

horizontal[PEXT()] | vertical{PEXT()]; // for rooks

diagonal[PEXT()] | antidiag[PEXT()]; // for bishops

I proposed both split index and even PEXT (before there was a PEXT) back in 2006.
It is overdue that both ideas are combined.

I repeat--will someone please do a test.

dangi12012 · Post by **dangi12012** » Mon Jun 26, 2023 12:35 am

Hey I think we already have this.
Essentially we use PEXT to hash for 4 rays instead of 2 and use split index

I have ZEN3 - we only need 16640 lookup slots:

Code: Select all

Pext constexpr                     788.439376                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
Sparse Pext                        510.107352                    16640   [130kb]     pext_u64                 no        Thomas Jahn                                  https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79049&start=340

Thomas Jahn had the same idea here:
https://www.talkchess.com/forum3/viewto ... &start=340

C++ version (disabled in main repo because not correct)
https://github.com/Gigantua/Chess_Moveg ... Sparse.hpp

Code: Select all

    static inline uint64_t BishopAttacks(int square, uint64_t occupation)
    {
        int offset = square * 128;
        uint64_t dSub = Subset[offset + (int)_pext_u64(occupation, DiagonalMask[square])];
        uint64_t aSub = Subset[offset + 64 + (int)_pext_u64(occupation, AntiDiagonalMask[square])];
        return dSub | aSub;
    }

Mike Sherwin · Post by **Mike Sherwin** » Mon Jun 26, 2023 1:08 am

Okay thanks I remember now. However, using one array with an offset for the other line I have shown to be slower than using two arrays. So the C# version that was faster could be faster still. And the C++ version never got a test because of a bug. Therefore my OP is still valid. So let me rephrase my request. Could someone please do a proper test of this.

Thanks!

martinn · Post by **martinn** » Mon Jun 26, 2023 2:55 pm

Mike Sherwin wrote: ↑Mon Jun 26, 2023 1:08 am Okay thanks I remember now. However, using one array with an offset for the other line I have shown to be slower than using two arrays. So the C# version that was faster could be faster still. And the C++ version never got a test because of a bug. Therefore my OP is still valid. So let me rephrase my request. Could someone please do a proper test of this.

Thanks!

Hello Mike,
I have already experimented with PEXT for rooks. You can still use there your hSubset and vSubset. But vSubset has to be changed. In your implementation you multiply occ & vmask[sq] with c2h7 diagonal but that then gives you 6 bits of blockers on rows in order: 7, 6, 5, 4, 3, 2 but pext gives it in order 2, 3, 4, 5, 6, 7 so I changed the vSubset to expect that in "pext" order. You can still use this 2d array in classic KGSSB just change the c2h7 diag for c7h2 antidiag. (I know that my explanation is terrible here I started writing something that is hopefully more clear https://github.com/martinnovaak/motor/b ... al-attacks). So for classic rook KGSSB and for Pext KGSSB you can use totally same 2d array.

I think that you had to change vmask and hmask to include all 6 bits. Here is my simple implementation of pext kgssb rook attacks: https://gist.github.com/martinnovaak/d5 ... 2a29f86e36. Actually when I tested it I didn't get any speedup from pext rook horizontal (pext was even slower on my CPU). I think that is because KGSSB horizontal rook is really fast, it is just bit shift and AND with first 6 bits. But again the implementation I sent is using 6 bits and as you said it could be reduced to 5.

Diagonal and antidiagonal pext wouldn't be so simple. I once tried to also implement it but my attempt failed and i didn't try to fix it again.

Mike Sherwin · Post by **Mike Sherwin** » Mon Jun 26, 2023 5:39 pm

martinn wrote: ↑Mon Jun 26, 2023 2:55 pm
Mike Sherwin wrote: ↑Mon Jun 26, 2023 1:08 am Okay thanks I remember now. However, using one array with an offset for the other line I have shown to be slower than using two arrays. So the C# version that was faster could be faster still. And the C++ version never got a test because of a bug. Therefore my OP is still valid. So let me rephrase my request. Could someone please do a proper test of this.

Thanks!
Hello Mike,
I have already experimented with PEXT for rooks. You can still use there your hSubset and vSubset. But vSubset has to be changed. In your implementation you multiply occ & vmask[sq] with c2h7 diagonal but that then gives you 6 bits of blockers on rows in order: 7, 6, 5, 4, 3, 2 but pext gives it in order 2, 3, 4, 5, 6, 7 so I changed the vSubset to expect that in "pext" order. You can still use this 2d array in classic KGSSB just change the c2h7 diag for c7h2 antidiag. (I know that my explanation is terrible here I started writing something that is hopefully more clear https://github.com/martinnovaak/motor/b ... al-attacks). So for classic rook KGSSB and for Pext KGSSB you can use totally same 2d array.

I think that you had to change vmask and hmask to include all 6 bits. Here is my simple implementation of pext kgssb rook attacks: https://gist.github.com/martinnovaak/d5 ... 2a29f86e36. Actually when I tested it I didn't get any speedup from pext rook horizontal (pext was even slower on my CPU). I think that is because KGSSB horizontal rook is really fast, it is just bit shift and AND with first 6 bits. But again the implementation I sent is using 6 bits and as you said it could be reduced to 5.

Diagonal and antidiagonal pext wouldn't be so simple. I once tried to also implement it but my attempt failed and i didn't try to fix it again.

Thanks Martin, That is solid information for anyone willing to make a test for C++ on a PEXT efficient machine in a R.W.C.E.

Use KGSSB original horizontal line as it is the fastest.
Use PEXT on vertical, diagonal and antidiagonal lines.

I think someone can do this test in a few hours. I did months of work over many years on this not just for myself but for everyone. Can't someone spare a few hours work for me and everyone else including themselves to determine the truth. Or send me 2500 US dollars and I'll buy a 7950x system and run the test myself.

Mike Sherwin · Post by **Mike Sherwin** » Thu Jun 29, 2023 2:21 am

I do not have an efficient PEXT cpu. I can't afford to buy one. If I did or could I would test this method in a real world chess engine myself using C++. Are there no good guys/gals left?

Ras · Post by **Ras** » Thu Jun 29, 2023 10:24 am

Mike Sherwin wrote: ↑Thu Jun 29, 2023 2:21 amI do not have an efficient PEXT cpu. I can't afford to buy one. If I did or could I would test this method in a real world chess engine myself using C++. Are there no good guys/gals left?

If you can provide C++ code that builds under Linux and has some way to do an A/B comparison test, I can run that on a Zen3 CPU which has proper PEXT.

dangi12012 · Post by **dangi12012** » Thu Jun 29, 2023 5:34 pm

I can do this over the weekend. Martin Novák's implementation looks reasonable.

Also cool that somebody else is using inline constexpr arrays.

To repeat why this is cool:
Many situations like loop unrolling will make it obvious to the compiler that we are talking about for example Square = 5 at a certain point in the code.
So when we just have static arrays the contents are only known at runtime, but with constexpr the compiler can already have masks and precalculate some results for some invocations of movegen.

Code size is unaffected since we often times can do the old constexpr lambda instant invocation trick to define constexpr arrays too.

The real killer feature in computer chess will be Zen4 tho imo. Lets hope AVX512 gains some traction again since Intel removed it since 12th gen in consumer chips.

Over the weekend Split Index Pext promised. (calling it SIP)

Mike Sherwin · Post by **Mike Sherwin** » Thu Jun 29, 2023 7:01 pm

Ras wrote: ↑Thu Jun 29, 2023 10:24 am
Mike Sherwin wrote: ↑Thu Jun 29, 2023 2:21 amI do not have an efficient PEXT cpu. I can't afford to buy one. If I did or could I would test this method in a real world chess engine myself using C++. Are there no good guys/gals left?
If you can provide C++ code that builds under Linux and has some way to do an A/B comparison test, I can run that on a Zen3 CPU which has proper PEXT.

There is complete C# code in the links above and C++ code for the rook as well. Martin's C++ rook code is 27% faster than black magic on Daniel's test platform. But that is not a RWCE and does not (yet) simulate a tt. The rook uses less memory than black magic so in a RWCE I expect it to outperform BM rook by more than 27 percent. In Daniel's test engine there is complete source code for my Kindergarten Math version. It just needs translated to use PEXT. The problem with translation if I understand correctly is that Kindergarten reverses the bit order from bcdefg to gfedcb so that needs to be changed in the initialization to use PEXT as PEXT does not reverse the bit order.

Here is the complete non PEXT code.
Below your post Daniel has promised the C++ PEXT code by this weekend!

Code: Select all

// Datum  : 21.01.2023; updated code as of 16.03.2023 (martinn)
// Author : Michael J Sherwin 2023
// Content: The name is Kindergarten Super SISSY Bitboards - https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
//          The code can be further "minimized" according to the author
//			C++20 constexpr port by Daniel Infuehr

#include <stdint.h>

namespace Chess_Lookup::KGSSB
{
	static uint64_t vMask[64];
	static uint64_t hMask[64];
	static uint64_t dMask[64];
	static uint64_t aMask[64];
	static uint64_t vSubset[64][64];
	static uint64_t hSubset[64][64];
	static uint64_t dSubset[64][64];
	static uint64_t aSubset[64][64];

	// new lookup table for shifts in calculation of hIndex - martinn
	static unsigned int shift_horizontal_table[64]; // new lookup table for shifts in calculation of hIndex


	static constexpr uint64_t Size = (64 * 64 * 4 + 64 * 4) * sizeof(uint64_t);

	enum { FILEa, FILEb, FILEc, FILEd, FILEe, FILEf, FILEg, FILEh };

	enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };


	static void InitSquare(int sq)
	{
		int ts, dx, dy;
		uint64_t occ, index;
		int x = sq % 8;
		int y = sq / 8;

		// Initialize Kindergarten Super SISSY Bitboards
		// diagonals
		for (ts = sq + 9, dx = x + 1, dy = y + 1; dx < FILEh && dy < RANK8; dMask[sq] |= 1ull << ts, ts += 9, dx++, dy++);
		for (ts = sq - 9, dx = x - 1, dy = y - 1; dx > FILEa && dy > RANK1; dMask[sq] |= 1ull << ts, ts -= 9, dx--, dy--);

		// anti diagonals
		for (ts = sq + 7, dx = x - 1, dy = y + 1; dx > FILEa && dy < RANK8; aMask[sq] |= 1ull << ts, ts += 7, dx--, dy++);
		for (ts = sq - 7, dx = x + 1, dy = y - 1; dx < FILEh && dy > RANK1; aMask[sq] |= 1ull << ts, ts -= 7, dx++, dy--);

		// diagonal indexes
		for (index = 0; index < 64; index++) {
			dSubset[sq][index] = 0;
			occ = index << 1;
			if ((sq & 7) != FILEh && (sq >> 3) != RANK8) {
				for (ts = sq + 9; ; ts += 9) {
					dSubset[sq][index] |= (1ull << ts);
					if (occ & (1ull << (ts & 7))) break;
					if ((ts & 7) == FILEh || (ts >> 3) == RANK8) break;
				}
			}
			if ((sq & 7) != FILEa && (sq >> 3) != RANK1) {
				for (ts = sq - 9; ; ts -= 9) {
					dSubset[sq][index] |= (1ull << ts);
					if (occ & (1ull << (ts & 7))) break;
					if ((ts & 7) == FILEa || (ts >> 3) == RANK1) break;
				}
			}
		}

		// anti diagonal indexes
		for (index = 0; index < 64; index++) {
			aSubset[sq][index] = 0;
			occ = index << 1;
			if ((sq & 7) != FILEa && (sq >> 3) != RANK8) {
				for (ts = sq + 7; ; ts += 7) {
					aSubset[sq][index] |= (1ull << ts);
					if (occ & (1ull << (ts & 7))) break;
					if ((ts & 7) == FILEa || (ts >> 3) == RANK8) break;
				}
			}
			if ((sq & 7) != FILEh && (sq >> 3) != RANK1) {
				for (ts = sq - 7; ; ts -= 7) {
					aSubset[sq][index] |= (1ull << ts);
					if (occ & (1ull << (ts & 7))) break;
					if ((ts & 7) == FILEh || (ts >> 3) == RANK1) break;
				}
			}
		}

		// horizontal lines
		for (ts = sq + 1, dx = x + 1; dx < FILEh; hMask[sq] |= 1ull << ts, ts += 1, dx++);
		for (ts = sq - 1, dx = x - 1; dx > FILEa; hMask[sq] |= 1ull << ts, ts -= 1, dx--);

		// vertical indexes
		for (index = 0; index < 64; index++) {
			vSubset[sq][index] = 0;
			uint64_t blockers = 0;
			for (int i = 0; i <= 5; i++) {
				if (index & (1ull << i)) {
					blockers |= (1ull << (((5 - i) << 3) + 8));
				}
			}
			if ((sq >> 3) != RANK8) {
				for (ts = sq + 8; ; ts += 8) {
					vSubset[sq][index] |= (1ull << ts);
					if (blockers & (1ull << (ts - (ts & 7)))) break;
					if ((ts >> 3) == RANK8) break;
				}
			}
			if ((sq >> 3) != RANK1) {
				for (ts = sq - 8; ; ts -= 8) {
					vSubset[sq][index] |= (1ull << ts);
					if (blockers & (1ull << (ts - (ts & 7)))) break;
					if ((ts >> 3) == RANK1) break;
				}
			}
		}

		// horizontal indexes
		for (index = 0; index < 64; index++) {
			hSubset[sq][index] = 0;
			occ = index << 1;
			if ((sq & 7) != FILEh) {
				for (ts = sq + 1; ; ts += 1) {
					hSubset[sq][index] |= (1ull << ts);
					if (occ & (1ull << (ts & 7))) break;
					if ((ts & 7) == FILEh) break;
				}
			}
			if ((sq & 7) != FILEa) {
				for (ts = sq - 1; ; ts -= 1) {
					hSubset[sq][index] |= (1ull << ts);
					if (occ & (1ull << (ts & 7))) break;
					if ((ts & 7) == FILEa) break;
				}
			}
		}
	}

	static void Init()
	{
		for (int i = 0; i < 64; i++) {
			InitSquare(i);
		}

		for (int i = 0; i < 64; i++) {
		  shift_horizontal_table[i] = (i & 56) + 1;
		}
	}

	static constexpr uint64_t file_b2_b7 = 0x0002020202020200ull;
	static constexpr uint64_t file_a2_a7 = 0x0001010101010100ull;
	static constexpr uint64_t diag_c2h7 = 0x0080402010080400ull;

	static uint64_t Bishop(int sq, uint64_t occ)
	{
	  return
		dSubset[sq][(((occ & dMask[sq]) * file_b2_b7) >> 58)] |
		aSubset[sq][(((occ & aMask[sq]) * file_b2_b7) >> 58)];
	}

	static uint64_t Rook(int sq, uint64_t occ)
	{
	  return
		hSubset[sq][(occ >> shift_horizontal_table[sq]) & 63] |
		vSubset[sq][((((occ >> (sq & 7)) & file_a2_a7) * diag_c2h7) >> 58)];
	}


	static uint64_t Queen(int sq, uint64_t occ)
	{
		return Rook(sq, occ) | Bishop(sq, occ);
	}
	
}

Mike Sherwin · Post by **Mike Sherwin** » Thu Jun 29, 2023 7:21 pm

dangi12012 wrote: ↑Thu Jun 29, 2023 5:34 pm I can do this over the weekend. Martin Novák's implementation looks reasonable.

Also cool that somebody else is using inline constexpr arrays.

To repeat why this is cool:
Many situations like loop unrolling will make it obvious to the compiler that we are talking about for example Square = 5 at a certain point in the code.
So when we just have static arrays the contents are only known at runtime, but with constexpr the compiler can already have masks and precalculate some results for some invocations of movegen.

Code size is unaffected since we often times can do the old constexpr lambda instant invocation trick to define constexpr arrays too.

The real killer feature in computer chess will be Zen4 tho imo. Lets hope AVX512 gains some traction again since Intel removed it since 12th gen in consumer chips.

Over the weekend Split Index Pext promised. (calling it SIP)

Thanks Daniel, that has got me excited! Not as excited as I used to get wooing a pretty girl but the next best thing.

New idea for my split index approach

New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach

Re: New idea for my split index approach