Combining two of Bob's classic bitboard attack getters

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 9:41 pm

tcusr wrote: ↑Mon Dec 13, 2021 9:03 pm
dangi12012 wrote: ↑Mon Dec 13, 2021 8:32 pm
tcusr wrote: ↑Mon Dec 13, 2021 7:11 pm but the problem is that, as hgm rightfully says, most engines spend most of their time on captures so they won't even generate the rest of quiet moves.
a quick way to generate captures is much more useful, both for qsearch and staged move generation
Nah hgm also said on other threads with bitboards you can select capturing moves with a single "and operation" on the enemy occupation. So you just do atk & enemy and have the moves to search first.

Its that branching 24 times for a queen like you said above is much much much more expensive than looking her atk set up in 2 operations and then with one AND you get the right moves. I tried it because the recurive code in my signature would be perfect for that.

Anyways tcusr I need your help on understanding this line of code:
uint64_t blocked_down = 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(block & mask | 1ull);

I made the inner core for arithmetic much cleaner now because I saw some intrinsics are hidden in there:
For example there was X & (1 << p) - 1 which perfecly maps to BZHI. Not all compilers find that. Now its another 5% faster and the code shrank by 5 lines.

If we find an elegant solution for blocked_down part it would be perfect. I can feel that there is an elegant solution hidden in there somewhere. I dont know where that 0x7FFFFFFFFFFFFFFFull comes from and why to shift by that amount. The lookup for any slider is getting close to 3 lines of code with a lookup table of 2Kb.
So a small table and short code without too many dependencies. I like that a lot.
Code: Select all
	/* Start of code */
	static const inline uint64_t slide_arithmetic(int p, uint64_t block) {
		//BZHI
		//[src & (1 << inx) - 1] ;
		// split the line into upper and lower rays
		uint64_t mask = _bzhi_u64(block, p);

		// for the bottom we use CLZ + a shift to fill in from the top
		uint64_t blocked_down = 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(block & mask | 1ull);

		//_blsmsk_u64 = X^X-1
		// the intersection of the two is the move set after masking with the line
		return (_blsmsk_u64(block & ~mask) ^ blocked_down);
	}

	static const inline uint64_t Queen(uint64_t s, uint64_t occ)
	{
		const uint64_t* r = rank_mask.data() + 4 * s;
		return slide_arithmetic(s, r[0] & occ) & r[0]
			 ^ slide_arithmetic(s, r[1] & occ) & r[1]
			 ^ slide_arithmetic(s, r[2] & occ) & r[2]
			 ^ slide_arithmetic(s, r[3] & occ) & r[3];
	}
tbh i have trouble understanding the code too, but i can try
this is 0x7FFFFFFFFFFFFFFFull
Code: Select all
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X |   | 8
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 7
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 6
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 5
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 4
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 3
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 2
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 1
+---+---+---+---+---+---+---+---+
  a   b   c   d   e   f   g   h
by shifting right it makes sure to only fill the bottom of the first encountered piece (msb), but idk if it's right

So this sounds like BEXTR to me. But only between some bits. I dont understand it -.-
https://en.wikipedia.org/wiki/X86_Bit_m ... uction_set

block is the masked line
block & mask is the masked lower half of the line (positive ray)
block & mask | 1ull i dont understand

Isnt there a trick to get positive rays faster?

tcusr · Post by **tcusr** » Mon Dec 13, 2021 9:48 pm

dangi12012 wrote: ↑Mon Dec 13, 2021 9:41 pm
tcusr wrote: ↑Mon Dec 13, 2021 9:03 pm
dangi12012 wrote: ↑Mon Dec 13, 2021 8:32 pm
tcusr wrote: ↑Mon Dec 13, 2021 7:11 pm but the problem is that, as hgm rightfully says, most engines spend most of their time on captures so they won't even generate the rest of quiet moves.
a quick way to generate captures is much more useful, both for qsearch and staged move generation
Nah hgm also said on other threads with bitboards you can select capturing moves with a single "and operation" on the enemy occupation. So you just do atk & enemy and have the moves to search first.

Its that branching 24 times for a queen like you said above is much much much more expensive than looking her atk set up in 2 operations and then with one AND you get the right moves. I tried it because the recurive code in my signature would be perfect for that.

Anyways tcusr I need your help on understanding this line of code:
uint64_t blocked_down = 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(block & mask | 1ull);

I made the inner core for arithmetic much cleaner now because I saw some intrinsics are hidden in there:
For example there was X & (1 << p) - 1 which perfecly maps to BZHI. Not all compilers find that. Now its another 5% faster and the code shrank by 5 lines.

If we find an elegant solution for blocked_down part it would be perfect. I can feel that there is an elegant solution hidden in there somewhere. I dont know where that 0x7FFFFFFFFFFFFFFFull comes from and why to shift by that amount. The lookup for any slider is getting close to 3 lines of code with a lookup table of 2Kb.
So a small table and short code without too many dependencies. I like that a lot.
Code: Select all
	/* Start of code */
	static const inline uint64_t slide_arithmetic(int p, uint64_t block) {
		//BZHI
		//[src & (1 << inx) - 1] ;
		// split the line into upper and lower rays
		uint64_t mask = _bzhi_u64(block, p);

		// for the bottom we use CLZ + a shift to fill in from the top
		uint64_t blocked_down = 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(block & mask | 1ull);

		//_blsmsk_u64 = X^X-1
		// the intersection of the two is the move set after masking with the line
		return (_blsmsk_u64(block & ~mask) ^ blocked_down);
	}

	static const inline uint64_t Queen(uint64_t s, uint64_t occ)
	{
		const uint64_t* r = rank_mask.data() + 4 * s;
		return slide_arithmetic(s, r[0] & occ) & r[0]
			 ^ slide_arithmetic(s, r[1] & occ) & r[1]
			 ^ slide_arithmetic(s, r[2] & occ) & r[2]
			 ^ slide_arithmetic(s, r[3] & occ) & r[3];
	}
tbh i have trouble understanding the code too, but i can try
this is 0x7FFFFFFFFFFFFFFFull
Code: Select all
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X |   | 8
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 7
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 6
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 5
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 4
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 3
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 2
+---+---+---+---+---+---+---+---+
| X | X | X | X | X | X | X | X | 1
+---+---+---+---+---+---+---+---+
  a   b   c   d   e   f   g   h
by shifting right it makes sure to only fill the bottom of the first encountered piece (msb), but idk if it's right
So this sounds like BEXTR to me. But only between some bits. I dont understand it -.-
https://en.wikipedia.org/wiki/X86_Bit_m ... uction_set

block is the masked line
block & mask is the masked lower half of the line (positive ray)
block & mask | 1ull i dont understand

Isnt there a trick to get positive rays faster?

this is for negative rays, for positive he uses x ^ (x-1)
look at this https://www.chessprogramming.org/Square ... iderations (Little-Endian Rank-File Mapping), left and bottom are negative rays

block & mask is used to isolate the bottom ray, the ' | 1' i think is for the clz instruction to avoid branches if there aren't blockers

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 10:03 pm

tcusr wrote: ↑Mon Dec 13, 2021 9:48 pm this is for negative rays, for positive he uses x ^ (x-1)
look at this https://www.chessprogramming.org/Square ... iderations (Little-Endian Rank-File Mapping), left and bottom are negative rays

block & mask is used to isolate the bottom ray, the ' | 1' i think is for the clz instruction to avoid branches if there aren't blockers

So your thinking is perfect. That | 1 is so that clz never returns 64. The mask is made with 1 bit missing on the top right. (somehow this works)
So - the question is: Do we have a trick that works with negative rays?

Something that would be smaller than 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(block & mask | 1ull):
"Shift by" "Count Zeroes of this expression" is not so nice in terms of dependencies and complexity of instructions.

tcusr · Post by **tcusr** » Mon Dec 13, 2021 10:14 pm

dangi12012 wrote: ↑Mon Dec 13, 2021 10:03 pm
tcusr wrote: ↑Mon Dec 13, 2021 9:48 pm this is for negative rays, for positive he uses x ^ (x-1)
look at this https://www.chessprogramming.org/Square ... iderations (Little-Endian Rank-File Mapping), left and bottom are negative rays

block & mask is used to isolate the bottom ray, the ' | 1' i think is for the clz instruction to avoid branches if there aren't blockers
So your thinking is perfect. That | 1 is so that clz never returns 64. The mask is made with 1 bit missing on the top right. (somehow this works)
So - the question is: Do we have a trick that works with negative rays?

Something that would be smaller than 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(block & mask | 1ull):
"Shift by" "Count Zeroes of this expression" is not so nice in terms of dependencies and complexity of instructions.

i think the missing bit is used to include the msb since the '| 1' is always done, i'm not sure

i have not found no such trick for negative rays in the wiki, hyperbola quintessence byteswaps (a vertical flip) the occupancy to generate them but this doesn't work for ranks because they don't change position.

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 10:21 pm

tcusr wrote: ↑Mon Dec 13, 2021 10:14 pm
dangi12012 wrote: ↑Mon Dec 13, 2021 10:03 pm
tcusr wrote: ↑Mon Dec 13, 2021 9:48 pm this is for negative rays, for positive he uses x ^ (x-1)
look at this https://www.chessprogramming.org/Square ... iderations (Little-Endian Rank-File Mapping), left and bottom are negative rays

block & mask is used to isolate the bottom ray, the ' | 1' i think is for the clz instruction to avoid branches if there aren't blockers
So your thinking is perfect. That | 1 is so that clz never returns 64. The mask is made with 1 bit missing on the top right. (somehow this works)
So - the question is: Do we have a trick that works with negative rays?

Something that would be smaller than 0x7FFFFFFFFFFFFFFFull >> std::countl_zero(block & mask | 1ull):
"Shift by" "Count Zeroes of this expression" is not so nice in terms of dependencies and complexity of instructions.
i think the missing bit is used to include the msb since the '| 1' is always done, i'm not sure

i have not found no such trick for negative rays in the wiki, hyperbola quintessence byteswaps (a vertical flip) the occupancy to generate them but this doesn't work for ranks because they don't change position.

Well thats a shame. I am sure that a pext + table lookup is slower. That table would have a little bit more than 2^7 entries.

dangi12012 · Post by **dangi12012** » Wed Dec 22, 2021 6:19 pm

Comparsion code for the dozen or so algos will be released soon.
For bob lookup I ended up with this code:
Could someone point me to the code for rook/bishop only?
(since queen is really just rook | bishop)

Code: Select all

struct Rays {
		uint64_t rayNW;
		uint64_t rayNN;
		uint64_t rayNE;
		uint64_t rayEE;
		uint64_t raySE;
		uint64_t raySS;
		uint64_t raySW;
		uint64_t rayWW;
		uint64_t rwsNW;
		uint64_t rwsNN;
		uint64_t rwsNE;
		uint64_t rwsEE;
		uint64_t rwsSE;
		uint64_t rwsSS;
		uint64_t rwsSW;
		uint64_t rwsWW;
	};

	consteval std::array<Rays,64> Initialize() {
		enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };
		enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

		int sq, ts, file, rank, c;
		std::array<Rays, 64> ray{};
		for (sq = 0; sq <= 63; sq++) {
			file = sq & 7;
			rank = sq >> 3;

			// Northwest
			ray[sq].rayNW = 0;
			for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
			ray[sq].rwsNW = ray[sq].rayNW | 0x8000000000000000;

			// Northeast
			ray[sq].rayNE = 0;
			for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
			ray[sq].rwsNE = ray[sq].rayNE | 0x8000000000000000;

			// Southeast
			ray[sq].raySE = 0;
			for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c >= RANK1; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
			ray[sq].rwsSE = ray[sq].raySE | 0x0000000000000001;

			// Southwest
			ray[sq].raySW = 0;
			for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c >= RANK1; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
			ray[sq].rwsSW = ray[sq].raySW | 0x0000000000000001;

			// North
			ray[sq].rayNN = 0;
			for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
			ray[sq].rwsNN = ray[sq].rayNN | 0x8000000000000000;

			// East
			ray[sq].rayEE = 0;
			for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
			ray[sq].rwsEE = ray[sq].rayEE | 0x8000000000000000;

			// South
			ray[sq].raySS = 0;
			for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
			ray[sq].rwsSS = ray[sq].raySS | 0x0000000000000001;

			// West
			ray[sq].rayWW = 0;
			for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
			ray[sq].rwsWW = ray[sq].rayWW | 0x0000000000000001;
		}
		return ray;
	}
	constexpr std::array<Rays, 64> ray = Initialize();

	constexpr auto Size = sizeof(ray);

	uint64_t Queen(int sq, uint64_t occ) {
		uint64_t r1 = 0;
		uint64_t r2 = 0;
		occ |= 0x8000000000000001;

		r1 |= ray[sq].rayNW ^ ray[std::countr_zero(ray[sq].rwsNW & occ)].rayNW;
		r1 |= ray[sq].rayNN ^ ray[std::countr_zero(ray[sq].rwsNN & occ)].rayNN;
		r1 |= ray[sq].rayNE ^ ray[std::countr_zero(ray[sq].rwsNE & occ)].rayNE;
		r1 |= ray[sq].rayEE ^ ray[std::countr_zero(ray[sq].rwsEE & occ)].rayEE;

		r2 |= ray[sq].raySE ^ ray[63 - std::countl_zero(ray[sq].rwsSE & occ)].raySE;
		r2 |= ray[sq].raySS ^ ray[63 - std::countl_zero(ray[sq].rwsSS & occ)].raySS;
		r2 |= ray[sq].raySW ^ ray[63 - std::countl_zero(ray[sq].rwsSW & occ)].raySW;
		r2 |= ray[sq].rayWW ^ ray[63 - std::countl_zero(ray[sq].rwsWW & occ)].rayWW;

		return r1 | r2;
	}

It seems to look trivial to seperate - but i have no testcode for rooks or bishops seperately.
Bob is one of my favourite algos. The code just looks nice to the eye.

Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters