Combining two of Bob's classic bitboard attack getters

Discussion of chess software programming and technical issues.

Moderator: Ras

Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Combining two of Bob's classic bitboard attack getters

Post by Mike Sherwin »

dangi12012 wrote: Fri Dec 10, 2021 1:09 pm
Mike Sherwin wrote: Fri Dec 10, 2021 3:06 am
The code is still buggy - i use this to print a bitboard and can compare reference vs output:

Code: Select all

static std::string _map(uint64_t value)
{
	static std::string str(64 + 8, 'o');
	for (uint64_t i = 0, c = 0; i < 64; i++)
	{
		uint64_t bitmask = (1ull) << i;

		if ((bitmask & value) != 0) str[c++] = 'X';
		else str[c++] = '.';

		if ((i + 1) % 8 == 0) str[c++] = '\n';
	}
	return str;
}
But there seems to be a bug: (after initialize was called!)
OCC = 6822289156618217916, pos = 2

Code: Select all

Compare_Engines:
..XXXX.X
X..X.XX.
.X.XXX..
XXX....X
XXX.X.XX
XX...X.X
X.XX.X.X
.XXXX.X.

Reference
XX.X....
.XXX....
X.X.....
..X.....
........
........
........
........

Bob:
XXX.....
XXXX...X
X.X.X.X.
X.XX.X.X
X.X.X.X.
X.XX.X.X
X.X.X.X.
XXXX...X
also think of using _BitScanForward64 -> std::countr_zero. So its compatible with gcc and clang.
Greetings!
I'm buggy. About a week ago I came down with a bad case of the shingles on my back and stomach. The pain is excruciating. Again I used & instead of | this time in the initialization. :oops: More later.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Combining two of Bob's classic bitboard attack getters

Post by Mike Sherwin »

I think that the initialization is working now. I have to rest awhile I'll try do do more latter.

Code: Select all

// Bob.Mike
// Example of Robert Hyatt's and Michael Sherwin's classical bitboard approach
// to generate moves for the sliding pieces.

#include <cstdint>
#include <intrin.h>

enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };

enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

struct Ray {
	uint64_t rayNW;
	uint64_t rayNN;
	uint64_t rayNE;
	uint64_t rayEE;
	uint64_t raySE;
	uint64_t raySS;
	uint64_t raySW;
	uint64_t rayWW;
	uint64_t rwsNW;
	uint64_t rwsNN;
	uint64_t rwsNE;
	uint64_t rwsEE;
	uint64_t rwsSE;
	uint64_t rwsSS;
	uint64_t rwsSW;
	uint64_t rwsWW;
};

Ray ray[64];

uint64_t Queen(int sq, uint64_t occ) {
	unsigned long iNW, iNN, iNE, iEE, iSE, iSS, iSW, iWW;
	uint64_t r1, r2, r3, r4, r5, r6;
	r1 = ray[sq].rwsNW & occ;
	r2 = ray[sq].rwsNN & occ;
	r3 = ray[sq].rwsNE & occ;
	r4 = ray[sq].rwsEE & occ;
	_BitScanForward64(&iNW, r1);
	_BitScanForward64(&iNN, r2);
	_BitScanForward64(&iNE, r3);
	_BitScanForward64(&iEE, r4);
	r1 = ray[sq].rayNW ^ ray[iNW].rayNW;
	r2 = ray[sq].rayNN ^ ray[iNN].rayNN;
	r3 = ray[sq].rayNE ^ ray[iNE].rayNE;
	r4 = ray[sq].rayEE ^ ray[iEE].rayEE;
	r5 = r1 | r2;
	r6 = r3 | r4;
	r1 = ray[sq].rwsSE & occ;
	r2 = ray[sq].rwsSS & occ;
	r3 = ray[sq].rwsSW & occ;
	r4 = ray[sq].rwsWW & occ;
	_BitScanReverse64(&iSE, r1);
	_BitScanReverse64(&iSS, r2);
	_BitScanReverse64(&iSW, r3);
	_BitScanReverse64(&iWW, r4);
	r1 = ray[sq].raySE ^ ray[iSE].raySE;
	r2 = ray[sq].raySS ^ ray[iSS].raySS;
	r3 = ray[sq].raySW ^ ray[iSW].raySW;
	r4 = ray[sq].rayWW ^ ray[iWW].rayWW;
	r1 = r1 | r3;
	r2 = r2 | r4;
	r3 = r5 | r6;
	return r1 | r2 | r3;
}

void Initialize() {
	int sq, ts, file, rank, c;

	for (sq = 0; sq <= 63; sq++) {
		file = sq & 7;
		rank = sq >> 3;

		// Northwest
		ray[sq].rayNW = 0;
		for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
		ray[sq].rwsNW = ray[sq].rayNW | 0x8000000000000000;

		// Northeast
		ray[sq].rayNE = 0;
		for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
		ray[sq].rwsNE = ray[sq].rayNE | 0x8000000000000000;

		// Southeast
		ray[sq].raySE = 0;
		for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c >= RANK1; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
		ray[sq].rwsSE = ray[sq].raySE | 0x0000000000000001;

		// Southwest
		ray[sq].raySW = 0;
		for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c >= RANK1; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
		ray[sq].rwsSW = ray[sq].raySW | 0x0000000000000001;

		// North
		ray[sq].rayNN = 0;
		for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
		ray[sq].rwsNN = ray[sq].rayNN | 0x8000000000000000;

		// East
		ray[sq].rayEE = 0;
		for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
		ray[sq].rwsEE = ray[sq].rayEE | 0x8000000000000000;

		// South
		ray[sq].raySS = 0;
		for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
		ray[sq].rwsSS = ray[sq].raySS | 0x0000000000000001;

		// West
		ray[sq].rayWW = 0;
		for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
		ray[sq].rwsWW = ray[sq].rayWW | 0x0000000000000001;
	}
}

int main() {
	Initialize();
}
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Combining two of Bob's classic bitboard attack getters

Post by dangi12012 »

Mike Sherwin wrote: Fri Dec 10, 2021 5:36 pm I think that the initialization is working now. I have to rest awhile I'll try do do more latter.
Do the resting first. Its important for a clear mind. Still a bug somewhere:

Code: Select all

BLOCKERS:
..XXXX.X
X..X.XX.
.X.XXX..
XXX....X
XXX.X.XX
XX...X.X
X.XX.X.X
.XXXX.X.

Reference queen:
.....XX.
......XX
.......X
.......X
........
........
........
........

Bob queen:
.XXXXXXX
.X....XX
..X....X
...X...X
....X...
.....X..
......X.
.......X
I get an access violation on more accurate debugging somewhere here:

Code: Select all

r1 = ray[sq].raySE ^ ray[iSE].raySE;
r2 = ray[sq].raySS ^ ray[iSS].raySS;
r3 = ray[sq].raySW ^ ray[iSW].raySW;
r4 = ray[sq].rayWW ^ ray[iWW].rayWW
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Combining two of Bob's classic bitboard attack getters

Post by dangi12012 »

The code can also be improved a little but by removing the RANKS from the loop and directly calculating "how many steps you can take in all 8 directions and looping for this many times"
which is dirc[x] in this snippet. idx0 = east and idx1 = northeast idx2 = ...

Directly calculate dircount:

Code: Select all

void atk(int sq)	
	int x = sq & 7; int y = sq >> 3; 
	int dirc[8]; int o[8] = { 1, -7, -8, -9, -1, 7, 8, 9 };
	int EA = dirc[0] = std::max(7 - x, 0); int WE = dirc[4] = x;
	int SO = dirc[6] = std::max(7 - y, 0); int NO = dirc[2] = y;
	int SE = dirc[7] = std::min(EA, SO);   int NW = dirc[3] = std::min(WE, NO);
	int SW = dirc[5] = std::min(WE, SO);   int NE = dirc[1] = std::min(EA, NO); 
Usage like that:

Code: Select all

for (int off = 0; off < dirc[dir]; off++) //dir 0...8
{
     int index = sq + (off + 1) * o[dir]);
}
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Combining two of Bob's classic bitboard attack getters

Post by Mike Sherwin »

dangi12012 wrote: Fri Dec 10, 2021 9:52 pm The code can also be improved a little but by removing the RANKS from the loop and directly calculating "how many steps you can take in all 8 directions and looping for this many times"
which is dirc[x] in this snippet. idx0 = east and idx1 = northeast idx2 = ...

Directly calculate dircount:

Code: Select all

void atk(int sq)	
	int x = sq & 7; int y = sq >> 3; 
	int dirc[8]; int o[8] = { 1, -7, -8, -9, -1, 7, 8, 9 };
	int EA = dirc[0] = std::max(7 - x, 0); int WE = dirc[4] = x;
	int SO = dirc[6] = std::max(7 - y, 0); int NO = dirc[2] = y;
	int SE = dirc[7] = std::min(EA, SO);   int NW = dirc[3] = std::min(WE, NO);
	int SW = dirc[5] = std::min(WE, SO);   int NE = dirc[1] = std::min(EA, NO); 
Usage like that:

Code: Select all

for (int off = 0; off < dirc[dir]; off++) //dir 0...8
{
     int index = sq + (off + 1) * o[dir]);
}
It will take me some time to understand that code. I slept in today and I have no idea when I got up. Anyway it seems to be working perfectly. The demo uses Raylib 4 (nuget package) to put a graphics board on the screen. The instructions are simple. Left click to add/remove the queen. Right click to add/remove blockers. Here is the MSVC 2022 code.

Code: Select all

// Bob.Mike
// Example of Robert Hyatt's and Michael Sherwin's classical bitboard approach
// to generate moves for the sliding pieces.

#include <cstdint>
#include <intrin.h>
#include "raylib.h"

enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };

enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

struct Rays {
	uint64_t rayNW;
	uint64_t rayNN;
	uint64_t rayNE;
	uint64_t rayEE;
	uint64_t raySE;
	uint64_t raySS;
	uint64_t raySW;
	uint64_t rayWW;
	uint64_t rwsNW;
	uint64_t rwsNN;
	uint64_t rwsNE;
	uint64_t rwsEE;
	uint64_t rwsSE;
	uint64_t rwsSS;
	uint64_t rwsSW;
	uint64_t rwsWW;
};

Rays ray[64];

Vector2 mousePoint = { 0.0f, 0.0f };

Rectangle chessBoard[64] = { 0 };

int queen = -1, blockers[64] = { 0 }, attacks[64] = { 0 };

Color color[2] = {
	RAYWHITE,
	DARKGREEN
};

uint64_t Queen(int sq, uint64_t occ) {
	unsigned long iNW, iNN, iNE, iEE, iSE, iSS, iSW, iWW;
	uint64_t r1, r2, r3, r4, r5, r6;
	occ |= 0x8000000000000001;
	r1 = ray[sq].rwsNW & occ;
	r2 = ray[sq].rwsNN & occ;
	r3 = ray[sq].rwsNE & occ;
	r4 = ray[sq].rwsEE & occ;
	_BitScanForward64(&iNW, r1);
	_BitScanForward64(&iNN, r2);
	_BitScanForward64(&iNE, r3);
	_BitScanForward64(&iEE, r4);
	r1 = ray[sq].rayNW ^ ray[iNW].rayNW;
	r2 = ray[sq].rayNN ^ ray[iNN].rayNN;
	r3 = ray[sq].rayNE ^ ray[iNE].rayNE;
	r4 = ray[sq].rayEE ^ ray[iEE].rayEE;
	r5 = r1 | r2;
	r6 = r3 | r4;
	r1 = ray[sq].rwsSE & occ;
	r2 = ray[sq].rwsSS & occ;
	r3 = ray[sq].rwsSW & occ;
	r4 = ray[sq].rwsWW & occ;
	_BitScanReverse64(&iSE, r1);
	_BitScanReverse64(&iSS, r2);
	_BitScanReverse64(&iSW, r3);
	_BitScanReverse64(&iWW, r4);
	r1 = ray[sq].raySE ^ ray[iSE].raySE;
	r2 = ray[sq].raySS ^ ray[iSS].raySS;
	r3 = ray[sq].raySW ^ ray[iSW].raySW;
	r4 = ray[sq].rayWW ^ ray[iWW].rayWW;
	r1 = r1 | r3;
	r2 = r2 | r4;
	r3 = r5 | r6;
	return r1 | r2 | r3;
}

void Initialize() {
	int sq, ts, file, rank, c;

	for (sq = 0; sq <= 63; sq++) {
		file = sq & 7;
		rank = sq >> 3;

		// Northwest
		ray[sq].rayNW = 0;
		for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
		ray[sq].rwsNW = ray[sq].rayNW | 0x8000000000000000;

		// Northeast
		ray[sq].rayNE = 0;
		for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
		ray[sq].rwsNE = ray[sq].rayNE | 0x8000000000000000;

		// Southeast
		ray[sq].raySE = 0;
		for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c >= RANK1; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
		ray[sq].rwsSE = ray[sq].raySE | 0x0000000000000001;

		// Southwest
		ray[sq].raySW = 0;
		for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c >= RANK1; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
		ray[sq].rwsSW = ray[sq].raySW | 0x0000000000000001;

		// North
		ray[sq].rayNN = 0;
		for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
		ray[sq].rwsNN = ray[sq].rayNN | 0x8000000000000000;

		// East
		ray[sq].rayEE = 0;
		for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
		ray[sq].rwsEE = ray[sq].rayEE | 0x8000000000000000;

		// South
		ray[sq].raySS = 0;
		for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
		ray[sq].rwsSS = ray[sq].raySS | 0x0000000000000001;

		// West
		ray[sq].rayWW = 0;
		for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
		ray[sq].rwsWW = ray[sq].rayWW | 0x0000000000000001;
	}
}

void InitGraphics() {
	InitWindow(600, 600, "Queen Move Generator Test");
	SetTargetFPS(60);
	for (int i = 0; i <= 63; i++) {
		chessBoard[i].x = 60.0f + 60.0f * (i % 8);
		chessBoard[i].y = 60.0f + 60.0f * (i / 8);
		chessBoard[i].width = 60.0f;
		chessBoard[i].height = 60.0f;
	}
}

int main() {
	int i, x, y, f, r, sq, c, get;
	uint64_t b, occ;
	unsigned long index;
	Initialize();
	InitGraphics();
	while (!WindowShouldClose()) {
		mousePoint = GetMousePosition();
		sq = -1;
		get = 0;
		for (i = 0; i <= 63; i++) {
			if (CheckCollisionPointRec(mousePoint, chessBoard[i])) sq = i;
		}
		if (sq >= 0) {
			if (IsMouseButtonReleased(0) && !blockers[sq]) {
				if (queen == -1) {
					queen = sq;
					get = 1;
				}
				else if (queen == sq) {
					queen = -1;
					for (i = 0; i <= 63; i++) {
						attacks[i] = 0;
						blockers[i] = 0;
					}
				}
			}
			if (IsMouseButtonReleased(1) && queen != -1 && sq != queen) {
				if (!blockers[sq]) blockers[sq] = 1;
				else blockers[sq] = 0;
				get = 1;
			}
		}
		if (get) {
			occ = 0;
			for (i = 0; i <= 63; i++) {
				attacks[i] = 0;
				if (blockers[i])
					occ |= 1ull << i;
			}
			b = Queen(queen, occ);
			while (b) {
				_BitScanForward64(&index, b);
				b ^= 1ull << index;
				attacks[index] = 1;
			}
		}
		BeginDrawing();
		ClearBackground(BLACK);
		for (i = 0; i <= 63; i++) {
			f = i & 7;
			r = i >> 3;
			c = (r & 1) ^ !(f & 1);
			DrawRectangleRec(chessBoard[i], color[c]);
			x = (int)chessBoard[i].x;
			y = (int)chessBoard[i].y;
			if (queen == i) DrawRectangle(x + 20, y + 20, 20, 20, BEIGE);
			if (blockers[i]) DrawRectangle(x + 20, y + 20, 20, 20, RED);
			if (attacks[i]) DrawCircle(x + 30, y + 30, 6.0f, GREEN);
		}
		EndDrawing();
	}
}

Here is just the executable. https://www.mediafire.com/file/d91vn75q ... e.exe/file
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Combining two of Bob's classic bitboard attack getters

Post by dangi12012 »

Mike Sherwin wrote: Sat Dec 11, 2021 4:51 am }[/code]
Here is just the executable. https://www.mediafire.com/file/d91vn75q ... e.exe/file
There we go! - Now its 100% correct and compatible with all sliding lookups:

Code: Select all

Megalookups/s:
Reference:      117.90MOps
KoggeStone:     112.60MOps
BobMike:        120.94MOps
FancyHash:      500.44MOps
Pext  :         863.54MOps
HyperLookup:    911.47MOps
These are still with all other lookups initialized at the same time so take that with a grain of salt.
I have the templates ready - need to compile one executable per Algorithm.

Also I will add all the algos that Harald provided!
Also bob assembly is of interest to me.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Combining two of Bob's classic bitboard attack getters

Post by dangi12012 »

Update on Bobs lookup method
the non portable _bitscanforward64 made it much slower than the algo really is. I made the initializsation consteval and now it compatible with gcc/clang.
Is 2x than before which is really good because it only takes 8Kib of lookup and thus fits inside L1. Also its 2x faster than Kogge-Stone Maybe we can optimize it more?

Code: Select all

#include <cstdint>
#include <bit>
#include <array>
 
struct Rays {
		uint64_t rayNW;
		uint64_t rayNN;
		uint64_t rayNE;
		uint64_t rayEE;
		uint64_t raySE;
		uint64_t raySS;
		uint64_t raySW;
		uint64_t rayWW;
		uint64_t rwsNW;
		uint64_t rwsNN;
		uint64_t rwsNE;
		uint64_t rwsEE;
		uint64_t rwsSE;
		uint64_t rwsSS;
		uint64_t rwsSW;
		uint64_t rwsWW;
	};

	consteval std::array<Rays,64> Initialize() {
		enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };
		enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

		int sq, ts, file, rank, c;
		std::array<Rays, 64> ray{};
		for (sq = 0; sq <= 63; sq++) {
			file = sq & 7;
			rank = sq >> 3;

			// Northwest
			ray[sq].rayNW = 0;
			for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
			ray[sq].rwsNW = ray[sq].rayNW | 0x8000000000000000;

			// Northeast
			ray[sq].rayNE = 0;
			for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
			ray[sq].rwsNE = ray[sq].rayNE | 0x8000000000000000;

			// Southeast
			ray[sq].raySE = 0;
			for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c >= RANK1; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
			ray[sq].rwsSE = ray[sq].raySE | 0x0000000000000001;

			// Southwest
			ray[sq].raySW = 0;
			for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c >= RANK1; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
			ray[sq].rwsSW = ray[sq].raySW | 0x0000000000000001;

			// North
			ray[sq].rayNN = 0;
			for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
			ray[sq].rwsNN = ray[sq].rayNN | 0x8000000000000000;

			// East
			ray[sq].rayEE = 0;
			for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
			ray[sq].rwsEE = ray[sq].rayEE | 0x8000000000000000;

			// South
			ray[sq].raySS = 0;
			for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
			ray[sq].rwsSS = ray[sq].raySS | 0x0000000000000001;

			// West
			ray[sq].rayWW = 0;
			for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
			ray[sq].rwsWW = ray[sq].rayWW | 0x0000000000000001;
		}
		return ray;
	}
	constexpr std::array<Rays, 64> ray = Initialize();
	constexpr auto size = sizeof(ray);

	uint64_t Queen(int sq, uint64_t occ) {
		unsigned long iNW, iNN, iNE, iEE, iSE, iSS, iSW, iWW = 0;
		uint64_t r1, r2, r3, r4, r5, r6;
		occ |= 0x8000000000000001;
		r1 = ray[sq].rwsNW & occ;
		r2 = ray[sq].rwsNN & occ;
		r3 = ray[sq].rwsNE & occ;
		r4 = ray[sq].rwsEE & occ;
		iNW = std::countr_zero(r1);
		iNN = std::countr_zero(r2);
		iNE = std::countr_zero(r3);
		iEE = std::countr_zero(r4);
		r1 = ray[sq].rayNW ^ ray[iNW].rayNW;
		r2 = ray[sq].rayNN ^ ray[iNN].rayNN;
		r3 = ray[sq].rayNE ^ ray[iNE].rayNE;
		r4 = ray[sq].rayEE ^ ray[iEE].rayEE;
		r5 = r1 | r2;
		r6 = r3 | r4;
		r1 = ray[sq].rwsSE & occ;
		r2 = ray[sq].rwsSS & occ;
		r3 = ray[sq].rwsSW & occ;
		r4 = ray[sq].rwsWW & occ;
		iSE = 63 - std::countl_zero(r1);
		iSS = 63 - std::countl_zero(r2);
		iSW = 63 - std::countl_zero(r3);
		iWW = 63 - std::countl_zero(r4);
		r1 = ray[sq].raySE ^ ray[iSE].raySE;
		r2 = ray[sq].raySS ^ ray[iSS].raySS;
		r3 = ray[sq].raySW ^ ray[iSW].raySW;
		r4 = ray[sq].rayWW ^ ray[iWW].rayWW;
		r1 = r1 | r3;
		r2 = r2 | r4;
		r3 = r5 | r6;
		return r1 | r2 | r3;
	}
Produces 244.82MOps performance on queen lookups:

Code: Select all

Megalookups/s:
Reference:      118.25MOps
KoggeStone:     112.38MOps
BobMike:        244.82MOps
FancyHash:      506.22MOps
Pext  :         852.53MOps
HyperLookup:    926.42MOps

With the assembly here:

https://godbolt.org/z/qdbsYdnWY

Code: Select all

Queen(int, unsigned long):                             # @Queen(int, unsigned long)
        push    r14
        push    rbx
        movsxd  rax, edi
        movabs  rcx, -9223372036854775807
        shl     rax, 7
        or      rcx, rsi
        mov     rsi, qword ptr [rax + ray+72]
        mov     rdi, qword ptr [rax + ray+80]
        mov     r8, qword ptr [rax + ray+64]
        mov     rdx, qword ptr [rax + ray+88]
        mov     rbx, qword ptr [rax + ray+96]
        and     rsi, rcx
        and     rdi, rcx
        and     r8, rcx
        and     rdx, rcx
        and     rbx, rcx
        tzcnt   r9, rsi
        tzcnt   r10, rdi
        mov     rsi, qword ptr [rax + ray+104]
        mov     rdi, qword ptr [rax + ray+112]
        tzcnt   r11, rdx
        lzcnt   r14, rbx
        mov     edx, 63
        mov     ebx, 63
        tzcnt   r8, r8
        sub     rdx, r14
        shl     rdx, 7
        shl     r10, 7
        shl     r11, 7
        shl     r8, 7
        shl     r9, 7
        vmovq   xmm2, qword ptr [rdx + ray+32] # xmm2 = mem[0],zero
        vmovq   xmm3, qword ptr [r8 + ray]  # xmm3 = mem[0],zero
        and     rsi, rcx
        and     rdi, rcx
        and     rcx, qword ptr [rax + ray+120]
        lzcnt   r14, rsi
        mov     esi, 63
        sub     rsi, r14
        lzcnt   r14, rdi
        mov     edi, 63
        sub     rdi, r14
        shl     rsi, 7
        shl     rdi, 7
        lzcnt   rcx, rcx
        vmovq   xmm1, qword ptr [rdi + ray+48] # xmm1 = mem[0],zero
        sub     rbx, rcx
        shl     rbx, 7
        vmovq   xmm0, qword ptr [rbx + ray+56] # xmm0 = mem[0],zero
        vpunpcklqdq     xmm0, xmm1, xmm0        # xmm0 = xmm1[0],xmm0[0]
        vmovq   xmm1, qword ptr [rsi + ray+40] # xmm1 = mem[0],zero
        vpunpcklqdq     xmm1, xmm2, xmm1        # xmm1 = xmm2[0],xmm1[0]
        vmovq   xmm2, qword ptr [r10 + ray+16] # xmm2 = mem[0],zero
        vinserti128     ymm0, ymm1, xmm0, 1
        vmovq   xmm1, qword ptr [r11 + ray+24] # xmm1 = mem[0],zero
        vpxor   ymm0, ymm0, ymmword ptr [rax + ray+32]
        vpunpcklqdq     xmm1, xmm2, xmm1        # xmm1 = xmm2[0],xmm1[0]
        vmovq   xmm2, qword ptr [r9 + ray+8] # xmm2 = mem[0],zero
        vpunpcklqdq     xmm2, xmm3, xmm2        # xmm2 = xmm3[0],xmm2[0]
        vinserti128     ymm1, ymm2, xmm1, 1
        vpxor   ymm1, ymm1, ymmword ptr [rax + ray]
        vpor    ymm0, ymm1, ymm0
        vextracti128    xmm1, ymm0, 1
        vpor    xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpor    xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        pop     rbx
        pop     r14
        vzeroupper
        ret
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
tcusr
Posts: 325
Joined: Tue Aug 31, 2021 10:32 pm
Full name: Mateo

Re: Combining two of Bob's classic bitboard attack getters

Post by tcusr »

why is it faster if the table available at compile? what kind of optimization does the compiler do?

thank you for your tests! would my mind adding hyperbola quintessence?
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Combining two of Bob's classic bitboard attack getters

Post by dangi12012 »

tcusr wrote: Sat Dec 11, 2021 6:08 pm why is it faster if the table available at compile? what kind of optimization does the compiler do?

thank you for your tests! would my mind adding hyperbola quintessence?
I dont mind adding it. Do you have a repo where it is implemented?
Ideally i would have 1 file with this interface:

Code: Select all

void Init()
uint64_t Queen(int sq, uint64_t occ)
So please send me an URL!

it was faster because I replaced old msvc intrinsic the std::countlz.
constexpr does not make it faster when its unknown. BUT: when you know square and submit 0 instead of x the compiler will remove almost all code. So in general case its same speed. But when you have an engine that loops over the squares from 0 to 64 it will be much much faster!

When i say sq = 2 the assembly is much smaller and the compiler knows what the content is!

Code: Select all

Queen(unsigned long):                              # @Queen(unsigned long)
        movabs  rax, -9223372036854775807
        movabs  rdx, -8644650654150162432
        movabs  rsi, -9223371486023118848
        movabs  r9, 578721382704613376
        or      rdi, rax
        lea     rcx, [rax + 16909311]
        add     rax, 239
        and     rcx, rdi
        and     rdx, rdi
        and     rax, rdi
        and     rsi, rdi
        and     edi, 7
        tzcnt   rcx, rcx
        tzcnt   rdx, rdx
        tzcnt   r8, rax
        mov     eax, 16909312
        tzcnt   rsi, rsi
        shl     rcx, 7
        shl     rdx, 7
        shl     rsi, 7
        shl     r8, 7
        xor     r9, qword ptr [rdx + ray+8]
        xor     rax, qword ptr [rcx + ray]
        mov     edx, 240
        xor     rdx, qword ptr [r8 + ray+24]
        or      r9, rax
        movabs  rax, 550831656960
        xor     rax, qword ptr [rsi + ray+16]
        or      rdx, rax
        lzcnt   rax, rdi
        xor     rax, 63
        or      rdx, r9
        shl     rax, 7
        mov     rax, qword ptr [rax + ray+56]
        xor     rax, 7
        or      rax, rdx
        ret
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Combining two of Bob's classic bitboard attack getters

Post by dangi12012 »

Update:
Sometimes I wonder how verbose the code is from other people. It hides the simple lookup code which is really just that:
Robert Hyatt's and Michael Sherwin's classical bitboard approach is this:

Code: Select all

uint64_t Queen(int sq, uint64_t occ) {
		uint64_t result = 0;
		occ |= 0x8000000000000001;

		result |= ray[sq].rayNW ^ ray[std::countr_zero(ray[sq].rwsNW & occ)].rayNW;
		result |= ray[sq].rayNN ^ ray[std::countr_zero(ray[sq].rwsNN & occ)].rayNN;
		result |= ray[sq].rayNE ^ ray[std::countr_zero(ray[sq].rwsNE & occ)].rayNE;
		result |= ray[sq].rayEE ^ ray[std::countr_zero(ray[sq].rwsEE & occ)].rayEE;

		result |= ray[sq].raySE ^ ray[63 - std::countl_zero(ray[sq].rwsSE & occ)].raySE;
		result |= ray[sq].raySS ^ ray[63 - std::countl_zero(ray[sq].rwsSS & occ)].raySS;
		result |= ray[sq].raySW ^ ray[63 - std::countl_zero(ray[sq].rwsSW & occ)].raySW;
		result |= ray[sq].rayWW ^ ray[63 - std::countl_zero(ray[sq].rwsWW & occ)].rayWW;

		return result;
	}
I see why this was done. Because 15 years ago the compilers didnt resolve the dependency chain in a good way.

This lookup is nice and clean. Just how it should be done. This can be improved by defining the rwsSE in a different way so that the inner 63 - will go away.
But that should be done by the original authors! - are they still active here or on other boards?
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer