Combining two of Bob's classic bitboard attack getters

Mike Sherwin · Post by **Mike Sherwin** » Fri Dec 10, 2021 3:08 pm

dangi12012 wrote: ↑Fri Dec 10, 2021 1:09 pm
Mike Sherwin wrote: ↑Fri Dec 10, 2021 3:06 am
The code is still buggy - i use this to print a bitboard and can compare reference vs output:
Code: Select all
static std::string _map(uint64_t value)
{
	static std::string str(64 + 8, 'o');
	for (uint64_t i = 0, c = 0; i < 64; i++)
	{
		uint64_t bitmask = (1ull) << i;

		if ((bitmask & value) != 0) str[c++] = 'X';
		else str[c++] = '.';

		if ((i + 1) % 8 == 0) str[c++] = '\n';
	}
	return str;
}
But there seems to be a bug: (after initialize was called!)
OCC = 6822289156618217916, pos = 2
Code: Select all
Compare_Engines:
..XXXX.X
X..X.XX.
.X.XXX..
XXX....X
XXX.X.XX
XX...X.X
X.XX.X.X
.XXXX.X.

Reference
XX.X....
.XXX....
X.X.....
..X.....
........
........
........
........

Bob:
XXX.....
XXXX...X
X.X.X.X.
X.XX.X.X
X.X.X.X.
X.XX.X.X
X.X.X.X.
XXXX...X
also think of using _BitScanForward64 -> std::countr_zero. So its compatible with gcc and clang.
Greetings!

I'm buggy. About a week ago I came down with a bad case of the shingles on my back and stomach. The pain is excruciating. Again I used & instead of | this time in the initialization.

More later.

Mike Sherwin · Post by **Mike Sherwin** » Fri Dec 10, 2021 5:36 pm

I think that the initialization is working now. I have to rest awhile I'll try do do more latter.

Code: Select all

// Bob.Mike
// Example of Robert Hyatt's and Michael Sherwin's classical bitboard approach
// to generate moves for the sliding pieces.

#include <cstdint>
#include <intrin.h>

enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };

enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

struct Ray {
	uint64_t rayNW;
	uint64_t rayNN;
	uint64_t rayNE;
	uint64_t rayEE;
	uint64_t raySE;
	uint64_t raySS;
	uint64_t raySW;
	uint64_t rayWW;
	uint64_t rwsNW;
	uint64_t rwsNN;
	uint64_t rwsNE;
	uint64_t rwsEE;
	uint64_t rwsSE;
	uint64_t rwsSS;
	uint64_t rwsSW;
	uint64_t rwsWW;
};

Ray ray[64];

uint64_t Queen(int sq, uint64_t occ) {
	unsigned long iNW, iNN, iNE, iEE, iSE, iSS, iSW, iWW;
	uint64_t r1, r2, r3, r4, r5, r6;
	r1 = ray[sq].rwsNW & occ;
	r2 = ray[sq].rwsNN & occ;
	r3 = ray[sq].rwsNE & occ;
	r4 = ray[sq].rwsEE & occ;
	_BitScanForward64(&iNW, r1);
	_BitScanForward64(&iNN, r2);
	_BitScanForward64(&iNE, r3);
	_BitScanForward64(&iEE, r4);
	r1 = ray[sq].rayNW ^ ray[iNW].rayNW;
	r2 = ray[sq].rayNN ^ ray[iNN].rayNN;
	r3 = ray[sq].rayNE ^ ray[iNE].rayNE;
	r4 = ray[sq].rayEE ^ ray[iEE].rayEE;
	r5 = r1 | r2;
	r6 = r3 | r4;
	r1 = ray[sq].rwsSE & occ;
	r2 = ray[sq].rwsSS & occ;
	r3 = ray[sq].rwsSW & occ;
	r4 = ray[sq].rwsWW & occ;
	_BitScanReverse64(&iSE, r1);
	_BitScanReverse64(&iSS, r2);
	_BitScanReverse64(&iSW, r3);
	_BitScanReverse64(&iWW, r4);
	r1 = ray[sq].raySE ^ ray[iSE].raySE;
	r2 = ray[sq].raySS ^ ray[iSS].raySS;
	r3 = ray[sq].raySW ^ ray[iSW].raySW;
	r4 = ray[sq].rayWW ^ ray[iWW].rayWW;
	r1 = r1 | r3;
	r2 = r2 | r4;
	r3 = r5 | r6;
	return r1 | r2 | r3;
}

void Initialize() {
	int sq, ts, file, rank, c;

	for (sq = 0; sq <= 63; sq++) {
		file = sq & 7;
		rank = sq >> 3;

		// Northwest
		ray[sq].rayNW = 0;
		for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
		ray[sq].rwsNW = ray[sq].rayNW | 0x8000000000000000;

		// Northeast
		ray[sq].rayNE = 0;
		for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
		ray[sq].rwsNE = ray[sq].rayNE | 0x8000000000000000;

		// Southeast
		ray[sq].raySE = 0;
		for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c >= RANK1; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
		ray[sq].rwsSE = ray[sq].raySE | 0x0000000000000001;

		// Southwest
		ray[sq].raySW = 0;
		for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c >= RANK1; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
		ray[sq].rwsSW = ray[sq].raySW | 0x0000000000000001;

		// North
		ray[sq].rayNN = 0;
		for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
		ray[sq].rwsNN = ray[sq].rayNN | 0x8000000000000000;

		// East
		ray[sq].rayEE = 0;
		for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
		ray[sq].rwsEE = ray[sq].rayEE | 0x8000000000000000;

		// South
		ray[sq].raySS = 0;
		for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
		ray[sq].rwsSS = ray[sq].raySS | 0x0000000000000001;

		// West
		ray[sq].rayWW = 0;
		for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
		ray[sq].rwsWW = ray[sq].rayWW | 0x0000000000000001;
	}
}

int main() {
	Initialize();
}

dangi12012 · Post by **dangi12012** » Fri Dec 10, 2021 9:47 pm

Mike Sherwin wrote: ↑Fri Dec 10, 2021 5:36 pm I think that the initialization is working now. I have to rest awhile I'll try do do more latter.

Do the resting first. Its important for a clear mind. Still a bug somewhere:

Code: Select all

BLOCKERS:
..XXXX.X
X..X.XX.
.X.XXX..
XXX....X
XXX.X.XX
XX...X.X
X.XX.X.X
.XXXX.X.

Reference queen:
.....XX.
......XX
.......X
.......X
........
........
........
........

Bob queen:
.XXXXXXX
.X....XX
..X....X
...X...X
....X...
.....X..
......X.
.......X

I get an access violation on more accurate debugging somewhere here:

Code: Select all

r1 = ray[sq].raySE ^ ray[iSE].raySE;
r2 = ray[sq].raySS ^ ray[iSS].raySS;
r3 = ray[sq].raySW ^ ray[iSW].raySW;
r4 = ray[sq].rayWW ^ ray[iWW].rayWW

dangi12012 · Post by **dangi12012** » Fri Dec 10, 2021 9:52 pm

The code can also be improved a little but by removing the RANKS from the loop and directly calculating "how many steps you can take in all 8 directions and looping for this many times"
which is dirc[x] in this snippet. idx0 = east and idx1 = northeast idx2 = ...

Directly calculate dircount:

Code: Select all

void atk(int sq)	
	int x = sq & 7; int y = sq >> 3; 
	int dirc[8]; int o[8] = { 1, -7, -8, -9, -1, 7, 8, 9 };
	int EA = dirc[0] = std::max(7 - x, 0); int WE = dirc[4] = x;
	int SO = dirc[6] = std::max(7 - y, 0); int NO = dirc[2] = y;
	int SE = dirc[7] = std::min(EA, SO);   int NW = dirc[3] = std::min(WE, NO);
	int SW = dirc[5] = std::min(WE, SO);   int NE = dirc[1] = std::min(EA, NO);

Usage like that:

Code: Select all

for (int off = 0; off < dirc[dir]; off++) //dir 0...8
{
     int index = sq + (off + 1) * o[dir]);
}

Mike Sherwin · Post by **Mike Sherwin** » Sat Dec 11, 2021 4:51 am

dangi12012 wrote: ↑Fri Dec 10, 2021 9:52 pm The code can also be improved a little but by removing the RANKS from the loop and directly calculating "how many steps you can take in all 8 directions and looping for this many times"
which is dirc[x] in this snippet. idx0 = east and idx1 = northeast idx2 = ...

Directly calculate dircount:
Code: Select all
void atk(int sq)	
	int x = sq & 7; int y = sq >> 3; 
	int dirc[8]; int o[8] = { 1, -7, -8, -9, -1, 7, 8, 9 };
	int EA = dirc[0] = std::max(7 - x, 0); int WE = dirc[4] = x;
	int SO = dirc[6] = std::max(7 - y, 0); int NO = dirc[2] = y;
	int SE = dirc[7] = std::min(EA, SO);   int NW = dirc[3] = std::min(WE, NO);
	int SW = dirc[5] = std::min(WE, SO);   int NE = dirc[1] = std::min(EA, NO); 
Usage like that:
Code: Select all
for (int off = 0; off < dirc[dir]; off++) //dir 0...8
{
     int index = sq + (off + 1) * o[dir]);
}

It will take me some time to understand that code. I slept in today and I have no idea when I got up. Anyway it seems to be working perfectly. The demo uses Raylib 4 (nuget package) to put a graphics board on the screen. The instructions are simple. Left click to add/remove the queen. Right click to add/remove blockers. Here is the MSVC 2022 code.

Code: Select all

// Bob.Mike
// Example of Robert Hyatt's and Michael Sherwin's classical bitboard approach
// to generate moves for the sliding pieces.

#include <cstdint>
#include <intrin.h>
#include "raylib.h"

enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };

enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

struct Rays {
	uint64_t rayNW;
	uint64_t rayNN;
	uint64_t rayNE;
	uint64_t rayEE;
	uint64_t raySE;
	uint64_t raySS;
	uint64_t raySW;
	uint64_t rayWW;
	uint64_t rwsNW;
	uint64_t rwsNN;
	uint64_t rwsNE;
	uint64_t rwsEE;
	uint64_t rwsSE;
	uint64_t rwsSS;
	uint64_t rwsSW;
	uint64_t rwsWW;
};

Rays ray[64];

Vector2 mousePoint = { 0.0f, 0.0f };

Rectangle chessBoard[64] = { 0 };

int queen = -1, blockers[64] = { 0 }, attacks[64] = { 0 };

Color color[2] = {
	RAYWHITE,
	DARKGREEN
};

uint64_t Queen(int sq, uint64_t occ) {
	unsigned long iNW, iNN, iNE, iEE, iSE, iSS, iSW, iWW;
	uint64_t r1, r2, r3, r4, r5, r6;
	occ |= 0x8000000000000001;
	r1 = ray[sq].rwsNW & occ;
	r2 = ray[sq].rwsNN & occ;
	r3 = ray[sq].rwsNE & occ;
	r4 = ray[sq].rwsEE & occ;
	_BitScanForward64(&iNW, r1);
	_BitScanForward64(&iNN, r2);
	_BitScanForward64(&iNE, r3);
	_BitScanForward64(&iEE, r4);
	r1 = ray[sq].rayNW ^ ray[iNW].rayNW;
	r2 = ray[sq].rayNN ^ ray[iNN].rayNN;
	r3 = ray[sq].rayNE ^ ray[iNE].rayNE;
	r4 = ray[sq].rayEE ^ ray[iEE].rayEE;
	r5 = r1 | r2;
	r6 = r3 | r4;
	r1 = ray[sq].rwsSE & occ;
	r2 = ray[sq].rwsSS & occ;
	r3 = ray[sq].rwsSW & occ;
	r4 = ray[sq].rwsWW & occ;
	_BitScanReverse64(&iSE, r1);
	_BitScanReverse64(&iSS, r2);
	_BitScanReverse64(&iSW, r3);
	_BitScanReverse64(&iWW, r4);
	r1 = ray[sq].raySE ^ ray[iSE].raySE;
	r2 = ray[sq].raySS ^ ray[iSS].raySS;
	r3 = ray[sq].raySW ^ ray[iSW].raySW;
	r4 = ray[sq].rayWW ^ ray[iWW].rayWW;
	r1 = r1 | r3;
	r2 = r2 | r4;
	r3 = r5 | r6;
	return r1 | r2 | r3;
}

void Initialize() {
	int sq, ts, file, rank, c;

	for (sq = 0; sq <= 63; sq++) {
		file = sq & 7;
		rank = sq >> 3;

		// Northwest
		ray[sq].rayNW = 0;
		for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
		ray[sq].rwsNW = ray[sq].rayNW | 0x8000000000000000;

		// Northeast
		ray[sq].rayNE = 0;
		for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
		ray[sq].rwsNE = ray[sq].rayNE | 0x8000000000000000;

		// Southeast
		ray[sq].raySE = 0;
		for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c >= RANK1; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
		ray[sq].rwsSE = ray[sq].raySE | 0x0000000000000001;

		// Southwest
		ray[sq].raySW = 0;
		for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c >= RANK1; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
		ray[sq].rwsSW = ray[sq].raySW | 0x0000000000000001;

		// North
		ray[sq].rayNN = 0;
		for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
		ray[sq].rwsNN = ray[sq].rayNN | 0x8000000000000000;

		// East
		ray[sq].rayEE = 0;
		for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
		ray[sq].rwsEE = ray[sq].rayEE | 0x8000000000000000;

		// South
		ray[sq].raySS = 0;
		for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
		ray[sq].rwsSS = ray[sq].raySS | 0x0000000000000001;

		// West
		ray[sq].rayWW = 0;
		for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
		ray[sq].rwsWW = ray[sq].rayWW | 0x0000000000000001;
	}
}

void InitGraphics() {
	InitWindow(600, 600, "Queen Move Generator Test");
	SetTargetFPS(60);
	for (int i = 0; i <= 63; i++) {
		chessBoard[i].x = 60.0f + 60.0f * (i % 8);
		chessBoard[i].y = 60.0f + 60.0f * (i / 8);
		chessBoard[i].width = 60.0f;
		chessBoard[i].height = 60.0f;
	}
}

int main() {
	int i, x, y, f, r, sq, c, get;
	uint64_t b, occ;
	unsigned long index;
	Initialize();
	InitGraphics();
	while (!WindowShouldClose()) {
		mousePoint = GetMousePosition();
		sq = -1;
		get = 0;
		for (i = 0; i <= 63; i++) {
			if (CheckCollisionPointRec(mousePoint, chessBoard[i])) sq = i;
		}
		if (sq >= 0) {
			if (IsMouseButtonReleased(0) && !blockers[sq]) {
				if (queen == -1) {
					queen = sq;
					get = 1;
				}
				else if (queen == sq) {
					queen = -1;
					for (i = 0; i <= 63; i++) {
						attacks[i] = 0;
						blockers[i] = 0;
					}
				}
			}
			if (IsMouseButtonReleased(1) && queen != -1 && sq != queen) {
				if (!blockers[sq]) blockers[sq] = 1;
				else blockers[sq] = 0;
				get = 1;
			}
		}
		if (get) {
			occ = 0;
			for (i = 0; i <= 63; i++) {
				attacks[i] = 0;
				if (blockers[i])
					occ |= 1ull << i;
			}
			b = Queen(queen, occ);
			while (b) {
				_BitScanForward64(&index, b);
				b ^= 1ull << index;
				attacks[index] = 1;
			}
		}
		BeginDrawing();
		ClearBackground(BLACK);
		for (i = 0; i <= 63; i++) {
			f = i & 7;
			r = i >> 3;
			c = (r & 1) ^ !(f & 1);
			DrawRectangleRec(chessBoard[i], color[c]);
			x = (int)chessBoard[i].x;
			y = (int)chessBoard[i].y;
			if (queen == i) DrawRectangle(x + 20, y + 20, 20, 20, BEIGE);
			if (blockers[i]) DrawRectangle(x + 20, y + 20, 20, 20, RED);
			if (attacks[i]) DrawCircle(x + 30, y + 30, 6.0f, GREEN);
		}
		EndDrawing();
	}
}

Here is just the executable. https://www.mediafire.com/file/d91vn75q ... e.exe/file

dangi12012 · Post by **dangi12012** » Sat Dec 11, 2021 3:35 pm

Mike Sherwin wrote: ↑Sat Dec 11, 2021 4:51 am }[/code]
Here is just the executable. https://www.mediafire.com/file/d91vn75q ... e.exe/file

There we go! - Now its 100% correct and compatible with all sliding lookups:

Code: Select all

Megalookups/s:
Reference:      117.90MOps
KoggeStone:     112.60MOps
BobMike:        120.94MOps
FancyHash:      500.44MOps
Pext  :         863.54MOps
HyperLookup:    911.47MOps

These are still with all other lookups initialized at the same time so take that with a grain of salt.
I have the templates ready - need to compile one executable per Algorithm.

Also I will add all the algos that Harald provided!
Also bob assembly is of interest to me.

dangi12012 · Post by **dangi12012** » Sat Dec 11, 2021 5:44 pm

Update on Bobs lookup method
the non portable _bitscanforward64 made it much slower than the algo really is. I made the initializsation consteval and now it compatible with gcc/clang.
Is 2x than before which is really good because it only takes 8Kib of lookup and thus fits inside L1. Also its 2x faster than Kogge-Stone Maybe we can optimize it more?

Code: Select all

#include <cstdint>
#include <bit>
#include <array>
 
struct Rays {
		uint64_t rayNW;
		uint64_t rayNN;
		uint64_t rayNE;
		uint64_t rayEE;
		uint64_t raySE;
		uint64_t raySS;
		uint64_t raySW;
		uint64_t rayWW;
		uint64_t rwsNW;
		uint64_t rwsNN;
		uint64_t rwsNE;
		uint64_t rwsEE;
		uint64_t rwsSE;
		uint64_t rwsSS;
		uint64_t rwsSW;
		uint64_t rwsWW;
	};

	consteval std::array<Rays,64> Initialize() {
		enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };
		enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

		int sq, ts, file, rank, c;
		std::array<Rays, 64> ray{};
		for (sq = 0; sq <= 63; sq++) {
			file = sq & 7;
			rank = sq >> 3;

			// Northwest
			ray[sq].rayNW = 0;
			for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
			ray[sq].rwsNW = ray[sq].rayNW | 0x8000000000000000;

			// Northeast
			ray[sq].rayNE = 0;
			for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
			ray[sq].rwsNE = ray[sq].rayNE | 0x8000000000000000;

			// Southeast
			ray[sq].raySE = 0;
			for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c >= RANK1; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
			ray[sq].rwsSE = ray[sq].raySE | 0x0000000000000001;

			// Southwest
			ray[sq].raySW = 0;
			for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c >= RANK1; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
			ray[sq].rwsSW = ray[sq].raySW | 0x0000000000000001;

			// North
			ray[sq].rayNN = 0;
			for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
			ray[sq].rwsNN = ray[sq].rayNN | 0x8000000000000000;

			// East
			ray[sq].rayEE = 0;
			for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
			ray[sq].rwsEE = ray[sq].rayEE | 0x8000000000000000;

			// South
			ray[sq].raySS = 0;
			for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
			ray[sq].rwsSS = ray[sq].raySS | 0x0000000000000001;

			// West
			ray[sq].rayWW = 0;
			for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
			ray[sq].rwsWW = ray[sq].rayWW | 0x0000000000000001;
		}
		return ray;
	}
	constexpr std::array<Rays, 64> ray = Initialize();
	constexpr auto size = sizeof(ray);

	uint64_t Queen(int sq, uint64_t occ) {
		unsigned long iNW, iNN, iNE, iEE, iSE, iSS, iSW, iWW = 0;
		uint64_t r1, r2, r3, r4, r5, r6;
		occ |= 0x8000000000000001;
		r1 = ray[sq].rwsNW & occ;
		r2 = ray[sq].rwsNN & occ;
		r3 = ray[sq].rwsNE & occ;
		r4 = ray[sq].rwsEE & occ;
		iNW = std::countr_zero(r1);
		iNN = std::countr_zero(r2);
		iNE = std::countr_zero(r3);
		iEE = std::countr_zero(r4);
		r1 = ray[sq].rayNW ^ ray[iNW].rayNW;
		r2 = ray[sq].rayNN ^ ray[iNN].rayNN;
		r3 = ray[sq].rayNE ^ ray[iNE].rayNE;
		r4 = ray[sq].rayEE ^ ray[iEE].rayEE;
		r5 = r1 | r2;
		r6 = r3 | r4;
		r1 = ray[sq].rwsSE & occ;
		r2 = ray[sq].rwsSS & occ;
		r3 = ray[sq].rwsSW & occ;
		r4 = ray[sq].rwsWW & occ;
		iSE = 63 - std::countl_zero(r1);
		iSS = 63 - std::countl_zero(r2);
		iSW = 63 - std::countl_zero(r3);
		iWW = 63 - std::countl_zero(r4);
		r1 = ray[sq].raySE ^ ray[iSE].raySE;
		r2 = ray[sq].raySS ^ ray[iSS].raySS;
		r3 = ray[sq].raySW ^ ray[iSW].raySW;
		r4 = ray[sq].rayWW ^ ray[iWW].rayWW;
		r1 = r1 | r3;
		r2 = r2 | r4;
		r3 = r5 | r6;
		return r1 | r2 | r3;
	}

Produces 244.82MOps performance on queen lookups:

Code: Select all

Megalookups/s:
Reference:      118.25MOps
KoggeStone:     112.38MOps
BobMike:        244.82MOps
FancyHash:      506.22MOps
Pext  :         852.53MOps
HyperLookup:    926.42MOps

With the assembly here:
https://godbolt.org/z/qdbsYdnWY

Code: Select all

Queen(int, unsigned long):                             # @Queen(int, unsigned long)
        push    r14
        push    rbx
        movsxd  rax, edi
        movabs  rcx, -9223372036854775807
        shl     rax, 7
        or      rcx, rsi
        mov     rsi, qword ptr [rax + ray+72]
        mov     rdi, qword ptr [rax + ray+80]
        mov     r8, qword ptr [rax + ray+64]
        mov     rdx, qword ptr [rax + ray+88]
        mov     rbx, qword ptr [rax + ray+96]
        and     rsi, rcx
        and     rdi, rcx
        and     r8, rcx
        and     rdx, rcx
        and     rbx, rcx
        tzcnt   r9, rsi
        tzcnt   r10, rdi
        mov     rsi, qword ptr [rax + ray+104]
        mov     rdi, qword ptr [rax + ray+112]
        tzcnt   r11, rdx
        lzcnt   r14, rbx
        mov     edx, 63
        mov     ebx, 63
        tzcnt   r8, r8
        sub     rdx, r14
        shl     rdx, 7
        shl     r10, 7
        shl     r11, 7
        shl     r8, 7
        shl     r9, 7
        vmovq   xmm2, qword ptr [rdx + ray+32] # xmm2 = mem[0],zero
        vmovq   xmm3, qword ptr [r8 + ray]  # xmm3 = mem[0],zero
        and     rsi, rcx
        and     rdi, rcx
        and     rcx, qword ptr [rax + ray+120]
        lzcnt   r14, rsi
        mov     esi, 63
        sub     rsi, r14
        lzcnt   r14, rdi
        mov     edi, 63
        sub     rdi, r14
        shl     rsi, 7
        shl     rdi, 7
        lzcnt   rcx, rcx
        vmovq   xmm1, qword ptr [rdi + ray+48] # xmm1 = mem[0],zero
        sub     rbx, rcx
        shl     rbx, 7
        vmovq   xmm0, qword ptr [rbx + ray+56] # xmm0 = mem[0],zero
        vpunpcklqdq     xmm0, xmm1, xmm0        # xmm0 = xmm1[0],xmm0[0]
        vmovq   xmm1, qword ptr [rsi + ray+40] # xmm1 = mem[0],zero
        vpunpcklqdq     xmm1, xmm2, xmm1        # xmm1 = xmm2[0],xmm1[0]
        vmovq   xmm2, qword ptr [r10 + ray+16] # xmm2 = mem[0],zero
        vinserti128     ymm0, ymm1, xmm0, 1
        vmovq   xmm1, qword ptr [r11 + ray+24] # xmm1 = mem[0],zero
        vpxor   ymm0, ymm0, ymmword ptr [rax + ray+32]
        vpunpcklqdq     xmm1, xmm2, xmm1        # xmm1 = xmm2[0],xmm1[0]
        vmovq   xmm2, qword ptr [r9 + ray+8] # xmm2 = mem[0],zero
        vpunpcklqdq     xmm2, xmm3, xmm2        # xmm2 = xmm3[0],xmm2[0]
        vinserti128     ymm1, ymm2, xmm1, 1
        vpxor   ymm1, ymm1, ymmword ptr [rax + ray]
        vpor    ymm0, ymm1, ymm0
        vextracti128    xmm1, ymm0, 1
        vpor    xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpor    xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        pop     rbx
        pop     r14
        vzeroupper
        ret

tcusr · Post by **tcusr** » Sat Dec 11, 2021 6:08 pm

why is it faster if the table available at compile? what kind of optimization does the compiler do?

thank you for your tests! would my mind adding hyperbola quintessence?

dangi12012 · Post by **dangi12012** » Sat Dec 11, 2021 6:17 pm

tcusr wrote: ↑Sat Dec 11, 2021 6:08 pm why is it faster if the table available at compile? what kind of optimization does the compiler do?

thank you for your tests! would my mind adding hyperbola quintessence?

I dont mind adding it. Do you have a repo where it is implemented?
Ideally i would have 1 file with this interface:

Code: Select all

void Init()
uint64_t Queen(int sq, uint64_t occ)

So please send me an URL!

it was faster because I replaced old msvc intrinsic the std::countlz.
constexpr does not make it faster when its unknown. BUT: when you know square and submit 0 instead of x the compiler will remove almost all code. So in general case its same speed. But when you have an engine that loops over the squares from 0 to 64 it will be much much faster!

When i say sq = 2 the assembly is much smaller and the compiler knows what the content is!

Code: Select all

Queen(unsigned long):                              # @Queen(unsigned long)
        movabs  rax, -9223372036854775807
        movabs  rdx, -8644650654150162432
        movabs  rsi, -9223371486023118848
        movabs  r9, 578721382704613376
        or      rdi, rax
        lea     rcx, [rax + 16909311]
        add     rax, 239
        and     rcx, rdi
        and     rdx, rdi
        and     rax, rdi
        and     rsi, rdi
        and     edi, 7
        tzcnt   rcx, rcx
        tzcnt   rdx, rdx
        tzcnt   r8, rax
        mov     eax, 16909312
        tzcnt   rsi, rsi
        shl     rcx, 7
        shl     rdx, 7
        shl     rsi, 7
        shl     r8, 7
        xor     r9, qword ptr [rdx + ray+8]
        xor     rax, qword ptr [rcx + ray]
        mov     edx, 240
        xor     rdx, qword ptr [r8 + ray+24]
        or      r9, rax
        movabs  rax, 550831656960
        xor     rax, qword ptr [rsi + ray+16]
        or      rdx, rax
        lzcnt   rax, rdi
        xor     rax, 63
        or      rdx, r9
        shl     rax, 7
        mov     rax, qword ptr [rax + ray+56]
        xor     rax, 7
        or      rax, rdx
        ret

dangi12012 · Post by **dangi12012** » Sat Dec 11, 2021 6:21 pm

Update:
Sometimes I wonder how verbose the code is from other people. It hides the simple lookup code which is really just that:
Robert Hyatt's and Michael Sherwin's classical bitboard approach is this:

Code: Select all

uint64_t Queen(int sq, uint64_t occ) {
		uint64_t result = 0;
		occ |= 0x8000000000000001;

		result |= ray[sq].rayNW ^ ray[std::countr_zero(ray[sq].rwsNW & occ)].rayNW;
		result |= ray[sq].rayNN ^ ray[std::countr_zero(ray[sq].rwsNN & occ)].rayNN;
		result |= ray[sq].rayNE ^ ray[std::countr_zero(ray[sq].rwsNE & occ)].rayNE;
		result |= ray[sq].rayEE ^ ray[std::countr_zero(ray[sq].rwsEE & occ)].rayEE;

		result |= ray[sq].raySE ^ ray[63 - std::countl_zero(ray[sq].rwsSE & occ)].raySE;
		result |= ray[sq].raySS ^ ray[63 - std::countl_zero(ray[sq].rwsSS & occ)].raySS;
		result |= ray[sq].raySW ^ ray[63 - std::countl_zero(ray[sq].rwsSW & occ)].raySW;
		result |= ray[sq].rayWW ^ ray[63 - std::countl_zero(ray[sq].rwsWW & occ)].rayWW;

		return result;
	}

I see why this was done. Because 15 years ago the compilers didnt resolve the dependency chain in a good way.

This lookup is nice and clean. Just how it should be done. This can be improved by defining the rwsSE in a different way so that the inner 63 - will go away.
But that should be done by the original authors! - are they still active here or on other boards?

Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters