Combining two of Bob's classic bitboard attack getters

Mike Sherwin · Post by **Mike Sherwin** » Fri Nov 19, 2021 7:46 am

THIS ONE

# define AttacksBishop(square, occ) \
(plus7dir[square] ^ plus7dir[LSB(plus7dir[square] & (occ))] | \
plus9dir[square] ^ plus9dir[LSB(plus9dir[square] & (occ))] | \
minus7dir[square] ^ minus7dir[MSB(minus7dir[square] & (occ))] | \
minus9dir[square] ^ minus9dir[MSB(minus9dir[square] & (occ))])

AND THIS ONE

Code: Select all

/*As mentioned by Robert Hyatt [2], instead of fetching four ray-attacks on the otherwise empty board, one may already use the rook- or bishop attacks to reset outer squares from that union set.

A further improvement was suggested by Michael Sherwin [3], to union the occupancy with the outer bits 0 and 63. Together with appropriate bits set in separate ray-masks, this yields to an efficient branchless solution with 13 64-bit operations in total and 4.5 KByte for the lookup tables for both rooks and bishops each.*/

struct {
  U64 bitsN;  // bits North, including MSB (bit 63)
  U64 bitsE;  // bits East, including MSB
  U64 bitsS;  // bits South, including LSB (bit 0 == 1)
  U64 bitsW;  // bits West, including LSB 
} CACHE_ALIGN rayWstop[64]; 

U64 attacksEmpty[64];
U64 rayN[64];
U64 rayE[64];
U64 rayS[64];
U64 rayW[64];

U64 rookAttacks(U64 occ, unsigned int sq) {
   unsigned long ulN, ulE, ulS, ulW;
   occ |= C64(0x8000000000000001);
   _BitScanForward64(&ulN, occ & rayWstop[sq].bitsN);
   _BitScanForward64(&ulE, occ & rayWstop[sq].bitsE);
   _BitScanReverse64(&ulS, occ & rayWstop[sq].bitsS);
   _BitScanReverse64(&ulW, occ & rayWstop[sq].bitsW);
   return attacksEmpty[sq]^rayN[ulN]^rayE[ulE]^rayS[ulS]^rayW[ulW];
}

Actually I also have independently developed both of these classic styles of sliding piece move generators and posted about them. Mine of course were somewhat different. Today if I have not made any mistakes I have combined Bob's two with my two to make a slightly better one, I hope. Here is the code for a bishop.

Code: Select all

u64 allPieces = pieces[WHITE] | pieces[BLACK] | 0x8000000000000001; 
u32 NW, NN, NE, EE, SE, SS, SW, WW;    

	  case WB:
		_BitScanForward64(&NW, ray[fs].NW & allPieces);
		_BitScanForward64(&NE, ray[fs].NE & allPieces);
		_BitScanReverse64(&SE, ray[fs].SE & allPieces);
		_BitScanReverse64(&SW, ray[fs].SW & allPieces);
		bb[ply][fs] = (
			(ray[fs].NW ^ ray[NW].NW) |
			(ray[fs].NE ^ ray[NE].NE) |
			(ray[fs].SE ^ ray[SE].SE) |
			(ray[fs].SW ^ ray[SW].SW))
			& ~pieces[WHITE];
                        captures |= bb[ply][fs];
			break;

u64 allPieces has 0x8000000000000001 ored with it. And the appropriate stop bits are also in the ray structure. The stop bit simply assures that the source for BitScan is not empty. For increasing directions if there is no blocker then 63 will be placed in NW etc and ray[63].NW will be all zero bits and won't change anything when xored. The advantage over standard classic bitboards is that no 'if' statement is needed to check for an empty bb. It looks as though the dependency chains are short and should run in two pipelines efficiently. And the memory use is not bad either.

I'm just doing this to compare with my SISSY bitboards. Whichever one is faster will be good enough for me!

Mike Sherwin · Post by **Mike Sherwin** » Fri Nov 19, 2021 2:08 pm

There is something said wrong that needs to be said right.

For increasing directions if there is no blocker then 63 will be placed in NW etc and ray[63].NW will be all zero bits and won't change anything when xored.

They are not all all zero bits. Both have the stop bit set but the stop bits cancel each other out in the operation, (ray[fs].NW ^ ray[NW].NW). So if there is no mistake it is a lucky break.

EDIT: It is not a lucky break. It is a mistake. A bishop on a1 moving NE if there are no blockers should have bit 63 set. So the table with stop bits is just for the bit scans and a table without the stops is used for the final combining thus doubling the memory. I wonder if I have it right now?

dangi12012 · Post by **dangi12012** » Fri Nov 19, 2021 2:58 pm

Mike Sherwin wrote: ↑Fri Nov 19, 2021 7:46 am u64 allPieces has 0x8000000000000001 ored with it. And the appropriate stop bits are also in the ray structure. The stop bit simply assures that the source for BitScan is not empty. For increasing directions if there is no blocker then 63 will be placed in NW etc and ray[63].NW will be all zero bits and won't change anything when xored. The advantage over standard classic bitboards is that no 'if' statement is needed to check for an empty bb. It looks as though the dependency chains are short and should run in two pipelines efficiently. And the memory use is not bad either.

I'm just doing this to compare with my SISSY bitboards. Whichever one is faster will be good enough for me!

Put them both into this website:
https://quick-bench.com/

There you also get the button to view in compiler explorer.
So you have a benchmark + assembly on modern processors under linux and windows compilers.

Keep in mind that MSVC - GCC - CLANG get different results so better benchmark them all.

Mike Sherwin · Post by **Mike Sherwin** » Fri Nov 19, 2021 4:23 pm

dangi12012 wrote: ↑Fri Nov 19, 2021 2:58 pm
Mike Sherwin wrote: ↑Fri Nov 19, 2021 7:46 am u64 allPieces has 0x8000000000000001 ored with it. And the appropriate stop bits are also in the ray structure. The stop bit simply assures that the source for BitScan is not empty. For increasing directions if there is no blocker then 63 will be placed in NW etc and ray[63].NW will be all zero bits and won't change anything when xored. The advantage over standard classic bitboards is that no 'if' statement is needed to check for an empty bb. It looks as though the dependency chains are short and should run in two pipelines efficiently. And the memory use is not bad either.

I'm just doing this to compare with my SISSY bitboards. Whichever one is faster will be good enough for me!
Put them both into this website:
https://quick-bench.com/

There you also get the button to view in compiler explorer.
So you have a benchmark + assembly on modern processors under linux and windows compilers.

Keep in mind that MSVC - GCC - CLANG get different results so better benchmark them all.

Whichever one is faster will be good enough for me!

I'm going to try to get this working today in my chess engine Bricabrac. Bricabrac searches over 10 million moves per second on my 3950x running at 4.4Ghz using SISSY bitboards. I would just use PEXT if I had a cpu where PEXT is fast. But I don't. And I know a bit about PEXT because I suggested to INTEL and AMD that they should implement such an instruction in December of 2006! http://www.open-aurec.com/wbforum/viewt ... 962#p28725

dangi12012 · Post by **dangi12012** » Fri Nov 19, 2021 7:06 pm

Mike Sherwin wrote: ↑Fri Nov 19, 2021 4:23 pm
dangi12012 wrote: ↑Fri Nov 19, 2021 2:58 pm
Mike Sherwin wrote: ↑Fri Nov 19, 2021 7:46 am u64 allPieces has 0x8000000000000001 ored with it. And the appropriate stop bits are also in the ray structure. The stop bit simply assures that the source for BitScan is not empty. For increasing directions if there is no blocker then 63 will be placed in NW etc and ray[63].NW will be all zero bits and won't change anything when xored. The advantage over standard classic bitboards is that no 'if' statement is needed to check for an empty bb. It looks as though the dependency chains are short and should run in two pipelines efficiently. And the memory use is not bad either.

I'm just doing this to compare with my SISSY bitboards. Whichever one is faster will be good enough for me!
Put them both into this website:
https://quick-bench.com/

There you also get the button to view in compiler explorer.
So you have a benchmark + assembly on modern processors under linux and windows compilers.

Keep in mind that MSVC - GCC - CLANG get different results so better benchmark them all.

Whichever one is faster will be good enough for me!
I'm going to try to get this working today in my chess engine Bricabrac. Bricabrac searches over 10 million moves per second on my 3950x running at 4.4Ghz using SISSY bitboards. I would just use PEXT if I had a cpu where PEXT is fast. But I don't. And I know a bit about PEXT because I suggested to INTEL and AMD that they should implement such an instruction in December of 2006! http://www.open-aurec.com/wbforum/viewt ... 962#p28725

Cool stuff! Its also a perfect instruction for the LZMA compression algorithm.
Well then go for perfect hashing? It is only 2 instruction more than PEXT.

Code: Select all

Attacks[PEXT(mask, occ)] //PEXT
Attacks[((occ | squaremask) * squareseed) >> 52] //fancy magic

Mike Sherwin · Post by **Mike Sherwin** » Fri Dec 10, 2021 12:47 am

I rewrote it mirroring how I'd write it in assembler to take maximum advantage of superscalar processors. On a cpu that can run 4 alu in parallel it effectively reduces to 5 lines of code execution (because there are no dependency chains) to get an attack bitboard. This may be for someone looking for something simple with small data that is not worried about the fastest possible.

Code Snippets

Code: Select all

// DEFINITIONS
using s32 = int;
using u32 = unsigned long;
using u64 = unsigned long long;
enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };
enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };
struct Ray {
	u64 NW;
	u64 NN;
	u64 NE;
	u64 EE;
	u64 SE;
	u64 SS;
	u64 SW;
	u64 WW;
};

// DATA
Ray ray[64];
Ray rws[64]; // ray with stop

// INITIALIZATION
// sq = 0 to 63
		x = sq & 7;
		y = sq >> 3;
		// Northwest
		ray[sq].NW = 0;  
		for (c = 1, ts = sq + 7; x - c >= FILE1 && y + c <= RANK8; c++, ts += 7) ray[sq].NW |= 1ull << ts;
		rws[sq].NW = ray[sq].NW & 0x8000000000000000;

		// Northeast
		ray[sq].NE = 0;
		for (c = 1, ts = sq + 9; x + c <= FILE8 && y + c <= RANK8; c++, ts += 9) ray[sq].NE |= 1ull << ts;
		rws[sq].NE = ray[sq].NE & 0x8000000000000000;

		// Southeast
		ray[sq].SE = 0;
		for (c = 1, ts = sq - 7; x + c <= FILE8 && y - c >= RANK1; c++, ts -= 7) ray[sq].SE |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;

		// Southwest
		ray[sq].SW = 0;
		for (c = 1, ts = sq - 9; x - c >= FILE1 && y - c >= RANK1; c++, ts -= 9) ray[sq].SW |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;
		
// Gen
	u64 allPieces, bb, r1, r2, r3, r4; // r# for a cpu register
	u32 fs, NW, NN, NE, EE, SE, SS, SW, WW;
        allPieces = pieces[WHITE] | pieces[BLACK] | 0x8000000000000001; // occupied with stops
        
		case WB:
			r1 = rws[fs].NW & allPieces;
			r2 = rws[fs].NE & allPieces;
			r3 = rws[fs].SE & allPieces;
			r4 = rws[fs].SW & allPieces;
			_BitScanForward64(&NW, r1);
			_BitScanForward64(&NE, r2);
			_BitScanReverse64(&SE, r3);
			_BitScanReverse64(&SW, r4);
			r1 = ray[fs].NW ^ ray[NW].NW;
			r2 = ray[fs].NE ^ ray[NE].NE;
			r3 = ray[fs].SE ^ ray[SE].SE;
			r4 = ray[fs].SW ^ ray[SW].SW;
			r1 = r1 | r3;
			r2 = r2 | r4;
			bb = r1 | r2;
			break;

dangi12012 · Post by **dangi12012** » Fri Dec 10, 2021 1:04 am

Mike Sherwin wrote: ↑Fri Dec 10, 2021 12:47 am I rewrote it mirroring how I'd write it in assembler to take maximum advantage of superscalar processors. On a cpu that can run 4 alu in parallel it effectively reduces to 5 lines of code execution (because there are no dependency chains) to get an attack bitboard. This may be for someone looking for something simple with small data that is not worried about the fastest possible.

Code Snippets

Code: Select all

// DEFINITIONS
using s32 = int;
using u32 = unsigned long;
using u64 = unsigned long long;
enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };
enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };
struct Ray {
	u64 NW;
	u64 NN;
	u64 NE;
	u64 EE;
	u64 SE;
	u64 SS;
	u64 SW;
	u64 WW;
};

// DATA
Ray ray[64];
Ray rws[64]; // ray with stop

// INITIALIZATION
// sq = 0 to 63

		// Northwest
		ray[sq].NW = 0;  
		for (c = 1, ts = sq + 7; x - c >= FILE1 && y + c <= RANK8; c++, ts += 7) ray[sq].NW |= 1ull << ts;
		rws[sq].NW = ray[sq].NW & 0x8000000000000000;

		// Northeast
		ray[sq].NE = 0;
		for (c = 1, ts = sq + 9; x + c <= FILE8 && y + c <= RANK8; c++, ts += 9) ray[sq].NE |= 1ull << ts;
		rws[sq].NE = ray[sq].NE & 0x8000000000000000;

		// Southeast
		ray[sq].SE = 0;
		for (c = 1, ts = sq - 7; x + c <= FILE8 && y - c >= RANK1; c++, ts -= 7) ray[sq].SE |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;

		// Southwest
		ray[sq].SW = 0;
		for (c = 1, ts = sq - 9; x - c >= FILE1 && y - c >= RANK1; c++, ts -= 9) ray[sq].SW |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;
		
// Gen
	u64 allPieces, bb, r1, r2, r3, r4; // r# for a cpu register
	u32 fs, NW, NN, NE, EE, SE, SS, SW, WW;
        allPieces = pieces[WHITE] | pieces[BLACK] | 0x8000000000000001; // occupied with stops
        
		case WB:
			r1 = rws[fs].NW & allPieces;
			r2 = rws[fs].NE & allPieces;
			r3 = rws[fs].SE & allPieces;
			r4 = rws[fs].SW & allPieces;
			_BitScanForward64(&NW, r1);
			_BitScanForward64(&NE, r2);
			_BitScanReverse64(&SE, r3);
			_BitScanReverse64(&SW, r4);
			r1 = ray[fs].NW ^ ray[NW].NW;
			r2 = ray[fs].NE ^ ray[NE].NE;
			r3 = ray[fs].SE ^ ray[SE].SE;
			r4 = ray[fs].SW ^ ray[SW].SW;
			r1 = r1 & r3;
			r2 = r2 & r4;
			bb = r1 & r2;
			break;

Could you send a full self contained Code sample that has the interface:
uint64_t Queen(int square, uint64_t occupy)

I would love to add this to the benchmark comparison of hypercube before release

Mike Sherwin · Post by **Mike Sherwin** » Fri Dec 10, 2021 1:09 am

dangi12012 wrote: ↑Fri Dec 10, 2021 1:04 am

Mike Sherwin wrote: ↑Fri Dec 10, 2021 12:47 am I rewrote it mirroring how I'd write it in assembler to take maximum advantage of superscalar processors. On a cpu that can run 4 alu in parallel it effectively reduces to 5 lines of code execution (because there are no dependency chains) to get an attack bitboard. This may be for someone looking for something simple with small data that is not worried about the fastest possible.

Code Snippets

Code: Select all

// DEFINITIONS
using s32 = int;
using u32 = unsigned long;
using u64 = unsigned long long;
enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };
enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };
struct Ray {
	u64 NW;
	u64 NN;
	u64 NE;
	u64 EE;
	u64 SE;
	u64 SS;
	u64 SW;
	u64 WW;
};

// DATA
Ray ray[64];
Ray rws[64]; // ray with stop

// INITIALIZATION
// sq = 0 to 63

		// Northwest
		ray[sq].NW = 0;  
		for (c = 1, ts = sq + 7; x - c >= FILE1 && y + c <= RANK8; c++, ts += 7) ray[sq].NW |= 1ull << ts;
		rws[sq].NW = ray[sq].NW & 0x8000000000000000;

		// Northeast
		ray[sq].NE = 0;
		for (c = 1, ts = sq + 9; x + c <= FILE8 && y + c <= RANK8; c++, ts += 9) ray[sq].NE |= 1ull << ts;
		rws[sq].NE = ray[sq].NE & 0x8000000000000000;

		// Southeast
		ray[sq].SE = 0;
		for (c = 1, ts = sq - 7; x + c <= FILE8 && y - c >= RANK1; c++, ts -= 7) ray[sq].SE |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;

		// Southwest
		ray[sq].SW = 0;
		for (c = 1, ts = sq - 9; x - c >= FILE1 && y - c >= RANK1; c++, ts -= 9) ray[sq].SW |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;
		
// Gen
	u64 allPieces, bb, r1, r2, r3, r4; // r# for a cpu register
	u32 fs, NW, NN, NE, EE, SE, SS, SW, WW;
        allPieces = pieces[WHITE] | pieces[BLACK] | 0x8000000000000001; // occupied with stops
        
		case WB:
			r1 = rws[fs].NW & allPieces;
			r2 = rws[fs].NE & allPieces;
			r3 = rws[fs].SE & allPieces;
			r4 = rws[fs].SW & allPieces;
			_BitScanForward64(&NW, r1);
			_BitScanForward64(&NE, r2);
			_BitScanReverse64(&SE, r3);
			_BitScanReverse64(&SW, r4);
			r1 = ray[fs].NW ^ ray[NW].NW;
			r2 = ray[fs].NE ^ ray[NE].NE;
			r3 = ray[fs].SE ^ ray[SE].SE;
			r4 = ray[fs].SW ^ ray[SW].SW;
			r1 = r1 & r3;
			r2 = r2 & r4;
			bb = r1 & r2;
			break;

Could you send a full self contained Code sample that has the interface:
uint64_t Queen(int square, uint64_t occupy)

I would love to add this to the benchmark comparison of hypercube before release

I'll try. I'm not very fast though. I already made a mistake using & in the final 3 lines instead of |.

Mike Sherwin · Post by **Mike Sherwin** » Fri Dec 10, 2021 3:06 am

dangi12012 wrote: ↑Fri Dec 10, 2021 1:04 am

Mike Sherwin wrote: ↑Fri Dec 10, 2021 12:47 am I rewrote it mirroring how I'd write it in assembler to take maximum advantage of superscalar processors. On a cpu that can run 4 alu in parallel it effectively reduces to 5 lines of code execution (because there are no dependency chains) to get an attack bitboard. This may be for someone looking for something simple with small data that is not worried about the fastest possible.

Code Snippets

Code: Select all

// DEFINITIONS
using s32 = int;
using u32 = unsigned long;
using u64 = unsigned long long;
enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };
enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };
struct Ray {
	u64 NW;
	u64 NN;
	u64 NE;
	u64 EE;
	u64 SE;
	u64 SS;
	u64 SW;
	u64 WW;
};

// DATA
Ray ray[64];
Ray rws[64]; // ray with stop

// INITIALIZATION
// sq = 0 to 63

		// Northwest
		ray[sq].NW = 0;  
		for (c = 1, ts = sq + 7; x - c >= FILE1 && y + c <= RANK8; c++, ts += 7) ray[sq].NW |= 1ull << ts;
		rws[sq].NW = ray[sq].NW & 0x8000000000000000;

		// Northeast
		ray[sq].NE = 0;
		for (c = 1, ts = sq + 9; x + c <= FILE8 && y + c <= RANK8; c++, ts += 9) ray[sq].NE |= 1ull << ts;
		rws[sq].NE = ray[sq].NE & 0x8000000000000000;

		// Southeast
		ray[sq].SE = 0;
		for (c = 1, ts = sq - 7; x + c <= FILE8 && y - c >= RANK1; c++, ts -= 7) ray[sq].SE |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;

		// Southwest
		ray[sq].SW = 0;
		for (c = 1, ts = sq - 9; x - c >= FILE1 && y - c >= RANK1; c++, ts -= 9) ray[sq].SW |= 1ull << ts;
		rws[sq].SE = ray[sq].SE & 0x0000000000000001;
		
// Gen
	u64 allPieces, bb, r1, r2, r3, r4; // r# for a cpu register
	u32 fs, NW, NN, NE, EE, SE, SS, SW, WW;
        allPieces = pieces[WHITE] | pieces[BLACK] | 0x8000000000000001; // occupied with stops
        
		case WB:
			r1 = rws[fs].NW & allPieces;
			r2 = rws[fs].NE & allPieces;
			r3 = rws[fs].SE & allPieces;
			r4 = rws[fs].SW & allPieces;
			_BitScanForward64(&NW, r1);
			_BitScanForward64(&NE, r2);
			_BitScanReverse64(&SE, r3);
			_BitScanReverse64(&SW, r4);
			r1 = ray[fs].NW ^ ray[NW].NW;
			r2 = ray[fs].NE ^ ray[NE].NE;
			r3 = ray[fs].SE ^ ray[SE].SE;
			r4 = ray[fs].SW ^ ray[SW].SW;
			r1 = r1 & r3;
			r2 = r2 & r4;
			bb = r1 & r2;
			break;

Could you send a full self contained Code sample that has the interface:
uint64_t Queen(int square, uint64_t occupy)

I would love to add this to the benchmark comparison of hypercube before release

Here is a start. I'll try to add some test code tomorrow.

Code: Select all

// Bob.Mike
// Example of Robert Hyatt's and Michael Sherwin's classical bitboard approach
// to generate moves for the sliding pieces.

#include <cstdint>
#include <intrin.h>

enum { FILE1, FILE2, FILE3, FILE4, FILE5, FILE6, FILE7, FILE8 };

enum { RANK1, RANK2, RANK3, RANK4, RANK5, RANK6, RANK7, RANK8 };

struct Ray {
	uint64_t rayNW;
	uint64_t rayNN;
	uint64_t rayNE;
	uint64_t rayEE;
	uint64_t raySE;
	uint64_t raySS;
	uint64_t raySW;
	uint64_t rayWW;
	uint64_t rwsNW;
	uint64_t rwsNN;
	uint64_t rwsNE;
	uint64_t rwsEE;
	uint64_t rwsSE;
	uint64_t rwsSS;
	uint64_t rwsSW;
	uint64_t rwsWW;
};

Ray ray[64];

uint64_t Queen(int sq, uint64_t occ) {
	unsigned long iNW, iNN, iNE, iEE, iSE, iSS, iSW, iWW;
	uint64_t r1, r2, r3, r4, r5, r6;
	r1 = ray[sq].rwsNW & occ;
	r2 = ray[sq].rwsNN & occ;
	r3 = ray[sq].rwsNE & occ;
	r4 = ray[sq].rwsEE & occ;
	_BitScanForward64(&iNW, r1);
	_BitScanForward64(&iNN, r2);
	_BitScanForward64(&iNE, r3);
	_BitScanForward64(&iEE, r4);
	r1 = ray[sq].rayNW ^ ray[iNW].rayNW;
	r2 = ray[sq].rayNN ^ ray[iNN].rayNN;
	r3 = ray[sq].rayNE ^ ray[iNE].rayNE;
	r4 = ray[sq].rayEE ^ ray[iEE].rayEE;	
	r5 = r1 | r2;
	r6 = r3 | r4;
	r1 = ray[sq].rwsSE & occ;
	r2 = ray[sq].rwsSS & occ;
	r3 = ray[sq].rwsSW & occ;
	r4 = ray[sq].rwsWW & occ;
	_BitScanReverse64(&iSE, r1);
	_BitScanReverse64(&iSS, r2);
	_BitScanReverse64(&iSW, r3);
	_BitScanReverse64(&iWW, r4);
	r1 = ray[sq].raySE ^ ray[iSE].raySE;
	r2 = ray[sq].raySS ^ ray[iSS].raySS;
	r3 = ray[sq].raySW ^ ray[iSW].raySW;
	r4 = ray[sq].rayWW ^ ray[iWW].rayWW;	
	r1 = r1 | r3;
	r2 = r2 | r4;
	r3 = r5 | r6;
	return r1 | r2 | r3;
}

void Initialize() {
	int sq, ts, file, rank, c;

	for (sq = 0; sq <= 63; sq++) {
		file = sq & 7;
		rank = sq >> 3;

		// Northwest
		ray[sq].rayNW = 0;
		for (c = 1, ts = sq + 7; file - c >= FILE1 && rank + c <= RANK8; c++, ts += 7) ray[sq].rayNW |= 1ull << ts;
		ray[sq].rwsNW = ray[sq].rayNW & 0x8000000000000000;

		// Northeast
		ray[sq].rayNE = 0;
		for (c = 1, ts = sq + 9; file + c <= FILE8 && rank + c <= RANK8; c++, ts += 9) ray[sq].rayNE |= 1ull << ts;
		ray[sq].rwsNE = ray[sq].rayNE & 0x8000000000000000;

		// Southeast
		ray[sq].raySE = 0;
		for (c = 1, ts = sq - 7; file + c <= FILE8 && rank - c <= RANK8; c++, ts -= 7) ray[sq].raySE |= 1ull << ts;
		ray[sq].rwsSE = ray[sq].raySE & 0x0000000000000001;

		// Southwest
		ray[sq].raySW = 0;
		for (c = 1, ts = sq - 9; file - c >= FILE1 && rank - c <= RANK8; c++, ts -= 9) ray[sq].raySW |= 1ull << ts;
		ray[sq].rwsSW = ray[sq].raySW & 0x0000000000000001;

		// North
		ray[sq].rayNN = 0;
		for (c = 1, ts = sq + 8; rank + c <= RANK8; c++, ts += 8) ray[sq].rayNN |= 1ull << ts;
		ray[sq].rwsNN = ray[sq].rayNN & 0x8000000000000000;

		// East
		ray[sq].rayEE = 0;
		for (c = 1, ts = sq + 1; file + c <= FILE8; c++, ts += 1) ray[sq].rayEE |= 1ull << ts;
		ray[sq].rwsEE = ray[sq].rayEE & 0x8000000000000000;

		// South
		ray[sq].raySS = 0;
		for (c = 1, ts = sq - 8; rank - c >= RANK1; c++, ts -= 8) ray[sq].raySS |= 1ull << ts;
		ray[sq].rwsSS = ray[sq].raySS & 0x0000000000000001;

		// West
		ray[sq].rayWW = 0;
		for (c = 1, ts = sq - 1; file - c >= FILE1; c++, ts -= 1) ray[sq].rayWW |= 1ull << ts;
		ray[sq].rwsEE = ray[sq].raySS & 0x0000000000000001;
	}
}

int main() {
	Initialize();
}

dangi12012 · Post by **dangi12012** » Fri Dec 10, 2021 1:09 pm

Mike Sherwin wrote: ↑Fri Dec 10, 2021 3:06 am

The code is still buggy - i use this to print a bitboard and can compare reference vs output:

Code: Select all

static std::string _map(uint64_t value)
{
	static std::string str(64 + 8, 'o');
	for (uint64_t i = 0, c = 0; i < 64; i++)
	{
		uint64_t bitmask = (1ull) << i;

		if ((bitmask & value) != 0) str[c++] = 'X';
		else str[c++] = '.';

		if ((i + 1) % 8 == 0) str[c++] = '\n';
	}
	return str;
}

But there seems to be a bug: (after initialize was called!)
OCC = 6822289156618217916, pos = 2

Code: Select all

Compare_Engines:
..XXXX.X
X..X.XX.
.X.XXX..
XXX....X
XXX.X.XX
XX...X.X
X.XX.X.X
.XXXX.X.

Reference
XX.X....
.XXX....
X.X.....
..X.....
........
........
........
........

Bob:
XXX.....
XXXX...X
X.X.X.X.
X.XX.X.X
X.X.X.X.
X.XX.X.X
X.X.X.X.
XXXX...X

also think of using _BitScanForward64 -> std::countr_zero. So its compatible with gcc and clang.
Greetings!

Combining two of Bob's classic bitboard attack getters

Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters

Re: Combining two of Bob's classic bitboard attack getters