Michael Sherwin wrote:If memory reads are bad for performance and so is too much memory use and even multiplication to some extent then maybe my most up to date folding rook attacks might be interesting.
It minimizes memory reads, uses far less memory than magic, no multiplication and should run in two pipes very well. And I hope that it can be further optimized. Although it should already be quite fast!
code snipped
For those unfamiliar with supersets think of them like this. If a single blocker is present then there is an attack set for it. Likewise for a second blocker. Now, if both blockers are present then each set above is a superset of the true set. By anding both supersets together we get the true set. Each blocker that is part of an index has a superset and all those supersets are anded together at each index location.
Breaking the large 12 bit index into 2 6 bit index is what saves the huge amount of memory. The two superset tables are only 2 * 64 * 64 * 8 = 65536 bytes! And that is cheap for a rook!!
Interesting code and freaky superset lookup idea, real bit-twiddling
Yes, there are zillions of variations on that matter to map the occupied bits and to index tables by square/occindex or more denser with file-rank/occindex but slight more trailing computation. Intersecting supersets, or more common, the union of subsets of disjoint rank- and file attacks. f.i. derived from
Occupancy of any Line without kindergarten mul and same 2 * 32KByte rotated like memory layout and number of operations in 64-bit and 32-bit mode:
Code: Select all
u64 RookAttacks(u32 sq, u64 occ) { // 64 (32)-bit mode
u32 i = inner6RankIndex(sq, occ); // 4 ( 5)
u32 j = inner6FileIndex(sq, occ); // 11 (13)
return rankAttacks[sq][i] | fileAttacks[sq][j]; // 1 ( 2)
} // 16 (20)
// 4 operations (5 32-bits)
int inner6RankIndex(u32 sq, u64 occ)
{
u32 rx8p1 = (sq & 56) + 1;
return (occ >> rx8p1) & 63;
}
#ifdef _64_BITS
// 11 operations
int inner6FileIndex(u32 sq, u64 occ)
{
u32 file = sq & 7;
occ = occ >> file;
occ &= C64(0x0001010101010100):
occ |= occ >> 28;
occ |= occ >> 14;
occ |= occ >> 7;
return(occ >> 1) & 63;
}
#else
// 13 32-bit operations
int inner6FileIndex(u32 sq, u64 occ)
{
u32 file = sq & 7;
u32 l = occ;
u32 h = occ >> 32; // nop
l >>= file;
h >>= file;
l &= 0x01010100;
h &= 0x00010101;
l = (l >> 8) | (h << 3);
l |= l >> 14;
l |= l >> 7;
return l & 63;
}
#endif
Your 32-bit optimized, slightly modified (common expressions and >> 32, which is nop in 32-bit mode) routine seems a bit more expensive in 64-bit mode, but the same 20 operations in 32-bit mode, if I count operations correctly.
Code: Select all
u64 RookAttacks(u32 sq, u64 occ) { // 64 (32)-bit mode
u32 file = sq & 7; // 1 ( 1)
u32 l = ((u32) occ >> (file + 7)) & 0x00020202; // 3 ( 3)
u32 h = ((u32)( occ >> 32) >> file) & 0x00010101; // 3 ( 2)
u32 r = ((u32)((occ << 1) >> (sq - file))) & 0x000000f6; // 4 ( 5)
l = l | h; // 1 ( 1)
r = r | l; // 1 ( 1)
u32 i = r & 0x0000003f; // 1 ( 1)
u32 j = ((r >> 6) | (l >> 12)) & 0x0000003f; // 4 ( 4)
return rookSuperSet1[sq][i] & rookSuperSet2[sq][j]; // 1 ( 2)
} // 19 (20)
19/20 operations for 64-Kbyte rook tables seems a bit too much compared to other approaches.
Gerd