Modified old 64 bit attack getter

Michael Sherwin · Post by **Michael Sherwin** » Sun Dec 06, 2009 8:46 pm

I did not even remember having written it in the first place, however on recollection I remember Gerd helping me with it.

Code: Select all

typedef struct { 
  u16 idx[64]; 
  u64 bitsN; 
  u64 bitsE; 
  u64 bitsS; 
  u64 bitsW; 
} square; 

u64 rookAttacks(u64 occ, square *sq) { 
    unsigned long ln, le, ls, lw; 
    occ |= fourCorners; 
    _BitScanForward64(&ln, occ & sq->bitsN); 
    _BitScanForward64(&le, occ & sq->bitsE); 
    _BitScanReverse64(&ls, occ & sq->bitsS); 
    _BitScanReverse64(&lw, occ & sq->bitsW); 
    u32 idx = sq->idx[ln] | sq->idx[le] | sq->idx[ls] | sq->idx[lw]; 
    return rookLookup[idx];
}

This modification greatly reduces the memory requirement. I wonder why I did not think of it before;

Code: Select all

typedef struct {
  u64 bitsN[64];    // bits North, no stop bit
  u64 bitsNs;       // bits North, including a stop bit
  u64 bitsE[64];
  u64 bitsEs;
  u64 bitsS[64];
  u64 bitsSs;
  u64 bitsW[64];
  u64 bitsWs;
} square;

square squares[64];

/*The 0 bit and 63 bit are used as stops so a bit is always found. They are set in occ and the correct bit is set in the directional bits (i.e. bitsNs). Also the stop bits used as index will return the bits for an empty ray for the given direction.*/

u64 rookAttacks(u64 occ, square *sq) { 
    u32 iN, iE, iS, iW; 
    occ |= 0x8000000000000001; 
    _BitScanForward64(&iN, occ & sq->bitsNs); 
    _BitScanForward64(&iE, occ & sq->bitsEs); 
    _BitScanReverse64(&iS, occ & sq->bitsSs); 
    _BitScanReverse64(&iW, occ & sq->bitsWs); 
    return (sq->bitsN[iN] | sq->bitsE[iE] | sq->bitsS[iS] | sq->bitsW[iW]); 
}

Edit1: An hallucination made me think that it was broken so I deleted it. However, it seems correct! so, for example, sq->bitsNs has all the possible attack bits set, while sq->bitsN just have the bits that can be reached given an index.

Edit2: Okay, now I see my mistake. sq->bitsN etc. need to be an array.

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Dec 07, 2009 7:33 am

Michael Sherwin wrote:I did not even remember having written it in the first place, however on recollection I remember Gerd helping me with it.
Code: Select all
typedef struct { 
  u16 idx[64]; 
  u64 bitsN; 
  u64 bitsE; 
  u64 bitsS; 
  u64 bitsW; 
} square; 

u64 rookAttacks(u64 occ, square *sq) { 
    unsigned long ln, le, ls, lw; 
    occ |= fourCorners; 
    _BitScanForward64(&ln, occ & sq->bitsN); 
    _BitScanForward64(&le, occ & sq->bitsE); 
    _BitScanReverse64(&ls, occ & sq->bitsS); 
    _BitScanReverse64(&lw, occ & sq->bitsW); 
    u32 idx = sq->idx[ln] | sq->idx[le] | sq->idx[ls] | sq->idx[lw]; 
    return rookLookup[idx];
}
This modification greatly reduces the memory requirement. I wonder why I did not think of it before;
Code: Select all
typedef struct {
  u64 bitsN[64];    // bits North, no stop bit
  u64 bitsNs;       // bits North, including a stop bit
  u64 bitsE[64];
  u64 bitsEs;
  u64 bitsS[64];
  u64 bitsSs;
  u64 bitsW[64];
  u64 bitsWs;
} square;

square squares[64];

/*The 0 bit and 63 bit are used as stops so a bit is always found. They are set in occ and the correct bit is set in the directional bits (i.e. bitsNs). Also the stop bits used as index will return the bits for an empty ray for the given direction.*/

u64 rookAttacks(u64 occ, square *sq) { 
    u32 iN, iE, iS, iW; 
    occ |= 0x8000000000000001; 
    _BitScanForward64(&iN, occ & sq->bitsNs); 
    _BitScanForward64(&iE, occ & sq->bitsEs); 
    _BitScanReverse64(&iS, occ & sq->bitsSs); 
    _BitScanReverse64(&iW, occ & sq->bitsWs); 
    return (sq->bitsN[iN] | sq->bitsE[iE] | sq->bitsS[iS] | sq->bitsW[iW]); 
}
Edit1: An hallucination made me think that it was broken so I deleted it. However, it seems correct! so, for example, sq->bitsNs has all the possible attack bits set, while sq->bitsN just have the bits that can be reached given an index.

Edit2: Okay, now I see my mistake. sq->bitsN etc. need to be an array.

For my taste > 64*64 bitboards == 32KByte per ray is too much for that calculations. As variation of the Classical Approach, I suggest to exclude the outer rays rather than to include the inner squares.

Code: Select all

typedef struct {
  u64 bitsN;   // bits North, no stop bit
  u64 bitsNs;  // bits North, including a stop bit
  u64 bitsE;
  u64 bitsEs;
  u64 bitsS;
  u64 bitsSs;
  u64 bitsW;
  u64 bitsWs;
} square;

square squares[64];
u64 rookAttacksOnEmptyBoard[64];

/*The 0 bit and 63 bit are used as stops so a bit is always found. They are set in occ and the correct bit is set in the directional bits (i.e. bitsNs). Also the stop bits used as index will return the bits for an empty ray for the given direction.*/

u64 rookAttacks(u64 occ, int sq) {
    u32 iN, iE, iS, iW;
    occ |= 0x8000000000000001;
    _BitScanForward64(&iN, occ & squares[sq].bitsNs);
    _BitScanForward64(&iE, occ & squares[sq].bitsEs);
    _BitScanReverse64(&iS, occ & squares[sq].bitsSs);
    _BitScanReverse64(&iW, occ & squares[sq].bitsWs);
    return rookAttacksOnEmptyBoard[sq]
         ^ squares[iN].bitsN
         ^ squares[iE].bitsE
         ^ squares[iS].bitsS
         ^ squares[iW].bitsW;
}

Anyway, still 13 operations including four bitscans and 9 further memory reads per rook/bishop. On the other hand there is potential for core2 with fast bsf/bsr to perform each ray independently.

Gerd

Michael Sherwin · Post by **Michael Sherwin** » Tue Dec 08, 2009 8:39 am

If memory reads are bad for performance and so is too much memory use and even multiplication to some extent then maybe my most up to date folding rook attacks might be interesting.

It minimizes memory reads, uses far less memory than magic, no multiplication and should run in two pipes very well. And I hope that it can be further optimized. Although it should already be quite fast!

Code: Select all

u64 RookAttacks(u32 sq, u64 occ) {
  u32 l = ((u32)( occ >> ((sq & 7) +  7))) & 0x00020202;
  u32 h = ((u32)( occ >> ((sq & 7) + 32))) & 0x00010101;
  u32 r = ((u32)((occ << 1) >> (sq - (sq & 7)))) & 0x000000f6;
  u32 i = ( l | h | r) & 0x0000003f;
  u32 j = (((l | h | r) >> 6) | ((l | h) >> 12)) & 0x0000003f;
  return rookSuperSet1[sq][i] & rookSuperSet2[sq][j];
}

Hopefully: :D

u32 l = oooooooo
        o3oooooo
        o2oooooo
        o1oooooo

u32 h = oooooooo
        6ooooooo
        5ooooooo
        4ooooooo

u32 r = oooooooo
        oooooooo
        oooooooo
        oo789abc

u32 i = oooooooo
        oooooooo
        oooooooo
        41789aoo

u32 j = oooooooo
        oooooooo
        oooooooo
        bc5263oo

For those unfamiliar with supersets think of them like this. If a single blocker is present then there is an attack set for it. Likewise for a second blocker. Now, if both blockers are present then each set above is a superset of the true set. By anding both supersets together we get the true set. Each blocker that is part of an index has a superset and all those supersets are anded together at each index location.

Breaking the large 12 bit index into 2 6 bit index is what saves the huge amount of memory. The two superset tables are only 2 * 64 * 64 * 8 = 65536 bytes! And that is cheap for a rook!!

wgarvin · Post by **wgarvin** » Tue Dec 08, 2009 5:47 pm

Edit: okay, I was looking at the wrong code snippet.. sorry

Even kindergarten bitboards can be done with a 4 KB table, though. 64 KB still seems too large to me.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Dec 08, 2009 8:15 pm

Michael Sherwin wrote:If memory reads are bad for performance and so is too much memory use and even multiplication to some extent then maybe my most up to date folding rook attacks might be interesting.

It minimizes memory reads, uses far less memory than magic, no multiplication and should run in two pipes very well. And I hope that it can be further optimized. Although it should already be quite fast!

code snipped

For those unfamiliar with supersets think of them like this. If a single blocker is present then there is an attack set for it. Likewise for a second blocker. Now, if both blockers are present then each set above is a superset of the true set. By anding both supersets together we get the true set. Each blocker that is part of an index has a superset and all those supersets are anded together at each index location.

Breaking the large 12 bit index into 2 6 bit index is what saves the huge amount of memory. The two superset tables are only 2 * 64 * 64 * 8 = 65536 bytes! And that is cheap for a rook!!

Interesting code and freaky superset lookup idea, real bit-twiddling

Yes, there are zillions of variations on that matter to map the occupied bits and to index tables by square/occindex or more denser with file-rank/occindex but slight more trailing computation. Intersecting supersets, or more common, the union of subsets of disjoint rank- and file attacks. f.i. derived from Occupancy of any Line without kindergarten mul and same 2 * 32KByte rotated like memory layout and number of operations in 64-bit and 32-bit mode:

Code: Select all

u64 RookAttacks(u32 sq, u64 occ) {            //     64 (32)-bit mode
  u32 i = inner6RankIndex(sq, occ);               //  4 ( 5)
  u32 j = inner6FileIndex(sq, occ);               // 11 (13)  
  return rankAttacks[sq][i] | fileAttacks[sq][j]; //  1 ( 2)
}                                                 // 16 (20) 

// 4 operations (5 32-bits)
int inner6RankIndex(u32 sq, u64 occ)
{
    u32 rx8p1 = (sq & 56) + 1;
    return (occ >> rx8p1) & 63;
}

#ifdef _64_BITS
// 11 operations
int inner6FileIndex(u32 sq, u64 occ)
{
   u32 file = sq & 7;
   occ = occ >> file;
   occ &= C64(0x0001010101010100):
   occ |= occ >> 28;
   occ |= occ >> 14;
   occ |= occ >>  7;
   return(occ >>  1) & 63;
}
#else
// 13 32-bit operations
int inner6FileIndex(u32 sq, u64 occ)
{
   u32 file = sq & 7;
   u32 l = occ;
   u32 h = occ >> 32; // nop
   l >>= file;
   h >>= file;
   l &= 0x01010100;
   h &= 0x00010101;
   l = (l >>  8) | (h << 3);
   l |= l >> 14;
   l |= l >>  7;
   return l & 63;
}
#endif

Your 32-bit optimized, slightly modified (common expressions and >> 32, which is nop in 32-bit mode) routine seems a bit more expensive in 64-bit mode, but the same 20 operations in 32-bit mode, if I count operations correctly.

Code: Select all

u64 RookAttacks(u32 sq, u64 occ) {                   //       64 (32)-bit mode
  u32 file = sq & 7;                                       //  1 ( 1)
  u32 l = ((u32)  occ >> (file +  7)) & 0x00020202;        //  3 ( 3)
  u32 h = ((u32)( occ >> 32) >> file) & 0x00010101;        //  3 ( 2)
  u32 r = ((u32)((occ << 1) >> (sq - file))) & 0x000000f6; //  4 ( 5)
  l = l | h;                                               //  1 ( 1)
  r = r | l;                                               //  1 ( 1)
  u32 i = r & 0x0000003f;                                  //  1 ( 1)
  u32 j = ((r >> 6) | (l >> 12)) & 0x0000003f;             //  4 ( 4)
  return rookSuperSet1[sq][i] & rookSuperSet2[sq][j];      //  1 ( 2)
}                                                          // 19 (20)

19/20 operations for 64-Kbyte rook tables seems a bit too much compared to other approaches.

Gerd

Michael Sherwin · Post by **Michael Sherwin** » Wed Dec 09, 2009 1:05 am

In that case I would hope that lack of dependency chains might make a big difference. Only testing will tell for sure. Anyway, bit twiddling is fun and relaxing!

Modified old 64 bit attack getter

Modified old 64 bit attack getter

Re: Modified old 64 bit attack getter

Re: Modified old 64 bit attack getter

Re: Modified old 64 bit attack getter

Re: Modified old 64 bit attack getter

Re: Modified old 64 bit attack getter