Comparison of bitboard attack-getter variants

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jan 05, 2016 11:01 pm

hgm wrote:OK, you mean that table[fromSqr][index] on a 64-bit architecture could be more expensive than t[index + offset[fromSqr]], because int offset[] takes only 4 bytes per entry while BitBoard *table[] takes 8 byte per entry for the pointers. It does seem to take an extra addition, though, because you need to scale the index by 8 (the size of the BitBoard elements), and only one of the two index registers can do that in scaled-indexed mode. You could of course store the offset in bytes, and write something like (BitBoard*)(((char*) t)[8*index + offset[fromSqr]]) to prevent that.

The reason is saving space. table[fromSqr][index] is used as well and called Plain Magic Bitboards, Lasse Hansen's initial approach with homogeneous two-dimensional precalculated attack arrays indexed by square and occupancy. Each square takes the the same space, which is a kind of waste, since most squares require much less than worst case. You have 12 relevant occupancy bits for the rook on corner squares, 11 on the other edge squares and 10 for the 36 inner squares. This is why Pradu Kannan proposed the offset or pointer method of Fancy Magic Bitboards for a Java like "two dimensional" array with different sizes per square, und thus, the overhead of variable shifts if calculating the index, but using only ~800K instead of 2MB for rooks.

hgm · Post by **hgm** » Tue Jan 05, 2016 11:15 pm

Gerd Isenberg wrote:The reason is saving space. table[fromSqr][index] is used as well and called Plain Magic Bitboards, Lasse Hansen's initial approach with homogeneous two-dimensional precalculated attack arrays indexed by square and occupancy.

But table[fromSqr][index] saves you the same space, with BitBoard *table[64]. It just depends on how you fill that 1-dimensional array of pointers, and of course you would not do that homogeneously.

kbhearn · Post by **kbhearn** » Tue Jan 05, 2016 11:38 pm

hgm wrote:
Gerd Isenberg wrote:The reason is saving space. table[fromSqr][index] is used as well and called Plain Magic Bitboards, Lasse Hansen's initial approach with homogeneous two-dimensional precalculated attack arrays indexed by square and occupancy.
But table[fromSqr][index] saves you the same space, with BitBoard *table[64]. It just depends on how you fill that 1-dimensional array of pointers, and of course you would not do that homogeneously.

that however costs you an extra indirection on lookup where adding an offset is free as part of memory indexing as far as i understand.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jan 05, 2016 11:55 pm

hgm wrote:
Gerd Isenberg wrote:The reason is saving space. table[fromSqr][index] is used as well and called Plain Magic Bitboards, Lasse Hansen's initial approach with homogeneous two-dimensional precalculated attack arrays indexed by square and occupancy.
But table[fromSqr][index] saves you the same space, with BitBoard *table[64]. It just depends on how you fill that 1-dimensional array of pointers, and of course you would not do that homogeneously.

Sorry I don't get it. I was referring

Code: Select all

U64 attack_table&#91;...&#93;; // ~840 KByte all rook and bishop attacks, less with constructive collisions optimization
 
struct SMagic &#123;
   U64* ptr;  // pointer to attack_table for each particular square
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41;
   U64 magic; // magic 64-bit factor
   int shift; // shift right
&#125;;
 
SMagic mBishopTbl&#91;64&#93;;
SMagic mRookTbl&#91;64&#93;;

versus

Code: Select all

U64 mBishopAttacks&#91;64&#93; &#91;512&#93;; // 256 K
U64 mRookAttacks  &#91;64&#93;&#91;4096&#93;; // 2048K
 
struct SMagic &#123;
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41;
   U64 magic; // magic 64-bit factor
&#125;;
 
SMagic mBishopTbl&#91;64&#93;;
SMagic mRookTbl  &#91;64&#93;;

What is your exact declaration of table[fromSqr][index]?

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jan 06, 2016 12:22 am

hgm wrote:
Gerd Isenberg wrote:The reason is saving space. table[fromSqr][index] is used as well and called Plain Magic Bitboards, Lasse Hansen's initial approach with homogeneous two-dimensional precalculated attack arrays indexed by square and occupancy.
But table[fromSqr][index] saves you the same space, with BitBoard *table[64]. It just depends on how you fill that 1-dimensional array of pointers, and of course you would not do that homogeneously.

Oups, got it. Pointer and arrays in C ...

wgarvin · Post by **wgarvin** » Wed Jan 06, 2016 3:32 am

I wonder if BMI2 - PEXT is any faster than plain magic, on the chips that support it?

(My guess would be about the same)

hgm · Post by **hgm** » Wed Jan 06, 2016 10:33 am

kbhearn wrote:that however costs you an extra indirection on lookup where adding an offset is free as part of memory indexing as far as i understand.

But the point of one of my previous postings was that it isn't: it needs an extra addition. Memory indexing can at most do A+8*B, and what you need here is 8*A+8*B or 8*(A+B), and there do not exist addressing modes for either of those in i386/x64 assembler. So t[index + rookOffset[fromSqr]] needs one (32-bit) memory LOAD for offset[fromSqr] with an addressing mode that multiplies fromSqr by 4, and ADD instruction to add it to the calculated index, and then a 64-bit memory LOAD with an adressing mode that multiplies the result of that ADD by 8. The 'extra indirection' can do with just the two LOADs: load the pointer table[fromSqr] with an addressing mode that multiplies fromSqr by 8, and use it to access t[] with an addressing mode that also multiplies by 8.

So the proper way to do it is the way I wrote it; adding an offset to the index is just an inefficient way to do it. It would be very hard for an optimizing compiler to realize that it can optimize the redundant ADD away by pre-multiplying the tabulated offsets by 8: these tables are likely in a global array, which could be used in other parts of the program outside the scope of the optimizer, so it cannot touch them. You could only achieve that by telling the compiler you want to store pre-multiplied offsets, and that it doesn't have to multiply them during the access, by casting the access to an array of characters:

#define K 1024
int rookOffset[64]; // pre-multiplied by 8 during initialization
BitBoard t[100*K];

attacks = *((BitBoard*) ((char*)t)[8*index + rookOffset[sqr]]);

Then it could use the A+8*B addressing mode after calculating index in B and loading rookOffset into A, to load the bitboard. This would have the advantage over the method with

BitBoard *rookTable[64]; // pointers into t[]
BitBoard t[100*K];

attacks = rookTable[sqr][index];

that rookOffset[] needs only 4 bytes per element (so 256 bytes in total) while rookTable[] would needs twice that with 8-byte pointers. But the price is awful code.

@Gerd: I hope that the above also answers your question. The table[pieceType] that I wrote before was just intended to symbolize hard-coded arrays rookTable and bishopTable, as of course in orthodox Chess those are the only two piece types for which it works this way, and Queens would have to be sythesized from them, while the leapers are just a lookup and don't have to bother with 'occupied'. But when you have more slider types you might actually want to do it that way. Although for slider-leaper compounds such as Chancellor, Archbishop, Dragon King or Dragon Horse you would probably just want to OR the Rook and Bishop attack-getter with the tabulated Knight or King moves to save space. But for Nightriders or asymmetric sliders like the Tori-Shogi Left and Right Quail you might want to have separate tables.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jan 06, 2016 1:35 pm

hgm wrote: @Gerd: I hope that the above also answers your question.

Yep, perfectly. Was initially confused with the two- and one-dimensional arrays and thought you made a typo, until I got it with the pointer array treated as two-dimentional array of inhomogeneous items. Also, in my first reply mentioning the space issue, I should have better carefully read your previous posts in this thread, to recognize that you are very well aware of that.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jan 06, 2016 2:05 pm

wgarvin wrote:I wonder if BMI2 - PEXT is any faster than plain magic, on the chips that support it?

(My guess would be about the same)

Yes, a comparison of plain magics vs. fancy magics vs. fancy PEXT BB in a concrete chess program would be interesting. My guess in Sven's Attack64 program (no hashtables) ...

1. PEXT BB (pext, 840K lookup)
2. plain magics (and, mul, fixed shift, 2MB lookup) close behind
3. fancy magics (and, mul, variable shift, 840K lookup)

Also measuring cycles of the attack getter with cpuid, rdtsc or the newer rdtscp and storing it in a histogram might be insightful.

Code: Select all

startcycle = rdtscp&#40;);
attackgetter
stopcycle = rdtscp&#40;);
duration = stopcycle - startcycle;
histogram &#91;duration % tblsize&#93;++;

Sven · Post by **Sven** » Wed Jan 06, 2016 2:05 pm

The usual implementation of the approach that uses different table sizes for each square can be found on the CPW page for magic bitboards under "Implementations ... Fancy". I repeat it here for convenience:

Code: Select all

U64 attack_table&#91;...&#93;; // ~840 KByte all rook and bishop attacks, less with constructive collisions optimization
 
struct SMagic &#123;
   U64* ptr;  // pointer to attack_table for each particular square
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41;
   U64 magic; // magic 64-bit factor
   int shift; // shift right
&#125;;
 
SMagic mBishopTbl&#91;64&#93;;
SMagic mRookTbl&#91;64&#93;;
 
U64 bishopAttacks&#40;U64 occ, enumSquare sq&#41; &#123;
   U64* aptr = mBishopTbl&#91;sq&#93;.ptr;
   occ      &= mBishopTbl&#91;sq&#93;.mask;
   occ      *= mBishopTbl&#91;sq&#93;.magic;
   occ     >>= mBishopTbl&#91;sq&#93;.shift;
   return aptr&#91;occ&#93;;
&#125;
 
U64 rookAttacks&#40;U64 occ, enumSquare sq&#41; &#123;
   U64* aptr = mRookTbl&#91;sq&#93;.ptr;
   occ      &= mRookTbl&#91;sq&#93;.mask;
   occ      *= mRookTbl&#91;sq&#93;.magic;
   occ     >>= mRookTbl&#91;sq&#93;.shift;
   return aptr&#91;occ&#93;;
&#125;

On the same page there is also a remark that the performance of the "Plain" approach that uses the same (full) size for all squares were not distinguishable from "Fancy". In "Plain" the square is actually used as an array index. None of these implementations uses HGM's proposal of a pointer array, though.

I put it on my test list for later ("Plain" as well as the pointer array for "Fancy").

Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants

Re: Comparison of bitboard attack-getter variants