Plain and fancy magic on modern hardware

micron · Post by **micron** » Sun Aug 22, 2010 7:52 am

In a 'plain' magic attack generator, the precomputed attack tables occupy 2304 KB (256 for B; 2048 for R). By eliminating some redundancy, 'fancy' magic reduces the table storage to about 840 KB.
http://chessprogramming.wikispaces.com/Magic+Bitboards
I implemented both methods in my infant engine Spandrel and measured the single-threaded performance on Core 2 Duo and Core i5.

Code: Select all

Speed advantage of fancy magic over plain magic
		  perft   search
Core 2   2.6%    1.3%
Core i5  1.6%    0.7%

Fancy magic is said to be faster because the more compact storage reduces cache misses. The effect, already small on Core 2 Duo, is even smaller on Core i5, perhaps owing to the different cache design (Core 2 Duo: 6 MB L2; Core i5: 256 KB L2 per core + 8 MB L3).

A programmer implementing fancy magic in his chess engine seems to be faced with unappealing choices: either borrowing/stealing complicated initialisation code (and perhaps the magic multipliers and other subsidiary magic numbers) from someone else's source, or working absurdly hard. By contrast, plain magic can be initialised straightforwardly (shown here for bishop attacks).

Code: Select all

const BitBoard  gBMagicNum&#91;64&#93; = &#123; 0xWhateverULL, ... &#125;;
BitBoard        gBMagicMask&#91;64&#93;;
int             gBMagicShift&#91;64&#93;;
BitBoard        gBMagicAttacks&#91;64&#93;&#91;512&#93;;


BitBoard BAttacks_Init&#40; int sq, BitBoard occupied ) &#123;
  ...
  return attacks;
&#125;


void InitBMagicAttacksForSquare&#40; int sq ) &#123;
  int       index; // 0--511
  BitBoard  subset = 0ULL;
  // iterate over all subsets of gBMagicMask&#91;sq&#93;
  // chessprogramming.wikispaces.com/Traversing+Subsets+of+a+Set
  do &#123;
    index = &#40;subset * gBMagicNum&#91;sq&#93;) >> gBMagicShift&#91;sq&#93;;
    gBMagicAttacks&#91;sq&#93;&#91;index&#93; = BAttacks_Init&#40; sq, subset );
    subset = &#40;subset - gBMagicMask&#91;sq&#93;) & gBMagicMask&#91;sq&#93;;
  &#125; while ( subset );
&#125;  


/*
 gBMagicNum&#91;&#93; is already initialised.
 Initialise gBMagicMask&#91;&#93;, gBMagicShift&#91;&#93; and gBMagicAttacks&#91;&#93;&#91;&#93;.
 */
void InitBMagicAttacks&#40;) &#123;
  for ( int j = 0; j < 64; j++ ) &#123;
    // each gBMagicMask&#91;&#93; is 4 diagonal rays that stop short of the edge
    gBMagicMask&#91;j&#93; = BAttacks_Init&#40; j, 0ULL ) & gNonEdgeSquares;
    gBMagicShift&#91;j&#93; = 64 - PopCount&#40; gBMagicMask&#91;j&#93; );
    InitBMagicAttacksForSquare&#40; j );
  &#125;
&#125;

I find myself preferring beauty of initialisation to a 1% or so speedup. Others who are in process of implementing magic in their engines may make a different choice. And of course on older hardware the advantage of 'fancy' is likely to be bigger than what I measured. Does anyone have comparable measurements?

Robert P.

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Aug 22, 2010 10:33 am

micron wrote:In a 'plain' magic attack generator, the precomputed attack tables occupy 2304 KB (256 for B; 2048 for R). By eliminating some redundancy, 'fancy' magic reduces the table storage to about 840 KB.
http://chessprogramming.wikispaces.com/Magic+Bitboards
I implemented both methods in my infant engine Spandrel and measured the single-threaded performance on Core 2 Duo and Core i5.
Code: Select all
Speed advantage of fancy magic over plain magic
		  perft   search
Core 2   2.6%    1.3%
Core i5  1.6%    0.7%
Fancy magic is said to be faster because the more compact storage reduces cache misses. The effect, already small on Core 2 Duo, is even smaller on Core i5, perhaps owing to the different cache design (Core 2 Duo: 6 MB L2; Core i5: 256 KB L2 per core + 8 MB L3).

A programmer implementing fancy magic in his chess engine seems to be faced with unappealing choices: either borrowing/stealing complicated initialisation code (and perhaps the magic multipliers and other subsidiary magic numbers) from someone else's source, or working absurdly hard. By contrast, plain magic can be initialised straightforwardly (shown here for bishop attacks).
Code: Select all
const BitBoard  gBMagicNum&#91;64&#93; = &#123; 0xWhateverULL, ... &#125;;
BitBoard        gBMagicMask&#91;64&#93;;
int             gBMagicShift&#91;64&#93;;
BitBoard        gBMagicAttacks&#91;64&#93;&#91;512&#93;;


BitBoard BAttacks_Init&#40; int sq, BitBoard occupied ) &#123;
  ...
  return attacks;
&#125;


void InitBMagicAttacksForSquare&#40; int sq ) &#123;
  int       index; // 0--511
  BitBoard  subset = 0ULL;
  // iterate over all subsets of gBMagicMask&#91;sq&#93;
  // chessprogramming.wikispaces.com/Traversing+Subsets+of+a+Set
  do &#123;
    index = &#40;subset * gBMagicNum&#91;sq&#93;) >> gBMagicShift&#91;sq&#93;;
    gBMagicAttacks&#91;sq&#93;&#91;index&#93; = BAttacks_Init&#40; sq, subset );
    subset = &#40;subset - gBMagicMask&#91;sq&#93;) & gBMagicMask&#91;sq&#93;;
  &#125; while ( subset );
&#125;  


/*
 gBMagicNum&#91;&#93; is already initialised.
 Initialise gBMagicMask&#91;&#93;, gBMagicShift&#91;&#93; and gBMagicAttacks&#91;&#93;&#91;&#93;.
 */
void InitBMagicAttacks&#40;) &#123;
  for ( int j = 0; j < 64; j++ ) &#123;
    // each gBMagicMask&#91;&#93; is 4 diagonal rays that stop short of the edge
    gBMagicMask&#91;j&#93; = BAttacks_Init&#40; j, 0ULL ) & gNonEdgeSquares;
    gBMagicShift&#91;j&#93; = 64 - PopCount&#40; gBMagicMask&#91;j&#93; );
    InitBMagicAttacksForSquare&#40; j );
  &#125;
&#125; 
I find myself preferring beauty of initialisation to a 1% or so speedup. Others who are in process of implementing magic in their engines may make a different choice. And of course on older hardware the advantage of 'fancy' is likely to be bigger than what I measured. Does anyone have comparable measurements?

Robert P.

Hi Robert,

How do you initialize the gBMagicNum? How do you ensure different attack occupancies did not result in the same index? I think that is the main initialization issue, once you have the magics right without bad collisions, initializing gBMagicAttacks is easy, no matter whether plain or fancy. And why don't you use a fixed shift of 64-9 or 64-12 with plain?

I am not so familiar with what happened with recent processors, 2MB versus 0.8M sounds huge, but with 6MB L2 cache and probably huge pages, plain might be fine or even faster in the future. Since source square and occupancy of relevant squares does not change that randomly inside a real search and each different square and occupancy has most often a different cacheline of a 4KByte page, the number of 6MB L2-misses is likely not that much different between the two memory layouts. And yes, plain, two dimensional array requires shorter code and has no variable shift by cl.

Fancy:

Code: Select all

U64 attack_table&#91;...&#93;; // ~840K Byte all rook and bishop attacks
 
struct SMagic &#123;
   U64* ptr;  // pointer to attack_table for each particular square
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41;
   U64 magic; // magic 64-bit factor
   int shift; // shift right
&#125;;

SMagic mBishopTbl&#91;64&#93;;
SMagic mRookTbl  &#91;64&#93;;

U64 bishopAttacks&#40;U64 occ, enumSquare sq&#41; &#123;
   U64* aptr = mBishopTbl&#91;sq&#93;.ptr;
   occ      &= mBishopTbl&#91;sq&#93;.mask;
   occ      *= mBishopTbl&#91;sq&#93;.magic;
   occ     >>= mBishopTbl&#91;sq&#93;.shift;
   return aptr&#91;occ&#93;;
&#125;

Plain:

Code: Select all

U64 mBishopAttacks&#91;64&#93; &#91;512&#93;; // 256 K
U64 mRookAttacks  &#91;64&#93;&#91;4096&#93;; // 2048K
 
struct SMagic &#123;
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41;
   U64 magic; // magic 64-bit factor
&#125;;

SMagic mBishopTbl&#91;64&#93;;
SMagic mRookTbl  &#91;64&#93;;

U64 bishopAttacks&#40;U64 occ, enumSquare sq&#41; &#123;
   occ &= mBishopTbl&#91;sq&#93;.mask;
   occ *= mBishopTbl&#91;sq&#93;.magic;
   occ >>= 64-9;
   return mBishopAttack&#91;sq&#93;&#91;occ&#93;;
&#125;

Gerd

micron · Post by **micron** » Sun Aug 22, 2010 12:55 pm

Gerd Isenberg wrote:How do you initialize the gBMagicNum? How do you ensure different attack occupancies did not result in the same index?

To obtain the plain magic multipliers I used Tord Romstad's code http://chessprogramming.wikispaces.com/ ... for+Magics
For fancy, I temporarily borrowed code and numbers, everything, from Crafty.

I think that is the main initialization issue, once you have the magics right without bad collisions, initializing gBMagicAttacks is easy, no matter whether plain or fancy.

As a newbie to magic, I found the implementation hard going until I separated plain from fancy. There's a lot of discussion of 'good' magic numbers, and I was slow to understand that these are irrelevant to a plain magic attack generator.

And why don't you use a fixed shift of 64-9 or 64-12 with plain?

Er, because I thought that a square-dependent shift is essential to the algorithm. Certainly CPW gives that impression. Following your hint, I made the obvious changes to Tord Romstad's code for the magic multipliers, and to my (plain) attack generators. Not only does the fixed-shift method work, but with one less lookup it's 1.6% faster (in perft). That's very nearly indistinguishable from fancy!

Where is fixed-shift documented?

Robert P.

Edmund · Post by **Edmund** » Sun Aug 22, 2010 1:05 pm

micron wrote:
And why don't you use a fixed shift of 64-9 or 64-12 with plain?
Er, because I thought that a square-dependent shift is essential to the algorithm. Certainly CPW gives that impression. Following your hint, I made the obvious changes to Tord Romstad's code for the magic multipliers, and to my (plain) attack generators. Not only does the fixed-shift method work, but with one less lookup it's 1.6% faster (in perft). That's very nearly indistinguishable from fancy!

Where is fixed-shift documented?

Robert P.

better not just test this solely in perft, but also in a normal chess engine environment. Memory footprint plays an important role. In perft you have all memory available for the move generation, otherwise also hash tables etc are competing for these resources. Especially once you add a dedicated movegeneration cache, lower memory requirements become increasingly relevant, as the computation spent on movegeneration decreases. So we are now using shared tables in Glass with a total memory requirement (R+B) of 472064bytes. Also Hyperbola Quintessence sounds really interesting to me, if not byteswap was that slow on my processor.

regards,
Edmund

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Aug 22, 2010 2:23 pm

micron wrote:
And why don't you use a fixed shift of 64-9 or 64-12 with plain?
Er, because I thought that a square-dependent shift is essential to the algorithm. Certainly CPW gives that impression. Following your hint, I made the obvious changes to Tord Romstad's code for the magic multipliers, and to my (plain) attack generators. Not only does the fixed-shift method work, but with one less lookup it's 1.6% faster (in perft). That's very nearly indistinguishable from fancy!

Where is fixed-shift documented?

No, it is implicit knowledge

If all squares have same size anyway, i.e. 512 for bishops, shift by 64-9 is sufficient to exactly get those 9 bits for the index, even if less bits are relevant. If f.i. only 5 bits were relevant the lower 4 bits are redundant or garbage, and there are larger gaps between relevant entries in the 512 sized sub-array, possibly producing more L1-misses up and then for bishops in the corners. But the constant shift will likely pay off.

micron · Post by **micron** » Mon Aug 23, 2010 6:35 am

Edmund wrote: Also Hyperbola Quintessence sounds really interesting to me, if not byteswap was that slow on my processor.

Your note prompted me to try the Hyperbola Quintessence bishop attacks generator.
http://chessprogramming.wikispaces.com/ ... SE3Version

The code is complete except that it lacks
#include <tmmintrin.h>
and has an untranslated comment:
__m128i swapMaskXMM; // needs to be initialized to swap the bytes in both quad-words
Cursory examination shows (1) that swapMaskXMM is to be stuffed with byte values 0..15, and (2) in order to proceed with #1 you have to know, or learn hastily, something about SIMD. The restriction #2 must have been a deliberate decision by the author. Without giving the game away, I can reveal that it's mainly a matter of finding the appropriate _mm_intrinsic_function_with_supposedly_logical_but_confusing_name.

The HQ method has some attractions:
- like the classic 4-ray bitboard method, it has no need of a huge initialised attacks db, and thus causes little cache pressure.
- the underlying theory makes for a gripping read.
- it's good to make use of those expensive SSE transistors for chess purposes, and be reminded of the amusing mapping between SIMD opcodes and their intrinsics, so that pshufb becomes _mm_shuffle_epi8.

Alas, when tested on Core i5 against my plain magic bishop attacks generator, HQ proves 3% slower in perft and 2% slower in search.

Robert P.

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Aug 23, 2010 8:03 am

micron wrote:
Edmund wrote: Also Hyperbola Quintessence sounds really interesting to me, if not byteswap was that slow on my processor.
Your note prompted me to try the Hyperbola Quintessence bishop attacks generator.
http://chessprogramming.wikispaces.com/ ... SE3Version

The code is complete except that it lacks
#include <tmmintrin.h>
and has an untranslated comment:
__m128i swapMaskXMM; // needs to be initialized to swap the bytes in both quad-words
Cursory examination shows (1) that swapMaskXMM is to be stuffed with byte values 0..15, and (2) in order to proceed with #1 you have to know, or learn hastily, something about SIMD. The restriction #2 must have been a deliberate decision by the author. Without giving the game away, I can reveal that it's mainly a matter of finding the appropriate _mm_intrinsic_function_with_supposedly_logical_but_confusing_name.

The HQ method has some attractions:
- like the classic 4-ray bitboard method, it has no need of a huge initialised attacks db, and thus causes little cache pressure.
- the underlying theory makes for a gripping read.
- it's good to make use of those expensive SSE transistors for chess purposes, and be reminded of the amusing mapping between SIMD opcodes and their intrinsics, so that pshufb becomes _mm_shuffle_epi8.

Alas, when tested on Core i5 against my plain magic bishop attacks generator, HQ proves 3% slower in perft and 2% slower in search.

Robert P.

Thanks for testing. I wrote that SSE3 routine without the ability to actually compile or test it due to my old AMD K8, which only supports SSE2. I am not sure about the prologue/epilogue overhead of that short routine and whether it is inlined, and how many cycles it actually takes. One need to probably inspect the assembly to see what the compiler makes out of it, and how many caller safe xmm-registers were used.

On my AMD computer with 1/3 cycle bswap throughput the general purpose HQ was very close to magics. Anyway, with that huge caches, it seems the three alu-operations "and mul shr" plus lookup of magics bitboards is unbeatable.

Gerd

micron · Post by **micron** » Mon Aug 23, 2010 9:30 am

Gerd Isenberg wrote: Thanks for testing. I wrote that SSE3 routine without the ability to actually compile or test it due to my old AMD K8, which only supports SSE2. I am not sure about the prologue/epilogue overhead of that short routine and whether it is inlined, and how many cycles it actually takes. One need to probably inspect the assembly to see what the compiler makes out of it, and how many caller safe xmm-registers were used.

Code: Select all

_bishopAttacks&#58;
 pushq      %rbp
 movq       %rsp, %rbp
 movq       %rdi, -8&#40;%rbp&#41;
 movslq     %esi,%rsi
 salq       $4, %rsi
 movq       _diaAntiMaskXMM@GOTPCREL&#40;%rip&#41;, %rax
 movdqa     (%rsi,%rax&#41;, %xmm4
 movq       _singleBitsXMM@GOTPCREL&#40;%rip&#41;, %rax
 movdqa     (%rsi,%rax&#41;, %xmm1
 movq       _swapMaskXMM@GOTPCREL&#40;%rip&#41;, %rax
 movdqa     (%rax&#41;, %xmm3
 movq       -8&#40;%rbp&#41;, %xmm0
 punpcklqdq %xmm0, %xmm0
 pand       %xmm4, %xmm0
 movdqa     %xmm0, %xmm2
 pshufb     %xmm3, %xmm2
 psubq      %xmm1, %xmm0
 pshufb     %xmm3, %xmm1
 psubq      %xmm1, %xmm2
 movdqa     %xmm2, %xmm1
 pshufb     %xmm3, %xmm1
 pxor       %xmm1, %xmm0
 pand       %xmm4, %xmm0
 movdqa     %xmm0, %xmm1
 punpckhqdq %xmm0, %xmm1
 paddq      %xmm1, %xmm0
 movq       %xmm0, -16&#40;%rbp&#41;
 movq       -16&#40;%rbp&#41;, %rax
 leave
 ret

Robert P.

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Aug 23, 2010 12:08 pm

micron wrote:
Gerd Isenberg wrote: Thanks for testing. I wrote that SSE3 routine without the ability to actually compile or test it due to my old AMD K8, which only supports SSE2. I am not sure about the prologue/epilogue overhead of that short routine and whether it is inlined, and how many cycles it actually takes. One need to probably inspect the assembly to see what the compiler makes out of it, and how many caller safe xmm-registers were used.
Code: Select all
_bishopAttacks&#58;
 pushq      %rbp
 movq       %rsp, %rbp
 movq       %rdi, -8&#40;%rbp&#41;
 movslq     %esi,%rsi
 salq       $4, %rsi
 movq       _diaAntiMaskXMM@GOTPCREL&#40;%rip&#41;, %rax
 movdqa     (%rsi,%rax&#41;, %xmm4
 movq       _singleBitsXMM@GOTPCREL&#40;%rip&#41;, %rax
 movdqa     (%rsi,%rax&#41;, %xmm1
 movq       _swapMaskXMM@GOTPCREL&#40;%rip&#41;, %rax
 movdqa     (%rax&#41;, %xmm3
 movq       -8&#40;%rbp&#41;, %xmm0
 punpcklqdq %xmm0, %xmm0
 pand       %xmm4, %xmm0
 movdqa     %xmm0, %xmm2
 pshufb     %xmm3, %xmm2
 psubq      %xmm1, %xmm0
 pshufb     %xmm3, %xmm1
 psubq      %xmm1, %xmm2
 movdqa     %xmm2, %xmm1
 pshufb     %xmm3, %xmm1
 pxor       %xmm1, %xmm0
 pand       %xmm4, %xmm0
 movdqa     %xmm0, %xmm1
 punpckhqdq %xmm0, %xmm1
 paddq      %xmm1, %xmm0
 movq       %xmm0, -16&#40;%rbp&#41;
 movq       -16&#40;%rbp&#41;, %rax
 leave
 ret
Robert P.

OK, that looks not that optimal. Is that GCC?

All memory->xmm transfer goes via general purpose reg, may be a matter of 16 byte alignment of the data. Also the final transfer from xmm0 low qword to rax goes via stack. Also the function should be inlined.

wgarvin · Post by **wgarvin** » Mon Aug 23, 2010 11:39 pm

IIRC, the magic tables for bishops are much smaller than the tables for rooks. Something like 38 KB versus 800 KB ?

So if you care about small table size (for efficient use of L1/L2 cache, etc.) then you might want to use magic for bishops, but something else such as kindergarten, for rooks.

Plain and fancy magic on modern hardware

Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware