Plain and fancy magic on modern hardware

micron · Post by **micron** » Tue Aug 24, 2010 12:01 am

Gerd Isenberg wrote: OK, that looks not that optimal. Is that GCC?
All memory->xmm transfer goes via general purpose reg, may be a matter of 16 byte alignment of the data. Also the final transfer from xmm0 low qword to rax goes via stack. Also the function should be inlined.

Yes, gcc-4.2. Presumably compilers know to align __m128i data correctly. I showed the out-of-line code, but the function is indeed inlined.
llvm-gcc-4.2 gives this:

Code: Select all

_bishopAttacks&#58;
 pushq    %rbp
 movq     %rsp, %rbp
 movslq   %esi, %rax
 shlq     $4, %rax
 movq     _singleBitsXMM@GOTPCREL&#40;%rip&#41;, %rcx
 movaps   (%rcx,%rax&#41;, %xmm0
 movq     _diaAntiMaskXMM@GOTPCREL&#40;%rip&#41;, %rcx
 movaps   (%rcx,%rax&#41;, %xmm1
 movd     %rdi, %xmm2
 movlhps  %xmm2, %xmm2
 andps    %xmm1, %xmm2
 movaps   %xmm2, %xmm3
 psubq    %xmm0, %xmm3
 movq     _swapMaskXMM@GOTPCREL&#40;%rip&#41;, %rax
 movaps   (%rax&#41;, %xmm4
 pshufb   %xmm4, %xmm0
 pshufb   %xmm4, %xmm2
 psubq    %xmm0, %xmm2
 pshufb   %xmm4, %xmm2
 xorps    %xmm3, %xmm2
 andps    %xmm1, %xmm2
 movaps   %xmm2, %xmm0
 shufpd   $3, %xmm0, %xmm0
 paddq    %xmm2, %xmm0
 movd     %xmm0, %rax
 popq     %rbp
 ret

which is similar speed, i.e. slower than plain magic.

Robert P.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Aug 24, 2010 10:30 pm

micron wrote:
Gerd Isenberg wrote: OK, that looks not that optimal. Is that GCC?
All memory->xmm transfer goes via general purpose reg, may be a matter of 16 byte alignment of the data. Also the final transfer from xmm0 low qword to rax goes via stack. Also the function should be inlined.
Yes, gcc-4.2. Presumably compilers know to align __m128i data correctly. I showed the out-of-line code, but the function is indeed inlined.
llvm-gcc-4.2 gives this:
Code: Select all
_bishopAttacks&#58;
 pushq    %rbp
 movq     %rsp, %rbp
 movslq   %esi, %rax
 shlq     $4, %rax
 movq     _singleBitsXMM@GOTPCREL&#40;%rip&#41;, %rcx
 movaps   (%rcx,%rax&#41;, %xmm0
 movq     _diaAntiMaskXMM@GOTPCREL&#40;%rip&#41;, %rcx
 movaps   (%rcx,%rax&#41;, %xmm1
 movd     %rdi, %xmm2
 movlhps  %xmm2, %xmm2
 andps    %xmm1, %xmm2
 movaps   %xmm2, %xmm3
 psubq    %xmm0, %xmm3
 movq     _swapMaskXMM@GOTPCREL&#40;%rip&#41;, %rax
 movaps   (%rax&#41;, %xmm4
 pshufb   %xmm4, %xmm0
 pshufb   %xmm4, %xmm2
 psubq    %xmm0, %xmm2
 pshufb   %xmm4, %xmm2
 xorps    %xmm3, %xmm2
 andps    %xmm1, %xmm2
 movaps   %xmm2, %xmm0
 shufpd   $3, %xmm0, %xmm0
 paddq    %xmm2, %xmm0
 movd     %xmm0, %rax
 popq     %rbp
 ret
which is similar speed, i.e. slower than plain magic.

Robert P.

Compiler seems too weak with this code. I am not that familiar with rip-relative addressing, but the sequence to load globals into xmm looks quite expensive, also there is no need for a local stackframe and mixing float with integer instructions. I would probably embed it inside a class for cheaper access via "this" rather than accessing globals...

Code: Select all

class Hyper &#123;
   struct &#123;
     __m128i diaAntiMaskXMM; //  antidiag &#58; diagonal, excluding square
     __m128i singleBitsXMM;  //  1<<sq  &#58; 1<<sq
   &#125; h&#91;64&#93;; // 2 KByte
   U64 swapMask;

public&#58;
   U64 bishopAttacks&#40;U64 occ, U64 sq&#41; &#123; ... &#125;
   void init&#40;) &#123;..&#125;
&#125;;

to get something like (not that familiar with gas syntax)

Code: Select all

_bishopAttacks&#58; 
; rdi - this, rsi - occ, rdx - square
 shlq       $4, %rdx         ; scale index for 16-byte access
 movdqa     (%rdi+%rdx+$X&#41;, %xmm0    ; dia&#58;anti - mask
 movdqa     (%rdi+%rdx+$X+16&#41;, %xmm1 ; bishop bits
 movd       %rsi, %xmm2      ; occ
 punpcklqdq %xmm2, %xmm2     ; occ&#58;occ
 movd       (%rdi+$Y&#41;, %xmm4 ; swapMask
 pand       %xmm1, %xmm2     ; o &#58;= occ & mask
 movdqa     %xmm2, %xmm3     ; o    
 psubq      %xmm0, %xmm3     ; o-r
 pshufb     %xmm4, %xmm0     ; r'
 pshufb     %xmm4, %xmm2     ; o'
 psubq      %xmm0, %xmm2     ; o'-r'
 pshufb     %xmm4, %xmm2     ; &#40;o'-r')'
 pxor       %xmm3, %xmm2     ; &#40;o'-r')' ^ ( o-r&#41;
 pand       %xmm1, %xmm2     ; & mask
 movdqa     %xmm2, %xmm0
 punpckhqdq %xmm0, %xmm0     ; high -> low
 pxor       %xmm2, %xmm0     ; due to disjoint sets or,add,xor are same
 movd       %xmm0, %rax
 ret

How expensive is an L1-Miss but L2-Hit on that i5 with 6MB L2?

micron · Post by **micron** » Wed Aug 25, 2010 12:02 pm

wgarvin wrote:IIRC, the magic tables for bishops are much smaller than the tables for rooks. Something like 38 KB versus 800 KB ?

So if you care about small table size (for efficient use of L1/L2 cache, etc.) then you might want to use magic for bishops, but something else such as kindergarten, for rooks.

Maybe, but the amount of global data accessed for slider attacks seems unimportant for speed. The table below shows times for fixed-depth search by my engine running on Core i5. Slider-attack-related data is competing for the cache with all the usual memory suspects: TT, Zobrist key tables and so on. For comparison, I included the classic 4-ray BitBoard method and a branch-free variant; these access minimal global data.

Code: Select all

Times in ms for nominal 11 ply search with different BitBoard slider attack generators.
In each case a preliminary "warmup" run0 was discarded.

                            KB   run1 run2 run3    minimum
plain magic &#40;fixed-shift&#41; 2304   3755 3756 3755    3755
fancy magic*               840   3764 3764 3762    3762
plain magic &#40;var-shift&#41;   2304   3810 3811 3810    3810
classic BB &#40;branch-free&#41;     4   4381 4381 4381    4381
classic BB                   4   4871 4867 4867    4867

*from Crafty

The classic BB methods are significantly slower than all the magic methods. I haven't tried kindergarten, but would expect its performance to be between classic and magic.

What I thought was an embarrassingly naive implementation of magic turns out to be respectably fast. With Gerd's delightful fixed-shift tweak (64-9 in the code below for bishop), as revealed earlier in this thread, plain magic managed to squeak into top place.

Code: Select all

U64 mBishopAttacks&#91;64&#93; &#91;512&#93;; // 256 K 
U64 mRookAttacks  &#91;64&#93;&#91;4096&#93;; // 2048K 
  
struct SMagic &#123; 
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41; 
   U64 magic; // magic 64-bit factor 
&#125;; 

SMagic mBishopTbl&#91;64&#93;; 
SMagic mRookTbl  &#91;64&#93;; 

U64 bishopAttacks&#40;U64 occ, enumSquare sq&#41; &#123; 
   occ &= mBishopTbl&#91;sq&#93;.mask; 
   occ *= mBishopTbl&#91;sq&#93;.magic; 
   occ >>= 64-9; 
   return mBishopAttack&#91;sq&#93;&#91;occ&#93;; 
&#125;

Robert P.

Houdini · Post by **Houdini** » Thu Aug 26, 2010 12:09 pm

If memory footprint is an issue, one can always apply a compacting scheme to the rook and bishop tables.

In the "fancy" magic scheme the rook table contains about 100k 64-bit entries (around 800 kB), but in reality there are only 4900 different attacking bitboards possible for a rook, so there is an important redundancy.

One can easily construct the look-up table of 4900 different attacking bitboards (about 40 kB), and then replace the 100 k rook table entries with a 16-bit index in the look-up table. This leads to a memory footprint of 40 kB + 2 * 100 kB or about 240 kB instead of the initial 800 kB.

The next step is to realize that there are maximum 144 rook attacking bitboards for a given square. This means that one can replace the 16-bit index with an 8-bit index combined with a fixed offset per square. This leads to a memory footprint of 40 kB + 100 kB or about 140 kB instead of the initial 800 kB.

In the "plain" magic scheme the savings would be even more important. The 2048 kB rook table becomes 40 kB + 256 kB or about 300 kB.
The plain magic code shown above would simply become:

Code: Select all

byte mBishopAttacks&#91;64&#93; &#91;512&#93;; // 32 K 
byte mRookAttacks  &#91;64&#93;&#91;4096&#93;; // 256 K 

U64 mRookLookup&#91;4900&#93;;  // 39 K
U64 mBishopLookup&#91;1428&#93;;  // 11 K

struct SMagic &#123; 
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41; 
   U64 magic; // magic 64-bit factor 
   int offset;   // offset into lookup table
&#125;; 

SMagic mBishopTbl&#91;64&#93;; 
SMagic mRookTbl  &#91;64&#93;; 

U64 bishopAttacks&#40;U64 occ, enumSquare sq&#41; &#123; 
   occ &= mBishopTbl&#91;sq&#93;.mask; 
   occ *= mBishopTbl&#91;sq&#93;.magic; 
   occ >>= 64-9; 
   return mBishopLookup&#91;mBishopTbl&#91;sq&#93;.offset + mBishopAttack&#91;sq&#93;&#91;occ&#93;&#93;;
&#125;

For the bishop table there are 1428 attacking bitboards, so the 40 kB (fancy) bishop table compacts to a 11 kB look-up table and a 5 kB index table = 16 kB. Probably less useful.

I've tested this compacting scheme successfully in Houdini, but I'm still not sure whether it's really beneficial or not.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 26, 2010 4:08 pm

Houdini wrote:If memory footprint is an issue, one can always apply a compacting scheme to the rook and bishop tables.

In the "fancy" magic scheme the rook table contains about 100k 64-bit entries (around 800 kB), but in reality there are only 4900 different attacking bitboards possible for a rook, so there is an important redundancy.

One can easily construct the look-up table of 4900 different attacking bitboards (about 40 kB), and then replace the 100 k rook table entries with a 16-bit index in the look-up table. This leads to a memory footprint of 40 kB + 2 * 100 kB or about 240 kB instead of the initial 800 kB.

The next step is to realize that there are maximum 144 rook attacking bitboards for a given square. This means that one can replace the 16-bit index with an 8-bit index combined with a fixed offset per square. This leads to a memory footprint of 40 kB + 100 kB or about 140 kB instead of the initial 800 kB.

In the "plain" magic scheme the savings would be even more important. The 2048 kB rook table becomes 40 kB + 256 kB or about 300 kB.
The plain magic code shown above would simply become:
Code: Select all
byte mBishopAttacks&#91;64&#93; &#91;512&#93;; // 32 K 
byte mRookAttacks  &#91;64&#93;&#91;4096&#93;; // 256 K 

U64 mRookLookup&#91;4900&#93;;  // 39 K
U64 mBishopLookup&#91;1428&#93;;  // 11 K

struct SMagic &#123; 
   U64 mask;  // to mask relevant squares of both lines &#40;no outer squares&#41; 
   U64 magic; // magic 64-bit factor 
   int offset;   // offset into lookup table
&#125;; 

SMagic mBishopTbl&#91;64&#93;; 
SMagic mRookTbl  &#91;64&#93;; 

U64 bishopAttacks&#40;U64 occ, enumSquare sq&#41; &#123; 
   occ &= mBishopTbl&#91;sq&#93;.mask; 
   occ *= mBishopTbl&#91;sq&#93;.magic; 
   occ >>= 64-9; 
   return mBishopLookup&#91;mBishopTbl&#91;sq&#93;.offset + mBishopAttack&#91;sq&#93;&#91;occ&#93;&#93;;
&#125;
For the bishop table there are 1428 attacking bitboards, so the 40 kB (fancy) bishop table compacts to a 11 kB look-up table and a 5 kB index table = 16 kB. Probably less useful.

I've tested this compacting scheme successfully in Houdini, but I'm still not sure whether it's really beneficial or not.

Another variation of the memory size versus computation theme. The 16-bit indirection was tested worse by Pradu and others, maybe things changed with recent processors. Your byte lookup idea with offset is interesting, 4 KByte per rook square instead of 32K, but two dependent "semi-huge" lookups.

I am not that memory and cache expert with nehalem and its huge smart caches of 6 MByte and more. And what are the typical hit/miss ratios and miss costs of various caches - if we lookup sparse entries from 2MB (32K per square) versus 256 KByte ( 4K per square) during a "real" search. And what benefits (if any) huge pages make here.

Houdini · Post by **Houdini** » Thu Aug 26, 2010 4:27 pm

Gerd Isenberg wrote:The 16-bit indirection was tested worse by Pradu and others

Any reference to their tests?
Thanks.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 26, 2010 4:56 pm

Houdini wrote:
Gerd Isenberg wrote:The 16-bit indirection was tested worse by Pradu and others
Any reference to their tests?
Thanks.

I can not found the original CCC post by Pradu with his results any longer (seems somehow lost), it was about to define PERFECT_MAGIC_HASH in Pradu's source (Download Magic Move Bitboard Generator). It was discussed and was slightly slower. Here is another thread on that topic.

Houdini · Post by **Houdini** » Thu Aug 26, 2010 5:26 pm

Gerd Isenberg wrote:I can not found the original CCC post by Pradu with his results any longer (seems somehow lost), it was about to define PERFECT_MAGIC_HASH in Pradu's source (Download Magic Move Bitboard Generator).

Thanks!
You're correct, inspection of Pradu's code shows that the 16-bit index scheme I suggested is identical to Pradu's PERFECT_MAGIC_HASH option.
The 8-bit index with offset per square is the logical next step, as it further halves the rook tables from 200 kB to 100 kB at a very minimal cost.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Aug 26, 2010 5:46 pm

Houdini wrote:
Gerd Isenberg wrote:I can not found the original CCC post by Pradu with his results any longer (seems somehow lost), it was about to define PERFECT_MAGIC_HASH in Pradu's source (Download Magic Move Bitboard Generator).
Thanks!
You're correct, inspection of Pradu's code shows that the 16-bit index scheme I suggested is identical to Pradu's PERFECT_MAGIC_HASH option.
The 8-bit index with offset per square is the logical next step, as it further halves the rook tables from 200 kB to 100 kB at a very minimal cost.

Yes, that could be very competitive. With post-mask one can even half the array of distinct rook attacks, even shared by bishop attacks with a resulting mRookBishopLookup of less than 20 KByte.

micron · Post by **micron** » Fri Aug 27, 2010 4:11 am

Houdini wrote:
Gerd Isenberg wrote:I can not found the original CCC post by Pradu with his results any longer (seems somehow lost), it was about to define PERFECT_MAGIC_HASH in Pradu's source (Download Magic Move Bitboard Generator).
Thanks!
You're correct, inspection of Pradu's code shows that the 16-bit index scheme I suggested is identical to Pradu's PERFECT_MAGIC_HASH option.
The 8-bit index with offset per square is the logical next step, as it further halves the rook tables from 200 kB to 100 kB at a very minimal cost.

The "plain with fixed-shift tweak" attack generator as discussed earlier in this thread is obtained in Pradu's magicmoves.h with three macros commented out:
//#define MINIMIZE_MAGIC
//#define PERFECT_MAGIC_HASH unsigned short
//#define VARIABLE_SHIFT

Robert P.

Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware

Re: Plain and fancy magic on modern hardware