Magic bitboards: Cache issues and state of the art?

SamuelS · Post by **SamuelS** » Fri Mar 22, 2013 4:47 pm

I have read that many strong chess engines use magic bitboards for sliding piece move generation. After getting familiar with the subject, I wrote such a move generator this afternoon. Currently, I use about 840 kB for the attack tables (bishop and rook combined). Based on a few simple test cases, the code seems to work fine.

My first question is: Is it worth the effort to try to pack the attack tables in fewer kilobytes? I guess processor cache sizes still limit the performance?

And another question is: What is the state of the art with magic bitboards today? What techniques are used in the top engines?

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Mar 22, 2013 7:33 pm

SamuelS wrote:My first question is: Is it worth the effort to try to pack the attack tables in fewer kilobytes? I guess processor cache sizes still limit the performance?

Whether reducing from 800K to let say 400K reduces the number of L1 and L2 misses significantly inside a search of a concrete chess program, where occupancy and squares don't change that randomly - depends on the individual cache- and memory footprint of the program and hardware architecture. I guess with todays processors, that plain fixed shift magics may pay off due to less register pressure, freeing x64 rcx aka cl shift register. One may pack fixed shift arrays from ~2MB below 800K as well. However, the effort - Elo gain ratio is likely better in other areas

SamuelS wrote:And another question is: What is the state of the art with magic bitboards today? What techniques are used in the top engines?

Seems most use Pradu's Fancy. Don' know for commercial top engines, but see Robert Houdart's post:
http://www.talkchess.com/forum/viewtopi ... 26&t=35858

syzygy · Post by **syzygy** » Sat Mar 23, 2013 12:24 am

Gerd Isenberg wrote:One may pack fixed shift arrays from ~2MB below 800K as well. However, the effort - Elo gain ratio is likely better in other areas

If you borrow from Volker Annuss the effort is negligible.

AlvaroBegue · Post by **AlvaroBegue** » Sat Mar 23, 2013 1:45 am

You can compute the column and the row separately for rook attacks, and then the tables are probably tiny. I should probably just try it.

Similarly for bishops, although bishops only use 41KB in my fairly naive implementation.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Mar 23, 2013 9:04 am

syzygy wrote:
Gerd Isenberg wrote:One may pack fixed shift arrays from ~2MB below 800K as well. However, the effort - Elo gain ratio is likely better in other areas
If you borrow from Volker Annuss the effort is negligible.

Yes, Thank you for mentioning Volker's approach.
Let say effort if you start looking for own, denser magics

SamuelS · Post by **SamuelS** » Sat Mar 23, 2013 11:22 am

I think half the fun is writing my own code and possibly finding good magic numbers.

It seems that the difficult squares for the rook are a1 and h1. There are about 1.8e19 different 64-bit numbers, so going through all of them is not very likely to happen soon. I wonder if there is a better way than just random sampling the numbers.

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Mar 23, 2013 12:51 pm

SamuelS wrote:I think half the fun is writing my own code and possibly finding good magic numbers.

It seems that the difficult squares for the rook are a1 and h1. There are about 1.8e19 different 64-bit numbers, so going through all of them is not very likely to happen soon. I wonder if there is a better way than just random sampling the numbers.

A proposal. Let say you want to map the a0 rook occupancies (1-6, 8-48) to 11 upper bits (53-63)

Code: Select all

0x000101010101017E    0xFFE0000000000000
 . . . . . . . .      1 1 1 1 1 1 1 1
 1 . . . . . . .      . . . . . 1 1 1
 1 . . . . . . .      . . . . . . . .
 1 . . . . . . .      . . . . . . . .
 1 . . . . . . .      . . . . . . . .
 1 . . . . . . .      . . . . . . . .
 1 . . . . . . .      . . . . . . . .
 . 1 1 1 1 1 1 .      . . . . . . . .

One may try factors with 12 bits set for an individual shift of each bit in the occupancy.
To map bit 1 to 63 requires bit 62 set as most possible highest bit in the factor.
To map bit 1 to 53 requires bit 52 set as least possible highest bit in the factor.
To map bit 48 to 53 requires bit 5 set (ignoring overflows from lower bits).
To map bit 48 to 63 requires bit 15 set, So an LSB > 2^15 would make bit 48 disappear
So snoob over all n-bit factors greater than 2^52 and less 2^63, with LSB greater 2^4.

Following (untested) approach, no idea how long it takes, and if it is possible to find 11 bit rook magics for 0 and 7 at all.

Code: Select all

for &#40;int lsb =  5; lsb <= 15; ++lsb&#41; 
for &#40;int msb = 52; msb <= 62; ++msb&#41; &#123;
	U64 set = (-1ULL << lsb&#41; ^ (-2ULL << msb&#41;;
	U64 sub = &#40;0x7ffULL << lsb&#41; ^ &#40;2ULL << 52&#41;; // 12 bit set
	U64 s;
	while ( &#40;s = snoob&#40;sub, set&#41;) > sub )
	&#123;
		sub = s;
		if ( testMagic&#40; sub, 0x000101010101017E, 11, a0 ) )
			print&#40; sub );
	&#125;
&#125;

// get next greater subset of set with same number of one bits
U64 snoob &#40;U64 sub, U64 set&#41; &#123;
   U64 tmp = sub-1;
   U64 rip = set & &#40;tmp + &#40;sub & &#40;0-sub&#41;) - set&#41;;
   for&#40;sub = &#40;tmp & sub&#41; ^ rip; sub &= sub-1; rip ^= tmp, set ^= tmp&#41;
       tmp = set & &#40;0-set&#41;;
   return rip;
&#125;

bool testMagic&#40; u64 magic2try, u64 occUniverse, u32 numberOfIndexBits, u32 sq&#41; &#123;
   U64 table&#91;1 << 11&#93;; 
   memset&#40;table, 0, sizeof&#40;table&#41;);     

   U64 occ = 0;
   do &#123; // for all occupancies starting with empty set
      u32 index = occ * magic2try >> &#40;64 - numberOfBits&#41;;
      u64 attacks = applySomeSlowAttackGetter&#40;sq, occ&#41;;
      if ( table&#91;index&#93; == 0 ) &#123;
        table&#91;index&#93; = attacks; // no collision, store attack set
      &#125; else &#123;
        // collision
        if ( table&#91;index&#93; != attacks ) &#123;
           return false; // magic does not work
        &#125;
      &#125;
      occ = &#40;occ - occUniverse&#41; & occUniverse; // next occupancy
   &#125; while ( occ );
   return true;
&#125;

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Mar 23, 2013 1:35 pm

nonsense snoob. Better something like this:

Code: Select all

int lsb&#91;6&#93;, msb&#91;6&#93;;
for &#40;lsb&#91;0&#93; =        5; lsb&#91;0&#93; <      16; ++lsb&#91;0&#93;)
for &#40;msb&#91;0&#93; =       62; msb&#91;0&#93; >      51; --msb&#91;0&#93;)
for &#40;lsb&#91;1&#93; = lsb&#91;0&#93;+1; lsb&#91;1&#93; <  msb&#91;0&#93;; ++lsb&#91;1&#93;)
for &#40;msb&#91;1&#93; = msb&#91;0&#93;-1; msb&#91;1&#93; >  lsb&#91;1&#93;; --msb&#91;1&#93;)
for &#40;lsb&#91;2&#93; = lsb&#91;1&#93;+1; lsb&#91;2&#93; <  msb&#91;1&#93;; ++lsb&#91;2&#93;)
for &#40;msb&#91;2&#93; = msb&#91;1&#93;-1; msb&#91;2&#93; >  lsb&#91;2&#93;; --msb&#91;2&#93;)
for &#40;lsb&#91;3&#93; = lsb&#91;2&#93;+1; lsb&#91;3&#93; <  msb&#91;2&#93;; ++lsb&#91;3&#93;)
for &#40;msb&#91;3&#93; = msb&#91;2&#93;-1; msb&#91;3&#93; >  lsb&#91;3&#93;; --msb&#91;3&#93;)
for &#40;lsb&#91;4&#93; = lsb&#91;3&#93;+1; lsb&#91;4&#93; <  msb&#91;3&#93;; ++lsb&#91;4&#93;)
for &#40;msb&#91;4&#93; = msb&#91;3&#93;-1; msb&#91;4&#93; >  lsb&#91;4&#93;; --msb&#91;4&#93;)
for &#40;lsb&#91;5&#93; = lsb&#91;4&#93;+1; lsb&#91;5&#93; <  msb&#91;4&#93;; ++lsb&#91;5&#93;)
for &#40;msb&#91;5&#93; = msb&#91;4&#93;-1; msb&#91;5&#93; >  lsb&#91;5&#93;; --msb&#91;5&#93;)
&#123;
  magic = &#40;1ULL << lsb&#91;0&#93;) | &#40;1ULL << lsb&#91;1&#93;) ... &#40;1ULL << lsb&#91;5&#93;)
        | &#40;1ULL << msb&#91;0&#93;) | &#40;1ULL << msb&#91;1&#93;) ... &#40;1ULL << msb&#91;5&#93;);
  if ( testMagic&#40; magic, 0x000101010101017E, 11, a0 ) )
     print&#40; magic );
&#125;

C contest. Un-unrolling the loops

SamuelS · Post by **SamuelS** » Sat Mar 23, 2013 7:57 pm

Gerd Isenberg wrote:nonsense snoob. Better something like this:

Code: Select all

int lsb&#91;6&#93;, msb&#91;6&#93;;
for &#40;lsb&#91;0&#93; =        5; lsb&#91;0&#93; <      16; ++lsb&#91;0&#93;)
for &#40;msb&#91;0&#93; =       62; msb&#91;0&#93; >      51; --msb&#91;0&#93;)
for &#40;lsb&#91;1&#93; = lsb&#91;0&#93;+1; lsb&#91;1&#93; <  msb&#91;0&#93;; ++lsb&#91;1&#93;)
for &#40;msb&#91;1&#93; = msb&#91;0&#93;-1; msb&#91;1&#93; >  lsb&#91;1&#93;; --msb&#91;1&#93;)
for &#40;lsb&#91;2&#93; = lsb&#91;1&#93;+1; lsb&#91;2&#93; <  msb&#91;1&#93;; ++lsb&#91;2&#93;)
for &#40;msb&#91;2&#93; = msb&#91;1&#93;-1; msb&#91;2&#93; >  lsb&#91;2&#93;; --msb&#91;2&#93;)
for &#40;lsb&#91;3&#93; = lsb&#91;2&#93;+1; lsb&#91;3&#93; <  msb&#91;2&#93;; ++lsb&#91;3&#93;)
for &#40;msb&#91;3&#93; = msb&#91;2&#93;-1; msb&#91;3&#93; >  lsb&#91;3&#93;; --msb&#91;3&#93;)
for &#40;lsb&#91;4&#93; = lsb&#91;3&#93;+1; lsb&#91;4&#93; <  msb&#91;3&#93;; ++lsb&#91;4&#93;)
for &#40;msb&#91;4&#93; = msb&#91;3&#93;-1; msb&#91;4&#93; >  lsb&#91;4&#93;; --msb&#91;4&#93;)
for &#40;lsb&#91;5&#93; = lsb&#91;4&#93;+1; lsb&#91;5&#93; <  msb&#91;4&#93;; ++lsb&#91;5&#93;)
for &#40;msb&#91;5&#93; = msb&#91;4&#93;-1; msb&#91;5&#93; >  lsb&#91;5&#93;; --msb&#91;5&#93;)
&#123;
  magic = &#40;1ULL << lsb&#91;0&#93;) | &#40;1ULL << lsb&#91;1&#93;) ... &#40;1ULL << lsb&#91;5&#93;)
        | &#40;1ULL << msb&#91;0&#93;) | &#40;1ULL << msb&#91;1&#93;) ... &#40;1ULL << msb&#91;5&#93;);
  if ( testMagic&#40; magic, 0x000101010101017E, 11, a0 ) )
     print&#40; magic );
&#125;

C contest. Un-unrolling the loops

LOL. First time I see 12 nested for-loops.

I am currently running a test where the upper bits (45-62) are tested systematically while the lower bits are generated at random. I think the higher bits are more critical, because they map the first row (b1-g1) to the bits that matter. But I might be wrong. Let's see.

Codesquid · Post by **Codesquid** » Sun Mar 24, 2013 9:26 am

SamuelS wrote:My first question is: Is it worth the effort to try to pack the attack tables in fewer kilobytes? I guess processor cache sizes still limit the performance?

Since the outcome of the magic move generation is symmetric with respect to the square index and occupation mask, I tried to half memory requirements by rotating the square index and occupation mask if the square index is larger than 31.

While memory usage for the tables was halved, performance was slightly worse. Even making use of inline assembly to utilize the processor's bswap instruction didn't help.

Magic bitboards: Cache issues and state of the art?

Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?

Re: Magic bitboards: Cache issues and state of the art?