Nice Math - Strange Conclusions

Gerd Isenberg · Post by **Gerd Isenberg** » Thu May 01, 2008 11:18 am

bob wrote:
michiguel wrote:Gerd, what is not published, it does not exist, and there are good reasons for academics to behave that way. What is published passed a reviewing process, what is not didn't. What I really encourage you to do is to publish your work and include whoever contributed it significantly as a co-author. It should not be difficult after all the fantastic work you did.

Miguel
I completely agree. It is easy for us to criticize, but I am not a member of every discussion board related to chess, but I am a member of the ICGA and read the journal as it comes out. If something is not published, how would one know it exists? Although I must add that I have had more than one NSF proposal get shot down for this very reason, because someone on the review panel knew of similar research that was not public, which makes it impossible to cite or use.

That's the only reason I wrote the rotated bitboard paper, and why I am also going to do a parallel search paper on how Crafty works, just so it becomes well-documented and nobody will have to reinvent the same wheel again.

The magic stuff ought to be documented in a journal article as well. I'd be more than willing to help if needed, but the two leaders of the development should really document it...

We already started with the wiki

Yes, it is easy to criticize. Their math and proof with congruent modulus is enlightening and instructive. Also in the meantime I am aware of perfect hashing versus minimal perfect hashing, I initially confused. Thus to claim perfect hashing is fine. Files and anti-diagonals are nearly minimal, diagonals have a huge gap.

Imho it is justified to criticize their conclusion and modality to (don't) test their approach against magic bitboards - where a paper + sourcecode by Pradu Kannan exists, as mentioned and quoted in the article.

64-bit modulus is not that cheap - maybe the authors were not aware of - even reciprocal multiplication takes some more operations per line as magic bitboards takes for both lines (ignoring memory issues). We already had similar discussions concerning bitscan. Isolated bit mod 65 versus De Bruijn multiplication plus shift right 58.

For (a mod constant):

Code: Select all

reciconst = 2^(64+k) div const ; // compiletime 64 bit constant 
divbit128 = a * reciconst      ; // 64*64 = 128bit
quotient  = div128bit >> (64+k);
modulus   = a - const*quotient ; // 0..const-1
occustate = bytelookup[modulus]; // for a minimal perfect hash, e.g. const = 514

On writing a paper on sliding piece attack generations - I am willing to try it. At least for stuff I completely understood, such as kindergarten bitboards for single lines, where you multiply diagonals with files without carries - all intermediate sums are disjoint.

The nature of magic bitboards, to map relevant occupancies of two lines, is far over my head. We have to deal with carries. So far there is only empirical trial and error how to feed in certain subset traversals, or to use low populated rngs. Is there a systematical way to maximize constructive collisions, where different redundant outer occupied bits share common attack sets?

Some more knowledgeable math guys like Trevor Fenner and Mark Levene should take some care on magic bitboards.

bob · Post by **bob** » Thu May 01, 2008 7:08 pm

Gerd Isenberg wrote:
bob wrote:
michiguel wrote:Gerd, what is not published, it does not exist, and there are good reasons for academics to behave that way. What is published passed a reviewing process, what is not didn't. What I really encourage you to do is to publish your work and include whoever contributed it significantly as a co-author. It should not be difficult after all the fantastic work you did.

Miguel
I completely agree. It is easy for us to criticize, but I am not a member of every discussion board related to chess, but I am a member of the ICGA and read the journal as it comes out. If something is not published, how would one know it exists? Although I must add that I have had more than one NSF proposal get shot down for this very reason, because someone on the review panel knew of similar research that was not public, which makes it impossible to cite or use.

That's the only reason I wrote the rotated bitboard paper, and why I am also going to do a parallel search paper on how Crafty works, just so it becomes well-documented and nobody will have to reinvent the same wheel again.

The magic stuff ought to be documented in a journal article as well. I'd be more than willing to help if needed, but the two leaders of the development should really document it...
We already started with the wiki

Yes, it is easy to criticize. Their math and proof with congruent modulus is enlightening and instructive. Also in the meantime I am aware of perfect hashing versus minimal perfect hashing, I initially confused. Thus to claim perfect hashing is fine. Files and anti-diagonals are nearly minimal, diagonals have a huge gap.

Imho it is justified to criticize their conclusion and modality to (don't) test their approach against magic bitboards - where a paper + sourcecode by Pradu Kannan exists, as mentioned and quoted in the article.

64-bit modulus is not that cheap - maybe the authors were not aware of - even reciprocal multiplication takes some more operations per line as magic bitboards takes for both lines (ignoring memory issues). We already had similar discussions concerning bitscan. Isolated bit mod 65 versus De Bruijn multiplication plus shift right 58.

For (a mod constant):
Code: Select all
reciconst = 2^(64+k) div const ; // compiletime 64 bit constant 
divbit128 = a * reciconst      ; // 64*64 = 128bit
quotient  = div128bit >> (64+k);
modulus   = a - const*quotient ; // 0..const-1
occustate = bytelookup[modulus]; // for a minimal perfect hash, e.g. const = 514
On writing a paper on sliding piece attack generations - I am willing to try it. At least for stuff I completely understood, such as kindergarten bitboards for single lines, where you multiply diagonals with files without carries - all intermediate sums are disjoint.

The nature of magic bitboards, to map relevant occupancies of two lines, is far over my head. We have to deal with carries. So far there is only empirical trial and error how to feed in certain subset traversals, or to use low populated rngs. Is there a systematical way to maximize constructive collisions, where different redundant outer occupied bits share common attack sets?

Some more knowledgeable math guys like Trevor Fenner and Mark Levene should take some care on magic bitboards.

Every time I read one of your posts, everything goes well until I run across something that makes my head explode.

My copy arrived today (I am farther from UM than you obviously) and a quick read does leave a couple of odd impressions. One place they claim to be better, another place equivalent, etc... And it seems they are using a trick in matlab as well (an associative array such as in snobol where you let the language hash the key). I don't think a scientific test should be based on that approach when there are faster ways already being used...

The explanation is pretty good for "how" but the performance aspects left me more confused than informed, and this is supposedly about performance more than anything else.

Gerd Isenberg · Post by **Gerd Isenberg** » Thu May 01, 2008 10:38 pm

bob wrote: My copy arrived today (I am farther from UM than you obviously) and a quick read does leave a couple of odd impressions. One place they claim to be better, another place equivalent, etc... And it seems they are using a trick in matlab as well (an associative array such as in snobol where you let the language hash the key). I don't think a scientific test should be based on that approach when there are faster ways already being used...

The explanation is pretty good for "how" but the performance aspects left me more confused than informed, and this is supposedly about performance more than anything else.

As far as I understand they test pseudo legal movegeration with some testpostions
1.) without hash - a simple loop over files and diagonals (mailbox?)
2.) with hash- their hashing scheme with modulo probably something like following C-source:

Code: Select all

U64 diagonalMask[64];
U64 antidiagMask[64];
U64 lookupTableDiagAttacks[514][64]; // ??? or 
U64 lookupTableAntiAttacks[257][64];


U64 diagonalAttacks(U64 occ, int sq) {
   int state = (occ & diagonalMask[sq]) % 514;
   return lookupTableDiagAttacks[state][sq];
}

U64 antiDiagAttacks(U64 occ, int sq) {
   int state = (occ & antidiagMask[sq]) % 257;
   return lookupTableAntiAttacks[state][sq];
}

U64 lookupTableRankAttacks[256][64];
U64 lookupTableFileAttacks[258][64];

// rank attacks as usual 
U64 rankAttacks(U64 occ, int sq) {
   int state = ( occ >> (sq&56) ) % 256; // & 255 ;-)
   return lookupTableRankAttacks[state][sq];
}

U64 fileAttacks(U64 occ, int sq) {
   const U64 aFile = C64(0x0101010101010101);
   int state = ( (occ >> (sq&7)) & aFile) % 258;
   return lookupTableFileAttacks[state][sq];
}

One may mask off the outer bits for denser states.
Looks not bad at the first glance - if modulo would always act like a bitwise-and.

The direct lookup stuff was the reference to Sam Tannous paper:
Avoiding Rotated Bitboards with Direct Lookup
ICGA Journal Vol.30 No.2 pp. 85-91. (June 2007)
http://chessprogramming.wikispaces.com/ ... ctionaries

The technique involves creating four hash tables using the built in hash arrays from an interpreted, high level language (Python).

Sam claimed his direct lookup was a bit faster than rotated bitboards (in Python, I guess). You reported similar performance between rotated and magic. They assumed to be competitive versus direct lookup - voilà.

Gerd Isenberg · Post by **Gerd Isenberg** » Fri May 02, 2008 11:11 am

Gerd Isenberg wrote: 64-bit modulus is not that cheap - maybe the authors were not aware of - even reciprocal multiplication takes some more operations per line as magic bitboards takes for both lines (ignoring memory issues). We already had similar discussions concerning bitscan. Isolated bit mod 65 versus De Bruijn multiplication plus shift right 58.

For (a mod constant):
Code: Select all
reciconst = 2^(64+k) div const ; // compiletime 64 bit constant 
divbit128 = a * reciconst      ; // 64*64 = 128bit
quotient  = div128bit >> (64+k);
modulus   = a - const*quotient ; // 0..const-1
occustate = bytelookup[modulus]; // for a minimal perfect hash, e.g. const = 514

I was wrong with using mod 65 for bitscan purpose, the prime number 67 was required for a most dense modulo lookup!
Here the De Bruijn bitscan versus the modulo 67:

Code: Select all

int bitScanForward(U64 bb) {
   return index64[((bb & -bb) * 0x07EDD5E59A4E28C2) >> 58];
}

int bitScanForward(U64 bb) {
   return lookup67[(bb & -bb) % 67];
}

the three gaps are 0, 17, 34, so the mod 67 can make a branchless trailing zero count:

Code: Select all

no Bit 0x0000000000000000 mod 67 =  0

Bit  0 0x0000000000000001 mod 67 =  1
Bit  1 0x0000000000000002 mod 67 =  2
Bit  2 0x0000000000000004 mod 67 =  4
Bit  3 0x0000000000000008 mod 67 =  8
Bit  4 0x0000000000000010 mod 67 = 16
Bit  5 0x0000000000000020 mod 67 = 32
Bit  6 0x0000000000000040 mod 67 = 64
Bit  7 0x0000000000000080 mod 67 = 61
Bit  8 0x0000000000000100 mod 67 = 55
Bit  9 0x0000000000000200 mod 67 = 43
Bit 10 0x0000000000000400 mod 67 = 19
Bit 11 0x0000000000000800 mod 67 = 38
Bit 12 0x0000000000001000 mod 67 =  9
Bit 13 0x0000000000002000 mod 67 = 18
Bit 14 0x0000000000004000 mod 67 = 36
Bit 15 0x0000000000008000 mod 67 =  5
Bit 16 0x0000000000010000 mod 67 = 10
Bit 17 0x0000000000020000 mod 67 = 20
Bit 18 0x0000000000040000 mod 67 = 40
Bit 19 0x0000000000080000 mod 67 = 13
Bit 20 0x0000000000100000 mod 67 = 26
Bit 21 0x0000000000200000 mod 67 = 52
Bit 22 0x0000000000400000 mod 67 = 37
Bit 23 0x0000000000800000 mod 67 =  7
Bit 24 0x0000000001000000 mod 67 = 14
Bit 25 0x0000000002000000 mod 67 = 28
Bit 26 0x0000000004000000 mod 67 = 56
Bit 27 0x0000000008000000 mod 67 = 45
Bit 28 0x0000000010000000 mod 67 = 23
Bit 29 0x0000000020000000 mod 67 = 46
Bit 30 0x0000000040000000 mod 67 = 25
Bit 31 0x0000000080000000 mod 67 = 50
Bit 32 0x0000000100000000 mod 67 = 33
Bit 33 0x0000000200000000 mod 67 = 66
Bit 34 0x0000000400000000 mod 67 = 65
Bit 35 0x0000000800000000 mod 67 = 63
Bit 36 0x0000001000000000 mod 67 = 59
Bit 37 0x0000002000000000 mod 67 = 51
Bit 38 0x0000004000000000 mod 67 = 35
Bit 39 0x0000008000000000 mod 67 =  3
Bit 40 0x0000010000000000 mod 67 =  6
Bit 41 0x0000020000000000 mod 67 = 12
Bit 42 0x0000040000000000 mod 67 = 24
Bit 43 0x0000080000000000 mod 67 = 48
Bit 44 0x0000100000000000 mod 67 = 29
Bit 45 0x0000200000000000 mod 67 = 58
Bit 46 0x0000400000000000 mod 67 = 49
Bit 47 0x0000800000000000 mod 67 = 31
Bit 48 0x0001000000000000 mod 67 = 62
Bit 49 0x0002000000000000 mod 67 = 57
Bit 50 0x0004000000000000 mod 67 = 47
Bit 51 0x0008000000000000 mod 67 = 27
Bit 52 0x0010000000000000 mod 67 = 54
Bit 53 0x0020000000000000 mod 67 = 41
Bit 54 0x0040000000000000 mod 67 = 15
Bit 55 0x0080000000000000 mod 67 = 30
Bit 56 0x0100000000000000 mod 67 = 60
Bit 57 0x0200000000000000 mod 67 = 53
Bit 58 0x0400000000000000 mod 67 = 39
Bit 59 0x0800000000000000 mod 67 = 11
Bit 60 0x1000000000000000 mod 67 = 22
Bit 61 0x2000000000000000 mod 67 = 44
Bit 62 0x4000000000000000 mod 67 = 21
Bit 63 0x8000000000000000 mod 67 = 42

Some wiki links:
http://en.wikipedia.org/wiki/Modulo_operation
http://en.wikipedia.org/wiki/Modular_arithmetic
http://en.wikipedia.org/wiki/Chinese_remainder_theorem

Gerd Isenberg · Post by **Gerd Isenberg** » Fri May 02, 2008 12:20 pm

Gerd Isenberg wrote: I was wrong with using mod 65 for bitscan purpose, the prime number 67 was required for a most dense modulo lookup!
Here the De Bruijn bitscan versus the modulo 67:
Code: Select all
int bitScanForward(U64 bb) {
   return index64[((bb & -bb) * 0x07EDD5E59A4E28C2) >> 58];
}

int bitScanForward(U64 bb) {
   return lookup67[(bb & -bb) % 67];
}

On modulo performance:

div (idiv) to get quotient and remainder in one instruction has a latency of 66-80, 56-70 (CPUID 0F_3H, 0F_2H) and throughput of 30 or 23 cycles for core2, according to the Intel 64 and IA-32 Architecture Reference Manual.

This is why optimizing compilers use reciprocal multiplication for modulo by constant, to divide and multiply the quotient again to get the floor, which is then subtracted from divisor to get the remainder. Still, mul 64*64 = 128 bit is also tad slower than 64*64=64bit - not to mention the second 32-bit imul. So De Bruijn mul/shift is clearly faster. Same applies for the attack-getter - kindergarten multiplication versus modulo.

Code: Select all

r = a - N * floor(a div n)
r = a - N * floor(a * rezin)

Nice Math - Strange Conclusions

Re: Nice Math - Strange Conclusions

Re: Nice Math - Strange Conclusions

Re: Nice Math - Strange Conclusions

Re: Nice Math - Strange Conclusions

Re: Nice Math - Strange Conclusions