DIRECT BITBOARD MOVEGENERATION

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: DIRECT BITBOARD MOVEGENERATION

Post by Gerd Isenberg »

Desperado wrote:hi everyone again.

today i tested some ideas. Let me start up with what i have done
to compute the msb. i order from slow to fast.

1. fasttest() -> performs -97%
testing bits, starting from sq to msb, obstructed by end of line
(in simple loop)

2. divconq() -> performs -78%
check always upper half of line, splits until msb is found

3. BitScanReverse64 intrinsic -> performs -10%

4. bsr (double conversion, from Gerd). my original one +-0

5. msb-lookup +30%

Code: Select all

//******************************************************************************

	 //...
	 for&#40;BTB k=1;k<256;k++)
		&#123;
		 msk.msb_rank&#91;k&#93; = bsr64&#40;k&#41;; //UI_08 msb_rank&#91;256&#93;; maybe 128 is enough
		 msk.msb_file&#91;k&#93; = bsf64&#40;k&#41;; // ""
		&#125;
	 //...

//******************************************************************************

BTB msb_rank&#40;BTB bb,SQR_ID sq64&#41;
	&#123;
	 UI_08 shift = &#40;sq64>>3&#41; << 3;

	 return&#40;&#40;BTB&#41; 1 << &#40;msk.msb_rank&#91;bb>>shift&#93; + shift&#41;);
	&#125;

//******************************************************************************

BTB msb_file&#40;BTB bb,SQR_ID sq64&#41;
	&#123;
	 //obstruction to rank 1 has to be set...
	 BTB x;

	 x   = bb >> &#40;sq64 & 7&#41;;
	 x  *= diag;
	 x >>= 56;
	 x   = bR8 >> &#40;msk.msb_file&#91;x&#93;*8&#41;;

	 return&#40;x & bb&#41;;
	&#125;

//******************************************************************************
So what conclusion has to be done here, with respect to Kindergarten BB ?

The 2 msb lookups are faster than computing by bitscan-reverse !?
should someone now use KG-BB ? Where are the differences ?
Because i am not (of course not) expert for KG-BB. What s about the memory usage ?
Or which kind of bsr should be used ? (Gerd's double conversion is the fastest i used until now. it s faster than the intrinsic ? or am i doing sth totally wrong ?)

note:

- in the case how the msb is used for the obstruction difference, it
is enough to isolate the corresponding file _OR_ rank, to build the ms1b.
That may lead to some other ideas.

- maybe a fast magic bsr-function without lookup can be created, due
to the simpleness of a masked line ?? any idea for that ? (would be enough to have it for one line,all other lines maybe shifted...)

well, now with the kind of msb-lookup the performance comes closer to
the magic approach, some more 20 % speedup and it will beat it...
I fear there is something seriously broken with your time measurement. Loop tests with small bodies are too chaotic, processor may internally "unroll" the loop. You should use Time Stamp Counter aka x86 RDTSC instruction with leading cpuid to measure how many cycles a routine takes. HGM once posted some nice code sample to index an array of counters indexed by cycle difference from RDTSC, couldn't find it yet. I will soon put some stuff in cpw.

Your msb_file approach is fastest? Double conversion faster than bsr? Hmmm. What processor do you use? AMD K8? Can you eventually post the generated assembly of the routines?

Also compare your msb_file code with kindergerten files.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: DIRECT BITBOARD MOVEGENERATION

Post by Desperado »

Hi Gerd.

1. my processor is amd athlon 64 x2 5200+ (~2.7Ghz)

2. of course i ve overflow the chesswiki for the kindergarten-bb.
And yes, when exctracting the msb by lookup one can easily lookup
the whole target bitboard instead.
I undestand so far, but my head full of magic,obstruction ( :? ) stuff i
thought there may be a difference in its detail (which i dont get by
overflying the article)
But doesn t seem to be so.

3. one more yes. my testing routine seems also to be a flop :(.
The reason for this may be that s my dirty trial modul, where i put
everything (quick and dirty to see what happens)
to test, before writing any clean moduls.

4. So i will do the following.
a: writing clean modul for obstruction difference.
b: try to build a reliable test routine

When all this is done, i hope i ll have some nice results.
Finished the work on 4. i will try to post some assembly additionally.
4b will get hard work because i am not very experienced with building
test routines, and that s why i am thankful for any help i can get for this.

"So i ask everyone who likes to do", compare the "OD" with KG or Magics or any other scheme you have at hand.

And Gerd, i was suprised (maybe similar like you), so i didnt trust it.
So i put so many "?" instead of "!".

So i see once again, there are lot of things where i need a lot of help
Best regards.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: DIRECT BITBOARD MOVEGENERATION

Post by Gerd Isenberg »

Desperado wrote:Hi Gerd.

1. my processor is amd athlon 64 x2 5200+ (~2.7Ghz)

2. of course i ve overflow the chesswiki for the kindergarten-bb.
And yes, when exctracting the msb by lookup one can easily lookup
the whole target bitboard instead.
I undestand so far, but my head full of magic,obstruction ( :? ) stuff i
thought there may be a difference in its detail (which i dont get by
overflying the article)
But doesn t seem to be so.

3. one more yes. my testing routine seems also to be a flop :(.
The reason for this may be that s my dirty trial modul, where i put
everything (quick and dirty to see what happens)
to test, before writing any clean moduls.

4. So i will do the following.
a: writing clean modul for obstruction difference.
b: try to build a reliable test routine

When all this is done, i hope i ll have some nice results.
Finished the work on 4. i will try to post some assembly additionally.
4b will get hard work because i am not very experienced with building
test routines, and that s why i am thankful for any help i can get for this.

"So i ask everyone who likes to do", compare the "OD" with KG or Magics or any other scheme you have at hand.

And Gerd, i was suprised (maybe similar like you), so i didnt trust it.
So i put so many "?" instead of "!".

So i see once again, there are lot of things where i need a lot of help
Best regards.
Hi Michael,
Ok, forget bsr on your K8. 11 cycles vector path. 2*bsr == 22 cycles, where no other instruction can be processed, so it ends up with 40 cycles for rook or so. That is not a fair test for Obstruction Difference ;-) You can easily beat that with mul/shift/lookup, like a DeBruijn mul is faster than K8 bsf. I wonder that even double conversion was faster. On OD, I think 16 cycles should be possible for the whole rook/bishop bsr based routine on a core2, or with lzcnt on K10/Nehalem, only using scratch registers which don't need to be preserved.

Don't hesitate to take a break up and then from that fascinating but insignificant bitboard stuff ;-)
It simply takes some time.

Cheers,
Gerd
diep
Posts: 1822
Joined: Thu Mar 09, 2006 11:54 pm
Location: The Netherlands

Re: DIRECT BITBOARD MOVEGENERATION

Post by diep »

Gerd Isenberg wrote:
Desperado wrote:Hi Gerd.

1. my processor is amd athlon 64 x2 5200+ (~2.7Ghz)

2. of course i ve overflow the chesswiki for the kindergarten-bb.
And yes, when exctracting the msb by lookup one can easily lookup
the whole target bitboard instead.
I undestand so far, but my head full of magic,obstruction ( :? ) stuff i
thought there may be a difference in its detail (which i dont get by
overflying the article)
But doesn t seem to be so.

3. one more yes. my testing routine seems also to be a flop :(.
The reason for this may be that s my dirty trial modul, where i put
everything (quick and dirty to see what happens)
to test, before writing any clean moduls.

4. So i will do the following.
a: writing clean modul for obstruction difference.
b: try to build a reliable test routine

When all this is done, i hope i ll have some nice results.
Finished the work on 4. i will try to post some assembly additionally.
4b will get hard work because i am not very experienced with building
test routines, and that s why i am thankful for any help i can get for this.

"So i ask everyone who likes to do", compare the "OD" with KG or Magics or any other scheme you have at hand.

And Gerd, i was suprised (maybe similar like you), so i didnt trust it.
So i put so many "?" instead of "!".

So i see once again, there are lot of things where i need a lot of help
Best regards.
Hi Michael,
Ok, forget bsr on your K8. 11 cycles vector path. 2*bsr == 22 cycles, where no other instruction can be processed, so it ends up with 40 cycles for rook or so. That is not a fair test for Obstruction Difference ;-) You can easily beat that with mul/shift/lookup, like a DeBruijn mul is faster than K8 bsf. I wonder that even double conversion was faster. On OD, I think 16 cycles should be possible for the whole rook/bishop bsr based routine on a core2, or with lzcnt on K10/Nehalem, only using scratch registers which don't need to be preserved.

Don't hesitate to take a break up and then from that fascinating but insignificant bitboard stuff ;-)
It simply takes some time.

Cheers,
Gerd
You sure if you do 2 after each other it is "only" 22 cycles?

Further isn't it locking out all other execution resources either?

Maybe first and last cycles if you're lucky only.

So no other instructions effectively can get processed while doing a BSF either. That's really making things more ugly i bet than "just 22 cycles".
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: DIRECT BITBOARD MOVEGENERATION

Post by Desperado »

Hi everybody again.
Here is my (beta) version of lineattacks, but some questions still left.

1. Where can i learn more about cycles and how to measure them?
2. What conclusions can i get from my assembly ?
3. Compared to assembly of Gerd , here are only 15 instead of 26
intstructions. What do i have to think about this? (double speed :lol: ?!)
4. instead of hi &=-hi, isnt it better to use also bitscan ? wouldnt be there
less assembler instructions ?

Code: Select all


static BTB_T attack_line&#40;const BTB_T occ,LINEMASK_T *const msk,const SQR_T sq64&#41;
	&#123;
	 ULONG id_msb;
	 BTB_T lo;
	 BTB_T hi;

	 lo = msk->line_lo&#91;sq64&#93; & occ;
	 bsr64&#40;&id_msb,lo|1&#41;;
 	lo = &#40;BTB_T&#41;-1 << id_msb;			

	 hi = msk->line_hi&#91;sq64&#93; & occ;
	 hi&=-hi;					

	 return&#40;msk->line_ex&#91;sq64&#93; & &#40;2*hi+lo&#41;); 
	&#125;

64 Bit compiled
============

Code: Select all

; Function compile flags&#58; /Ogtpy
;	COMDAT ?attack_line@@YA_K_KQEAULINEMASK_T@@E@Z
_TEXT	SEGMENT
occ$ = 8
msk$ = 16
sq64$ = 24
?attack_line@@YA_K_KQEAULINEMASK_T@@E@Z PROC		; attack_line, COMDAT
; File c&#58;\users\michael\documents\visual studio 2008\projects\chessengine\chessengine\btb.cpp
; Line 186
	movzx	r10d, r8b
	mov	r9, rcx
; Line 191
	mov	rax, QWORD PTR &#91;rdx+r10*8+1024&#93;
	and	rax, rcx
; Line 192
	or	rax, 1
	bsr	rcx, rax
; Line 196
	mov	rax, QWORD PTR &#91;rdx+r10*8+512&#93;
	and	rax, r9
; Line 197
	mov	r8, rax
	neg	r8
	and	r8, rax
	or	rax, -1
	shl	rax, cl
; Line 199
	lea	rax, QWORD PTR &#91;rax+r8*2&#93;
	and	rax, QWORD PTR &#91;rdx+r10*8&#93;
; Line 201
	ret	0
?attack_line@@YA_K_KQEAULINEMASK_T@@E@Z ENDP		; attack_line
32 Bit compiled
=============

Code: Select all

; Function compile flags&#58; /Ogtpy
;	COMDAT ?attack_line@@YA_K_KQAULINEMASK_T@@E@Z
_TEXT	SEGMENT
_occ$ = 8						; size = 8
_id_msb$ = 16						; size = 4
_sq64$ = 16						; size = 1
?attack_line@@YA_K_KQAULINEMASK_T@@E@Z PROC		; attack_line, COMDAT
; _msk$ = edi
; File c&#58;\users\michael\documents\visual studio 2008\projects\chessengine\chessengine\btb.cpp

; Line 191
	mov	ecx, DWORD PTR _occ$&#91;esp-4&#93;
	push	ebx
	push	ebp
	push	esi
	movzx	esi, BYTE PTR _sq64$&#91;esp+8&#93;
	mov	eax, DWORD PTR &#91;edi+esi*8+1024&#93;
	and	eax, ecx
; Line 192
	or	eax, 1
	bsr	edx, eax
; Line 196
	mov	eax, DWORD PTR &#91;edi+esi*8+512&#93;
	and	eax, ecx
	mov	ecx, DWORD PTR &#91;edi+esi*8+516&#93;
	and	ecx, DWORD PTR _occ$&#91;esp+12&#93;
	mov	DWORD PTR _id_msb$&#91;esp+8&#93;, edx
; Line 197
	mov	edx, eax
	neg	edx
	mov	ebx, ecx
	adc	ebx, 0
; Line 199
	push	0
	neg	ebx
	push	2
	and	ebx, ecx
	and	edx, eax
	push	ebx
	push	edx
	call	__allmul
	mov	ecx, DWORD PTR _id_msb$&#91;esp+8&#93;
	mov	ebx, eax
	mov	ebp, edx
	or	eax, -1
	or	edx, -1
	call	__allshl
	add	ebx, eax
	adc	ebp, edx
	and	ebp, DWORD PTR &#91;edi+esi*8+4&#93;
	and	ebx, DWORD PTR &#91;edi+esi*8&#93;
; Line 201
	pop	esi
	mov	edx, ebp
	pop	ebp
	mov	eax, ebx
	pop	ebx
	ret	0

?attack_line@@YA_K_KQAULINEMASK_T@@E@Z ENDP		; attack_line
thx
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: DIRECT BITBOARD MOVEGENERATION

Post by Gerd Isenberg »

Desperado wrote:Hi everybody again.
Hi Michael,
Desperado wrote: Here is my (beta) version of lineattacks, but some questions still left.
1. Where can i learn more about cycles and how to measure them?
1. Agner Fog has good papers, and the amd and intel optimization manuals, see bitscan references. RDTSC is the instruction. But it a bit tricky to use. Windows api has QueryPerformanceCounter.
Desperado wrote:2. What conclusions can i get from my assembly ?
32-bit is horror, 64-bit is fine, see below
Desperado wrote:3. Compared to assembly of Gerd , here are only 15 instead of 26
intstructions. What do i have to think about this? (double speed :lol: ?!)
The 64 bit assembly looks fine, but don't compare it with the 26 from by hand made assembly, since it was for two lines (rook or bishop), and the first one I posted was this one with only 12, where I already considered square and line passed by a pointer:

Code: Select all

; rcx - occ
; rdx - pointer to mask structure&#91;sq&#93;&#91;line&#93; &#40;not changed&#41;
mov   rax, rcx
and   rcx, &#91;rdx + LOW&#93;
or    rcx, 1
mov   r8, -1
bsr   rcx, rcx
and   rax, &#91;rdx + HIGH&#93;
shl   r8, cl
mov   rcx, rax
neg   rcx
and   rax, rcx
lea   rax, &#91;2*rax + r8&#93;
and   rax, &#91;rdx + MASKEX&#93; 
Better dont pass square as char type to save the mov zero extension of bytereg to 32-bit register.

Code: Select all

movzx   r10d, r8b 
You may use an array of structs instead of structs of arrays,

Code: Select all

LINEMASK_T mask&#91;64&#93;&#91;4&#93;; // 6KByte
and to call it with

Code: Select all

attacks = attack_line&#40;occ, &mask&#91;sq&#93;&#91;line&#93;);
to save passing square at all.

Code: Select all

static BTB_T attack_line&#40;BTB_T occ, const LINEMASK_T *pmsk&#41;
&#123;
    ULONG id_msb;
    BTB_T lo, hi;

    lo = pmsk->line_lo & occ;
    hi = pmsk->line_hi & occ;
    bsr64&#40;&id_msb,lo|1&#41;;
    lo = &#40;BTB_T&#41;-1 << id_msb;         
    hi&=-hi;               
    return&#40;pmsk->line_ex & &#40;2*hi+lo&#41;);
&#125;
Desperado wrote:4. instead of hi &=-hi, isnt it better to use also bitscan ? wouldnt be there
less assembler instructions ?
Nope, hi & -hi is very cheap. Bitscan and dependent variable shift (constrained to cl) are potentially more expensive, even on core2 extreme edition. You even see the mov reg, -1 was actually done by an or instruction in your assembly.

Cheers,
Gerd
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: DIRECT BITBOARD MOVEGENERATION

Post by Desperado »

First: Thx Gerd once again.

But I forgot one more question, which may be more of general interest.

5. We can easily reduce the 6K table to 4K table because

line_ex = line_lo | line_hi

just a simple additional _or_ ...

So is that a matter of taste because we _only_ save 2K of memory ?
(4K (2^12), looks nice to me). Or is that more a question of needs
(performance vs resources) ?
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: DIRECT BITBOARD MOVEGENERATION

Post by Gerd Isenberg »

Desperado wrote:First: Thx Gerd once again.

But I forgot one more question, which may be more of general interest.

5. We can easily reduce the 6K table to 4K table because

line_ex = line_lo | line_hi

just a simple additional _or_ ...

So is that a matter of taste because we _only_ save 2K of memory ?
(4K (2^12), looks nice to me). Or is that more a question of needs
(performance vs resources) ?
Sure, worth a try, 1/4 versus 3/8 chachelines per entry. With 4 mask structures per cacheline (64 bytes 8 bitboards), no entry crosses a cacheline boarder, for instance all four queen lines in one instead of 1.5. Otoh, it costs an or-instruction, a register per line (or multiple reads from mem before the final 'and') and therefor longer code, which multiplies if heavily inlined.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: DIRECT BITBOARD MOVEGENERATION

Post by Desperado »

hi everyone.

Today i had the time to work hard on OD (the whole day :shock: ).
I had different ideas, which lead to the following result.

Code: Select all

BTB_T attack_B&#40;const BTB_T occ,const BTB_T bsw,const SQR_T sq64&#41;
	&#123;
	 BTB_T lo,hi,diag,anti;

	 //**************************************
	 //* COMPUTING DIAG &#40;SWNE&#41;			     *
	 //**************************************

	 lo = msk&#91;MSK_SENW&#93;&#91;sq64^56&#93;.line_hi & bsw; //reversed direction SWNE->SENW
	 lo|=bH8;									        //instead of set bit 0, set bit 63
	 lo&=-lo;									        //LSB
	 lo = _byteswap_uint64&#40;lo&#41;;					  //lsb becomes msb of lo
	 
	 hi = msk&#91;MSK_SWNE&#93;&#91;sq64&#93;.line_hi & occ;
	 hi&=-hi;

	 diag = msk&#91;MSK_SWNE&#93;&#91;sq64&#93;.line_ex & &#40;2*hi-lo&#41;;

	 //**************************************
	 //* COMPUTING ANTI &#40;SENW&#41;				  *
	 //**************************************

	 lo = msk&#91;MSK_SWNE&#93;&#91;sq64 ^ 56&#93;.line_hi & bsw; //reversed direction SENW->SWNE
	 lo|=bH8;									          //instead of set bit 0, set bit 63
	 lo&=-lo;									          //LSB
	 lo = _byteswap_uint64&#40;lo&#41;;					    //lsb becomes msb of lo

	 hi = msk&#91;MSK_SENW&#93;&#91;sq64&#93;.line_hi & occ;
	 hi&=-hi;

	 anti= msk&#91;MSK_SENW&#93;&#91;sq64&#93;.line_ex & &#40;2*hi-lo&#41;;

	 return&#40;diag|anti&#41;;
	&#125;
Basic Ideas and "improvements"(!?):

1: As you can see there is no bitscan anymore.
The classical idea ---
of using one "reversed (it s more a byteswap-board" bitboard,
updated by move once and then available for the complete stage 1,2,3,...100 attack calculations)
--- is back (at least for me :lol: )
Maybe (i hope so), this time it seems to be very profitable.

2: I thought there is an easier way to reverse a _single bit_(isolated but not index avail.) bitboard.
i couldnt find a faster solution (that takes the whole day...) then the byteswap ( and i think also to
open a new thread where we maybe find some new ideas to do that smarter) (THIS WAS THE INITIATOR for this version.


3: (Some Facts)

- while the bitscan versions having the option between 6K (4K) mask-tables, this version only needs the _hi_masks, so only 4K are necessary!
- at least for machines (like mine, amd k8), this performes more than 2xtimes faster !!! (here i hope too, that byteswap is in general competetive with bitscan (hey Gerd are you there ??? :lol: )
- this version _looks_ very competetive against HQ, because the bishop
has only 2-byteswaps against 6-byteswaps...
- if i could(would) trust my magic-bitboards, now this version is also
competetive with the fast magics.

Cheers. :)
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: DIRECT BITBOARD MOVEGENERATION

Post by Gerd Isenberg »

Desperado wrote:hi everyone.

Today i had the time to work hard on OD (the whole day :shock: ).
I had different ideas, which lead to the following result.


Basic Ideas and "improvements"(!?):

1: As you can see there is no bitscan anymore.
The classical idea ---
of using one "reversed (it s more a byteswap-board" bitboard,
updated by move once and then available for the complete stage 1,2,3,...100 attack calculations)
--- is back (at least for me :lol: )
Maybe (i hope so), this time it seems to be very profitable.

2: I thought there is an easier way to reverse a _single bit_(isolated but not index avail.) bitboard.
i couldnt find a faster solution (that takes the whole day...) then the byteswap ( and i think also to
open a new thread where we maybe find some new ideas to do that smarter) (THIS WAS THE INITIATOR for this version.


3: (Some Facts)

- while the bitscan versions having the option between 6K (4K) mask-tables, this version only needs the _hi_masks, so only 4K are necessary!
- at least for machines (like mine, amd k8), this performes more than 2xtimes faster !!! (here i hope too, that byteswap is in general competetive with bitscan (hey Gerd are you there ??? :lol: )
- this version _looks_ very competetive against HQ, because the bishop
has only 2-byteswaps against 6-byteswaps...
- if i could(would) trust my magic-bitboards, now this version is also
competetive with the fast magics.

Cheers. :)
Very good and creative, Reverse Obstruction Difference ;-)

Like in Hyperbola Quintessence, bswap does not work for ranks, and loses it's generality for all lines. Yes, we need a fast bit reversal operation, like announced by amd (still?) with sse5 (PPERM instruction).

Lets count some ops for one line:

Reverse Obstruction Difference

Code: Select all

    lo = msk&#91;MSK_SENW&#93;&#91;sq64^56&#93;.line_hi & bsw; //reversed direction SWNE->SENW
    lo|=bH8;                                   //instead of set bit 0, set bit 63
    lo&=-lo;                                   //LSB
    lo = _byteswap_uint64&#40;lo&#41;;                 //lsb becomes msb of lo
   
    hi = msk&#91;MSK_SWNE&#93;&#91;sq64&#93;.line_hi & occ;
    hi&=-hi;

    diag = msk&#91;MSK_SWNE&#93;&#91;sq64&#93;.line_ex & &#40;2*hi-lo&#41;; 
xor, 5 and, or, 2*negate, bswap, shl(lea), sub, 12 ops (~3 more than pure OD, but no bsr) plus once to calculate bsw as swapped occupancy.

Two independent instruction chains, register friendly, good balance between register ops and memory reads (as OD). Two lines might be processed in parallel with up to 4 instructions per cycle.

Hyperbola Quintessence
In HQ you also have the option to use sq^56 instead of swapping the single bit, another read from a second cacheline (which is fine btw):

Code: Select all

U64 diagonalAttacks&#40;U64 occ, enumSquare sq&#41; &#123;
   U64 forward, reverse;
   forward = occ & smsk&#91;sq&#93;.diagonalMaskEx;
   reverse  = _byteswap_uint64&#40;forward&#41;;
   forward -= smsk&#91;sq&#93;.bitMask;
   reverse -= smsk&#91;sq^56&#93;.bitMask; // _byteswap_uint64&#40;smsk&#91;sq&#93;.bitMask&#41;
   forward ^= _byteswap_uint64&#40;reverse&#41;;
   forward &= smsk&#91;sq&#93;.diagonalMaskEx;
   return forward;
&#125;
2 and, 2 bswap, 2 sub, 2 xor, 8 operations, but needs an extra register to presave the mask. It needs 2.5 KByte for all four lines, in conjunction with a 512 byte first rank attack getter. More dependencies than ROD, but as well relaxed if processing two directions in parallel.

Bishop attacks with five (not six!) bswaps, which are as cheap as other instructions on K8/K10. VC2008 generated assembly:

Code: Select all

occ$ = 16
sq$ = 24
?bishopAttacks@@YA_K_KI@Z PROC
  00000   40 53                push    rbx
  00002   8b c2                mov    eax, edx
  00004   4c 8d 15 00 00 00 00 lea    r10, OFFSET FLAT&#58;?smsk
  0000b   48 c1 e0 05          shl    rax, 5
  0000f   4a 8b 5c 10 08       mov    rbx, QWORD PTR &#91;rax+r10+8&#93;  ; diagonalMaskEx
  00014   4e 8b 4c 10 10       mov    r9,  QWORD PTR &#91;rax+r10+16&#93; ; antidiagMaskEx
  00019   4e 8b 14 10          mov    r10, QWORD PTR &#91;rax+r10&#93;    ; r &#58;= 1 << sq
  0001d   4c 8b db             mov    r11, rbx                    ; diagonalMaskEx
  00020   49 8b d1             mov    rdx, r9                     ; antidiagMaskEx
  00023   4d 8b c2             mov    r8, r10                     ; r &#58;= 1 << sq
  00026   48 23 d1             and    rdx, rcx                    ; anti & occ
  00029   4c 23 d9             and    r11, rcx                    ; dia  & occ
  0002c   49 0f c8             bswap  r8                          ; r'
  0002f   48 8b c2             mov    rax, rdx                    ; ant
  00032   49 8b cb             mov    rcx, r11                    ; dia
  00035   49 2b d2             sub    rdx, r10                    ; ant - r
  00038   48 0f c8             bswap  rax                         ; ant'
  0003b   48 0f c9             bswap  rcx                         ; dia'
  0003e   4d 2b da             sub    r11, r10                    ; dia - r
  00041   49 2b c0             sub    rax, r8                     ; ant' - r'
  00044   49 2b c8             sub    rcx, r8                     ; dia' - r'
  00047   48 0f c8             bswap  rax                         ;&#40;ant' - r')'
  0004a   48 0f c9             bswap  rcx                         ;&#40;dia' - r')'
  0004d   48 33 c2             xor    rax, rdx                    ; ant &#58;= &#40;ant' - r')' ^ &#40;ant - r&#41;
  00050   49 33 cb             xor    rcx, r11                    ; dia &#58;= &#40;dia' - r')' ^ &#40;dia - r&#41;
  00053   49 23 c1             and    rax, r9                     ; ant &= antidiagMaskEx
  00056   48 23 cb             and    rcx, rbx                    ; dia &= diagonalMaskEx
  00059   48 03 c1             add    rax, rcx                    ; attacks &#58;= dia + ant
  0005c   5b                   pop    rbx
  0005d   c3                   ret    0
?bishopAttacks@@YA_K_KI@Z ENDP
I don't like push/pop rbx that much. How about your assembly?

Cheers,
Gerd