DIRECT BITBOARD MOVEGENERATION

Desperado · Post by **Desperado** » Thu Jul 23, 2009 10:27 pm

hi.

i am testing bitboard generating methods.
The last work i did was experimenting with magic bitboards,
and now some tests on the Hyberbola-Q idea.

So if you look at the return value of the function below,
can someone explain to me why the factor can raise like
in the comments made to the function ?
Can i do something against ?

Code: Select all

#define BB(SQ64)   ((BTB)1<<(SQ64))
#define LO64(SQ64) (BB(SQ64)-1)
#define HI64(SQ64) ((bH8-BB(SQ64))<<1)

BTB rook_trial(const BTB occ,const SQR_T sq64)
	{
	 
	 static const BTB cfle = (bR1|bR8);
	 static const BTB crnk = (bAF|bHF);

	 static BTB fle,rnk,lo,hi;
	 
	 //FILE

	 (lo = LO64(sq64) & (occ|cfle) & msk.msk_file[sq64]) ? lo : lo = BB(sq64);
	 (hi = HI64(sq64) & (occ|cfle) & msk.msk_file[sq64]) ? hi : hi = BB(sq64);
	
	 hi &= -hi;
	 lo  = BB(bsr64(lo));
	 fle = (((hi<<1)-lo) & msk.msk_file[sq64]);


	 //RANK

	 (lo = LO64(sq64) & (occ|crnk) & msk.msk_rank[sq64]) ? lo : lo = BB(sq64);
	 (hi = HI64(sq64) & (occ|crnk) & msk.msk_rank[sq64]) ? hi : hi = BB(sq64);
	
	 hi &= -hi;
	 lo  = BB(bsr64(lo));
	 rnk = (((hi<<1)-lo) & msk.msk_rank[sq64]);

	 //return(rnk); //performs factor 1
	 //return(fle); //performs factor 1

	 return(fle|rnk); //performs about factor 20...? why not 2 or 3 ???
	}

(all file/rank masks are excluding the square itself)

Anyway i would be happy if someone would made some interesting
annotations about the code, so we may discuss some ideas,pros,cons THX
I will also try to ask (edit: haha answer of course) any questions about the code.
(my goal at the moment is to have an acceptable direct computing method, without any lookups)

PS: of course i would be happy if someone tries out, and tells me if the performence is well or not(independent on the "performance problem" i ve descriped above)

Gerd Isenberg · Post by **Gerd Isenberg** » Thu Jul 23, 2009 11:45 pm

1+1=20?
May be the optimizing compiler is as obfuscated as I am after watching your code

If you make separate routines for file and rank, and "or" them together, what happens then? Non inlined versus inlined? Some more comments on the code appreciated, explain each line in words. I found statements like

Code: Select all

( a = expression) ? a : (a = otherExpression );

rather confusing and would prefere

Code: Select all

a = expression ? expression : otherExpression;

Desperado · Post by **Desperado** » Fri Jul 24, 2009 12:55 am

thx for reply.

ok, the conditional operator isnt my best friend

.

i will add some description for the "file" code. The rank
is analoge as you can see.

i am suprised too, about "1+1 = 20". ??? that s why i posted it. i cannot
explain, therefore i asked for any help.

And yes, i tried out things you mentioned. same effect...

Well, all i can add at the moment is, that independent on inline/non-inlining, seperate routines/one routine...
put the results together -> 1+1 = 20!

Code: Select all


#define BB(SQ64)   ((BTB)1<<(SQ64))
#define LO64(SQ64) (BB(SQ64)-1)
#define HI64(SQ64) ((bH8-BB(SQ64))<<1)

	 //FILE

	 //******************************************************************************
	 //*																		    
	 //* STATEMENTS																	 
	 //*																			
	 //*																			
	 //* STEP_1:																	
	 //*																			
	 //*																			
	 //*																			
	 //* LO64(sq64)	-> bitboard all bits set below sq-index (excluding sq)			
	 //* HI64(sq64)	-> bitboard all bits set above sq-index (excluding sq)			
	 //* occ|cfle	-> makes sure file has at least set the bit on rank8 and rank1	
	 //* msk_file	-> all bit set, excluding the sq with the slider(rook) on file	
	 //*																			
	 //* lo			-> all occupancy bits on file below sq							
	 //*				-> if lo == 0 the rook is on border. So border-square is set
	 //* hi			-> same as for lo, but bits above sq.							
	 //*																			
	 //******************************************************************************

	 (lo = LO64(sq64) & (occ|cfle) & msk.msk_file[sq64]) ? lo : lo = BB(sq64);
	 (hi = HI64(sq64) & (occ|cfle) & msk.msk_file[sq64]) ? hi : hi = BB(sq64);
	
	 //******************************************************************************
	 //*
	 //*STEP_2:
	 //*
	 //*hi:  -> extract the lsb-board of all bits above sq
	 //*lo:  -> extract the msb-board of all bits below sq
	 //*fle: -> connect by sub
	 //*fle: -> (hi<<1) is just for including the nn-blocker.
	 //******************************************************************************

	 hi &= -hi;									//lsb of bits above sq
	 lo  = BB(bsr64(lo));						//msb of bits below sq
	 fle = (((hi<<1)-lo) & msk.msk_file[sq64]); //connecting bits

pls, what do you mean with _the optimizing compiler is as obfuscated_ ?

rvida · Post by **rvida** » Fri Jul 24, 2009 3:29 am

Im a PASCAL guy, I dont use C, but let me guess...

If you return ONLY "rnk", compiler will optimize away whole "fle" calculation.
If you return ONLY "fle", compiler will optimize away whole "rnk" calculation.

If you return logical OR of both, it must include both calculations.

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Jul 24, 2009 8:14 am

Thanks, I understand your code now. Your conditional assignments made me confused, and may also hard to understand for the optimizing compiler to produce reasonable code with x86 cmov instruction for instance. Also the semantics of cfle and crnk was unclear first.

I would write it that way (and think about a branchless solution).

Code: Select all

(lo = LO64(sq64) & (occ|cfle) & msk.msk_file[sq64]) ? lo : lo = BB(sq64); 

lo = LO64(sq64) & (occ|cfle) & msk.msk_file[sq64];
if ( lo == 0 ) lo = BB(sq64);

The idea is fine. To subtract the most significant occupied bit of the negative (low) ray from the least significant occupied bit of the positive (high) ray times two, but I fear it is too expensive due to four conditional branches per rook or bishop, likely hard to predict if used for multiple white and black rooks. Usually with reasonable small branchless code, one may improve ipc if processing independent stuff in parallel, so that in best case
1 + 1 = 1
however branches avoid parallel scheduling, and branches may become much more expensive, if they are too complicated to predict in combination. But still
1 + 1 = 20
seems a bit strange. How do you measure the timing?

Desperado · Post by **Desperado** » Fri Jul 24, 2009 10:36 am

hi rvida. think excactly that happens, but doesnt explain the jump in time.

hi gerd. ok maybe somethink with "branching" code causes the problem.
An indicator for that is, that i first coded it like you suggested!.
That was worse than using the conditional operator.

maybe by using the operator a second time, the compiler cannot optimize
the same way?! i will try out, and try a solution avoiding any branches.

For measuring the time i just used the below code.
So far now, have to go to work... til later.

Code: Select all

	 srand((unsigned)time(NULL));

	 for(UI_16 i=0;i<1024;i++) {r[i] = rnd64();s[i]=rand()%64;r[i]|=BB(s[i]);}

	 start = clock();
	 for(UI_64 i=0;i<1500000000;i++)
		{
		 bb = rook_trial(r[i&1023],s[i&1023]);
		}
	 end = clock();

	 print_bb(bb);
	 printf("\nTIME: %d ",end-start,x);getchar();

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Jul 24, 2009 10:35 pm

Desperado wrote:hi rvida. think excactly that happens, but doesnt explain the jump in time.

hi gerd. ok maybe somethink with "branching" code causes the problem.
An indicator for that is, that i first coded it like you suggested!.
That was worse than using the conditional operator.

maybe by using the operator a second time, the compiler cannot optimize
the same way?! i will try out, and try a solution avoiding any branches.

For measuring the time i just used the below code.
So far now, have to go to work... til later.
Code: Select all
	 srand((unsigned)time(NULL));

	 for(UI_16 i=0;i<1024;i++) {r[i] = rnd64();s[i]=rand()%64;r[i]|=BB(s[i]);}

	 start = clock();
	 for(UI_64 i=0;i<1500000000;i++)
		{
		 bb = rook_trial(r[i&1023],s[i&1023]);
		}
	 end = clock();

	 print_bb(bb);
	 printf("\nTIME: %d ",end-start,x);getchar();

Such loop test sucks. Compiler likely unrolls the loop and there are code L1 issues. Better use Agner Fog's Test programs for measuring clock cycles and performance monitoring.

On the routine. A branchless, untested proposal:

Code: Select all

U64 fileAttacks(U64 occ, enumSquare sq) {
{
	U64 high = smsk[sq].fileMaskEx & occ & (-2ULL << sq); 
	U64 low  = smsk[sq].fileMaskEx & occ & ((1ULL << sq) - 1);
	high  = high & -high;          // ls1b of high (if any)
	low   = -1ULL << bsr64(low|1); // ms1b of low (at least bit zero)
	return smsk[sq].fileMaskEx & (2*high+low); // lea
}

Even if this works, simple count the operations compared to bswap hyperbola quintessence:

Code: Select all

U64 fileAttacks(U64 occ, enumSquare sq) {
   U64 forward, reverse;
   forward  = occ & smsk[sq].fileMaskEx;
   reverse  = _byteswap_uint64(forward);
   forward -= 1ULL << sq;
   reverse -= _byteswap_uint64(1ULL << sq); 
   forward ^= _byteswap_uint64(reverse);
   forward &= smsk[sq].fileMaskEx;
   return forward;
}

The advantage is that it works for all lines, while the bswap does not cover ranks, which are cheap by lookup anyway.

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Jul 24, 2009 10:59 pm

Desperado wrote: For measuring the time i just used the below code.
So far now, have to go to work... til later.

Code: Select all

	 srand((unsigned)time(NULL));

	 for(UI_16 i=0;i<1024;i++) {r[i] = rnd64();s[i]=rand()%64;r[i]|=BB(s[i]);}

	 start = clock();
	 for(UI_64 i=0;i<1500000000;i++)
		{
		 bb = rook_trial(r[i&1023],s[i&1023]);
		}
	 end = clock();

	 print_bb(bb);
	 printf("\nTIME: %d ",end-start,x);getchar();

A smart compiler would even optimize the whole loop away, to only execute the last index. better use bb ^= rook_trial.

wgarvin · Post by **wgarvin** » Sat Jul 25, 2009 2:29 am

Gerd Isenberg wrote:
Desperado wrote: For measuring the time i just used the below code.
So far now, have to go to work... til later.
Code: Select all
	 srand((unsigned)time(NULL));

	 for(UI_16 i=0;i<1024;i++) {r[i] = rnd64();s[i]=rand()%64;r[i]|=BB(s[i]);}

	 start = clock();
	 for(UI_64 i=0;i<1500000000;i++)
		{
		 bb = rook_trial(r[i&1023],s[i&1023]);
		}
	 end = clock();

	 print_bb(bb);
	 printf("\nTIME: %d ",end-start,x);getchar();
A smart compiler would even optimize the whole loop away, to only execute the last index. better use bb ^= rook_trial.

That might not be a good way to test it either, because it introduces a loop-carried dependence where there isn't really a need for one.

My suggestion is to replace "bb" with a volatile global variable and use an ordinary assignment (not the ^= assignment). The compiler will not be able to optimize away the store instruction, but it should be able to optimize everything else. (At least on 64-bit systems... does the compiler generate slower code for a volatile int64 than a non-volatile one 32-bit systems? I have no idea.) Another possibility is to invoke the rook_trial a bunch of times within each loop (maybe 8 times?) and store each return value in one of 8 different temporaries. Then there is a loop-carried dependence for each temporary, but there is lots of other calculation to do between def and use so they won't limit the throughput.

[Edit: actually it doesn't read from bb inside the loop other than the read-modify-write of the ^=, so maybe it would be fine that way. If you want to be on the safe side, you could do maybe 2 or 4 calls to rook_trial and use bb1 ^= rook_trial, bb2 ^= rook_trial, etc.]

Gerd Isenberg · Post by **Gerd Isenberg** » Sat Jul 25, 2009 7:23 pm

Gerd Isenberg wrote: On the routine. A branchless, untested proposal:
Code: Select all
U64 fileAttacks(U64 occ, enumSquare sq) {
{
	U64 high = smsk[sq].fileMaskEx & occ & (-2ULL << sq); 
	U64 low  = smsk[sq].fileMaskEx & occ & ((1ULL << sq) - 1);
	high  = high & -high;          // ls1b of high (if any)
	low   = -1ULL << bsr64(low|1); // ms1b of low (at least bit zero)
	return smsk[sq].fileMaskEx & (2*high+low); // lea
}
Even if this works, simple count the operations compared to bswap hyperbola quintessence:

The advantage is that it works for all lines, while the bswap does not cover ranks, which are cheap by lookup anyway.

For intels with fast bsr but slower bswap, this approach might be not that bad as I initially thought. Alternatively one can save 5 ops by upper/lower lookups from the very same cacheline to end up with 9 ops per line (file, rank, dia, antidia) and two independent instruction chains until the final lea/and:

Code: Select all

U64 lineAttacks(U64 occ, enumLine line, enumSquare sq) {
   U64 low, high;
   low  = smsk[sq][line].lower & occ;
   high = smsk[sq][line].upper & occ;
   low  = -C64(1) << bsr64(low|1); // ms1b of low (at least bit zero) -> -low
   high = high & -high;            // ls1b of high (if any)
   return smsk[sq][line].maskEx & (2*high+low); // lea option due to add -low
}

This is how it works, e.g. on a file from rook d4:

Code: Select all

occupancy              fileMaskEx(d4)           occupancy d-file
 . . . 1 . 1 . .       . . . 1 . . . .          . . . 1 . . . . 
 1 . 1 . . 1 1 .       . . . 1 . . . .          . . . . . . . . 
 . 1 . 1 . . . 1       . . . 1 . . . .          . . . 1 . . . . 
 . . . . . 1 . .       . . . 1 . . . .          . . . . . . . . 
 . 1 . 1 . . . .   &   . . . . . . . .    =     . . . R . . . .   
 . . . . . . 1 .       . . . 1 . . . .          . . . . . . . . 
 1 1 . 1 . 1 . 1       . . . 1 . . . .          . . . 1 . . . . 
 . . . 1 1 . 1 .       . . . 1 . . . .          . . . 1 . . . . 
                       
occupancy d-file      occupancy d-file    
 . . . 1 . . . .       . . . 1 . . . .   
 . . . . . . . .       . . . . . . . .    
 . . . 1 . . . .       . . . 1 . . . .    
 . . . . . . . .       . . . . . . . .      
 . . . R . . . .       . . . R . . . .        
 . . . . . . . .       . . . . . . . .      
 . . . 1 . . . .       . . . 1 . . . .      
 . . . 1 . . . .       . . . 1 . . . .         
                                                      
high = -2 << d4       low = (1<<d4)-1  
 1 1 1 1 1 1 1 1       . . . . . . . .    
 1 1 1 1 1 1 1 1       . . . . . . . .          
 1 1 1 1 1 1 1 1       . . . . . . . .      
 1 1 1 1 1 1 1 1       . . . . . . . .       
 . . . . 1 1 1 1       1 1 1 . . . . .      
 . . . . . . . .       1 1 1 1 1 1 1 1         
 . . . . . . . .       1 1 1 1 1 1 1 1         
 . . . . . . . .       1 1 1 1 1 1 1 1          
                                             
occupancy high         occupancy low                   
 . . . 1 . . . .       . . . . . . . .     
 . . . . . . . .       . . . . . . . .        
 . . . 1 . . . .       . . . . . . . .          
 . . . . . . . .       . . . . . . . .      
 . . . . . . . .       . . . . . . . .      
 . . . . . . . .       . . . . . . . .       
 . . . . . . . .       . . . 1 . . . .         
 . . . . . . . .       . . . 1 . . . .  
                                           
LS1B high              MS1B low: 1 << bsr(low|1)       
 . . . . . . . .       . . . . . . . .    
 . . . . . . . .       . . . . . . . .      
 . . . 1 . . . .       . . . . . . . .      
 . . . . . . . .       . . . . . . . .     
 . . . . . . . .       . . . . . . . .      
 . . . . . . . .       . . . . . . . .        
 . . . . . . . .       . . . 1 . . . .       
 . . . . . . . .       1 . . . . . . .           
                                           
2* LS1B high           MS1B low              
 . . . . . . . .       . . . . . . . .         . . . . . . . .     
 . . . . . . . .       . . . . . . . .         . . . . . . . .    
 . . . . 1 . . .       . . . . . . . .         1 1 1 1 . . . .    
 . . . . . . . .       . . . . . . . .         1 1 1 1 1 1 1 1   
 . . . . . . . .   -   . . . . . . . .    =    1 1 1 1 1 1 1 1       
 . . . . . . . .       . . . . . . . .         1 1 1 1 1 1 1 1    
 . . . . . . . .       . . . 1 . . . .         . . . 1 1 1 1 1 
 . . . . . . . .       . . . . . . . .         . . . . . . . .  
          
                          
2*high-low             fileMaskEx              fileAttacks
 . . . . . . . .       . . . 1 . . . .         . . . . . . . . 
 . . . . . . . .       . . . 1 . . . .         . . . . . . . . 
 1 1 1 1 . . . .       . . . 1 . . . .         . . . 1 . . . . 
 1 1 1 1 1 1 1 1       . . . 1 . . . .         . . . 1 . . . . 
 1 1 1 1 1 1 1 1   &   . . . . . . . .    =    . . . . . . . .   
 1 1 1 1 1 1 1 1       . . . 1 . . . .         . . . 1 . . . . 
 . . . 1 1 1 1 1       . . . 1 . . . .         . . . 1 . . . . 
 . . . . . . . .       . . . 1 . . . .         . . . . . . . .

DIRECT BITBOARD MOVEGENERATION

DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION

Re: DIRECT BITBOARD MOVEGENERATION