Leading Zero Count Question

ZirconiumX · Post by **ZirconiumX** » Sun Sep 16, 2012 2:36 pm

The PowerPC cntlzw asm-instruction is 32-bit, so when we scan a 64 bit value >> 32, do we add 32 then XOR by 63 , or do we XOR by 63 then add 32?

To put it another way, would the following code produce the correct result?

Code: Select all

Square msb(Bitboard b) {

  unsigned b32;
  int result = 0;

  asm ("cntlzw %1,%0" : "=r"(result) : "r"(b >> 32));

  if (result != 32)
  {
      return Square((result + 32) ^ 63);
  }

  result = 0;

  asm ("cntlzw %1,%0" : "=r"(result) : "r"(b & 0xFFFFFFFF));

  if (result != 32)
  {
      return Square(result ^ 63);
  }

  return Square(0);
}

Matthew:out

Daniel White · Post by **Daniel White** » Sun Sep 16, 2012 5:50 pm

Note that N^63 == 63-N. Therefore (N+32)^63 == 31-N and (N^63)+32 == 95-I.

However I think the correct code for the first return should be:

Code: Select all

return Square(result ^ 63); // or 63 - result

and the second return:

Code: Select all

return Square(result ^ 31); // or 31 - result

Also it should not be necessary to AND b with 0xFFFFFFFF for the second test - execution can only reach that part if the upper bits of b are not set.

Hopefully somebody will correct me if any of this is wrong...

Gerd Isenberg · Post by **Gerd Isenberg** » Sun Sep 16, 2012 6:19 pm

Daniel White wrote:Note that N^63 == 63-N. Therefore (N+32)^63 == 31-N and (N^63)+32 == 95-I.

However I think the correct code for the first return should be:
Code: Select all
return Square(result ^ 63); // or 63 - result
and the second return:
Code: Select all
return Square(result ^ 31); // or 31 - result
Also it should not be necessary to AND b with 0xFFFFFFFF for the second test - execution can only reach that part if the upper bits of b are not set.

Hopefully somebody will correct me if any of this is wrong...

Correct to get the MSbitindex 63...0. I would put the conditions before the cntlzw.

Code: Select all

Square msb(Bitboard b) {
  uint32_t b32;
  int result = 0;

  b32 = (uint32_t)(b >> 32)
  if ( b32 ) {
     asm ("cntlzw %1,%0" : "=r"(result) : "r"(b32));
     return Square(result ^ 63); // 0 lz -> sq 63; 31 lz -> sq 32
  }
  b32 = (uint32_t)(b)
  if ( b32 ) {
     asm ("cntlzw %1,%0" : "=r"(result) : "r"(b32));
     return Square(result ^ 31); // 32 lz -> sq 31; 63 lz -> sq 0
  }
  return Square(invalid); 
}

ZirconiumX · Post by **ZirconiumX** » Sun Sep 16, 2012 7:13 pm

Gerd Isenberg wrote:
Daniel White wrote:Note that N^63 == 63-N. Therefore (N+32)^63 == 31-N and (N^63)+32 == 95-I.

However I think the correct code for the first return should be:
Code: Select all
return Square(result ^ 63); // or 63 - result
and the second return:
Code: Select all
return Square(result ^ 31); // or 31 - result
Also it should not be necessary to AND b with 0xFFFFFFFF for the second test - execution can only reach that part if the upper bits of b are not set.

Hopefully somebody will correct me if any of this is wrong...
Correct to get the MSbitindex 63...0. I would put the conditions before the cntlzw.
Code: Select all
Square msb(Bitboard b) {
  uint32_t b32;
  int result = 0;

  b32 = (uint32_t)(b >> 32)
  if ( b32 ) {
     result = __cntlzw(b32);
//     asm ("cntlzw %1,%0" : "=r"(result) : "r"(b32));
     return Square(result ^ 63); // 0 lz -> sq 63; 31 lz -> sq 32
  }
  b32 = (uint32_t)(b)
  if ( b32 ) {
     result = __cntlzw(b32);
//     asm ("cntlzw %1,%0" : "=r"(result) : "r"(b32));
     return Square(result ^ 31); // 32 lz -> sq 31; 63 lz -> sq 0
  }
  return Square(invalid); 
}

After switching to intrinsics (I appear to have been writing to a read-only register

) Gerd's code actually slightly regresses in Stockfish (about 2%).

I do wish PPC had a cnttzw instruction, but nope...

Matthew:out

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Sep 17, 2012 7:31 am

ZirconiumX wrote: After switching to intrinsics (I appear to have been writing to a read-only register ) Gerd's code actually slightly regresses in Stockfish (about 2%).

Compared to what other msb-routine?

ZirconiumX · Post by **ZirconiumX** » Mon Sep 17, 2012 8:02 am

Gerd Isenberg wrote:
ZirconiumX wrote: After switching to intrinsics (I appear to have been writing to a read-only register ) Gerd's code actually slightly regresses in Stockfish (about 2%).
Compared to what other msb-routine?

The default Stockfish Msb routine - like Nalimov's, I think.

Note that PowerPC is big endian - I think we may be scanning them in the wrong order.

Matthew:out

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Sep 17, 2012 8:34 am

ZirconiumX wrote:
Gerd Isenberg wrote:
ZirconiumX wrote: After switching to intrinsics (I appear to have been writing to a read-only register ) Gerd's code actually slightly regresses in Stockfish (about 2%).
Compared to what other msb-routine?
The default Stockfish Msb routine - like Nalimov's, I think.

Note that PowerPC is big endian - I think we may be scanning them in the wrong order.

Matthew:out

Ok, something like this with final byte lookup?

Code: Select all

/**
 * bitScanReverse
 * @author Eugene Nalimov
 * @param bb bitboard to scan
 * @return index (0..63) of most significant one bit
 */
int bitScanReverse(U64 bb)
{
   int result = 0;
   if (bb > 0xFFFFFFFF) {
      bb >>= 32;
      result = 32;
   }
   if (bb > 0xFFFF) {
      bb >>= 16;
      result += 16;
   }
   if (bb > 0xFF) {
      bb >>= 8;
      result += 8;
   }
   return result + ms1bTable[bb];
}

Seems cntlzw is not particular fast on PPC, 11 cycles or something. Big endian should not be an issue when extracting bytes/words/dword from qwords by shifting right, rather than pointer/array casting.

Gerd

ZirconiumX · Post by **ZirconiumX** » Mon Sep 17, 2012 9:11 am

Gerd Isenberg wrote:
ZirconiumX wrote:
Gerd Isenberg wrote:
ZirconiumX wrote: After switching to intrinsics (I appear to have been writing to a read-only register ) Gerd's code actually slightly regresses in Stockfish (about 2%).
Compared to what other msb-routine?
The default Stockfish Msb routine - like Nalimov's, I think.

Note that PowerPC is big endian - I think we may be scanning them in the wrong order.

Matthew:out
Ok, something like this with final byte lookup?
Code: Select all
/**
 * bitScanReverse
 * @author Eugene Nalimov
 * @param bb bitboard to scan
 * @return index (0..63) of most significant one bit
 */
int bitScanReverse(U64 bb)
{
   int result = 0;
   if (bb > 0xFFFFFFFF) {
      bb >>= 32;
      result = 32;
   }
   if (bb > 0xFFFF) {
      bb >>= 16;
      result += 16;
   }
   if (bb > 0xFF) {
      bb >>= 8;
      result += 8;
   }
   return result + ms1bTable[bb];
}
Seems cntlzw is not particular fast on PPC, 11 cycles or something. Big endian should not be an issue when extracting bytes/words/dword from qwords by shifting right, rather than pointer/array casting.

Gerd

Yup, like that.

Cntlzw is actually quite fast, achieving around two cycles on a G5. I think the problem is that you are screwing the branch prediction logic - flushing the instruction pipeline isn't cheap on PPC - around 25 cycles for a 24 instruction pipeline.

Matthew:out

Gerd Isenberg · Post by **Gerd Isenberg** » Mon Sep 17, 2012 11:00 am

ZirconiumX wrote:
Gerd Isenberg wrote:
ZirconiumX wrote:
Gerd Isenberg wrote:
ZirconiumX wrote: After switching to intrinsics (I appear to have been writing to a read-only register ) Gerd's code actually slightly regresses in Stockfish (about 2%).
Compared to what other msb-routine?
The default Stockfish Msb routine - like Nalimov's, I think.

Note that PowerPC is big endian - I think we may be scanning them in the wrong order.

Matthew:out
Ok, something like this with final byte lookup?
Code: Select all
/**
 * bitScanReverse
 * @author Eugene Nalimov
 * @param bb bitboard to scan
 * @return index (0..63) of most significant one bit
 */
int bitScanReverse(U64 bb)
{
   int result = 0;
   if (bb > 0xFFFFFFFF) {
      bb >>= 32;
      result = 32;
   }
   if (bb > 0xFFFF) {
      bb >>= 16;
      result += 16;
   }
   if (bb > 0xFF) {
      bb >>= 8;
      result += 8;
   }
   return result + ms1bTable[bb];
}
Seems cntlzw is not particular fast on PPC, 11 cycles or something. Big endian should not be an issue when extracting bytes/words/dword from qwords by shifting right, rather than pointer/array casting.

Gerd
Yup, like that.

Cntlzw is actually quite fast, achieving around two cycles on a G5. I think the problem is that you are screwing the branch prediction logic - flushing the instruction pipeline isn't cheap on PPC - around 25 cycles for a 24 instruction pipeline.

Matthew:out

Sorry for screwing the branch prediction logic

Some more trials ....

Code: Select all

Square msb(Bitboard b) {
  assert (b != 0);
  if ( b > 0xffffffff )
     return Square( __cntlzw((uint32_t)(b >> 32)) ^ 63);
  return Square __cntlzw((uint32_t)b) ^ 31);
}

Branchless, but I fear it is too much to pay off...

Code: Select all

Square msb64(Bitboard b) {
  int h, l, m;
  assert (b != 0);
  h = __cntlzw(b >> 32);
  l = __cntlzw((uint32_t)b);
  m = ~((h - 32) >> 31); // ~0 (-1) if h >= 32, else 0
  return Square( (h+(m&l)) ^ 63);
}

Otherwise I give up.

Cheers,
Gerd

ZirconiumX · Post by **ZirconiumX** » Mon Sep 17, 2012 7:03 pm

Note: This is PowerPC assembler - I hope it isn't too difficult to decipher - if all else fails, consult http://developer.apple.com/library/mac/ ... tions.html.

Stockfish:

Code: Select all

__Z3msby:
	mflr r0
LBB20:
LBB21:
	cmplwi cr7,r3,0
LBE21:
LBE20:
	mr r10,r4
	bcl 20,31,"L00000000003$pb"
"L00000000003$pb":
LBB22:
LBB23:
	li r11,0
LBE23:
LBE22:
	mflr r8
	mtlr r0
LBB24:
LBB25:
	ble cr7,L9
	li r11,32
	mr r10,r3
L9:
	cmplwi cr7,r10,65535
	mr r0,r10
	ble cr7,L10
	srwi r0,r10,16
	addi r11,r11,16
L10:
	cmplwi cr7,r0,255
	ble cr7,L12
	srwi r0,r0,8
	addi r11,r11,8
L12:
	addis r2,r8,ha16(__ZN17_GLOBAL__N_RMasks9MS1BTableE-"L00000000003$pb")
	slwi r0,r0,2
	la r2,lo16(__ZN17_GLOBAL__N_RMasks9MS1BTableE-"L00000000003$pb")(r2)
	lwzx r3,r2,r0
LBE25:
LBE24:
	add r3,r11,r3
	blr

24 instructions :
5 branch instructions,
3 compare instructions,
4 ADD instructions,
3 rotate left instructions,
4 load instructions,
5 move instructions.

Gerd #1:

Code: Select all

__Z3msby:
LBB20:
LBB21:
	cmpwi cr7,r3,0
	beq cr7,L6
	cntlzw r0,r3
	xori r3,r0,63
	blr
L6:
	cmpwi cr7,r4,0
	li r3,0
	cntlzw r0,r4
	xori r3,r0,31
LBE21:
LBE20:
	blr

11 instructions :
2 compare instructions,
4 branch instructions,
1 load instruction,
2 XOR instructions,
2 Leading Zero Count instructions.

Gerd #2 (I modified it to return 0 if it gets a null bitboard - shouldn't be too costly) :

Code: Select all

__Z3msby:
	mr r9,r3
LBB20:
	li r3,0
	or. r0,r9,r4
	beqlr cr0
	cmplwi cr7,r9,0
	ble cr7,L13
	cntlzw r0,r9
	xori r3,r0,63
	blr
L13:
	cntlzw r0,r4
	xori r3,r0,31
LBE20:
	blr

12 instructions :
4 branch instructions,
2 Count Leading Zeroes instructions,
2 XOR instructions,
1 load instruction,
1 compare instruction,
1 OR instruction,
1 move instruction.

Gerd #3 (same modification as above):

Code: Select all

__Z3msby:
	mr r10,r3
LBB20:
LBB21:
	li r3,0
	or. r0,r10,r4
	beqlr cr0
	cntlzw r9,r10
	cntlzw r11,r4
	addi r0,r9,-32
	srawi r0,r0,31
	andc r0,r11,r0
	add r9,r9,r0
	xori r3,r9,63
LBE21:
LBE20:
	blr

12 instructions:
1 move instruction,
1 load instruction,
1 OR instruction,
2 branch instruction,
2 count leading zeroes instructions,
2 ADD instructions,
1 rotate instruction,
1 AND instruction,
1 XOR instruction.

Benchmark results to be posted shortly.

Matthew:out

Leading Zero Count Question

Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question

Re: Leading Zero Count Question