64-bit intrinsic performance

nthom · Post by **nthom** » Tue Oct 27, 2009 12:35 pm

Hi all,

I'm using Visual Studio 2008 and my 64-bit version of LittleThought is far slower than my 32-bit version. The culprit seems to be the intrinsic functions I use in VS. The 32-bit LSB function uses 2 x inline assembler 32-bit scans and the 64-bit LSB function has to use an intrinsic due to limitations with the VS 64-bit compiler (cant use inline asm). To my surprise instead of being faster, the intrinsic is about half the speed of the inline asm:

Code: Select all

__forceinline unsigned long LSB64(BitBoard b) {
	unsigned long pos;
	ASSERT(b);
	_BitScanForward64(&pos,b);
	return pos;
}


__forceinline unsigned int LSB(BitBoard b) {
	// Assumes b is not zero.
	ASSERT(b);
	__asm
	{
		bsf eax,[b+4]
		xor eax,32
		bsf eax,[b]
	}
}

Can anyone shed some light on this and give me some suggestions as to what to do?

Cheers,
- Nathan

Karlo Bala · Post by **Karlo Bala** » Tue Oct 27, 2009 1:12 pm

nthom wrote:Hi all,

I'm using Visual Studio 2008 and my 64-bit version of LittleThought is far slower than my 32-bit version. The culprit seems to be the intrinsic functions I use in VS. The 32-bit LSB function uses 2 x inline assembler 32-bit scans and the 64-bit LSB function has to use an intrinsic due to limitations with the VS 64-bit compiler (cant use inline asm). To my surprise instead of being faster, the intrinsic is about half the speed of the inline asm:
Code: Select all
__forceinline unsigned long LSB64(BitBoard b) {
	unsigned long pos;
	ASSERT(b);
	_BitScanForward64(&pos,b);
	return pos;
}


__forceinline unsigned int LSB(BitBoard b) {
	// Assumes b is not zero.
	ASSERT(b);
	__asm
	{
		bsf eax,[b+4]
		xor eax,32
		bsf eax,[b]
	}
}
Can anyone shed some light on this and give me some suggestions as to what to do?

Cheers,
- Nathan

On AMD (I think Phenom core and later) there is faster instruction named LZCNT. It count leading zeroes and you can call it as __lzcnt64 intrinsic function. To be honest I don't understand why your 64bit version is slower - it should be faster especially on intel processors.

P.S.
Even for 32bit you should always use intrinsic, it is never slower then inline ASM.

rvida · Post by **rvida** » Tue Oct 27, 2009 2:23 pm

Hi, here is my code for bit scans from the new Critter, for me it works with intrinsics as expected. (I use MSVC 2008)

from magic32.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  if (b & 0xffffffff) {
    _BitScanForward(&index, b & 0xffffffff); 
    return Square(index);
  }
  else {
    _BitScanForward(&index, b >> 32);
    return Square(index + 32);
  }
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  if (b >> 32) {
    _BitScanReverse(&index, b >> 32); 
    return Square(index + 32);
  }
  else {
    _BitScanReverse(&index, b & 0xffffffff);
    return Square(index);
  }
}

from magic64.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  _BitScanForward64(&index, b);
  return Square(index);
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  _BitScanReverse64(&index, b);
  return Square(index);
}

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Oct 27, 2009 2:25 pm

Enable the option to use inline intrinsic functions in the optimization settings.

nthom · Post by **nthom** » Wed Oct 28, 2009 1:14 pm

Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.

Ahh silly me. That helped, but x64 is still a tad slower.

nthom · Post by **nthom** » Wed Oct 28, 2009 1:16 pm

rvida wrote:Hi, here is my code for bit scans from the new Critter, for me it works with intrinsics as expected. (I use MSVC 2008)

from magic64.h:
Code: Select all
inline Square BSF(Bitboard b) {
  unsigned long index;
  _BitScanForward64(&index, b);
  return Square(index);
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  _BitScanReverse64(&index, b);
  return Square(index);
}

I changed my 32-bit asm to intrinsics and it sped up a little, but my 64-bit stuff is still slower. Is yours noticeably faster?

I'm running Vista 64 on a Phenom quad core if that makes a difference.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Oct 28, 2009 1:34 pm

nthom wrote:
Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.

Can you post the generated 64-bit assembly of a typical bitboard serialization loop? What is your processor?

rvida · Post by **rvida** » Wed Oct 28, 2009 2:38 pm

I measured perft(5) speed on 300 wac positions. 64 bit is 27% faster.
At least on my Athlon64 4200, debruijn multiply & lookup is even 7% faster than that.
On intel processors pure BSF may be faster. YMMV

Code: Select all

32bit: Total time: 730547, Total nodes: 22931089169, Avg speed: 31388 knps
64bit(BSF): Total time: 573282, Total nodes: 22931089169, Avg speed: 39999 knps
64bit(lookup): Total time: 536157, Total nodes: 22931089169, Avg speed: 42769 knps

debruijn lookup code:

Code: Select all

const U64 debruijn = 0x021a2c94c7eae76fULL;

const int bsf_index[64] ={
    0,  1,  2,  7,  3, 15,  8, 34,
    4, 22, 25, 16, 30,  9, 51, 35,
    5, 13, 23, 28, 26, 43, 17, 45,
   31, 19, 10, 56, 47, 52, 59, 36,
   63,  6, 14, 33, 21, 24, 29, 50,
   12, 27, 42, 44, 18, 55, 46, 58,
   62, 32, 20, 49, 11, 41, 54, 57,
   61, 48, 40, 53, 60, 39, 38, 37,
};

inline Square BSF(Bitboard b) {
  return Square(bsf_index[((b & 0-b) * debruijn) >> 58]);
}

bob · Post by **bob** » Wed Oct 28, 2009 6:37 pm

nthom wrote:
Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.

Note that on 64 bit processors, pointers are 64 bits. That is 2x as long as on 32 bit machines. If you are doing a lot of pointer stuff in your code, that can have a slight negative impact.

nthom · Post by **nthom** » Thu Oct 29, 2009 12:16 pm

Gerd Isenberg wrote: Can you post the generated 64-bit assembly of a typical bitboard serialization loop? What is your processor?

What do you mean by a serialization loop? Can you post pseudo code?

I'm testing on a Phenom 9550 quad core.

64-bit intrinsic performance

64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance

Re: 64-bit intrinsic performance