64-bit intrinsic performance

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 6:15 am
Location: Australia

64-bit intrinsic performance

Post by nthom »

Hi all,

I'm using Visual Studio 2008 and my 64-bit version of LittleThought is far slower than my 32-bit version. The culprit seems to be the intrinsic functions I use in VS. The 32-bit LSB function uses 2 x inline assembler 32-bit scans and the 64-bit LSB function has to use an intrinsic due to limitations with the VS 64-bit compiler (cant use inline asm). To my surprise instead of being faster, the intrinsic is about half the speed of the inline asm:

Code: Select all

__forceinline unsigned long LSB64(BitBoard b) {
	unsigned long pos;
	ASSERT(b);
	_BitScanForward64(&pos,b);
	return pos;
}


__forceinline unsigned int LSB(BitBoard b) {
	// Assumes b is not zero.
	ASSERT(b);
	__asm
	{
		bsf eax,[b+4]
		xor eax,32
		bsf eax,[b]
	}
}
Can anyone shed some light on this and give me some suggestions as to what to do?

Cheers,
- Nathan
Karlo Bala
Posts: 373
Joined: Wed Mar 22, 2006 10:17 am
Location: Novi Sad, Serbia
Full name: Karlo Balla

Re: 64-bit intrinsic performance

Post by Karlo Bala »

nthom wrote:Hi all,

I'm using Visual Studio 2008 and my 64-bit version of LittleThought is far slower than my 32-bit version. The culprit seems to be the intrinsic functions I use in VS. The 32-bit LSB function uses 2 x inline assembler 32-bit scans and the 64-bit LSB function has to use an intrinsic due to limitations with the VS 64-bit compiler (cant use inline asm). To my surprise instead of being faster, the intrinsic is about half the speed of the inline asm:

Code: Select all

__forceinline unsigned long LSB64(BitBoard b) {
	unsigned long pos;
	ASSERT(b);
	_BitScanForward64(&pos,b);
	return pos;
}


__forceinline unsigned int LSB(BitBoard b) {
	// Assumes b is not zero.
	ASSERT(b);
	__asm
	{
		bsf eax,[b+4]
		xor eax,32
		bsf eax,[b]
	}
}
Can anyone shed some light on this and give me some suggestions as to what to do?

Cheers,
- Nathan
On AMD (I think Phenom core and later) there is faster instruction named LZCNT. It count leading zeroes and you can call it as __lzcnt64 intrinsic function. To be honest I don't understand why your 64bit version is slower - it should be faster especially on intel processors.

P.S.
Even for 32bit you should always use intrinsic, it is never slower then inline ASM.
Best Regards,
Karlo Balla Jr.
User avatar
rvida
Posts: 481
Joined: Thu Apr 16, 2009 12:00 pm
Location: Slovakia, EU

Re: 64-bit intrinsic performance

Post by rvida »

Hi, here is my code for bit scans from the new Critter, for me it works with intrinsics as expected. (I use MSVC 2008)


from magic32.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  if (b & 0xffffffff) {
    _BitScanForward(&index, b & 0xffffffff); 
    return Square(index);
  }
  else {
    _BitScanForward(&index, b >> 32);
    return Square(index + 32);
  }
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  if (b >> 32) {
    _BitScanReverse(&index, b >> 32); 
    return Square(index + 32);
  }
  else {
    _BitScanReverse(&index, b & 0xffffffff);
    return Square(index);
  }
}
from magic64.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  _BitScanForward64(&index, b);
  return Square(index);
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  _BitScanReverse64(&index, b);
  return Square(index);
}
Gian-Carlo Pascutto
Posts: 1260
Joined: Sat Dec 13, 2008 7:00 pm

Re: 64-bit intrinsic performance

Post by Gian-Carlo Pascutto »

Enable the option to use inline intrinsic functions in the optimization settings.
User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 6:15 am
Location: Australia

Re: 64-bit intrinsic performance

Post by nthom »

Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.
User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 6:15 am
Location: Australia

Re: 64-bit intrinsic performance

Post by nthom »

rvida wrote:Hi, here is my code for bit scans from the new Critter, for me it works with intrinsics as expected. (I use MSVC 2008)

from magic64.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  _BitScanForward64(&index, b);
  return Square(index);
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  _BitScanReverse64(&index, b);
  return Square(index);
}
I changed my 32-bit asm to intrinsics and it sped up a little, but my 64-bit stuff is still slower. Is yours noticeably faster?

I'm running Vista 64 on a Phenom quad core if that makes a difference.
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 64-bit intrinsic performance

Post by Gerd Isenberg »

nthom wrote:
Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.
Can you post the generated 64-bit assembly of a typical bitboard serialization loop? What is your processor?
User avatar
rvida
Posts: 481
Joined: Thu Apr 16, 2009 12:00 pm
Location: Slovakia, EU

Re: 64-bit intrinsic performance

Post by rvida »

I measured perft(5) speed on 300 wac positions. 64 bit is 27% faster.
At least on my Athlon64 4200, debruijn multiply & lookup is even 7% faster than that.
On intel processors pure BSF may be faster. YMMV

Code: Select all

32bit: Total time: 730547, Total nodes: 22931089169, Avg speed: 31388 knps
64bit(BSF): Total time: 573282, Total nodes: 22931089169, Avg speed: 39999 knps
64bit(lookup): Total time: 536157, Total nodes: 22931089169, Avg speed: 42769 knps
debruijn lookup code:

Code: Select all

const U64 debruijn = 0x021a2c94c7eae76fULL;

const int bsf_index[64] ={
    0,  1,  2,  7,  3, 15,  8, 34,
    4, 22, 25, 16, 30,  9, 51, 35,
    5, 13, 23, 28, 26, 43, 17, 45,
   31, 19, 10, 56, 47, 52, 59, 36,
   63,  6, 14, 33, 21, 24, 29, 50,
   12, 27, 42, 44, 18, 55, 46, 58,
   62, 32, 20, 49, 11, 41, 54, 57,
   61, 48, 40, 53, 60, 39, 38, 37,
};

inline Square BSF(Bitboard b) {
  return Square(bsf_index[((b & 0-b) * debruijn) >> 58]);
}
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: 64-bit intrinsic performance

Post by bob »

nthom wrote:
Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.
Note that on 64 bit processors, pointers are 64 bits. That is 2x as long as on 32 bit machines. If you are doing a lot of pointer stuff in your code, that can have a slight negative impact.
User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 6:15 am
Location: Australia

Re: 64-bit intrinsic performance

Post by nthom »

Gerd Isenberg wrote: Can you post the generated 64-bit assembly of a typical bitboard serialization loop? What is your processor?
What do you mean by a serialization loop? Can you post pseudo code?

I'm testing on a Phenom 9550 quad core.