64-bit intrinsic performance

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 5:15 am
Location: Australia
Contact:

64-bit intrinsic performance

Post by nthom » Tue Oct 27, 2009 11:35 am

Hi all,

I'm using Visual Studio 2008 and my 64-bit version of LittleThought is far slower than my 32-bit version. The culprit seems to be the intrinsic functions I use in VS. The 32-bit LSB function uses 2 x inline assembler 32-bit scans and the 64-bit LSB function has to use an intrinsic due to limitations with the VS 64-bit compiler (cant use inline asm). To my surprise instead of being faster, the intrinsic is about half the speed of the inline asm:

Code: Select all

__forceinline unsigned long LSB64(BitBoard b) {
	unsigned long pos;
	ASSERT(b);
	_BitScanForward64(&pos,b);
	return pos;
}


__forceinline unsigned int LSB(BitBoard b) {
	// Assumes b is not zero.
	ASSERT(b);
	__asm
	{
		bsf eax,[b+4]
		xor eax,32
		bsf eax,[b]
	}
}
Can anyone shed some light on this and give me some suggestions as to what to do?

Cheers,
- Nathan

Karlo Bala
Posts: 296
Joined: Wed Mar 22, 2006 9:17 am
Location: Novi Sad, Serbia

Re: 64-bit intrinsic performance

Post by Karlo Bala » Tue Oct 27, 2009 12:12 pm

nthom wrote:Hi all,

I'm using Visual Studio 2008 and my 64-bit version of LittleThought is far slower than my 32-bit version. The culprit seems to be the intrinsic functions I use in VS. The 32-bit LSB function uses 2 x inline assembler 32-bit scans and the 64-bit LSB function has to use an intrinsic due to limitations with the VS 64-bit compiler (cant use inline asm). To my surprise instead of being faster, the intrinsic is about half the speed of the inline asm:

Code: Select all

__forceinline unsigned long LSB64(BitBoard b) {
	unsigned long pos;
	ASSERT(b);
	_BitScanForward64(&pos,b);
	return pos;
}


__forceinline unsigned int LSB(BitBoard b) {
	// Assumes b is not zero.
	ASSERT(b);
	__asm
	{
		bsf eax,[b+4]
		xor eax,32
		bsf eax,[b]
	}
}
Can anyone shed some light on this and give me some suggestions as to what to do?

Cheers,
- Nathan
On AMD (I think Phenom core and later) there is faster instruction named LZCNT. It count leading zeroes and you can call it as __lzcnt64 intrinsic function. To be honest I don't understand why your 64bit version is slower - it should be faster especially on intel processors.

P.S.
Even for 32bit you should always use intrinsic, it is never slower then inline ASM.
Best Regards,
Karlo Balla Jr.

User avatar
rvida
Posts: 481
Joined: Thu Apr 16, 2009 10:00 am
Location: Slovakia, EU

Re: 64-bit intrinsic performance

Post by rvida » Tue Oct 27, 2009 1:23 pm

Hi, here is my code for bit scans from the new Critter, for me it works with intrinsics as expected. (I use MSVC 2008)


from magic32.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  if (b & 0xffffffff) {
    _BitScanForward(&index, b & 0xffffffff); 
    return Square(index);
  }
  else {
    _BitScanForward(&index, b >> 32);
    return Square(index + 32);
  }
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  if (b >> 32) {
    _BitScanReverse(&index, b >> 32); 
    return Square(index + 32);
  }
  else {
    _BitScanReverse(&index, b & 0xffffffff);
    return Square(index);
  }
}
from magic64.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  _BitScanForward64(&index, b);
  return Square(index);
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  _BitScanReverse64(&index, b);
  return Square(index);
}

Gian-Carlo Pascutto
Posts: 1063
Joined: Sat Dec 13, 2008 6:00 pm
Contact:

Re: 64-bit intrinsic performance

Post by Gian-Carlo Pascutto » Tue Oct 27, 2009 1:25 pm

Enable the option to use inline intrinsic functions in the optimization settings.

User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 5:15 am
Location: Australia
Contact:

Re: 64-bit intrinsic performance

Post by nthom » Wed Oct 28, 2009 12:14 pm

Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.

User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 5:15 am
Location: Australia
Contact:

Re: 64-bit intrinsic performance

Post by nthom » Wed Oct 28, 2009 12:16 pm

rvida wrote:Hi, here is my code for bit scans from the new Critter, for me it works with intrinsics as expected. (I use MSVC 2008)

from magic64.h:

Code: Select all

inline Square BSF(Bitboard b) {
  unsigned long index;
  _BitScanForward64(&index, b);
  return Square(index);
}

inline Square BSR(Bitboard b) {
  unsigned long index;
  _BitScanReverse64(&index, b);
  return Square(index);
}
I changed my 32-bit asm to intrinsics and it sped up a little, but my 64-bit stuff is still slower. Is yours noticeably faster?

I'm running Vista 64 on a Phenom quad core if that makes a difference.

Gerd Isenberg
Posts: 2105
Joined: Wed Mar 08, 2006 7:47 pm
Location: Hattingen, Germany

Re: 64-bit intrinsic performance

Post by Gerd Isenberg » Wed Oct 28, 2009 12:34 pm

nthom wrote:
Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.
Can you post the generated 64-bit assembly of a typical bitboard serialization loop? What is your processor?

User avatar
rvida
Posts: 481
Joined: Thu Apr 16, 2009 10:00 am
Location: Slovakia, EU

Re: 64-bit intrinsic performance

Post by rvida » Wed Oct 28, 2009 1:38 pm

I measured perft(5) speed on 300 wac positions. 64 bit is 27% faster.
At least on my Athlon64 4200, debruijn multiply & lookup is even 7% faster than that.
On intel processors pure BSF may be faster. YMMV

Code: Select all

32bit: Total time: 730547, Total nodes: 22931089169, Avg speed: 31388 knps
64bit(BSF): Total time: 573282, Total nodes: 22931089169, Avg speed: 39999 knps
64bit(lookup): Total time: 536157, Total nodes: 22931089169, Avg speed: 42769 knps
debruijn lookup code:

Code: Select all

const U64 debruijn = 0x021a2c94c7eae76fULL;

const int bsf_index[64] ={
    0,  1,  2,  7,  3, 15,  8, 34,
    4, 22, 25, 16, 30,  9, 51, 35,
    5, 13, 23, 28, 26, 43, 17, 45,
   31, 19, 10, 56, 47, 52, 59, 36,
   63,  6, 14, 33, 21, 24, 29, 50,
   12, 27, 42, 44, 18, 55, 46, 58,
   62, 32, 20, 49, 11, 41, 54, 57,
   61, 48, 40, 53, 60, 39, 38, 37,
};

inline Square BSF(Bitboard b) {
  return Square(bsf_index[((b & 0-b) * debruijn) >> 58]);
}

bob
Posts: 20342
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: 64-bit intrinsic performance

Post by bob » Wed Oct 28, 2009 5:37 pm

nthom wrote:
Gian-Carlo Pascutto wrote:Enable the option to use inline intrinsic functions in the optimization settings.
Ahh silly me. That helped, but x64 is still a tad slower.
Note that on 64 bit processors, pointers are 64 bits. That is 2x as long as on 32 bit machines. If you are doing a lot of pointer stuff in your code, that can have a slight negative impact.

User avatar
nthom
Posts: 112
Joined: Thu Mar 09, 2006 5:15 am
Location: Australia
Contact:

Re: 64-bit intrinsic performance

Post by nthom » Thu Oct 29, 2009 11:16 am

Gerd Isenberg wrote: Can you post the generated 64-bit assembly of a typical bitboard serialization loop? What is your processor?
What do you mean by a serialization loop? Can you post pseudo code?

I'm testing on a Phenom 9550 quad core.

Post Reply