32 bit versions for bitscan64

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

mcostalba wrote:I have tested the below version on my Intel Core 2 Duo but with no results, perhaps a very very slightly increase but well below 1%

Code: Select all

Square pop_1st_bit(Bitboard* bb) {

  uint32_t* ptr = (uint32_t*)bb;
  unsigned long id;

  if (*ptr)
  {
      _BitScanForward(&id, *ptr);
      *ptr ^= &#40;1 << id&#41;;
  &#125;
  else
  &#123;
      ptr += 1;
      _BitScanForward&#40;&id, *ptr&#41;;
      *ptr ^= &#40;1 << id&#41;;
      id += 32;
  &#125;
  return Square&#40;id&#41;;
&#125;

This is the disassembled code (MSVC):

Code: Select all

Square pop_1st_bit&#40;Bitboard* bb&#41; &#123;
0040E760  push        ecx  

  uint32_t* ptr = &#40;uint32_t*&#41;bb;
  unsigned long id;

  if (*ptr&#41;
0040E761  mov         edx,dword ptr &#91;esi&#93;  
0040E763  push        edi  
  &#123;
      _BitScanForward&#40;&id, *ptr&#41;;
      *ptr ^= &#40;1 << id&#41;;
0040E764  mov         edi,1  
0040E769  test        edx,edx  
0040E76B  je          pop_1st_bit+1Bh &#40;40E77Bh&#41;  
0040E76D  bsf         eax,edx  
0040E770  mov         ecx,eax  
0040E772  shl         edi,cl  
0040E774  xor         edi,edx  
0040E776  mov         dword ptr &#91;esi&#93;,edi  
0040E778  pop         edi  
  &#125;
  return Square&#40;id&#41;;
&#125;
0040E779  pop         ecx  
0040E77A  ret  
  &#125;
  else
  &#123;
      ptr += 1;
      _BitScanForward&#40;&id, *ptr&#41;;
0040E77B  mov         edx,dword ptr &#91;esi+4&#93;  
0040E77E  bsf         ecx,edx  
      *ptr ^= &#40;1 << id&#41;;
0040E781  shl         edi,cl  
0040E783  mov         eax,ecx  
0040E785  xor         edi,edx  
0040E787  mov         dword ptr &#91;esi+4&#93;,edi  
      id += 32;
0040E78A  add         eax,20h  
0040E78D  pop         edi  
  &#125;
  return Square&#40;id&#41;;
&#125;
0040E78E  pop         ecx  
0040E78F  ret  
It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops. Two distinct high-low loops in 32-bit mode may keep all in registers, but that is a mess with respect to keep source code as architecture independent as possible. Another movegen alternative is to hash up to eight bits of a sliding move target line, knight- and king-target set in one run, almost branchless.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Code: Select all


ULONG bsf64&#40;BTB_T bb&#41;
	&#123;
	 UI_32 *const ptr = &#40;UI_32*)&bb;
	 ULONG id=64;

	 _BitScanForward&#40;&id,*&#40;ptr&#41;) ? id &#58; id += 31 + &#40;bool&#41;_BitScanForward&#40;&id,*&#40;ptr+1&#41;);

	 return&#40;id&#41;;
	&#125;

and its assembly

	push	ecx
	bsf	eax, DWORD PTR _bb$&#91;esp&#93;
	mov	DWORD PTR _id$&#91;esp+4&#93;, 64		; 00000040H
	jne	SHORT $LN4@bsf64
	bsf	eax, DWORD PTR _bb$&#91;esp+4&#93;
	mov	ecx, 0
	setne	cl
	mov	edx, eax
	mov	DWORD PTR _id$&#91;esp+4&#93;, eax
	lea	eax, DWORD PTR &#91;edx+ecx+31&#93;
$LN4@bsf64&#58;
	pop	ecx
	ret	0
Matts Folding Trick

Code: Select all

	 const int lsb_64_table&#91;64&#93; =
		&#123;
		 63, 30,  3, 32, 59, 14, 11, 33,
		 60, 24, 50,  9, 55, 19, 21, 34,
		 61, 29,  2, 53, 51, 23, 41, 18,
		 56, 28,  1, 43, 46, 27,  0, 35,
		 62, 31, 58,  4,  5, 49, 54,  6,
		 15, 52, 12, 40,  7, 42, 45, 16,
		 25, 57, 48, 13, 10, 39,  8, 44,
		 20, 47, 38, 22, 17, 37, 36, 26
		&#125;;

ULONG bsf64_MattTaylor&#40;BTB_T bb&#41;
	&#123;

	 bb ^= &#40;bb- 1&#41;;
     unsigned int folded = &#40;int&#41; bb ^ &#40;bb >> 32&#41;;
   
	 return lsb_64_table&#91;folded * 0x78291ACF >> 26&#93;;
	&#125;

	mov	ecx, DWORD PTR _bb$&#91;esp-4&#93;
	mov	eax, DWORD PTR _bb$&#91;esp&#93;
	push	esi
	mov	esi, eax
	mov	edx, ecx
	add	edx, -1
	adc	esi, -1
	xor	eax, esi
	xor	ecx, edx
	xor	eax, ecx
	imul	eax, 2015959759				; 78291acfH
	shr	eax, 26					; 0000001aH
	mov	eax, DWORD PTR _lsb_64_table&#91;eax*4&#93;
	pop	esi
	ret	0
Hi, just came back and have a look what happend. Without any news
i only want to give you my assembly.Tomorrow i will analyse what s going
on. I am soooooo tired.

til tomorrow...
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Hello Gerd,
Gerd Isenberg wrote: It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops. Two distinct high-low loops in 32-bit mode may keep all in registers, but that is a mess with respect to keep source code as architecture independent as possible. Another movegen alternative is to hash up to eight bits of a sliding move target line, knight- and king-target set in one run, almost branchless.
1.

i am a little bit suprised or maybe i dont really see the problem.
What is the matter using a pointer with adding an offset ?
I thought the assembler is doing the same memoryadress + offset, so why
shouldnt the programmer do that explicit?

2.

can it be, that if someone only wants to pop1 (64bit value), that
value64 &= -value64 is the most effectiv way, also in 32 bit mode?

best regards
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

Desperado wrote: 1.

i am a little bit suprised or maybe i dont really see the problem.
What is the matter using a pointer with adding an offset ?
I thought the assembler is doing the same memoryadress + offset, so why
shouldnt the programmer do that explicit?

2.

can it be, that if someone only wants to pop1 (64bit value), that
value64 &= -value64 is the most effectiv way, also in 32 bit mode?

best regards
1. The point is to keep the bitboard to serialize inside a register and to don't read it and write it back each time from/to memory. With small loop bodies, i.e. writing generated moves into a move list, additional memory transfer becomes relative expensive. This is not a big deal, anyway.

2. Yes, I would prefer a loop with value64 &= -value64, which might be done in parallel with the bitscan forward (on core2, i7, not K8) to improve ipc.

Code: Select all

  ; rdx bitboard to serialize
  test rdx, rdx
  jz   over
loop&#58;
  lea  rcx, &#91;rdx-1&#93;
  bsf  rax, rdx
  ...
  and  rdx, rcx ; reset ls1b, sets or resets zero-flag
  ... ; mov, stosq etc
  jnz  loop
over&#58;
Cheers,
Gerd
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Ok, now a version _WITHOUT_ any pointer tricks.

Code: Select all


//****************************************************************************
//* FUNCTION	&#58; bsf64a													 *
//* DESCRIPTION	&#58; 32 bit version											 *
//****************************************************************************

unsigned long bsf64a&#40;unsigned __int64 bb&#41;
	&#123;
	 unsigned long id;

	 _BitScanForward&#40;&id,&#40;UI_32&#41;bb&#41; ? id &#58; id+=31+_BitScanForward&#40;&id,&#40;UI_32&#41;&#40;bb>>32&#41;);

	 return&#40;id&#41;;
	&#125;


...


	push	ecx
	bsf	eax, DWORD PTR _bb$&#91;esp&#93;
	jne	SHORT $LN4@bsf64a
	bsf	eax, DWORD PTR _bb$&#91;esp+4&#93;
	setne	cl
	movzx	edx, cl
	mov	DWORD PTR _id$&#91;esp+4&#93;, eax
	lea	eax, DWORD PTR &#91;eax+edx+31&#93;
$LN4@bsf64a&#58;
	pop	ecx
	ret	0


This is doing the same performance as the pointer-version. (without _traverse bitboards via pointer_)

Marco (or anybody else), would you be so nice to give me some benchmarks compared to MattsFT ? i am very interested how intel machine is performing.

Additionally i would suggest to do "postoperation" with value64^=value64&-value64 to pop1.

cheers

edit: thx gerd for reply, was preparing my post at the moment...
Last edited by Desperado on Sat Aug 22, 2009 11:41 am, edited 1 time in total.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Gerd Isenberg wrote: It is a mess to traverse bitboards via pointer or reference aka memory in small traversal loops.
I have tried also this version, with less pointers, but it seems no gain on the current one.

Code: Select all

// Use type-punning
union b_union &#123;

    Bitboard b;
    struct &#123;
        uint32_t l;
        uint32_t h;
    &#125; dw;
&#125;;

// WARNING&#58; Needs -fno-strict-aliasing compiler option
Square pop_1st_bit&#40;Bitboard* bb&#41; &#123;

  unsigned long id;
  b_union u;

  u.b = *bb;

  if &#40;u.dw.l&#41;
  &#123;
      _BitScanForward&#40;&id, u.dw.l&#41;;
      *(&#40;uint32_t*&#41;bb&#41; ^= &#40;1 << id&#41;;
  &#125;
  else
  &#123;
      _BitScanForward&#40;&id, u.dw.h&#41;;
      *(&#40;uint32_t*&#41;bb+1&#41; ^= &#40;1 << id&#41;; // Little endian only?
      id += 32;
  &#125;
  return Square&#40;id&#41;;
&#125;

I have an Intel Core 2 Duo so bsf should be fast enough.

Below is the assembly of the above:

Code: Select all

0041E510  push        ecx  
0041E511  mov         eax,dword ptr &#91;edx&#93; 
0041E513  mov         ecx,dword ptr &#91;edx+4&#93; 
0041E516  push        esi  
0041E517  mov         esi,1 
0041E51C  test        eax,eax 
0041E51E  je          pop_1st_bit+20h &#40;41E530h&#41; 
0041E520  bsf         eax,eax 
0041E523  mov         ecx,eax 
0041E525  shl         esi,cl 
0041E527  mov         dword ptr &#91;esp+4&#93;,eax 
0041E52B  xor         dword ptr &#91;edx&#93;,esi 
0041E52D  pop         esi  
0041E52E  pop         ecx  
0041E52F  ret              
0041E530  bsf         ecx,ecx 
0041E533  shl         esi,cl 
0041E535  mov         eax,ecx 
0041E537  mov         dword ptr &#91;esp+4&#93;,ecx 
0041E53B  xor         dword ptr &#91;edx+4&#93;,esi 
0041E53E  add         eax,20h 
0041E541  pop         esi  
0041E542  pop         ecx  
0041E543  ret  
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Desperado wrote:Ok, now a version _WITHOUT_ any pointer tricks.

Code: Select all


//****************************************************************************
//* FUNCTION	&#58; bsf64a													 *
//* DESCRIPTION	&#58; 32 bit version											 *
//****************************************************************************

unsigned long bsf64a&#40;unsigned __int64 bb&#41;
	&#123;
	 unsigned long id;

	 _BitScanForward&#40;&id,&#40;UI_32&#41;bb&#41; ? id &#58; id+=31+_BitScanForward&#40;&id,&#40;UI_32&#41;&#40;bb>>32&#41;);

	 return&#40;id&#41;;
	&#125;
This is doing the same performance as the pointer-version. (without _traverse bitboards via pointer_)

Marco (or anybody else), would you be so nice to give me some benchmarks compared to MattsFT ? i am very interested how intel machine is performing.

Additionally i would suggest to do "postoperation" with value64^=value64&-value64 to pop1.

cheers

edit: thx gerd for reply, was preparing my post at the moment...
But does this version resets the bit ?


I think the two things are different. IMHO you cannot take a fast bitscan only routine insert in an already exsisting pop_1s_bit() and assume you will speed up also the second one.

IMHO to get a fast pop_1s_bit() it is better to _forget_ compeletly about bitscan alone and write from scratch as is the bitscan does not exsist.

I will be very happy to test a pop_1s_bit() that is what is used in chess engines but a bit scan alone IMHO just blurs the problem.
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

ok, here it is.

but i first like to mention that my style seems to be different, because
i most often pop1 in the following way.

Code: Select all

//some code...

while&#40;bitboard&#41;
	&#123;
	 id = bitscan //scan instruction

	 //code

	 bb^=bb&-bb; //pop instruction
	&#125;

//some code...
I think if you only have one operation (without a loop, so to say),
there is no difference, beside i can use the unmodified or the modified bitboard in the _code between_. (There is no additional information gain popping immediatelly,or is there an additional profit ?)

I mean, i can _always_ put the pop-instruction behind (as next instruction)
to the scan-instruction, with the same result.

i would be happy if you can spend some words on that point.

either way, here is the SCAN_AND_POP code.

Code: Select all

unsigned long bsf64_and_pop1&#40;unsigned __int64 &bb&#41;
	&#123;
	 unsigned long id;

	 _BitScanForward&#40;&id,&#40;UI_32&#41;bb&#41; ? id &#58; id+=31+_BitScanForward&#40;&id,&#40;UI_32&#41;&#40;bb>>32&#41;);
	 
	 bb^=bb&-bb;

	 return&#40;id&#41;;
	&#125;
:)
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

Edit to my last post:

So, i am very interest in a comparison between MattFT an bitscan_only
version, because as i descriped above,i do _not_ always pop when i scan!

when i pop1 i _do not_ need _any_ scan before!

(beside i would pop1 with bb^= 1<<id for some reason).

And if it is necessary to scan_and_pop, i just put my instructions together (in a row).

:?
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

edit on edit:

Sorry Marco, the longer i think about what i described i disagree with your opinion. At least for me, and also for practise, it is _absolut relevant_
what the bscan alone is doing.
pop and scan are independent components, which can be simply put together if needed.
if there would be a way, where the combination of both become better(faster,different,profitable in some way) than their basic components itself,
ok then this combination becomes also an _autonomous component_.
(i dont think a scan_and_pop function meets this argument, so seperating
the pop and scan doesnt matter)

with most respect i say this, because i think i am not so experienced.

So if someone else has an opinion on that point, let me know or correct me. I am here to learn sth.

Thx