32 bit versions for bitscan64

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Ok I have tested with:

Code: Select all

Square pop_1st_bit(Bitboard* bb)
{
  unsigned long id;

  _BitScanForward(&id, (uint32_t)(*bb)) ? id : id += 31 + _BitScanForward(&id, (uint32_t)((*bb) >> 32));
  *bb ^= *bb & -(*bb);
  return Square(id);
}
but no speed increase, perhaps just very slightly, anyhow below 1%

Then I have changed my move serialization loop from

Code: Select all

while (b)
    (*mlist++).move = make_move(from, pop_1st_bit(&b))
to

Code: Select all

while (b)
{
    Square id = first_1(b);
    (*mlist++).move = make_move(from, id);
    b ^= &#40;1ULL << id&#41;;
&#125;
With first_1() defined as:

Code: Select all

Square first_1&#40;Bitboard b&#41; &#123;

  unsigned long id;

  _BitScanForward&#40;&id, &#40;uint32_t&#41;b&#41; ? id &#58; id += 31 + _BitScanForward&#40;&id, &#40;uint32_t&#41;&#40;b >> 32&#41;);

  return Square&#40;id&#41;; 
&#125;
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!! :shock:
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

mcostalba wrote:Ok I have tested with:

Code: Select all

while &#40;b&#41;
&#123;
    Square id = first_1&#40;b&#41;;
    (*mlist++).move = make_move&#40;from, id&#41;;
    b ^= &#40;1ULL << id&#41;;
&#125;
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!! :shock:
i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers
Last edited by Desperado on Sat Aug 22, 2009 2:42 pm, edited 1 time in total.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

OK, I have put the definition of id out of the loop:

Code: Select all

Square id;
while &#40;b&#41;
&#123;
    id = first_1&#40;b&#41;;
    (*mlist++).move = make_move&#40;from, id&#41;;
    b ^= &#40;1ULL << id&#41;;
&#125; 
Now it is better at 438.560 nodes/sec but still -1.7% slower then the combined bitscan + pop
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

perhaps this one...

Code: Select all


while &#40;b&#41; 
&#123; 
    (*mlist++).move = make_move&#40;from, first_1&#40;b&#41;); 
    b ^= b&-b;
&#125; 

good luck :)
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Already under pgo compilation..... :-)
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Ok with this.

Code: Select all

while &#40;b&#41;
&#123;
     (*mlist++).move = make_move&#40;from, first_1&#40;b&#41;);
     b ^= b & (-b&#41;;
&#125;
Now we are at 443.655 nodes/sec, very similar to combined bitscan+pop that is at 446.222 nodes/sec.

My impression is that the current version that I posted at the beginning, does not have 64 bits operations that are very costly on a 32 bit machine, while in this one we have 3 operations more on 64 bits (xor, neg, and):

Code: Select all

 b ^= b & (-b&#41;;
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: 32 bit versions for bitscan64

Post by Gerd Isenberg »

Desperado wrote:
mcostalba wrote:Ok I have tested with:

Code: Select all

while &#40;b&#41;
&#123;
    Square id = first_1&#40;b&#41;;
    (*mlist++).move = make_move&#40;from, id&#41;;
    b ^= &#40;1ULL << id&#41;;
&#125;
And to my surprise I experienced a -11% speed decrease from 446.222 nodes/sec to 401.151 nodes/sec !!!! :shock:
i cannot explain, but try b^=bb&-b instead of b^ (1ULL << id)

using this makes a _big_ difference on my machine (btw: keeps the pop independent on the scan, whatever this may be good for...)

cheers
Sure, 1ULL << id in 32-bit mode is not that cheap, and dependent from the bitscan (also in 64-bit mode). May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...
User avatar
Desperado
Posts: 879
Joined: Mon Dec 15, 2008 11:45 am

Re: 32 bit versions for bitscan64

Post by Desperado »

i cannot get it faster than that 8-) :

Code: Select all


ULONG bsf64_and_pop1&#40;UI_64 &bb&#41;
	&#123;
	 ULONG id;

	 _BitScanForward&#40;&id,&#40;UI_32&#41;bb&#41; ? 
	 id &#58; 
	 id += 31 + _BitScanForward&#40;&id,&#40;UI_32&#41;&#40;bb>>32&#41;);

                 bb &= bb - 1;
	 //bb ^= bb&-bb;

	 return&#40;id&#41;;
	&#125;


So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!


thx
Last edited by Desperado on Sat Aug 22, 2009 3:22 pm, edited 1 time in total.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Gerd Isenberg wrote:May be the generated assembly is the same, but b &= b - 1 to reset LS1B is still one instruction less...
It is a bit faster indeed at 445.761 nodes/sec very similar to the reference version.

I would think any 64 bit operation in 32 bit mode is to avoid, perhaps that's the reason the reference version still seems to be the best.

Anyhow using the intrinsic _BitScanForward() instead of the Matt multiply doesn't seem to give a good increase even on an Intel Core 2 Duo, on slower CPU it can go only worse and on faster 64 bit CPU in any case you have another version tuned for 64 bits.
mcostalba
Posts: 2684
Joined: Sat Jun 14, 2008 9:17 pm

Re: 32 bit versions for bitscan64

Post by mcostalba »

Desperado wrote: So once again Marco, would you be so nice to give me an intel-bench
for MattsFT and the bitscan-only?

i would be very (very very very ) thankful!
I think I have alreaady gave you, the separated bitscan + pop.

With the last version we have more or less the same node count of the MattsFT version and the last tested version is separated bitscan.

I cannot test _only_ bitscan without pop because otherwise engine doen't work anymore. What I have done is to _separate_ the two and I think we have verified that the bitscan intrinsic version is of same speed of MattsFT version.

Am I missing something ?