An easy bitboard optimization

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

micron
Posts: 155
Joined: Mon Feb 15, 2010 9:33 am
Location: New Zealand

An easy bitboard optimization

Post by micron »

A ubiquitous construct in bitboard programming is

Code: Select all

while ( bits )
{
  square = GetLSBit( bits );
  //...do something with square...
  ClearBit( &bits, square );
}
ClearBit() is typically a macro or inline.

Code: Select all

static inline void ClearBit( BitBoard * b, int bitNum )
{
  *b &= gMaskBitOff[bitNum];
}
Following the hint at https://chessprogramming.wikispaces.com ... ialization I replaced ClearBit() in the above construct by ClearLSBit( &bits ).

Code: Select all

static inline void ClearLSBit( BitBoard * b )
{
   *b &= *b - 1;
}
My program, running on Core 2 Duo, was speeded up by 10% -- not a bad result for a few minutes work. Although other hardware may not benefit to the same extent, Core 2 is very common. Somewhat to my surprise, even a 32-bit build was improved, but only by 3%.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: An easy bitboard optimization

Post by hgm »

Doesn't everyone do something like that? (I.e. clearing with a simple AND instruction?) I thought the standard way of extracting bits was

Code: Select all

while(bits) {
  u64 lsb = bits & -bits;
  square = BitToNr(lsb);
  ...
  bits &= ~lsb;
}
But all my engines are mailbox or (for the new one I am writing) even better, so I could be wrong.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An easy bitboard optimization

Post by bob »

micron wrote:A ubiquitous construct in bitboard programming is

Code: Select all

while ( bits )
{
  square = GetLSBit( bits );
  //...do something with square...
  ClearBit( &bits, square );
}
ClearBit() is typically a macro or inline.

Code: Select all

static inline void ClearBit( BitBoard * b, int bitNum )
{
  *b &= gMaskBitOff[bitNum];
}
Following the hint at https://chessprogramming.wikispaces.com ... ialization I replaced ClearBit() in the above construct by ClearLSBit( &bits ).

Code: Select all

static inline void ClearLSBit( BitBoard * b )
{
   *b &= *b - 1;
}
My program, running on Core 2 Duo, was speeded up by 10% -- not a bad result for a few minutes work. Although other hardware may not benefit to the same extent, Core 2 is very common. Somewhat to my surprise, even a 32-bit build was improved, but only by 3%.
Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.

I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.
Gerd Isenberg
Posts: 2250
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: An easy bitboard optimization

Post by Gerd Isenberg »

bob wrote: Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.

I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.
Some programs don't care the scan-order. The advantage is to make bitscan ls1b and reset ls1b independently. Whether this safes 10% depends how much time is spend in serializing bitboards before with bsf result dependent memory read. If it was let say 15..20% before, 10% might be reasonable ;-)

Code: Select all

; input rcx target bitboard to serialize
;       rdi i.e. move destination
   text rcx, rcx
   jz   over
l1:
   lea  rdx, [rcx - 1]
   bsf  rax, rcx
   ... ; eventually "or" from square
   and  rcx, rdx
   ... ; mov or stos
   jnz  l1
over:
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An easy bitboard optimization

Post by bob »

Gerd Isenberg wrote:
bob wrote: Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.

I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.
Some programs don't care the scan-order. The advantage is to make bitscan ls1b and reset ls1b independently. Whether this safes 10% depends how much time is spend in serializing bitboards before with bsf result dependent memory read. If it was let say 15..20% before, 10% might be reasonable ;-)

Code: Select all

; input rcx target bitboard to serialize
;       rdi i.e. move destination
   text rcx, rcx
   jz   over
l1:
   lea  rdx, [rcx - 1]
   bsf  rax, rcx
   ... ; eventually "or" from square
   and  rcx, rdx
   ... ; mov or stos
   jnz  l1
over:
I know some don't, but most would probably care for debugging issues unless you are one of those that flip the entire board each ply so it is always the same side to move. I want to search the moves in the same order whether I start from black or white on the initial position, which lets me detect asymmetry issues, not to mention improve move ordering and such.
micron
Posts: 155
Joined: Mon Feb 15, 2010 9:33 am
Location: New Zealand

Re: An easy bitboard optimization

Post by micron »

The measured 10% speedup is for actual search, not some contrived test.
My program uses LSB scans only, and rather a lot of of them. I incrementally update attack boards with scan-rich code, and my positional scoring does up to 12 scans. Thus circumstances are favourable for the optimization.
Disassembly:

Code: Select all

  movq  %rax, %rcx
  testq %rax, %rax
  je    L2
L6:
  movq  %rcx, %rdx
  bsfq  %rdx, %rax     
  ;..do something..
  leaq  -1(%rcx), %rax
  andq  %rax, %rcx
  jne   L6
L2:
Robert P.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: An easy bitboard optimization

Post by bob »

micron wrote:The measured 10% speedup is for actual search, not some contrived test.
My program uses LSB scans only, and rather a lot of of them. I incrementally update attack boards with scan-rich code, and my positional scoring does up to 12 scans. Thus circumstances are favourable for the optimization.
Disassembly:

Code: Select all

  movq  %rax, %rcx
  testq %rax, %rax
  je    L2
L6:
  movq  %rcx, %rdx
  bsfq  %rdx, %rax     
  ;..do something..
  leaq  -1(%rcx), %rax
  andq  %rax, %rcx
  jne   L6
L2:
Robert P.
My only comment is that if this is true, you need to seriously re-evaluate how you are doing things. There's absolutely no reason for the LSB/MSB stuff to take 10% of your time, even if you use a loop / shift to find these bits, rather than the a & a-1 trick. If I run on an I7 and use the hardware popcnt instruction, it is difficult to measure any improvement in speed, as I am simply careful where I used them.

10% is _way_ too high. 1% is too high. And if speeding this up improves overall speed by 10%, that means that MSB/LSB stuff is taking _more_ than 10% of total time. There's a free performance boost waiting to be grabbed...

Don't speed 'em up, get rid of 'em completely, if they are taking that much time. Another way of thinking about this is that if you could make everything but MSB/LSB code run in zero time, you would speed the program up by less than a factor of 10... not good.