An easy bitboard optimization

micron · Post by **micron** » Wed Mar 10, 2010 11:22 am

A ubiquitous construct in bitboard programming is

while ( bits )
&#123;
  square = GetLSBit&#40; bits );
  //...do something with square...
  ClearBit&#40; &bits, square );
&#125;

ClearBit() is typically a macro or inline.

Code: Select all

static inline void ClearBit&#40; BitBoard * b, int bitNum )
&#123;
  *b &= gMaskBitOff&#91;bitNum&#93;;
&#125;

Following the hint at https://chessprogramming.wikispaces.com ... ialization I replaced ClearBit() in the above construct by ClearLSBit( &bits ).

Code: Select all

static inline void ClearLSBit&#40; BitBoard * b )
&#123;
   *b &= *b - 1;
&#125;

My program, running on Core 2 Duo, was speeded up by 10% -- not a bad result for a few minutes work. Although other hardware may not benefit to the same extent, Core 2 is very common. Somewhat to my surprise, even a 32-bit build was improved, but only by 3%.

hgm · Post by **hgm** » Wed Mar 10, 2010 12:37 pm

Doesn't everyone do something like that? (I.e. clearing with a simple AND instruction?) I thought the standard way of extracting bits was

Code: Select all

while&#40;bits&#41; &#123;
  u64 lsb = bits & -bits;
  square = BitToNr&#40;lsb&#41;;
  ...
  bits &= ~lsb;
&#125;

But all my engines are mailbox or (for the new one I am writing) even better, so I could be wrong.

bob · Post by **bob** » Wed Mar 10, 2010 3:31 pm

micron wrote:A ubiquitous construct in bitboard programming is
Code: Select all
while ( bits )
&#123;
  square = GetLSBit&#40; bits );
  //...do something with square...
  ClearBit&#40; &bits, square );
&#125;
ClearBit() is typically a macro or inline.
Code: Select all
static inline void ClearBit&#40; BitBoard * b, int bitNum )
&#123;
  *b &= gMaskBitOff&#91;bitNum&#93;;
&#125;
Following the hint at https://chessprogramming.wikispaces.com ... ialization I replaced ClearBit() in the above construct by ClearLSBit( &bits ).
Code: Select all
static inline void ClearLSBit&#40; BitBoard * b )
&#123;
   *b &= *b - 1;
&#125;
My program, running on Core 2 Duo, was speeded up by 10% -- not a bad result for a few minutes work. Although other hardware may not benefit to the same extent, Core 2 is very common. Somewhat to my surprise, even a 32-bit build was improved, but only by 3%.

Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.

I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Mar 10, 2010 7:56 pm

bob wrote: Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.

I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.

Some programs don't care the scan-order. The advantage is to make bitscan ls1b and reset ls1b independently. Whether this safes 10% depends how much time is spend in serializing bitboards before with bsf result dependent memory read. If it was let say 15..20% before, 10% might be reasonable

Code: Select all

; input rcx target bitboard to serialize
;       rdi i.e. move destination
   text rcx, rcx
   jz   over
l1&#58;
   lea  rdx, &#91;rcx - 1&#93;
   bsf  rax, rcx
   ... ; eventually "or" from square
   and  rcx, rdx
   ... ; mov or stos
   jnz  l1
over&#58;

bob · Post by **bob** » Wed Mar 10, 2010 9:12 pm

Gerd Isenberg wrote:
bob wrote: Only works for LSB, yet there are lots of occasions where you want the 1 bit on the other end. And for a black/white neutral program, this is a further problem.

I have no idea why it was a 10% improvement, however. It should be minimal and more like 1% at best. Unless you stick that in a loop and only test it by running repeated calls for a minute or so. In a real program, that is a tiny part of the total work being done.
Some programs don't care the scan-order. The advantage is to make bitscan ls1b and reset ls1b independently. Whether this safes 10% depends how much time is spend in serializing bitboards before with bsf result dependent memory read. If it was let say 15..20% before, 10% might be reasonable
Code: Select all
; input rcx target bitboard to serialize
;       rdi i.e. move destination
   text rcx, rcx
   jz   over
l1&#58;
   lea  rdx, &#91;rcx - 1&#93;
   bsf  rax, rcx
   ... ; eventually "or" from square
   and  rcx, rdx
   ... ; mov or stos
   jnz  l1
over&#58;

I know some don't, but most would probably care for debugging issues unless you are one of those that flip the entire board each ply so it is always the same side to move. I want to search the moves in the same order whether I start from black or white on the initial position, which lets me detect asymmetry issues, not to mention improve move ordering and such.

micron · Post by **micron** » Wed Mar 10, 2010 11:22 pm

The measured 10% speedup is for actual search, not some contrived test.
My program uses LSB scans only, and rather a lot of of them. I incrementally update attack boards with scan-rich code, and my positional scoring does up to 12 scans. Thus circumstances are favourable for the optimization.
Disassembly:

Code: Select all

  movq  %rax, %rcx
  testq %rax, %rax
  je    L2
L6&#58;
  movq  %rcx, %rdx
  bsfq  %rdx, %rax     
  ;..do something..
  leaq  -1&#40;%rcx&#41;, %rax
  andq  %rax, %rcx
  jne   L6
L2&#58;

Robert P.

bob · Post by **bob** » Thu Mar 11, 2010 6:25 pm

micron wrote:The measured 10% speedup is for actual search, not some contrived test.
My program uses LSB scans only, and rather a lot of of them. I incrementally update attack boards with scan-rich code, and my positional scoring does up to 12 scans. Thus circumstances are favourable for the optimization.
Disassembly:
Code: Select all
  movq  %rax, %rcx
  testq %rax, %rax
  je    L2
L6&#58;
  movq  %rcx, %rdx
  bsfq  %rdx, %rax     
  ;..do something..
  leaq  -1&#40;%rcx&#41;, %rax
  andq  %rax, %rcx
  jne   L6
L2&#58;
Robert P.

My only comment is that if this is true, you need to seriously re-evaluate how you are doing things. There's absolutely no reason for the LSB/MSB stuff to take 10% of your time, even if you use a loop / shift to find these bits, rather than the a & a-1 trick. If I run on an I7 and use the hardware popcnt instruction, it is difficult to measure any improvement in speed, as I am simply careful where I used them.

10% is _way_ too high. 1% is too high. And if speeding this up improves overall speed by 10%, that means that MSB/LSB stuff is taking _more_ than 10% of total time. There's a free performance boost waiting to be grabbed...

Don't speed 'em up, get rid of 'em completely, if they are taking that much time. Another way of thinking about this is that if you could make everything but MSB/LSB code run in zero time, you would speed the program up by less than a factor of 10... not good.

An easy bitboard optimization

An easy bitboard optimization

Re: An easy bitboard optimization

Re: An easy bitboard optimization

Re: An easy bitboard optimization

Re: An easy bitboard optimization

Re: An easy bitboard optimization

Re: An easy bitboard optimization