speed of bitboard operations question

Discussion of chess software programming and technical issues.

Moderator: Ras

Uri Blass
Posts: 10792
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

speed of bitboard operations question

Post by Uri Blass »

Strelka's code include the following code for generating direct checks with captures

(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC

The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).

It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00

The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).

Uri
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: speed of bitboard operations question

Post by bob »

Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures

(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC

The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).

It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00

The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).

Uri
Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: speed of bitboard operations question

Post by wgarvin »

(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
Compilers do something called "constant folding" where an expression involving only constants is immediately evaluated and replaced with the new constant. Its pretty useful because the generated code is identical to what you get if you write the constant yourself in source code--you can write it the more natural way and it costs exactly nothing at runtime. (For example, think of all the macros you've ever seen that do math on some other constants or macros...)

Now suppose you wanted to decompile some compiled code. For the parts that have to *do something* at runtime, some code that does it was generated and you can carefully examine that code and try to untangle exactly what it does. But for a folded constant expression, nothing was generated except the final value, which is why Strelka contains so many odd-looking constants like this. He could try to break them back down into sensible expressions (which the compiler would just fold up again when compiling Strelka), but why bother?
Uri Blass
Posts: 10792
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: speed of bitboard operations question

Post by Uri Blass »

wgarvin wrote:
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
Compilers do something called "constant folding" where an expression involving only constants is immediately evaluated and replaced with the new constant. Its pretty useful because the generated code is identical to what you get if you write the constant yourself in source code--you can write it the more natural way and it costs exactly nothing at runtime. (For example, think of all the macros you've ever seen that do math on some other constants or macros...)

Now suppose you wanted to decompile some compiled code. For the parts that have to *do something* at runtime, some code that does it was generated and you can carefully examine that code and try to untangle exactly what it does. But for a folded constant expression, nothing was generated except the final value, which is why Strelka contains so many odd-looking constants like this. He could try to break them back down into sensible expressions (which the compiler would just fold up again when compiling Strelka), but why bother?
1)You are right if the expression involves only constants but
we have expression that include non constants and the pawns of one side are not constants so if the compiler does not know that pawns cannot be in rank1 or rank8 then the compiler can do different things for different constants that lead to the same result.

I understood from Bob that the choice of the constant is not important and WhitepawnsBB&Ranks12BB is the same as WhitepawnsBB&Ranks2BB
but the situation of only constants does not happen here.

2)I think that breaking the constants back to sesible expressions is important because later I may translate the same sensible expressions to different expression(because I do not plan to use A1<B1<C1<..H1<A2 but A1<A2<A3<...A8<B1...)

I may want to use constants with the same meaning so translating the constants numbers to names can help me not only to understand strelka better but also to write my own program later(After I have list of constants for strelka I may write a function to translate these constants
to the board representation that I plan to use).

Uri
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: speed of bitboard operations question

Post by wgarvin »

Uri Blass wrote:1)You are right if the expression involves only constants but
we have expression that include non constants and the pawns of one side are not constants so if the compiler does not know that pawns cannot be in rank1 or rank8 then the compiler can do different things for different constants that lead to the same result.

I understood from Bob that the choice of the constant is not important and WhitepawnsBB&Ranks12BB is the same as WhitepawnsBB&Ranks2BB
but the situation of only constants does not happen here.
Sorry, even though I quoted the whole line, when I was talking about constant folding I was only trying to explain where the constant 0x00FCFCFCFCFCFCFC might have come from.

In other words, if that line of code came from reverse-engineering of some compiled code, then the left part of it would be a bunch of instructions and the person doing the reverse-engineering could translate them into something sensible:
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
But the 0x00FCFCFCFCFCFCFC would appear in compiled code as a raw number--the constant folding erases the trail of where it came from.

I guess in this case it was ((~Rank1BB & ~Rank2BB) >> 8) or something, but maybe the expression (or part of it?) was wrapped in a macro which was re-used from elsewhere in the program, that might explain why the shift is there? Who knows---whatever the original thing was, it got folded into 0x00FCFCFCFCFCFCFC and whoever did the reverse-engineering did not try to unfold it again.

If I am going on about something that is totally obvious to you already, then I apologize. :D
Uri Blass wrote:2)I think that breaking the constants back to sesible expressions is important because later I may translate the same sensible expressions to different expression(because I do not plan to use A1<B1<C1<..H1<A2 but A1<A2<A3<...A8<B1...)

I may want to use constants with the same meaning so translating the constants numbers to names can help me not only to understand strelka better but also to write my own program later(After I have list of constants for strelka I may write a function to translate these constants
to the board representation that I plan to use).

Uri
Sure. In some cases it will be obvious what to replace them with. There's probably a few mysterious ones in there though. :wink:
Uri Blass
Posts: 10792
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: speed of bitboard operations question

Post by Uri Blass »

bob wrote:
Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures

(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC

The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).

It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00

The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).

Uri
Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...
Thanks

Here is another question

I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers

I can replace it by
a=c&((a1&b1)|(a2&b2))

I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.

I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.

Uri
Gerd Isenberg
Posts: 2251
Joined: Wed Mar 08, 2006 8:47 pm
Location: Hattingen, Germany

Re: speed of bitboard operations question

Post by Gerd Isenberg »

Uri Blass wrote: Here is another question

I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers

I can replace it by
a=c&((a1&b1)|(a2&b2))

I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.

I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.

Uri
Yes, conjunction is distributive over disjunction and vice versa.
I wonder why I put all that stuff in the wiki ;-)
http://chessprogramming.wikispaces.com/ ... erations17

Gerd
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: speed of bitboard operations question

Post by bob »

Uri Blass wrote:
bob wrote:
Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures

(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC

The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).

It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00

The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).

Uri
Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...
Thanks

Here is another question

I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers

I can replace it by
a=c&((a1&b1)|(a2&b2))

I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.

I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.

Uri
The compiler will most likely do it the fastest way. But the general rule I use is that the fewer operations required, the faster it will run. The problem is, it is not always easy to determine the number of operations required because the compilers are pretty clever and end up doing what you meant, not what you wrote, to make things faster.

This means you have some flexibility to write the code so that it is easier to read, so long as you understand how the optimizers work, so that your easier-to-read code can be optimized into something just as fast as the less clear code you could write to accomplish the same thing in an apparently faster way.
wgarvin
Posts: 838
Joined: Thu Jul 05, 2007 5:03 pm
Location: British Columbia, Canada

Re: speed of bitboard operations question

Post by wgarvin »

bob wrote:The compiler will most likely do it the fastest way. But the general rule I use is that the fewer operations required, the faster it will run. The problem is, it is not always easy to determine the number of operations required because the compilers are pretty clever and end up doing what you meant, not what you wrote, to make things faster.

This means you have some flexibility to write the code so that it is easier to read, so long as you understand how the optimizers work, so that your easier-to-read code can be optimized into something just as fast as the less clear code you could write to accomplish the same thing in an apparently faster way.
That is a good rule of thumb, and most of the time it's all you need.

Another thing I try to keep in mind is the dependencies between expressions. If you write two expressions which the compiler can evaluate separately, and then combine the result, this will likely be faster than a series of things it has to do one at a time because each one depends on the result of the previous one (which is called a "dependence chain", and is bad for performance if it gets too long).

Fake example, if you write:

Code: Select all

U64 x = (((((a>>2)|b)>>3)|c)>>5)|d;
Then the generated code might end up being equivalent to this:

Code: Select all

U64 t = (a>>2);
t |= b;
t >>= 3;
t |= c;
t >>= 5;
t |= d;  // one dependence chain of 6 instructions, where each one depends on the result previous one
Whereas if you had written it differently, you might have written:

Code: Select all

U64 x = (a>>10)|(b>>8)|(c>>5)|d;
And in that case you would probably get better generated code, because the compiler can do subexpressions like (a>>10) and (c>>5) in parallel (separate dependence chains) and combine the result. I.e. it might do something like:

Code: Select all

// these first 3 expressions don't depend on each other
U64 t1 = (a>>10);
U64 t2 = (b>>8);
U64 t3 = (c>>5);
// the next two instructions don't depend on each other
t1 |= t2;
t3 |= d;
// one last instruction to combine the two halves together
t1 |= t3;
Compilers will try and arrange the generated code to space out an instruction which produces a value, and a later instruction which needs to use that value. ("latency hiding"). They can most easily do this when you have lots of short, independent calculations. When you have one long dependence chain, you create a problem for the compiler, because it might not have enough "other stuff" to mix in with those instructions to hide the latency. Out-of-order execution on x86-32 and x86-64 processors helps a lot here, but short dependence chains are still better.

[Edit: for x86, one reason short dependence chains are better is that if the whole chain is like 3 or 4 instructions long, then the compiler doesn't even have to space them out, all of them can be decoded and they can just wait their turn to be executed in the out-of-order core, while other short chains are also being decoded and executed. If you have a really long chain, or a chain with long-latency instructions like divides in it, then it becomes a bottleneck, tying up processor resources or just taking a long time to execute during which the processor doesn't have enough other stuff to keep it busy]

By the way, I wouldn't bother trying to rewrite your code to improve the generated code, unless you have already made sure (using a profiler tool) that it really does spend a lot of its time there. If you make a small function 30% faster but less than 1% of the total runtime was spent in that function anyways, then your efforts were basically wasted. Use the profiler to find the inner loop or key function(s) where you're spending like 8% or 15% of the total time, and improve those...
Dann Corbit
Posts: 12777
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: speed of bitboard operations question

Post by Dann Corbit »

bob wrote:
Uri Blass wrote:
bob wrote:
Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures

(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC

The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).

It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00

The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).

Uri
Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...
Thanks

Here is another question

I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers

I can replace it by
a=c&((a1&b1)|(a2&b2))

I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.

I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.

Uri
The compiler will most likely do it the fastest way. But the general rule I use is that the fewer operations required, the faster it will run. The problem is, it is not always easy to determine the number of operations required because the compilers are pretty clever and end up doing what you meant, not what you wrote, to make things faster.

This means you have some flexibility to write the code so that it is easier to read, so long as you understand how the optimizers work, so that your easier-to-read code can be optimized into something just as fast as the less clear code you could write to accomplish the same thing in an apparently faster way.
My compiler generates identical assembly for both constructs:

Code: Select all

typedef unsigned long long bb;

bb              t0(bb a1, bb b1, bb a2, bb b2, bb c)
{
    bb              a = (a1 & b1 & c) | (a2 & b2 & c);
    return a;
}
/*
; generated assembly:
PUBLIC	_t0
; Function compile flags: /Ogtpy
_TEXT	SEGMENT
_a1$ = 8						; size = 8
_b1$ = 16						; size = 8
_a2$ = 24						; size = 8
_b2$ = 32						; size = 8
_c$ = 40						; size = 8
_t0	PROC

; 5    :     bb              a = (a1 & b1 & c) | (a2 & b2 & c);

  00050	8b 44 24 04	 mov	 eax, DWORD PTR _a1$[esp-4]
  00054	8b 54 24 08	 mov	 edx, DWORD PTR _a1$[esp]
  00058	8b 4c 24 14	 mov	 ecx, DWORD PTR _a2$[esp-4]
  0005c	23 44 24 0c	 and	 eax, DWORD PTR _b1$[esp-4]
  00060	23 54 24 10	 and	 edx, DWORD PTR _b1$[esp]
  00064	23 4c 24 1c	 and	 ecx, DWORD PTR _b2$[esp-4]
  00068	56		 push	 esi
  00069	8b 74 24 1c	 mov	 esi, DWORD PTR _a2$[esp+4]
  0006d	23 74 24 24	 and	 esi, DWORD PTR _b2$[esp+4]
  00071	0b c1		 or	 eax, ecx
  00073	23 44 24 28	 and	 eax, DWORD PTR _c$[esp]
  00077	0b d6		 or	 edx, esi
  00079	23 54 24 2c	 and	 edx, DWORD PTR _c$[esp+4]

; 6    :     return a;
; 7    : }

  0007d	5e		 pop	 esi
  0007e	c3		 ret	 0
_t0	ENDP
*/
bb              t1(bb a1, bb b1, bb a2, bb b2, bb c)
{
    bb              a = c & ((a1 & b1) | (a2 & b2));
    return a;
}
/*
PUBLIC	_t1
; Function compile flags: /Ogtpy
_TEXT	SEGMENT
_a1$ = 8						; size = 8
_b1$ = 16						; size = 8
_a2$ = 24						; size = 8
_b2$ = 32						; size = 8
_c$ = 40						; size = 8
_t1	PROC

; 11   :     bb              a = c & ((a1 & b1) | (a2 & b2));

  00020	8b 44 24 04	 mov	 eax, DWORD PTR _a1$[esp-4]
  00024	8b 54 24 08	 mov	 edx, DWORD PTR _a1$[esp]
  00028	8b 4c 24 14	 mov	 ecx, DWORD PTR _a2$[esp-4]
  0002c	23 44 24 0c	 and	 eax, DWORD PTR _b1$[esp-4]
  00030	23 54 24 10	 and	 edx, DWORD PTR _b1$[esp]
  00034	23 4c 24 1c	 and	 ecx, DWORD PTR _b2$[esp-4]
  00038	56		 push	 esi
  00039	8b 74 24 1c	 mov	 esi, DWORD PTR _a2$[esp+4]
  0003d	23 74 24 24	 and	 esi, DWORD PTR _b2$[esp+4]
  00041	0b c1		 or	 eax, ecx
  00043	23 44 24 28	 and	 eax, DWORD PTR _c$[esp]
  00047	0b d6		 or	 edx, esi
  00049	23 54 24 2c	 and	 edx, DWORD PTR _c$[esp+4]

; 12   :     return a;
; 13   : }

  0004d	5e		 pop	 esi
  0004e	c3		 ret	 0
_t1	ENDP
_TEXT	ENDS
*/
bb              rbb(void)
{
    bb              a1 = rand();
    a1 <<= 31;
    a1 |= rand();
    return a1;
}
int             main(void)
{
    bb              a,
                    a1,
                    b1,
                    a2,
                    b2,
                    c;
    a1 = rbb();
    b1 = rbb();
    a2 = rbb();
    b2 = rbb();
    c = rbb();
    a = t0(a1, b1, a2, b2, c);
    printf("%llu\n ", a);
    a1 = rbb();
    b1 = rbb();
    a2 = rbb();
    b2 = rbb();
    c = rbb();
    a = t1(a1, b1, a2, b2, c);
    printf("%llu\n ", a);
    return 0;
}
{General advice, not targeted at Dr. Hyatt, who already knows this well}:
The first rule of optimization is:
1. DON'T DO IT.
And the second rule (for experts only) is:
2. DON'T DO IT YET.

The general idea is that we should write code that is as clear and expressive as possible. Later on, when we want to make our program faster the right way to do it is to improve the algorithm. For instance, going from something that is O(N*N) to something that is O(N*log(N)) will create an enormous speedup and bit fiddling will never create anything more than a small linear factor.

So, when we have the code as clear and expressive as possible, we profile it. After we profile it, we speed up the hot spots. We can spend weeks making a slow routine faster only to discover that it is hardly ever called and so even though it is ten times faster than before the program shaves only one second off of a one hour run. A profile will tell you where to look to improve things.