Strelka's code include the following code for generating direct checks with captures
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).
It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00
The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).
Uri
speed of bitboard operations question
Moderator: Ras
-
- Posts: 10792
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: speed of bitboard operations question
Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).
It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00
The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).
Uri
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: speed of bitboard operations question
Compilers do something called "constant folding" where an expression involving only constants is immediately evaluated and replaced with the new constant. Its pretty useful because the generated code is identical to what you get if you write the constant yourself in source code--you can write it the more natural way and it costs exactly nothing at runtime. (For example, think of all the macros you've ever seen that do math on some other constants or macros...)(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
Now suppose you wanted to decompile some compiled code. For the parts that have to *do something* at runtime, some code that does it was generated and you can carefully examine that code and try to untangle exactly what it does. But for a folded constant expression, nothing was generated except the final value, which is why Strelka contains so many odd-looking constants like this. He could try to break them back down into sensible expressions (which the compiler would just fold up again when compiling Strelka), but why bother?
-
- Posts: 10792
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: speed of bitboard operations question
1)You are right if the expression involves only constants butwgarvin wrote:Compilers do something called "constant folding" where an expression involving only constants is immediately evaluated and replaced with the new constant. Its pretty useful because the generated code is identical to what you get if you write the constant yourself in source code--you can write it the more natural way and it costs exactly nothing at runtime. (For example, think of all the macros you've ever seen that do math on some other constants or macros...)(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
Now suppose you wanted to decompile some compiled code. For the parts that have to *do something* at runtime, some code that does it was generated and you can carefully examine that code and try to untangle exactly what it does. But for a folded constant expression, nothing was generated except the final value, which is why Strelka contains so many odd-looking constants like this. He could try to break them back down into sensible expressions (which the compiler would just fold up again when compiling Strelka), but why bother?
we have expression that include non constants and the pawns of one side are not constants so if the compiler does not know that pawns cannot be in rank1 or rank8 then the compiler can do different things for different constants that lead to the same result.
I understood from Bob that the choice of the constant is not important and WhitepawnsBB&Ranks12BB is the same as WhitepawnsBB&Ranks2BB
but the situation of only constants does not happen here.
2)I think that breaking the constants back to sesible expressions is important because later I may translate the same sensible expressions to different expression(because I do not plan to use A1<B1<C1<..H1<A2 but A1<A2<A3<...A8<B1...)
I may want to use constants with the same meaning so translating the constants numbers to names can help me not only to understand strelka better but also to write my own program later(After I have list of constants for strelka I may write a function to translate these constants
to the board representation that I plan to use).
Uri
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: speed of bitboard operations question
Sorry, even though I quoted the whole line, when I was talking about constant folding I was only trying to explain where the constant 0x00FCFCFCFCFCFCFC might have come from.Uri Blass wrote:1)You are right if the expression involves only constants but
we have expression that include non constants and the pawns of one side are not constants so if the compiler does not know that pawns cannot be in rank1 or rank8 then the compiler can do different things for different constants that lead to the same result.
I understood from Bob that the choice of the constant is not important and WhitepawnsBB&Ranks12BB is the same as WhitepawnsBB&Ranks2BB
but the situation of only constants does not happen here.
In other words, if that line of code came from reverse-engineering of some compiled code, then the left part of it would be a bunch of instructions and the person doing the reverse-engineering could translate them into something sensible:
But the 0x00FCFCFCFCFCFCFC would appear in compiled code as a raw number--the constant folding erases the trail of where it came from.(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
I guess in this case it was ((~Rank1BB & ~Rank2BB) >> 8) or something, but maybe the expression (or part of it?) was wrapped in a macro which was re-used from elsewhere in the program, that might explain why the shift is there? Who knows---whatever the original thing was, it got folded into 0x00FCFCFCFCFCFCFC and whoever did the reverse-engineering did not try to unfold it again.
If I am going on about something that is totally obvious to you already, then I apologize.

Sure. In some cases it will be obvious what to replace them with. There's probably a few mysterious ones in there though.Uri Blass wrote:2)I think that breaking the constants back to sesible expressions is important because later I may translate the same sensible expressions to different expression(because I do not plan to use A1<B1<C1<..H1<A2 but A1<A2<A3<...A8<B1...)
I may want to use constants with the same meaning so translating the constants numbers to names can help me not only to understand strelka better but also to write my own program later(After I have list of constants for strelka I may write a function to translate these constants
to the board representation that I plan to use).
Uri

-
- Posts: 10792
- Joined: Thu Mar 09, 2006 12:37 am
- Location: Tel-Aviv Israel
Re: speed of bitboard operations question
Thanksbob wrote:Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).
It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00
The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).
Uri
Here is another question
I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers
I can replace it by
a=c&((a1&b1)|(a2&b2))
I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.
I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.
Uri
-
- Posts: 2251
- Joined: Wed Mar 08, 2006 8:47 pm
- Location: Hattingen, Germany
Re: speed of bitboard operations question
Yes, conjunction is distributive over disjunction and vice versa.Uri Blass wrote: Here is another question
I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers
I can replace it by
a=c&((a1&b1)|(a2&b2))
I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.
I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.
Uri
I wonder why I put all that stuff in the wiki

http://chessprogramming.wikispaces.com/ ... erations17
Gerd
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: speed of bitboard operations question
The compiler will most likely do it the fastest way. But the general rule I use is that the fewer operations required, the faster it will run. The problem is, it is not always easy to determine the number of operations required because the compilers are pretty clever and end up doing what you meant, not what you wrote, to make things faster.Uri Blass wrote:Thanksbob wrote:Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).
It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00
The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).
Uri
Here is another question
I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers
I can replace it by
a=c&((a1&b1)|(a2&b2))
I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.
I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.
Uri
This means you have some flexibility to write the code so that it is easier to read, so long as you understand how the optimizers work, so that your easier-to-read code can be optimized into something just as fast as the less clear code you could write to accomplish the same thing in an apparently faster way.
-
- Posts: 838
- Joined: Thu Jul 05, 2007 5:03 pm
- Location: British Columbia, Canada
Re: speed of bitboard operations question
That is a good rule of thumb, and most of the time it's all you need.bob wrote:The compiler will most likely do it the fastest way. But the general rule I use is that the fewer operations required, the faster it will run. The problem is, it is not always easy to determine the number of operations required because the compilers are pretty clever and end up doing what you meant, not what you wrote, to make things faster.
This means you have some flexibility to write the code so that it is easier to read, so long as you understand how the optimizers work, so that your easier-to-read code can be optimized into something just as fast as the less clear code you could write to accomplish the same thing in an apparently faster way.
Another thing I try to keep in mind is the dependencies between expressions. If you write two expressions which the compiler can evaluate separately, and then combine the result, this will likely be faster than a series of things it has to do one at a time because each one depends on the result of the previous one (which is called a "dependence chain", and is bad for performance if it gets too long).
Fake example, if you write:
Code: Select all
U64 x = (((((a>>2)|b)>>3)|c)>>5)|d;
Code: Select all
U64 t = (a>>2);
t |= b;
t >>= 3;
t |= c;
t >>= 5;
t |= d; // one dependence chain of 6 instructions, where each one depends on the result previous one
Code: Select all
U64 x = (a>>10)|(b>>8)|(c>>5)|d;
Code: Select all
// these first 3 expressions don't depend on each other
U64 t1 = (a>>10);
U64 t2 = (b>>8);
U64 t3 = (c>>5);
// the next two instructions don't depend on each other
t1 |= t2;
t3 |= d;
// one last instruction to combine the two halves together
t1 |= t3;
[Edit: for x86, one reason short dependence chains are better is that if the whole chain is like 3 or 4 instructions long, then the compiler doesn't even have to space them out, all of them can be decoded and they can just wait their turn to be executed in the out-of-order core, while other short chains are also being decoded and executed. If you have a really long chain, or a chain with long-latency instructions like divides in it, then it becomes a bottleneck, tying up processor resources or just taking a long time to execute during which the processor doesn't have enough other stuff to keep it busy]
By the way, I wouldn't bother trying to rewrite your code to improve the generated code, unless you have already made sure (using a profiler tool) that it really does spend a lot of its time there. If you make a small function 30% faster but less than 1% of the total runtime was spent in that function anyways, then your efforts were basically wasted. Use the profiler to find the inner loop or key function(s) where you're spending like 8% or 15% of the total time, and improve those...
-
- Posts: 12777
- Joined: Wed Mar 08, 2006 8:57 pm
- Location: Redmond, WA USA
Re: speed of bitboard operations question
My compiler generates identical assembly for both constructs:bob wrote:The compiler will most likely do it the fastest way. But the general rule I use is that the fewer operations required, the faster it will run. The problem is, it is not always easy to determine the number of operations required because the compilers are pretty clever and end up doing what you meant, not what you wrote, to make things faster.Uri Blass wrote:Thanksbob wrote:Makes absolutely no difference, since there will never be pawns on those ranks. 64 bit operations (AND in this case) do not depend on how many bits are set or not when calculating their speed...Uri Blass wrote:Strelka's code include the following code for generating direct checks with captures
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFCFC
The idea is to look only at pawns at files c-h because pawns at files a-b cannot check the opponent king by left capture to left direction(b6xa7 by a pawn can be direct check but not to the direction that continue b6-a7).
It is clear that there are no pawns at rank 1 or rank 8 so you can get the same result by
(Board->mp[WhitePawn]) &0xFCFCFCFCFCFCFCFC or by
(Board->mp[WhitePawn]) & 0x00FCFCFCFCFCFC00
The question is if there is a speed reason to prefer one of them(the first one seems more simple if you replace the magic numbers of strelka by constants).
Uri
Here is another question
I see code of the type
a=(a1&b1&c)|(a2&b2&c) when all variables are 64 bit numbers
I can replace it by
a=c&((a1&b1)|(a2&b2))
I guess that the speed is the same because the compiler is smart enough
to optimize both for the same code but I wonder in case that it is not smart enough what is faster.
I think that the second is faster because in the first expression there are
4 & operations and 1 | operation when in the second expression there are only 3 & operation and 1| operation.
Uri
This means you have some flexibility to write the code so that it is easier to read, so long as you understand how the optimizers work, so that your easier-to-read code can be optimized into something just as fast as the less clear code you could write to accomplish the same thing in an apparently faster way.
Code: Select all
typedef unsigned long long bb;
bb t0(bb a1, bb b1, bb a2, bb b2, bb c)
{
bb a = (a1 & b1 & c) | (a2 & b2 & c);
return a;
}
/*
; generated assembly:
PUBLIC _t0
; Function compile flags: /Ogtpy
_TEXT SEGMENT
_a1$ = 8 ; size = 8
_b1$ = 16 ; size = 8
_a2$ = 24 ; size = 8
_b2$ = 32 ; size = 8
_c$ = 40 ; size = 8
_t0 PROC
; 5 : bb a = (a1 & b1 & c) | (a2 & b2 & c);
00050 8b 44 24 04 mov eax, DWORD PTR _a1$[esp-4]
00054 8b 54 24 08 mov edx, DWORD PTR _a1$[esp]
00058 8b 4c 24 14 mov ecx, DWORD PTR _a2$[esp-4]
0005c 23 44 24 0c and eax, DWORD PTR _b1$[esp-4]
00060 23 54 24 10 and edx, DWORD PTR _b1$[esp]
00064 23 4c 24 1c and ecx, DWORD PTR _b2$[esp-4]
00068 56 push esi
00069 8b 74 24 1c mov esi, DWORD PTR _a2$[esp+4]
0006d 23 74 24 24 and esi, DWORD PTR _b2$[esp+4]
00071 0b c1 or eax, ecx
00073 23 44 24 28 and eax, DWORD PTR _c$[esp]
00077 0b d6 or edx, esi
00079 23 54 24 2c and edx, DWORD PTR _c$[esp+4]
; 6 : return a;
; 7 : }
0007d 5e pop esi
0007e c3 ret 0
_t0 ENDP
*/
bb t1(bb a1, bb b1, bb a2, bb b2, bb c)
{
bb a = c & ((a1 & b1) | (a2 & b2));
return a;
}
/*
PUBLIC _t1
; Function compile flags: /Ogtpy
_TEXT SEGMENT
_a1$ = 8 ; size = 8
_b1$ = 16 ; size = 8
_a2$ = 24 ; size = 8
_b2$ = 32 ; size = 8
_c$ = 40 ; size = 8
_t1 PROC
; 11 : bb a = c & ((a1 & b1) | (a2 & b2));
00020 8b 44 24 04 mov eax, DWORD PTR _a1$[esp-4]
00024 8b 54 24 08 mov edx, DWORD PTR _a1$[esp]
00028 8b 4c 24 14 mov ecx, DWORD PTR _a2$[esp-4]
0002c 23 44 24 0c and eax, DWORD PTR _b1$[esp-4]
00030 23 54 24 10 and edx, DWORD PTR _b1$[esp]
00034 23 4c 24 1c and ecx, DWORD PTR _b2$[esp-4]
00038 56 push esi
00039 8b 74 24 1c mov esi, DWORD PTR _a2$[esp+4]
0003d 23 74 24 24 and esi, DWORD PTR _b2$[esp+4]
00041 0b c1 or eax, ecx
00043 23 44 24 28 and eax, DWORD PTR _c$[esp]
00047 0b d6 or edx, esi
00049 23 54 24 2c and edx, DWORD PTR _c$[esp+4]
; 12 : return a;
; 13 : }
0004d 5e pop esi
0004e c3 ret 0
_t1 ENDP
_TEXT ENDS
*/
bb rbb(void)
{
bb a1 = rand();
a1 <<= 31;
a1 |= rand();
return a1;
}
int main(void)
{
bb a,
a1,
b1,
a2,
b2,
c;
a1 = rbb();
b1 = rbb();
a2 = rbb();
b2 = rbb();
c = rbb();
a = t0(a1, b1, a2, b2, c);
printf("%llu\n ", a);
a1 = rbb();
b1 = rbb();
a2 = rbb();
b2 = rbb();
c = rbb();
a = t1(a1, b1, a2, b2, c);
printf("%llu\n ", a);
return 0;
}
The first rule of optimization is:
1. DON'T DO IT.
And the second rule (for experts only) is:
2. DON'T DO IT YET.
The general idea is that we should write code that is as clear and expressive as possible. Later on, when we want to make our program faster the right way to do it is to improve the algorithm. For instance, going from something that is O(N*N) to something that is O(N*log(N)) will create an enormous speedup and bit fiddling will never create anything more than a small linear factor.
So, when we have the code as clear and expressive as possible, we profile it. After we profile it, we speed up the hot spots. We can spend weeks making a slow routine faster only to discover that it is hardly ever called and so even though it is ten times faster than before the program shaves only one second off of a one hour run. A profile will tell you where to look to improve things.