ChessUSA.com TalkChess.com
Hosted by Your Move Chess & Games
 
 FAQFAQ   SearchSearch   MemberlistMemberlist   UsergroupsUsergroups   RegisterRegister 
 ProfileProfile   Log in to check your private messagesLog in to check your private messages   Log inLog in 

Resource for bit twiddlers

 
Post new topic       TalkChess.com Forum Index -> Computer Chess Club: Programming and Technical Discussions Threaded
View previous topic :: View next topic  
Author Message
J. Wesley Cleveland



Joined: 01 Jul 2006
Posts: 622

PostPosted: Fri May 04, 2007 1:55 am    Post subject: Resource for bit twiddlers Reply to topic Reply with quote

I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz
Back to top
View user's profile Send private message
Gerd Isenberg



Joined: 08 Mar 2006
Posts: 1787
Location: Hattingen, Germany

PostPosted: Mon May 28, 2007 12:40 pm    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

jwes wrote:
I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz


This is really a great ressource - thanks for providing the link.

"tweaking several bytes at once" (page 19), nicely explains some SWAR (SIMD within a regsiter) tricks suggested by H.G. Dietz.

Code:
SWAR add z = x + y 
    z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)

SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)

SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
    z = (x & y) + (((x ^ y) & ~L) >> 1)   

For bytewise math, H = 0x8080808080808080 and L = 0x0101010101010101. But one can use arbitrary masks for other sizes such as 6*10 or 5*12 bit as well - or even 64-bit structures with mixed "word"-sizes.

Performing bytewise or rankwise sub using the o ^(o-2r)-trick saves some instructions compared to one Kogge-Stone attack getter, even with general purpose registers:
Code:

u64 eastAttacks(u64 rooks, u64 empty)
{
    empty  = empty &  0xfefefefefefefefe;
    rooks |= empty & (rooks << 1);
    empty  = empty & (empty << 1);
    rooks |= empty & (rooks << 2);
    empty  = empty & (empty << 2);
    rooks |= empty & (rooks << 4);
    return (rooks << 1) & 0xfefefefefefefefe;
}

Code:
u64 eastAttacks(u64 rooks, u64 empty)
{
    const u64 H     = 0x8080808080808080;
    u64 occupied    = rooks | ~empty;
    u64 occNoRook   = rooks ^ occupied;
    u64 difference  = ((occNoRook | H) - (rooks & ~H))
                    ^  (~occupied & H);
    return occupied ^ difference;
}

Gerd
Back to top
View user's profile Send private message
Gerd Isenberg



Joined: 08 Mar 2006
Posts: 1787
Location: Hattingen, Germany

PostPosted: Mon May 28, 2007 3:16 pm    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

Gerd Isenberg wrote:


Code:
u64 eastAttacks(u64 rooks, u64 empty)
{
    const u64 H     = 0x8080808080808080;
    u64 occupied    = rooks | ~empty;
    u64 occNoRook   = rooks ^ occupied;
    u64 difference  = ((occNoRook | H) - (rooks & ~H))
                    ^  (~occupied & H);
    return occupied ^ difference;
}

Gerd


Oups, since

A ^ (~A & B) == A | B

the code becomes a bit cheaper:

Code:
u64 eastAttacks(u64 rooks, u64 empty)
{
    const u64 H     = 0x8080808080808080;
    u64 occInclRook = rooks | ~empty;
    u64 occExclRook = rooks ^ occInclRook;
    u64 attacks     = ((occExclRook | H) - (rooks & ~H))
                    ^  (occInclRook | H);
    return attacks;
}
Back to top
View user's profile Send private message
Gerd Isenberg



Joined: 08 Mar 2006
Posts: 1787
Location: Hattingen, Germany

PostPosted: Mon May 28, 2007 4:35 pm    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

The vc2005 compiler surprisingly was not aware of the
A ^ (~A & B) == A | B simplification.

Code:
?eastAttacks@@YA_K_K0@Z PROC
  00000   49 b9 80 80 80
   80 80 80 80 80    mov    r9, 8080808080808080H
  0000a   48 f7 d2    not    rdx
  0000d   49 b8 7f 7f 7f
   7f 7f 7f 7f 7f    mov    r8, 7f7f7f7f7f7f7f7fH
  00017   48 0b d1    or    rdx, rcx
  0001a   48 8b c2    mov    rax, rdx
  0001d   48 33 c1    xor    rax, rcx
  00020   49 23 c8    and    rcx, r8
  00023   49 0b c1    or    rax, r9
  00026   48 2b c1    sub    rax, rcx
  00029   48 8b ca    mov    rcx, rdx
  0002c   48 f7 d1    not    rcx
  0002f   49 23 c9    and    rcx, r9
  00032   48 33 c1    xor    rax, rcx
  00035   48 33 c2    xor    rax, rdx
  00038   c3       ret    0


Code:
rooks$ = 8
empty$ = 16
?eastAttacks@@YA_K_K0@Z PROC
  00000   48 f7 d2    not    rdx
  00003   49 b9 80 80 80
   80 80 80 80 80    mov    r9, 8080808080808080H
  0000d   49 b8 7f 7f 7f
   7f 7f 7f 7f 7f    mov    r8, 7f7f7f7f7f7f7f7fH
  00017   48 0b d1    or    rdx, rcx
  0001a   48 8b c2    mov    rax, rdx
  0001d   49 0b d1    or    rdx, r9
  00020   48 33 c1    xor    rax, rcx
  00023   49 23 c8    and    rcx, r8
  00026   49 0b c1    or    rax, r9
  00029   48 2b c1    sub    rax, rcx
  0002c   48 33 c2    xor    rax, rdx
  0002f   c3       ret    0


This one with a De Morgan and -i = ~i+1 transformation takes one register and a few code bytes less. The additional +1 is done together with the add by the AGU instaed of the ALU, performing a lea-instruction.
Code:
u64 eastAttacks(u64 rooks, u64 empty)
{
  const u64 H     = 0x8080808080808080;
  u64 occInclRook = rooks | ~empty;
  u64 occExclRook = rooks ^ occInclRook;
  u64 attacks     = ((occExclRook | H) + (~rooks | H) + 1)
                  ^  (occInclRook | H);
  return attacks;
}

Code:
rooks$ = 8
empty$ = 16
?eastAttacks@@YA_K_K0@Z PROC
  00000   49 b8 80 80 80
   80 80 80 80 80    mov    r8, 8080808080808080H
  0000a   48 f7 d2    not    rdx
  0000d   48 0b d1    or    rdx, rcx
  00010   48 8b c2    mov    rax, rdx
  00013   49 0b d0    or    rdx, r8
  00016   48 33 c1    xor    rax, rcx
  00019   48 f7 d1    not    rcx
  0001c   49 0b c0    or    rax, r8
  0001f   49 0b c8    or    rcx, r8
  00022   48 8d 44 08 01    lea    rax, QWORD PTR [rax+rcx+1]
  00027   48 33 c2    xor    rax, rdx
  0002a   c3       ret    0
Back to top
View user's profile Send private message
Gerd Isenberg



Joined: 08 Mar 2006
Posts: 1787
Location: Hattingen, Germany

PostPosted: Wed May 30, 2007 5:55 pm    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

Some further optimization by using common sub-expressions. One operation less, and more importantly, only three scratch registers used. The compiler prefers 10 byte opcode loading rax with an 64-bit immediate, while there is already the one's complement in rax, where a 3-byte "not rax" would result in 7 bytes shorter opcode but higher alu-pressure. With 64-bit immediate moves, the 32-byte prefetch of the K10 makes sense.

Code:
u64 eastAttacks(u64 rooks, u64 empty)
{
  const u64 H  =  0x8080808080808080;
  u64 inclRook =  rooks | H | ~empty;
  u64 exclRook = (rooks   &=  ~H  ) ^ inclRook;
  u64 attacks  = (exclRook - rooks) ^ inclRook;
  return attacks;
}

rooks$ = 8
empty$ = 16
?eastAttacks@@YA_K_K0@Z PROC
  00000   48 f7 d2    not    rdx
  00003   48 b8 80 80 80
   80 80 80 80 80    mov    rax, 8080808080808080H
  0000d   48 0b d1    or    rdx, rcx
  00010   48 0b d0    or    rdx, rax
  00013   48 b8 7f 7f 7f   ; not rax would also possible here
   7f 7f 7f 7f 7f    mov    rax, 7f7f7f7f7f7f7f7fH
  0001d   48 23 c8    and    rcx, rax
  00020   48 8b c2    mov    rax, rdx
  00023   48 33 c1    xor    rax, rcx
  00026   48 2b c1    sub    rax, rcx
  00029   48 33 c2    xor    rax, rdx
  0002c   c3       ret    0

If one passes already the occupied set, and guarantee that rooks are subset of occupied, one safes another two instructions of course, but that loses a bit flexibility and tends to confuse due to the prototype and semantic of the Kogge-Stone getters.

Gerd
Back to top
View user's profile Send private message
Pradu Kannan



Joined: 11 Mar 2006
Posts: 287
Location: Atlanta, GA

PostPosted: Wed May 30, 2007 10:51 pm    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

Gerd Isenberg wrote:
Some further optimization by using common sub-expressions. One operation less, and more importantly, only three scratch registers used. The compiler prefers 10 byte opcode loading rax with an 64-bit immediate, while there is already the one's complement in rax, where a 3-byte "not rax" would result in 7 bytes shorter opcode but higher alu-pressure. With 64-bit immediate moves, the 32-byte prefetch of the K10 makes sense.

Code:
u64 eastAttacks(u64 rooks, u64 empty)
{
  const u64 H  =  0x8080808080808080;
  u64 inclRook =  rooks | H | ~empty;
  u64 exclRook = (rooks   &=  ~H  ) ^ inclRook;
  u64 attacks  = (exclRook - rooks) ^ inclRook;
  return attacks;
}

rooks$ = 8
empty$ = 16
?eastAttacks@@YA_K_K0@Z PROC
  00000   48 f7 d2    not    rdx
  00003   48 b8 80 80 80
   80 80 80 80 80    mov    rax, 8080808080808080H
  0000d   48 0b d1    or    rdx, rcx
  00010   48 0b d0    or    rdx, rax
  00013   48 b8 7f 7f 7f   ; not rax would also possible here
   7f 7f 7f 7f 7f    mov    rax, 7f7f7f7f7f7f7f7fH
  0001d   48 23 c8    and    rcx, rax
  00020   48 8b c2    mov    rax, rdx
  00023   48 33 c1    xor    rax, rcx
  00026   48 2b c1    sub    rax, rcx
  00029   48 33 c2    xor    rax, rdx
  0002c   c3       ret    0

If one passes already the occupied set, and guarantee that rooks are subset of occupied, one safes another two instructions of course, but that loses a bit flexibility and tends to confuse due to the prototype and semantic of the Kogge-Stone getters.

Gerd


From what I can understand this subtraction technique can only be used for right fills and left fills. Here's what I could infer from your filling algorithm:

I'll term rooks as squares.

so basically the concept is to use the fact that
Code:

(LSB at left, looking at only first rank for now)
HFile      "rooks"    Result
00000001 - 00001000 = 00001110
Result     HFile      Fill
00001110 ^ 00000001 = 00001111


Now lets add an occupancy to see if it still works:
Code:

(LSB at left like for the first rank)
HFile|occ  "rooks"    Result
00000011 - 00001000 = 00001101
Result     HFile|occ  Fill
00001101 ^ 00000011 = 00001110


Works perfect! But because of the subtraction, we have to make sure the occupancy and the rook don't coinside eg:
Code:

(LSB at left like for the first rank)
HFile|occ  "rooks"    Result
00001001 - 00001000 = 00000001 (BAD)
Result     HFile|occ  Fill
00000001 ^ 00000011 = 00000010 (WRONG)


So we have to make sure we take the rooks out of the occupancy. Assuming the same occupancy as before:
Code:

((HFile|occ)&~rooks) - "rooks"    Result
00000001             - 00001000 = 00001110
Result ^ ((HFile|occ)&~rooks)   = Fill
00001110 00000001               = 00001110


From the above we can make an easy right fill:

Code:
U64 rightFill(U64 squares, U64 occ)
{
   occ = (occ|FILEHBB)&~squares; //occupancy without squares
   return ((occ - squares)^occ) | squares; //xor to get rid of unused occupancy
}


I hope my iterpretation is correct? Is it also possible to to do fills other than left and right with the subtraction technique?
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Gerd Isenberg



Joined: 08 Mar 2006
Posts: 1787
Location: Hattingen, Germany

PostPosted: Thu May 31, 2007 6:17 am    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

Pradu wrote:

From what I can understand this subtraction technique can only be used for right fills and left fills. Here's what I could infer from your filling algorithm:

I'll term rooks as squares.

so basically the concept is to use the fact that
Code:

(LSB at left, looking at only first rank for now)
HFile      "rooks"    Result
00000001 - 00001000 = 00001110
Result     HFile      Fill
00001110 ^ 00000001 = 00001111


Now lets add an occupancy to see if it still works:
Code:

(LSB at left like for the first rank)
HFile|occ  "rooks"    Result
00000011 - 00001000 = 00001101
Result     HFile|occ  Fill
00001101 ^ 00000011 = 00001110


Works perfect! But because of the subtraction, we have to make sure the occupancy and the rook don't coinside eg:
Code:

(LSB at left like for the first rank)
HFile|occ  "rooks"    Result
00001001 - 00001000 = 00000001 (BAD)
Result     HFile|occ  Fill
00000001 ^ 00000011 = 00000010 (WRONG)


So we have to make sure we take the rooks out of the occupancy. Assuming the same occupancy as before:
Code:

((HFile|occ)&~rooks) - "rooks"    Result
00000001             - 00001000 = 00001110
Result ^ ((HFile|occ)&~rooks)   = Fill
00001110 00000001               = 00001110


From the above we can make an easy right fill:

Code:
U64 rightFill(U64 squares, U64 occ)
{
   occ = (occ|FILEHBB)&~squares; //occupancy without squares
   return ((occ - squares)^occ) | squares; //xor to get rid of unused occupancy
}


I hope my iterpretation is correct? Is it also possible to to do fills other than left and right with the subtraction technique?


Hi Pradu,

no, the o^(o-2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a->h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's Kogge-Stone routines.

The bytewise or rankwise o^(o-2r) works setwise even with multiple rooks per rank. The first substraction of -2r == - r - r can be done by xor or "andnot" to clear the whole rook-bits in the occupancy, while the substration does the carry propagation, and the final xor leaves the attacks. Simd instructions with MMX or SSE2 are nice for that bytewise-substraction, but it turned out that the SWAR-trick is as quite effective as well.

With single rooks/bishops by square metric, one can use this trick for four positive vertical and diagonal directions with pre- and post-masks of the complete file or diagonal. For bishops and file-attacks one can use vertical flips (bswap) to do the negative directions. But a bishop attack-getter based on this technique was a bit slower than kindergarten, not to mention magic:

Code:
u64 bishopAttacks(u64 occ, u32 sq) {
    u64 rvsdia, rvsant, bishop;
    u64 occdia = occ, occant = occ;
    occdia &=  smsk[sq].diagonalMaskEx;
    occant &=  smsk[sq].antidiagMaskEx;
    bishop  =  smsk[sq].bitMask;
    rvsdia  = _byteswap_uint64(occdia);
    rvsant  = _byteswap_uint64(occant);
    occdia ^=  occdia - bishop;
    occant ^=  occant - bishop;
    bishop  = _byteswap_uint64(bishop);
    rvsdia ^=  rvsdia - bishop;
    rvsant ^=  rvsant - bishop;
    occdia ^= _byteswap_uint64(rvsdia);
    occant ^= _byteswap_uint64(rvsant);
    occdia &=  smsk[sq].diagonalMaskEx;
    occant &=  smsk[sq].antidiagMaskEx;
    return occdia ^ occant;
}

Gerd
Back to top
View user's profile Send private message
Pradu Kannan



Joined: 11 Mar 2006
Posts: 287
Location: Atlanta, GA

PostPosted: Fri Jun 01, 2007 12:16 am    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

Gerd Isenberg wrote:


Hi Pradu,

no, the o^(o-2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a->h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's Kogge-Stone routines.
My mistake. I guess also that my code doesn't work well sometimes for multiple rooks on the same rank when one rook coincides with an occupancy. I need to look at your code a bit more closely.
Back to top
View user's profile Send private message Send e-mail Visit poster's website
Gerd Isenberg



Joined: 08 Mar 2006
Posts: 1787
Location: Hattingen, Germany

PostPosted: Fri Jun 01, 2007 8:19 am    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

Pradu wrote:
Gerd Isenberg wrote:


Hi Pradu,

no, the o^(o-2r) trick with r is subset of o, unfortunately works only for "positive" directions, in this setwise case for the a->h direction (east, right from white points of view on the chessboard) with mapping a1==0, h8==63. For all other seven directions one has to use Steffan Westcott's Kogge-Stone routines.
My mistake. I guess also that my code doesn't work well sometimes for multiple rooks on the same rank when one rook coincides with an occupancy. I need to look at your code a bit more closely.

Code:

               hex    bin        bin
            arithmetical order   reversed according to the a1 = 0 mapping
o              0xDB   11011011   11011011   occupancy including rooks
r = subset(o)  0x12   00010010   01001000   rooks
o - r          0xC9   11001001   10010011   occupancy excluding rooks
o - 2r         0xB7   10110111   11101101   subtracting rooks from either next higher occupancy or borrow
o ^ (o - 2r)   0x6C   01101100   00110110   rook attacks of both rooks

You can not do it the other way around - subtracting none-rooks from rooks to get the opposite direction Wink
Back to top
View user's profile Send private message
Gerd Isenberg



Joined: 08 Mar 2006
Posts: 1787
Location: Hattingen, Germany

PostPosted: Mon Jun 04, 2007 2:58 pm    Post subject: Re: Resource for bit twiddlers Reply to topic Reply with quote

Gerd Isenberg wrote:
jwes wrote:
I just ran accross this. It is a preprint of part of Knuth's Art of Computer Programming, Vol 4, and covers bit twiddling tricks and techniques.
http://www-cs-faculty.stanford.edu/~knuth/fasc1a.ps.gz


This is really a great ressource - thanks for providing the link.

"tweaking several bytes at once" (page 19), nicely explains some SWAR (SIMD within a regsiter) tricks suggested by H.G. Dietz.

Code:
SWAR add z = x + y 
    z = ((x &~H) + (y &~H)) ^ ((x ^ y) & H)

SWAR sub z = x - y
    z = ((x | H) - (y &~H)) ^ ((x ^~y) & H)

SWAR average z = (x+y)/2 based on x + y = (x^y) + 2*(x&y)
    z = (x & y) + (((x ^ y) & ~L) >> 1)   

For bytewise math, H = 0x8080808080808080 and L = 0x0101010101010101. But one can use arbitrary masks for other sizes such as 6*10 or 5*12 bit as well - or even 64-bit structures with mixed "word"-sizes.

Performing bytewise or rankwise sub using the o ^(o-2r)-trick saves some instructions compared to one Kogge-Stone attack getter, even with general purpose registers:
Code:

u64 eastAttacks(u64 rooks, u64 empty)
{
    empty  = empty &  0xfefefefefefefefe;
    rooks |= empty & (rooks << 1);
    empty  = empty & (empty << 1);
    rooks |= empty & (rooks << 2);
    empty  = empty & (empty << 2);
    rooks |= empty & (rooks << 4);
    return (rooks << 1) & 0xfefefefefefefefe;
}

Code:
u64 eastAttacks(u64 rooks, u64 empty)
{
    const u64 H     = 0x8080808080808080;
    u64 occupied    = rooks | ~empty;
    u64 occNoRook   = rooks ^ occupied;
    u64 difference  = ((occNoRook | H) - (rooks & ~H))
                    ^  (~occupied & H);
    return occupied ^ difference;
}

Gerd



I'll hope the attack-getter does not violate any United States Patent Wink

see also
http://aggregate.org/MAGIC/#SIMD%20Within%20A%20Register%20(SWAR)%20Operations
by Dr. Henry G. (Hank) Dietz

Quote:
SIMD Within A Register (SWAR) Operations

Before we coined the name SWAR in Fall 1996, we already had defined a complete set of basic operations and described how they could be implemented with good efficiency. On February 4, 1997, we posted this fairly complete overview document and there also are slides from a seminar presentation I gave at Purdue. These methods were used in our SWARC compiler and were detailed in a number of our publications throughout the 1990s. We hadn't posted them on this page because they were so prominently published elsewhere.

However, much to our surprize, United States Patent 7039906, "Compiler for enabling multiple signed independent data elements per register," was filed October 20, 2000 and was issued to IBM on May 2, 2006! By our reading, this patent appears to seriously overlap my and Randy Fisher's earlier published work -- much of which was cited by the patent examiner. I am posting this note here so that people who happen to hear about the IBM patent will not be discouraged from using the prior art technologies developed by us, which, by being cited in the patent, are explicitly not covered by the patent.
Back to top
View user's profile Send private message
Display posts from previous:   
Post new topic       TalkChess.com Forum Index -> Computer Chess Club: Programming and Technical Discussions All times are GMT
Threaded
Page 1 of 1

 
Jump to:  
You cannot post new topics in this forum
You cannot reply to topics in this forum
You cannot edit your posts in this forum
You cannot delete your posts in this forum
You cannot vote in polls in this forum




Powered by phpBB © 2001, 2005 phpBB Group
Enhanced with Moby Threads