Page 1 of 1

Kogge Stone, Vector Based

Posted: Wed Jan 23, 2013 12:30 am
by smatovic
Maybe it is of interest for someone,
the 8 directions for Kogge-Stone are done via an ulong8 vector datatype,
on my gpu it performs as fast as magic bitboards...

Code: Select all


        // generate pseudo legal moves

        while( bbWork )

            bbTemp = 0;
            bbMoves = 0;

            // pop 1st bit
            pos     = pop_1st_bit(bbWork);

            piece   = getPiece(board, pos);


            // all 8 directions in parrallel via vector ulong8
            pro8 = ~bbBlockers;
            gen8 = SetMaskBB[pos];
            r8 = shift8[piece];
            pro8 &= avoidWrap8[piece];

            // do kogge stone
            gen8 |= pro8 & (&#40;gen8 << r8&#41; | &#40;gen8 >> &#40;64-r8&#41;));
            pro8 &=       (&#40;pro8 << r8&#41; | &#40;pro8 >> &#40;64-r8&#41;));
            gen8 |= pro8 & (&#40;gen8 << 2*r8&#41; | &#40;gen8 >> &#40;64-2*r8&#41;));
            pro8 &=       (&#40;pro8 << 2*r8&#41; | &#40;pro8 >> &#40;64-2*r8&#41;));
            gen8 |= pro8 & (&#40;gen8 << 4*r8&#41; | &#40;gen8 >> &#40;64-4*r8&#41;));

            // Shift one for Captures
            gen8 = (&#40;gen8 << r8&#41; | &#40;gen8 >> &#40;64-r8&#41;)) & avoidWrap8&#91;piece&#93;;
            bbTemp = gen8.s0 | gen8.s1 | gen8.s2 | gen8.s3 | gen8.s4 | gen8.s5 | gen8.s6 | gen8.s7;     

            // Verify
            bbTemp &= ( piece == PAWN&#41; ? &#40;PawnAttackTables&#91;som*64+pos&#93; & bbOpposite&#41;  | &#40;PawnAttackTables&#91;&#40;som+2&#41;*64+pos&#93; & ~bbBlockers&#41;  &#58; AttackTables&#91;som*6*64+piece*64+pos&#93; ;

            // Pawn double square
            if ( piece == PAWN && ( ( som == WHITE && &#40;pos>>3&#41; == 1&#41; || &#40;som == BLACK && &#40;pos>>3&#41; == 6 ) ) ) &#123;
                to  = &#40;som == BLACK&#41;? pos-8  &#58; pos+8;
                cpt = &#40;som == BLACK&#41;? pos-16 &#58; pos+16;
                if (   (~bbBlockers & SetMaskBB&#91;to&#93;) && (~bbBlockers & SetMaskBB&#91;cpt&#93;) )
                    bbTemp |= SetMaskBB&#91;cpt&#93;;
            &#125;

            // Captures
            bbMoves  = &#40;bbTemp & bbOpposite&#41;;
            // Non cpatures
            bbMoves |= &#40;qs == 0&#41;? bbTemp & ~bbBlockers &#58; 0 ;


full code:
https://github.com/smatovic/Zeta/tree/z ... koggestone


--
Srdja

Re: Kogge Stone, Vector Based

Posted: Wed Jan 23, 2013 5:10 pm
by Gerd Isenberg
Wow!

You process a vector of 8 bitboards with individial shifts per SIMD instruction, i.e. {9, 1, -7, -8, -9, -1, 7, 8} for a queen - positive amounts shifting left, negative shifting right, but you still need the generalized Kogge-Stone with rotate, since shifting left by negative amount is not shifting right, but shifing left with amount and 63 (mod 64).

Do you have any links to the concrete SIMD instruction set?

Gerd

Re: Kogge Stone, Vector Based

Posted: Wed Jan 23, 2013 7:19 pm
by smatovic
the idea for the generalized shift is taken from the chessprogramming wiki,
Do you have any links to the concrete SIMD instruction set?
For Nvidia there is the PTX instruction set:

www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf

for AMD CAL

http://coachk.cs.ucf.edu/courses/CDA693 ... amming.pdf

--
Srdja

Re: Kogge Stone, Vector Based

Posted: Wed Jan 23, 2013 7:49 pm
by smatovic
Just tested the idea with the AMD OpenCL compiler,
it doesn't work.

Code: Select all

Error&#58; Building Program &#40;clBuildProgram&#41;
, status&#58;-11buildlog&#58; "/tmp/OCLfv1xCw.cl", line 1357&#58; error&#58; vector operation requires identical
          element types
              gen8 |= pro8 & (&#40;gen8 << r8&#41; | &#40;gen8 >> &#40;64-r8&#41;));
                                    ^

--
Srdja

Re: Kogge Stone, Vector Based

Posted: Wed Jan 23, 2013 8:37 pm
by Gerd Isenberg
smatovic wrote: For Nvidia there is the PTX instruction set:

www.nvidia.com/content/CUDA-ptx_isa_1.4.pdf
Thanks. Pity - PTX has no rotate instruction. It is called SIMT (single-instruction, multiple-thread) rather than SIMD. So let say a shl.b64 a, b instruction is executed in 8 treads, each operating on corresponding elements of the vectors?

Re: Kogge Stone, Vector Based

Posted: Wed Jan 23, 2013 8:50 pm
by smatovic
So let say a shl.b64 a, b instruction is executed in 8 treads, each operating on corresponding elements of the vectors?
Afaik, yes.

On Nvidia Fermi devices 32 threads are executed in SIMT fashion in one "Warp" over 4 clock cycles.

--
Srdja

Re: Kogge Stone, Vector Based, 2*4 directions, ulong4

Posted: Mon Jan 28, 2013 6:03 pm
by smatovic
another solution is to split left and right shifts in 2*4 directions via ulong4.....


Code: Select all

__constant ulong4 shift4&#91;&#93; =

&#123;
&#123;     0, 0, 0, 0 &#125;,                      // EMTPY
&#123;     9, 0, 7, 8 &#125;,                     // PAWN
&#123;    17,10, 6,15 &#125;,                    // KNIGTH
&#123;     9, 1, 7, 8 &#125;,                   // KING
&#123;     9, 0, 7, 0 &#125;,                  // BISHOP
&#123;     0, 1, 0, 8 &#125;,                 // ROOK
&#123;     9, 1, 7, 8 &#125;                 // QUEEN
&#125;;    

Code: Select all

     
            // 2*4 directions via ulong4, left and right shifting
            s4 = shift4&#91;&#40;piece>>1&#41;&#93;;

            // 1*4 directions, left shift, in parrallel via vector ulong4
            pro4 = ~bbBlockers;
            gen4 = SetMaskBB&#91;pos&#93;;

            // do kogge stone
            gen4 |= pro4 & &#40;gen4 << s4&#41;;
            pro4 &=        &#40;pro4 << s4&#41;;
            gen4 |= pro4 & &#40;gen4 << 2*s4&#41;;
            pro4 &=        &#40;pro4 << 2*s4&#41;;
            gen4 |= pro4 & &#40;gen4 << 4*s4&#41;;

            // Shift one further
            gen4 =  &#40;gen4 << s4&#41;;
            bbTemp = gen4.s0 | gen4.s1 | gen4.s2 | gen4.s3;     

            // 1*4 directions, right shift, in parrallel via vector ulong4
            pro4 = ~bbBlockers;
            gen4 = SetMaskBB&#91;pos&#93;;

            // do kogge stone
            gen4 |= pro4 & &#40;gen4 >> s4&#41;;
            pro4 &=        &#40;pro4 >> s4&#41;;
            gen4 |= pro4 & &#40;gen4 >> 2*s4&#41;;
            pro4 &=        &#40;pro4 >> 2*s4&#41;;
            gen4 |= pro4 & &#40;gen4 >> 4*s4&#41;;

            // Shift one further
            gen4 =  &#40;gen4 >> s4&#41;;
            bbTemp |= gen4.s0 | gen4.s1 | gen4.s2 | gen4.s3;     

            // Verify
            bbTemp &= ( &#40;piece>>1&#41; == PAWN&#41; ? &#40;PawnAttackTables&#91;som*64+pos&#93; & bbOpposite&#41;  | &#40;PawnAttackTables&#91;&#40;som+2&#41;*64+pos&#93; & ~bbBlockers&#41;  &#58; AttackTables&#91;&#40;som*7*64&#41;+(&#40;piece>>1&#41;*64&#41;+pos&#93; ;

            // Captures
            bbMoves  = &#40;bbTemp & bbOpposite&#41;;
            // Non cpatures
            bbMoves |= &#40;qs == 0&#41;? bbTemp & ~bbBlockers &#58; 0 ;


--
Srdja

Re: Kogge Stone, Vector Based, 2*4 directions, ulong4

Posted: Wed Jan 30, 2013 2:08 pm
by smatovic
argh, i forgot the wraps....but the example shows how the idea works...

--
Srdja

Re: Kogge Stone, Vector Based, 2*4 directions...update

Posted: Mon Jun 06, 2016 7:04 pm
by smatovic
just for the record...

the different shifts did not perform well on the gpu,
so i redesigned the move gen a bit,
the same code is run for all pieces,
except castle, en passant and pawn promotion moves.

OepnCL pseudo code:

Code: Select all

    // ####################################################################
    // ### generate all moves except castle, en passant and pawn promo  ### 
    // ####################################################################
  __private bool qs = false;
  __private Bitboard bbTemp;
  __private Bitboard bbWork;
  __private Bitboard bbMove;
  __private Bitboard bbBlockers;
  __private Bitboard bbMe;
  __private Bitboard bbOpp;
  __private ulong4 bbWrap4;
  __private ulong4 bbPro4;
  __private ulong4 bbGen4;
...

for each $piece on $sq do

    bbWork  = SETMASKBB&#40;sq&#41;;
    // get koggestone wraps
    bbWrap4 = wraps4&#91;0&#93;;
    // generator and propagator &#40;piece and empty squares&#41;
    bbGen4  = &#40;ulong4&#41;bbWork;
    bbPro4  = &#40;ulong4&#41;(~bbBlockers&#41;;
    // kogge stone shift left via ulong4 vector
    bbPro4 &= bbWrap4;
    bbGen4 |= bbPro4  & &#40;bbGen4 << shift4&#41;;
    bbPro4 &=           &#40;bbGen4 << shift4&#41;;
    bbGen4 |= bbPro4  & &#40;bbGen4 << 2*shift4&#41;;
    bbPro4 &=           &#40;bbGen4 << 2*shift4&#41;;
    bbGen4 |= bbPro4  & &#40;bbGen4 << 4*shift4&#41;;
    bbGen4  = bbWrap4 & &#40;bbGen4 << shift4&#41;;
    bbTemp |= bbGen4.s0|bbGen4.s1|bbGen4.s2|bbGen4.s3;
    // get koggestone wraps
    bbWrap4 = wraps4&#91;1&#93;;
    // set generator and propagator &#40;piece and empty squares&#41;
    bbGen4  = &#40;ulong4&#41;bbWork;
    bbPro4  = &#40;ulong4&#41;(~bbBlockers&#41;;
    // kogge stone shift right via ulong4 vector
    bbPro4 &= bbWrap4;
    bbGen4 |= bbPro4  & &#40;bbGen4 >> shift4&#41;;
    bbPro4 &=           &#40;bbGen4 >> shift4&#41;;
    bbGen4 |= bbPro4  & &#40;bbGen4 >> 2*shift4&#41;;
    bbPro4 &=           &#40;bbGen4 >> 2*shift4&#41;;
    bbGen4 |= bbPro4  & &#40;bbGen4 >> 4*shift4&#41;;
    bbGen4  = bbWrap4 & &#40;bbGen4 >> shift4&#41;;
    bbTemp |= bbGen4.s0|bbGen4.s1|bbGen4.s2|bbGen4.s3;
    // consider knights
    bbTemp  = &#40;piece==KNIGHT&#41;?BBFULL&#58;bbTemp;
    // verify captures
    n       = &#40;piece==PAWN&#41;?stm&#58;piece;
    bbWork  = AttackTables&#91;n*64+sq&#93;;
    bbMove  = bbWork&bbTemp&bbOpp;
    // verify non captures
    bbWork  = &#40;piece==PAWN&#41;?&#40;AttackTablesPawnPushes&#91;stm*64+sq&#93;)&#58;bbWork;
    bbMove |= &#40;qs&#41;?BBEMPTY&#58;&#40;bbWork&bbTemp&~bbBlockers&#41;;

end

here the needed consta&#324;t tables

Code: Select all

// move generator costants
// wraps for kogge stone shifts
__constant ulong4 shift4 = &#40;ulong4&#41;( 9, 1, 7, 8&#41;;

__constant ulong4 wraps4&#91;2&#93; =
&#123;
  // wraps shift left
  &#40;ulong4&#41;(
            0xfefefefefefefe00, // <<9
            0xfefefefefefefefe, // <<1 */
            0x7f7f7f7f7f7f7f00, // <<7 */
            0xffffffffffffff00  // <<8 */
          ),

  // wraps shift right
  &#40;ulong4&#41;(
            0x007f7f7f7f7f7f7f, // >>9
            0x7f7f7f7f7f7f7f7f, // >>1 
            0x00fefefefefefefe, // >>7
            0x00ffffffffffffff  // >>8
          )
&#125;;
__constant Bitboard AttackTablesPawnPushes&#91;2*64&#93; = 
&#123;
  /* white pawn pushes */
  0x100,0x200,0x400,0x800,0x1000,0x2000,0x4000,0x8000,0x1010000,0x2020000,0x4040000,0x8080000,0x10100000,0x20200000,0x40400000,0x80800000,0x1000000,0x2000000,0x4000000,0x8000000,0x10000000,0x20000000,0x40000000,0x80000000,0x100000000,0x200000000,0x400000000,0x800000000,0x1000000000,0x2000000000,0x4000000000,0x8000000000,0x10000000000,0x20000000000,0x40000000000,0x80000000000,0x100000000000,0x200000000000,0x400000000000,0x800000000000,0x1000000000000,0x2000000000000,0x4000000000000,0x8000000000000,0x10000000000000,0x20000000000000,0x40000000000000,0x80000000000000,0x100000000000000,0x200000000000000,0x400000000000000,0x800000000000000,0x1000000000000000,0x2000000000000000,0x4000000000000000,0x8000000000000000,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,
  /* black pawn pushes */
  0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x1,0x2,0x4,0x8,0x10,0x20,0x40,0x80,0x100,0x200,0x400,0x800,0x1000,0x2000,0x4000,0x8000,0x10000,0x20000,0x40000,0x80000,0x100000,0x200000,0x400000,0x800000,0x1000000,0x2000000,0x4000000,0x8000000,0x10000000,0x20000000,0x40000000,0x80000000,0x100000000,0x200000000,0x400000000,0x800000000,0x1000000000,0x2000000000,0x4000000000,0x8000000000,0x10100000000,0x20200000000,0x40400000000,0x80800000000,0x101000000000,0x202000000000,0x404000000000,0x808000000000,0x1000000000000,0x2000000000000,0x4000000000000,0x8000000000000,0x10000000000000,0x20000000000000,0x40000000000000,0x80000000000000
&#125;;
/* piece attack tables */
__constant Bitboard AttackTables&#91;7*64&#93; = 
&#123;
  /* white pawn */
  0x200,0x500,0xa00,0x1400,0x2800,0x5000,0xa000,0x4000,0x20000,0x50000,0xa0000,0x140000,0x280000,0x500000,0xa00000,0x400000,0x2000000,0x5000000,0xa000000,0x14000000,0x28000000,0x50000000,0xa0000000,0x40000000,0x200000000,0x500000000,0xa00000000,0x1400000000,0x2800000000,0x5000000000,0xa000000000,0x4000000000,0x20000000000,0x50000000000,0xa0000000000,0x140000000000,0x280000000000,0x500000000000,0xa00000000000,0x400000000000,0x2000000000000,0x5000000000000,0xa000000000000,0x14000000000000,0x28000000000000,0x50000000000000,0xa0000000000000,0x40000000000000,0x200000000000000,0x500000000000000,0xa00000000000000,0x1400000000000000,0x2800000000000000,0x5000000000000000,0xa000000000000000,0x4000000000000000,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,
  /* black pawn */
  0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x0,0x2,0x5,0xa,0x14,0x28,0x50,0xa0,0x40,0x200,0x500,0xa00,0x1400,0x2800,0x5000,0xa000,0x4000,0x20000,0x50000,0xa0000,0x140000,0x280000,0x500000,0xa00000,0x400000,0x2000000,0x5000000,0xa000000,0x14000000,0x28000000,0x50000000,0xa0000000,0x40000000,0x200000000,0x500000000,0xa00000000,0x1400000000,0x2800000000,0x5000000000,0xa000000000,0x4000000000,0x20000000000,0x50000000000,0xa0000000000,0x140000000000,0x280000000000,0x500000000000,0xa00000000000,0x400000000000,0x2000000000000,0x5000000000000,0xa000000000000,0x14000000000000,0x28000000000000,0x50000000000000,0xa0000000000000,0x40000000000000,
  /* knight */
  0x20400,0x50800,0xa1100,0x142200,0x284400,0x508800,0xa01000,0x402000,0x2040004,0x5080008,0xa110011,0x14220022,0x28440044,0x50880088,0xa0100010,0x40200020,0x204000402,0x508000805,0xa1100110a,0x1422002214,0x2844004428,0x5088008850,0xa0100010a0,0x4020002040,0x20400040200,0x50800080500,0xa1100110a00,0x142200221400,0x284400442800,0x508800885000,0xa0100010a000,0x402000204000,0x2040004020000,0x5080008050000,0xa1100110a0000,0x14220022140000,0x28440044280000,0x50880088500000,0xa0100010a00000,0x40200020400000,0x204000402000000,0x508000805000000,0xa1100110a000000,0x1422002214000000,0x2844004428000000,0x5088008850000000,0xa0100010a0000000,0x4020002040000000,0x400040200000000,0x800080500000000,0x1100110a00000000,0x2200221400000000,0x4400442800000000,0x8800885000000000,0x100010a000000000,0x2000204000000000,0x4020000000000,0x8050000000000,0x110a0000000000,0x22140000000000,0x44280000000000,0x88500000000000,0x10a00000000000,0x20400000000000,
  /* king */
  0x302,0x705,0xe0a,0x1c14,0x3828,0x7050,0xe0a0,0xc040,0x30203,0x70507,0xe0a0e,0x1c141c,0x382838,0x705070,0xe0a0e0,0xc040c0,0x3020300,0x7050700,0xe0a0e00,0x1c141c00,0x38283800,0x70507000,0xe0a0e000,0xc040c000,0x302030000,0x705070000,0xe0a0e0000,0x1c141c0000,0x3828380000,0x7050700000,0xe0a0e00000,0xc040c00000,0x30203000000,0x70507000000,0xe0a0e000000,0x1c141c000000,0x382838000000,0x705070000000,0xe0a0e0000000,0xc040c0000000,0x3020300000000,0x7050700000000,0xe0a0e00000000,0x1c141c00000000,0x38283800000000,0x70507000000000,0xe0a0e000000000,0xc040c000000000,0x302030000000000,0x705070000000000,0xe0a0e0000000000,0x1c141c0000000000,0x3828380000000000,0x7050700000000000,0xe0a0e00000000000,0xc040c00000000000,0x203000000000000,0x507000000000000,0xa0e000000000000,0x141c000000000000,0x2838000000000000,0x5070000000000000,0xa0e0000000000000,0x40c0000000000000,
  /* bishop */
  0x8040201008040200,0x80402010080500,0x804020110a00,0x8041221400,0x182442800,0x10204885000,0x102040810a000,0x102040810204000,0x4020100804020002,0x8040201008050005,0x804020110a000a,0x804122140014,0x18244280028,0x1020488500050,0x102040810a000a0,0x204081020400040,0x2010080402000204,0x4020100805000508,0x804020110a000a11,0x80412214001422,0x1824428002844,0x102048850005088,0x2040810a000a010,0x408102040004020,0x1008040200020408,0x2010080500050810,0x4020110a000a1120,0x8041221400142241,0x182442800284482,0x204885000508804,0x40810a000a01008,0x810204000402010,0x804020002040810,0x1008050005081020,0x20110a000a112040,0x4122140014224180,0x8244280028448201,0x488500050880402,0x810a000a0100804,0x1020400040201008,0x402000204081020,0x805000508102040,0x110a000a11204080,0x2214001422418000,0x4428002844820100,0x8850005088040201,0x10a000a010080402,0x2040004020100804,0x200020408102040,0x500050810204080,0xa000a1120408000,0x1400142241800000,0x2800284482010000,0x5000508804020100,0xa000a01008040201,0x4000402010080402,0x2040810204080,0x5081020408000,0xa112040800000,0x14224180000000,0x28448201000000,0x50880402010000,0xa0100804020100,0x40201008040201,
  /* rook */
  0x1010101010101fe,0x2020202020202fd,0x4040404040404fb,0x8080808080808f7,0x10101010101010ef,0x20202020202020df,0x40404040404040bf,0x808080808080807f,0x10101010101fe01,0x20202020202fd02,0x40404040404fb04,0x80808080808f708,0x101010101010ef10,0x202020202020df20,0x404040404040bf40,0x8080808080807f80,0x101010101fe0101,0x202020202fd0202,0x404040404fb0404,0x808080808f70808,0x1010101010ef1010,0x2020202020df2020,0x4040404040bf4040,0x80808080807f8080,0x1010101fe010101,0x2020202fd020202,0x4040404fb040404,0x8080808f7080808,0x10101010ef101010,0x20202020df202020,0x40404040bf404040,0x808080807f808080,0x10101fe01010101,0x20202fd02020202,0x40404fb04040404,0x80808f708080808,0x101010ef10101010,0x202020df20202020,0x404040bf40404040,0x8080807f80808080,0x101fe0101010101,0x202fd0202020202,0x404fb0404040404,0x808f70808080808,0x1010ef1010101010,0x2020df2020202020,0x4040bf4040404040,0x80807f8080808080,0x1fe010101010101,0x2fd020202020202,0x4fb040404040404,0x8f7080808080808,0x10ef101010101010,0x20df202020202020,0x40bf404040404040,0x807f808080808080,0xfe01010101010101,0xfd02020202020202,0xfb04040404040404,0xf708080808080808,0xef10101010101010,0xdf20202020202020,0xbf40404040404040,0x7f80808080808080,
  /* queen */
  0x81412111090503fe,0x2824222120a07fd,0x404844424150efb,0x8080888492a1cf7,0x10101011925438ef,0x2020212224a870df,0x404142444850e0bf,0x8182848890a0c07f,0x412111090503fe03,0x824222120a07fd07,0x4844424150efb0e,0x80888492a1cf71c,0x101011925438ef38,0x20212224a870df70,0x4142444850e0bfe0,0x82848890a0c07fc0,0x2111090503fe0305,0x4222120a07fd070a,0x844424150efb0e15,0x888492a1cf71c2a,0x1011925438ef3854,0x212224a870df70a8,0x42444850e0bfe050,0x848890a0c07fc0a0,0x11090503fe030509,0x22120a07fd070a12,0x4424150efb0e1524,0x88492a1cf71c2a49,0x11925438ef385492,0x2224a870df70a824,0x444850e0bfe05048,0x8890a0c07fc0a090,0x90503fe03050911,0x120a07fd070a1222,0x24150efb0e152444,0x492a1cf71c2a4988,0x925438ef38549211,0x24a870df70a82422,0x4850e0bfe0504844,0x90a0c07fc0a09088,0x503fe0305091121,0xa07fd070a122242,0x150efb0e15244484,0x2a1cf71c2a498808,0x5438ef3854921110,0xa870df70a8242221,0x50e0bfe050484442,0xa0c07fc0a0908884,0x3fe030509112141,0x7fd070a12224282,0xefb0e1524448404,0x1cf71c2a49880808,0x38ef385492111010,0x70df70a824222120,0xe0bfe05048444241,0xc07fc0a090888482,0xfe03050911214181,0xfd070a1222428202,0xfb0e152444840404,0xf71c2a4988080808,0xef38549211101010,0xdf70a82422212020,0xbfe0504844424140,0x7fc0a09088848281,
&#125;;
I implemented it also on cpu, but it is not competitive to magic bitboards...at least it should be SIMD friendly now...

--
Srdja