Split Index Super Set Yielding (SISSY) Bitboards

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Mike Sherwin
Posts: 860
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Mike Sherwin »

Could it really have been this easy all this time and I just could not think of it?

Adding this to the bottom of InitializeBSS() ...

Code: Select all

  for (sq = 0; sq < 64; sq++) {
    for (i = 0; i < 128; i++) {
      for (j = 0; j < 128; j++) {
        if (!(i & j)) {
          bss[sq][i | j][0] = bss[sq][i][0] & bss[sq][j][3];
          bss[sq][i | j][1] = bss[sq][i][1] & bss[sq][j][4];
          bss[sq][i | j][2] = bss[sq][i][2] & bss[sq][j][5];
        }
      }
    }
  }
... looks correct but is it? Time to test!

The complete function.

Code: Select all

void InitializeBSS() {
  u08 sq, sqr, i, j, k, l;
  s08 x, dx, y, dy;
  u64 b, bb;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      if (i ^ 1) {
        j = i >> 1;
        for (k = 8, l = 0; k <= 48; k += 8, l++) {
          bb = 0;
          b = (u64)i << k;
          for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          bss[sq][j][l] = bb;
        }
      }
    }
  }

  for (sq = 0; sq < 64; sq++) {
    for (i = 0; i < 128; i++) {
      for (j = 0; j < 128; j++) {
        if (!(i & j)) {
          bss[sq][i | j][0] = bss[sq][i][0] & bss[sq][j][3];
          bss[sq][i | j][1] = bss[sq][i][1] & bss[sq][j][4];
          bss[sq][i | j][2] = bss[sq][i][2] & bss[sq][j][5];
        }
      }
    }
  }

}
Mike Sherwin
Posts: 860
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Mike Sherwin »

Mike Sherwin wrote: Sat Feb 13, 2021 8:53 pm Could it really have been this easy all this time and I just could not think of it?

Adding this to the bottom of InitializeBSS() ...

Code: Select all

  for (sq = 0; sq < 64; sq++) {
    for (i = 0; i < 128; i++) {
      for (j = 0; j < 128; j++) {
        if (!(i & j)) {
          bss[sq][i | j][0] = bss[sq][i][0] & bss[sq][j][3];
          bss[sq][i | j][1] = bss[sq][i][1] & bss[sq][j][4];
          bss[sq][i | j][2] = bss[sq][i][2] & bss[sq][j][5];
        }
      }
    }
  }
... looks correct but is it? Time to test!

The complete function.

Code: Select all

void InitializeBSS() {
  u08 sq, sqr, i, j, k, l;
  s08 x, dx, y, dy;
  u64 b, bb;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      if (i ^ 1) {
        j = i >> 1;
        for (k = 8, l = 0; k <= 48; k += 8, l++) {
          bb = 0;
          b = (u64)i << k;
          for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
          }
          bss[sq][j][l] = bb;
        }
      }
    }
  }

  for (sq = 0; sq < 64; sq++) {
    for (i = 0; i < 128; i++) {
      for (j = 0; j < 128; j++) {
        if (!(i & j)) {
          bss[sq][i | j][0] = bss[sq][i][0] & bss[sq][j][3];
          bss[sq][i | j][1] = bss[sq][i][1] & bss[sq][j][4];
          bss[sq][i | j][2] = bss[sq][i][2] & bss[sq][j][5];
        }
      }
    }
  }

}
Nope, not that simple. The problem is all the pieces contribute to the indexes. Therefore, the simple test for collisions is not correct. A way to fix this that comes to mind is to and the occupancy bits with the empty board bishop bitboard to limit the number of indexes that point to supersets. But, that would add an extra step to both initialization and to generation. And I'm not even sure that I am thinking correctly. So unless someone else steps in with a solution I won't be seen here again unless I can solve it. Bye for now! :)
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

Mike Sherwin wrote: Sat Feb 13, 2021 12:57 pm

Code: Select all

U64 bss[6][64][64]; // 192 K

U64 bishopAttacks(U64 occ, enumSquare sq) {
  return
    bss[0][sq][(occ >>  9) & 63] &  // 2nd rank
    bss[1][sq][(occ >> 17) & 63] &  
    bss[2][sq][(occ >> 25) & 63] &  
    bss[3][sq][(occ >> 33) & 63] &  
    bss[4][sq][(occ >> 41) & 63] &  
    bss[5][sq][(occ >> 49) & 63] ;  // 7th rank
}    
hmm, looking at this code, I wonder if one could fuse some the occupancies in a simple way, cutting it down to 3 lookups by using (much) bigger table, say:

Code: Select all


// 6MB!!
uint64_t bss[3][16][64*256];

uint64_t bishopAttacks2(uint64_t occ, uint8_t sq)
{
    const uint64_t tmp = (uint64_t)63;
    const uint64_t mask =
        (tmp << 9) |
        (tmp << 17) |
        (tmp << 25) |
        (tmp << 33) |
        (tmp << 41) |
        (tmp << 49);
    occ &= mask;

    int sqmsb2 = (sq & (3 << 4))*4;
    int sqlsb4 = sq & 15;

  return
    bss[0][sqlsb4][(occ >>  9) + sqmsb2] &  // 2nd&3rd rank
    bss[1][sqlsb4][(occ >>  25) + sqmsb2] &  // 4th&5th rank
    bss[2][sqlsb4][(occ >>  41) + sqmsb2];  // 6th&7th rank
}

the basic idea is to pre-mask and utilize 2 bits of sq to fill gaps in the table
it's very likely that I've made a mistake somewhere as I didn't test this, but the code that gcc produces looks like this:

Code: Select all

bishopAttacks2(unsigned long, unsigned char):
        movabs  rax, 35604928818740736
        lea     rdx, [0+rsi*4]
        and     esi, 15
        and     rdi, rax
        and     edx, 192
        sal     rsi, 14
        mov     rax, rdi
        mov     rcx, rdi
        shr     rdi, 41
        shr     rax, 9
        shr     rcx, 25
        add     rax, rdx
        add     rax, rsi
        add     rsi, rdx
        lea     rdx, [rcx+262144+rsi]
        mov     rax, QWORD PTR pbss[0+rax*8]
        and     rax, QWORD PTR pbss[0+rdx*8]
        lea     rdx, [rsi+524288+rdi]
        and     rax, QWORD PTR pbss[0+rdx*8]
        ret
perhaps more micro-optimizations could be made, like the square setup code, one might lookup a pointer table with sq to fetch the offset table
pointer and so on

I have my doubts though that this would help much (it would likely fry the cache), but who knows... not to mention that this was only the bishops...

I wonder if one might exploit the fact that A&B = B&A is some useful way
Martin Sedlak
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

refining the idea further, simplifying square setup, because it can be precomputed into an offset table:

Code: Select all

uint64_t pbss3_0[16*64*256];
uint64_t pbss3_1[16*64*256];
uint64_t pbss3_2[16*64*256];
uint32_t pbss3ofs[64];

uint64_t bishopAttacks3(uint64_t occ, uint32_t sq)
{
    const uint64_t tmp = (uint64_t)63;
    const uint64_t mask =
        (tmp << 9) |
        (tmp << 17) |
        (tmp << 25) |
        (tmp << 33) |
        (tmp << 41) |
        (tmp << 49);
    occ &= mask;

    uint32_t ofs = pbss3ofs[sq];

  return
    pbss3_0[(occ >>  9) + ofs] &  // 2nd&3rd rank
    pbss3_1[(occ >>  25) + ofs] &  // 4th&5th rank
    pbss3_2[(occ >>  41) + ofs];  // 6th&7th rank
}

Code: Select all

bishopAttacks3(unsigned long, unsigned int):
        movabs  rax, 35604928818740736
        mov     esi, esi
        and     rdi, rax
        mov     edx, DWORD PTR pbss3ofs[0+rsi*4]
        mov     rax, rdi
        mov     rcx, rdi
        shr     rdi, 41
        shr     rax, 9
        shr     rcx, 25
        add     rdi, rdx
        add     rax, rdx
        add     rcx, rdx
        mov     rax, QWORD PTR pbss3_0[0+rax*8]
        and     rax, QWORD PTR pbss3_1[0+rcx*8]
        and     rax, QWORD PTR pbss3_2[0+rdi*8]
        ret
again, no idea if I haven't made a mistake somewhere, but apart from probably frying the cache, the final assembly looks pretty good I think
Martin Sedlak
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

of course I'm stupid, "premasking" won't work, so let's fix it:

Code: Select all

uint64_t pbss4_0[16*64*256];
uint64_t pbss4_1[16*64*256];
uint64_t pbss4_2[16*64*256];
uint32_t pbss4ofs[64];

uint64_t bishopAttacks4(uint64_t occ, uint32_t sq)
{
    const uint32_t mask = 63 + (63 << 8);
    uint32_t ofs = pbss4ofs[sq];

  return
    pbss4_0[((occ >>  9) & mask) + ofs] &  // 2nd&3rd rank
    pbss4_1[((occ >>  25) & mask) + ofs] &  // 4th&5th rank
    pbss4_2[((occ >>  41) & mask) + ofs];  // 6th&7th rank
}

Code: Select all

bishopAttacks4(unsigned long, unsigned int):
        mov     esi, esi
        mov     rax, rdi
        mov     rdx, rdi
        shr     rdi, 41
        mov     ecx, DWORD PTR pbss4ofs[0+rsi*4]
        shr     rax, 9
        shr     rdx, 25
        and     edi, 16191
        and     eax, 16191
        and     edx, 16191
        add     rax, rcx
        add     rdx, rcx
        add     rdi, rcx
        mov     rax, QWORD PTR pbss4_0[0+rax*8]
        and     rax, QWORD PTR pbss4_1[0+rdx*8]
        and     rax, QWORD PTR pbss4_2[0+rdi*8]
        ret
Martin Sedlak
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

and final version, helping the compiler further, baking the offset as pointer into merged table, applying offsets on the go:
(because mov r,[r*8] implies 32-bit displacement anyway)

Code: Select all

uint64_t pbss5_merged[3*16*64*256];
const uint64_t *pbss5ptr[64];

uint64_t bishopAttacks5(uint64_t occ, uint32_t sq)
{
    const uint32_t mask = 63 + (63 << 8);
    const uint64_t *ptr = pbss5ptr[sq];
    const uint32_t ofs = 16*64*256;

  return
    ptr[((occ >>  9) & mask)] &  // 2nd&3rd rank
    ptr[((occ >>  25) & mask) + 1*ofs] &  // 4th&5th rank
    ptr[((occ >>  41) & mask) + 2*ofs];  // 6th&7th rank
}

Code: Select all

bishopAttacks5(unsigned long, unsigned int):
        mov     esi, esi
        mov     rax, rdi
        mov     rcx, rdi
        shr     rdi, 9
        mov     rdx, QWORD PTR pbss5ptr[0+rsi*8]
        shr     rax, 25
        shr     rcx, 41
        and     edi, 16191
        and     eax, 16191
        and     ecx, 16191
        mov     rax, QWORD PTR [rdx+2097152+rax*8]
        and     rax, QWORD PTR [rdx+4194304+rcx*8]
        and     rax, QWORD PTR [rdx+rdi*8]
        ret
Martin Sedlak
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

one last thing - masking occupancy with relevant bits only (i.e. and with pseudo-attacks for sq) one might in general improve caching behavior a lot.
in the case above, that would mean 1 extra cheap lookup (1 instruction)

magics do the same thing, obviously (but for a different reason) - I wonder if plain SISSY might benefit from this masking as well? i.e. trade off extra lookup for better cache utilization
Martin Sedlak
Mike Sherwin
Posts: 860
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Mike Sherwin »

mar wrote: Sun Feb 14, 2021 1:52 am and final version, helping the compiler further, baking the offset as pointer into merged table, applying offsets on the go:
(because mov r,[r*8] implies 32-bit displacement anyway)

Code: Select all

uint64_t pbss5_merged[3*16*64*256];
const uint64_t *pbss5ptr[64];

uint64_t bishopAttacks5(uint64_t occ, uint32_t sq)
{
    const uint32_t mask = 63 + (63 << 8);
    const uint64_t *ptr = pbss5ptr[sq];
    const uint32_t ofs = 16*64*256;

  return
    ptr[((occ >>  9) & mask)] &  // 2nd&3rd rank
    ptr[((occ >>  25) & mask) + 1*ofs] &  // 4th&5th rank
    ptr[((occ >>  41) & mask) + 2*ofs];  // 6th&7th rank
}

Code: Select all

bishopAttacks5(unsigned long, unsigned int):
        mov     esi, esi
        mov     rax, rdi
        mov     rcx, rdi
        shr     rdi, 9
        mov     rdx, QWORD PTR pbss5ptr[0+rsi*8]
        shr     rax, 25
        shr     rcx, 41
        and     edi, 16191
        and     eax, 16191
        and     ecx, 16191
        mov     rax, QWORD PTR [rdx+2097152+rax*8]
        and     rax, QWORD PTR [rdx+4194304+rcx*8]
        and     rax, QWORD PTR [rdx+rdi*8]
        ret
Hi Martin, I just finished a long think while half sleeping. I have no idea how long it was. Wow! This is really something to wake up to! If this really does work it is some incredible work in such a short time. Will you be posting the initialization code? And the assembly looks fast!!! During my working knap I think that I also came up with a solution. Just store supersets for the actual blocker bits and then convert occupancy to blockers in the generation code and the supersets can be anded together correctly. Once that is done a further reduction to just two lookups would be next. However, your code looks incredibly fast and is probably better. I can't wait to see this working. :)
Mike Sherwin
Posts: 860
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by Mike Sherwin »

mar wrote: Sun Feb 14, 2021 3:19 am one last thing - masking occupancy with relevant bits only (i.e. and with pseudo-attacks for sq) one might in general improve caching behavior a lot.
in the case above, that would mean 1 extra cheap lookup (1 instruction)

magics do the same thing, obviously (but for a different reason) - I wonder if plain SISSY might benefit from this masking as well? i.e. trade off extra lookup for better cache utilization
I think we are on the same page with what I posted last. I did not consider the positive effect on the cache but that makes sense because there would be far less memory cells to bring in and out of the cache. Do I understand that correctly? Anyway thanks for all your work.
mar
Posts: 2554
Joined: Fri Nov 26, 2010 2:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Split Index Super Set Yielding (SISSY) Bitboards

Post by mar »

I might try (this is just a concept, I've no idea if it'll work), but like I said it has two problems:
1) the table is now 6 megabytes (!) for bishops only
2) the assembly looks fast, but the last 3 instructions will be extremenly cache-inefficient, so it's possible that it would actually run slower than plain SISSY

there's of course only one way to say for sure - to test it in practice. that will take some time, of course. I'd have to find your initialization code first,
I remember I did some experiments in the past but I can't seem to find the code...

and there's also a good chance that I messed up somewhere as this is totally untested

as for pre-masking occupancy with pseudo-attack mask at square: yes, that's exactly what I meant.
but I've no idea if it would be actually beneficial to offset the extra lookup, because the individual lookups in plain SISSY only access a small portion of memory, still it should be trivial to test and see if it helps
Martin Sedlak