Split Index Super Set Yielding (SISSY) Bitboards

Mike Sherwin · Post by **Mike Sherwin** » Sun Feb 14, 2021 5:10 am

mar wrote: ↑Sun Feb 14, 2021 3:49 am I might try (this is just a concept, I've no idea if it'll work), but like I said it has two problems:
1) the table is now 6 megabytes (!) for bishops only
2) the assembly looks fast, but the last 3 instructions will be extremenly cache-inefficient, so it's possible that it would actually run slower than plain SISSY

there's of course only one way to say for sure - to test it in practice. that will take some time, of course. I'd have to find your initialization code first,
I remember I did some experiments in the past but I can't seem to find the code...

and there's also a good chance that I messed up somewhere as this is totally untested

as for pre-masking occupancy with pseudo-attack mask at square: yes, that's exactly what I meant.
but I've no idea if it would be actually beneficial to offset the extra lookup, because the individual lookups in plain SISSY only access a small portion of memory, still it should be trivial to test and see if it helps

I can at least verify that using blockers instead of occupancy does save time. Perft 7 using queen code for all pieces was 204 seconds. After adding the bishop code perft 7 took 207 seconds. After switching over to blockers perft 7 was 202 seconds. So a 2.47% speedup by adding an extra instruction.

I just wanted it to look like I was still trying to do something.

mar · Post by **mar** » Sun Feb 14, 2021 11:42 am

ok, so I made it work and it seems slightly faster than the previous SISSY bishop code I had.
surprisingly, masking relevant occupancy seems to slow things down, both for the new fused version and old SISSY (at least for me, in perft 7)
also the code I found used queen attacks for rooks, even though I found had your rnk table but it wasn't used, no idea

so here's the initialization code (note that the old sissy version I found used some fancy "clever" scheme):

Code: Select all

static uint64_t pbss5_merged[3*16*64*256];
static uint64_t *pbss5ptr[64];

	static inline uint64_t bishopAttacks5(uint64_t occ, uint32_t sq)
	{
		const uint32_t mask = 63 + (63 << 8);
		const uint64_t *ptr = pbss5ptr[sq];
		const uint32_t ofs = 16*64*256;

	  return
		ptr[((occ >>  9) & mask)] &  // 2nd&3rd rank
		ptr[((occ >>  25) & mask) + 1*ofs] &  // 4th&5th rank
		ptr[((occ >>  41) & mask) + 2*ofs];  // 6th&7th rank
	}

init code:

Code: Select all

	for (uint32_t sq4lsb=0; sq4lsb<16; sq4lsb++)
	{
		for (uint32_t occ = 0; occ<64*256; occ++)
		{
			uint32_t sq2msb = (occ >> 6) & 3;

			uint32_t sq = sq4lsb + (sq2msb << 4);

			auto &tmp = bss[sq];

			auto bss0 = tmp.value[(occ & 63)*8+0];
			auto bss1 = tmp.value[((occ >> 8) & 63)*8+1];

			auto bss2 = tmp.value[(occ & 63)*8+2];
			auto bss3 = tmp.value[((occ >> 8) & 63)*8+3];

			auto bss4 = tmp.value[(occ & 63)*8+4];
			auto bss5 = tmp.value[((occ >> 8) & 63)*8+5];

			pbss5_merged[sq4lsb*64*256 + occ] = bss0 & bss1;
			pbss5_merged[sq4lsb*64*256 + 1*16*64*256 + occ] = bss2 & bss3;
			pbss5_merged[sq4lsb*64*256 + 2*16*64*256 + occ] = bss4 & bss5;

			uint32_t offset = sq4lsb*64*256 + (sq2msb << 6);

			pbss5ptr[sq] = pbss5_merged + offset;
		}
	}

note that all those tmp.value[x*8+0] is simply manual indexing into tmp.value[x][0], but it should be trivial to rewrite

the final timings for perft 7 (bulk counting) versus magics is: with magics perft runs ~6.9% faster, but again I had quite a bit of noise between runs,
so it should be taken with a grain of salt
also this is with microsoft compiler, not gcc

I've also tried a hybid (this new fused SISSY + rook magics) and it was still slower than plain magics
a hybrid magic + queen SISSY was also slower than plain magics (I was hoping it might not be the case) - maybe again because of extra memory that has to be accessed, I don't know

very hard to beat magics

Mike Sherwin · Post by **Mike Sherwin** » Sun Feb 14, 2021 6:13 pm

mar wrote: ↑Sun Feb 14, 2021 11:42 am ok, so I made it work and it seems slightly faster than the previous SISSY bishop code I had.
surprisingly, masking relevant occupancy seems to slow things down, both for the new fused version and old SISSY (at least for me, in perft 7)
also the code I found used queen attacks for rooks, even though I found had your rnk table but it wasn't used, no idea

so here's the initialization code (note that the old sissy version I found used some fancy "clever" scheme):
Code: Select all
static uint64_t pbss5_merged[3*16*64*256];
static uint64_t *pbss5ptr[64];

	static inline uint64_t bishopAttacks5(uint64_t occ, uint32_t sq)
	{
		const uint32_t mask = 63 + (63 << 8);
		const uint64_t *ptr = pbss5ptr[sq];
		const uint32_t ofs = 16*64*256;

	  return
		ptr[((occ >>  9) & mask)] &  // 2nd&3rd rank
		ptr[((occ >>  25) & mask) + 1*ofs] &  // 4th&5th rank
		ptr[((occ >>  41) & mask) + 2*ofs];  // 6th&7th rank
	}
init code:
Code: Select all
	for (uint32_t sq4lsb=0; sq4lsb<16; sq4lsb++)
	{
		for (uint32_t occ = 0; occ<64*256; occ++)
		{
			uint32_t sq2msb = (occ >> 6) & 3;

			uint32_t sq = sq4lsb + (sq2msb << 4);

			auto &tmp = bss[sq];

			auto bss0 = tmp.value[(occ & 63)*8+0];
			auto bss1 = tmp.value[((occ >> 8) & 63)*8+1];

			auto bss2 = tmp.value[(occ & 63)*8+2];
			auto bss3 = tmp.value[((occ >> 8) & 63)*8+3];

			auto bss4 = tmp.value[(occ & 63)*8+4];
			auto bss5 = tmp.value[((occ >> 8) & 63)*8+5];

			pbss5_merged[sq4lsb*64*256 + occ] = bss0 & bss1;
			pbss5_merged[sq4lsb*64*256 + 1*16*64*256 + occ] = bss2 & bss3;
			pbss5_merged[sq4lsb*64*256 + 2*16*64*256 + occ] = bss4 & bss5;

			uint32_t offset = sq4lsb*64*256 + (sq2msb << 6);

			pbss5ptr[sq] = pbss5_merged + offset;
		}
	}
note that all those tmp.value[x*8+0] is simply manual indexing into tmp.value[x][0], but it should be trivial to rewrite

the final timings for perft 7 (bulk counting) versus magics is: with magics perft runs ~6.9% faster, but again I had quite a bit of noise between runs,
so it should be taken with a grain of salt
also this is with microsoft compiler, not gcc

I've also tried a hybid (this new fused SISSY + rook magics) and it was still slower than plain magics
a hybrid magic + queen SISSY was also slower than plain magics (I was hoping it might not be the case) - maybe again because of extra memory that has to be accessed, I don't know

very hard to beat magics

Thanks Martin for trying.

Mike Sherwin · Post by **Mike Sherwin** » Mon Apr 12, 2021 1:43 am

mar wrote: ↑Sun Feb 14, 2021 11:42 am ok, so I made it work and it seems slightly faster than the previous SISSY bishop code I had.
surprisingly, masking relevant occupancy seems to slow things down, both for the new fused version and old SISSY (at least for me, in perft 7)
also the code I found used queen attacks for rooks, even though I found had your rnk table but it wasn't used, no idea

so here's the initialization code (note that the old sissy version I found used some fancy "clever" scheme):
Code: Select all
static uint64_t pbss5_merged[3*16*64*256];
static uint64_t *pbss5ptr[64];

	static inline uint64_t bishopAttacks5(uint64_t occ, uint32_t sq)
	{
		const uint32_t mask = 63 + (63 << 8);
		const uint64_t *ptr = pbss5ptr[sq];
		const uint32_t ofs = 16*64*256;

	  return
		ptr[((occ >>  9) & mask)] &  // 2nd&3rd rank
		ptr[((occ >>  25) & mask) + 1*ofs] &  // 4th&5th rank
		ptr[((occ >>  41) & mask) + 2*ofs];  // 6th&7th rank
	}
init code:
Code: Select all
	for (uint32_t sq4lsb=0; sq4lsb<16; sq4lsb++)
	{
		for (uint32_t occ = 0; occ<64*256; occ++)
		{
			uint32_t sq2msb = (occ >> 6) & 3;

			uint32_t sq = sq4lsb + (sq2msb << 4);

			auto &tmp = bss[sq];

			auto bss0 = tmp.value[(occ & 63)*8+0];
			auto bss1 = tmp.value[((occ >> 8) & 63)*8+1];

			auto bss2 = tmp.value[(occ & 63)*8+2];
			auto bss3 = tmp.value[((occ >> 8) & 63)*8+3];

			auto bss4 = tmp.value[(occ & 63)*8+4];
			auto bss5 = tmp.value[((occ >> 8) & 63)*8+5];

			pbss5_merged[sq4lsb*64*256 + occ] = bss0 & bss1;
			pbss5_merged[sq4lsb*64*256 + 1*16*64*256 + occ] = bss2 & bss3;
			pbss5_merged[sq4lsb*64*256 + 2*16*64*256 + occ] = bss4 & bss5;

			uint32_t offset = sq4lsb*64*256 + (sq2msb << 6);

			pbss5ptr[sq] = pbss5_merged + offset;
		}
	}
note that all those tmp.value[x*8+0] is simply manual indexing into tmp.value[x][0], but it should be trivial to rewrite

the final timings for perft 7 (bulk counting) versus magics is: with magics perft runs ~6.9% faster, but again I had quite a bit of noise between runs,
so it should be taken with a grain of salt
also this is with microsoft compiler, not gcc

I've also tried a hybid (this new fused SISSY + rook magics) and it was still slower than plain magics
a hybrid magic + queen SISSY was also slower than plain magics (I was hoping it might not be the case) - maybe again because of extra memory that has to be accessed, I don't know

very hard to beat magics

Hi Martin,
I've studied your code off and on a few times trying to wrap my brain around it. I think that I am beginning to get a handle on it. I think it is close to what I want to do. My idea, because there are no collisions between ranks 2&5, 3&6 and 4&7 is to combine them into 7 bit indexes reducing the accessed memory.

THUS
[d]8/1b6/2b3b1/3b1b2/4B3/3b1b2/2b3b1/8 w - - 0 1
BECOMES (in the upper 32 bits for one less shift)
[d]8/1b6/2bb1bb1/2bb1bb1/4B3/8/8/8 w - - 0 1
Wouldn't this be what we were aiming for?
Edit: I almost forgot. And if that saved shift can be used to shift rank 7 down one rank ...
[d]8/8/1bbb1bb1/2bb1bb1/4B3/8/8/8 w - - 0 1

mar · Post by **mar** » Mon Apr 12, 2021 4:48 pm

hi Michael,

I don't remember as it was several months ago. I believe I did nothing special but simply used a bigger LUT to process two ranks at once?

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 14, 2021 2:10 am

mar wrote: ↑Mon Apr 12, 2021 4:48 pm hi Michael,

I don't remember as it was several months ago. I believe I did nothing special but simply used a bigger LUT to process two ranks at once?

I think I am going to be able to get it done!

The first step is not trying to save memory with 6 bit indexes and just using the natural 7 bit index. The second step is just computing the indexes for blockers. Here is the new initialization and bishop code.

Code: Select all

void InitializeBSS() {
  u08 sq, sqr, i, ii, k, l;
  s08 x, dx, y, dy;
  u64 b, bb;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      for (k = 8, l = 0; k <= 48; k += 8, l++) {
        bb = 0;
        b = ((u64)i << k) & bob[sq];
        for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
        }
        bss[sq][i][l] = bb;
      }
    }
  }
}

    case BB:
      blockers = allPieces & bob[fs];
      bb = bss[fs][(blockers >> 8) & 127][0]
        & bss[fs][(blockers >> 16) & 127][1]
        & bss[fs][(blockers >> 24) & 127][2]
        & bss[fs][(blockers >> 32) & 127][3]
        & bss[fs][(blockers >> 40) & 127][4]
        & bss[fs][(blockers >> 48) & 127][5]
        & notme;
      break;

And here is the perft 6 result.

r n b q k b n r
p p p p p p p p
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
P P P P P P P P
R N B Q K B N R

Enter Command(w): perft 6
b1a3: 4856835
b1c3: 5708064
g1f3: 5723523
g1h3: 4877234
a2a3: 4463267
a2a4: 5363555
b2b3: 5310358
b2b4: 5293555
c2c3: 5417640
c2c4: 5866666
d2d3: 8073082
d2d4: 8879566
e2e3: 9726018
e2e4: 9771632
f2f3: 4404141
f2f4: 4890429
g2g3: 5346260
g2g4: 5239875
h2h3: 4463070
h2h4: 5385554

Perft(6) = 119060324

more updates as it progresses.

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 14, 2021 4:59 am

YES,,,YES YES YES I dyxjubf did it!!!

Code: Select all

union SplitBB {
  struct {
    u32 l32;
    u32 h32;
  };
  u64 bb;
};

u64 bss[64][128][6];

void InitializeBSS() {
  u08 sq, sqr, i, j, k, l;
  s08 x, dx, y, dy;
  u64 b, bb, ii, jj;

  for (sq = 0; sq < 64; sq++) {
    y = sq >> 3;
    x = sq & 7;
    for (i = 0; i < 128; i++) {
      for (k = 8, l = 0; k <= 48; k += 8, l++) {
        bb = 0;
        b = ((u64)i << k) & bob[sq];
        for (dx = +1, dy = +1; x + dx < +8 && y + dy < +8; dx++, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = +1; x + dx > -1 && y + dy < +8; dx--, dy++) {
          sqr = (((y + dy) << 3) + x + dx);
          bb |= one << sqr;
          if ((one << sqr) & b) break;
        }
        for (dx = +1, dy = -1; x + dx < +8 && y + dy > -1; dx++, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
        }
        for (dx = -1, dy = -1; x + dx > -1 && y + dy > -1; dx--, dy--) {
            sqr = (((y + dy) << 3) + x + dx);
            bb |= one << sqr;
            if ((one << sqr) & b) break;
        }
        bss[sq][i][l] = bb;
      }
    }
  }
  for (sq = 0; sq <= 63; sq++) {
    for (ii = 0; ii <= 127; ii++) {
      for (jj = 0; jj <= 127; jj++) {
        i = ((ii <<  8) & bob[sq]) >>  8;
        j = ((jj << 32) & bob[sq]) >> 32;
        bss[sq][i | j][0] = bss[sq][i][0] & bss[sq][j][3];
        i = ((ii << 16) & bob[sq]) >> 16;
        j = ((jj << 40) & bob[sq]) >> 40;
        bss[sq][i | j][1] = bss[sq][i][1] & bss[sq][j][4];
        i = ((ii << 24) & bob[sq]) >> 24;
        j = ((jj << 48) & bob[sq]) >> 48;
        bss[sq][i | j][2] = bss[sq][i][2] & bss[sq][j][5];
      }
    }
  }
}

    case WB:
    case BB:
      blocks.bb = allPieces & bob[fs];
      blocks.bb = (blocks.l32 >> 8) | blocks.h32;
      bb = bss[fs][(blocks.bb >>  0) & 127][0]
         & bss[fs][(blocks.bb >>  8) & 127][1]
         & bss[fs][(blocks.bb >> 16) & 127][2]
         & notme;
      break;

Now I just have to optimise it with pointers like Martin did.

r n b q k b n r
p p p p p p p p
. . . . . . . .
. . . . . . . .
. . . . . . . .
. . . . . . . .
P P P P P P P P
R N B Q K B N R

Enter Command(w): perft 6
b1a3: 4856835
b1c3: 5708064
g1f3: 5723523
g1h3: 4877234
a2a3: 4463267
a2a4: 5363555
b2b3: 5310358
b2b4: 5293555
c2c3: 5417640
c2c4: 5866666
d2d3: 8073082
d2d4: 8879566
e2e3: 9726018
e2e4: 9771632
f2f3: 4404141
f2f4: 4890429
g2g3: 5346260
g2g4: 5239875
h2h3: 4463070
h2h4: 5385554

Perft(6) = 119060324

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 14, 2021 5:50 am

But first, full union with an optimization that only using a union would permit. It took 5 seconds off of perft 7!

Code: Select all

union SplitBB {
  struct {
    u08 r1;
    u08 r2;
    u08 r3;
    u08 r4;
    u08 r5;
    u08 r6;
    u08 r7;
    u08 r8;
  };
  struct {
    u32 l32;
    u32 h32;
  };
  u64 bb;
};

    case WB:
    case BB:
      blocks.bb = allPieces & bob[fs] & 0x7e7e7e7e7e7e7e7e;
      blocks.bb = ((blocks.l32 >>  8) | blocks.h32);
      bb = bss[fs][blocks.r1][0]
         & bss[fs][blocks.r2][1]
         & bss[fs][blocks.r3][2]
         & notme;
      break;

EDIT: Oh, and another optimization would be to initialize a bob[64] equivalent with 0x7e7e7e7e7e7e7e7e already &'ed in! brb.

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 14, 2021 6:12 am

Hmm, lost 4 seconds?

Code: Select all

    case WB:
    case BB:
      blocks.bb = aPieces & b7e[fs];
      blocks.bb = ((blocks.l32 >>  8) | blocks.h32);
      bb = bss[fs][blocks.r1][0]
         & bss[fs][blocks.r2][1]
         & bss[fs][blocks.r3][2]
         & notme;

Mike Sherwin · Post by **Mike Sherwin** » Wed Apr 14, 2021 9:56 pm

Seems I had the Split Index Super Set Yielding (SISSY) idea back in 2009.
http://www.talkchess.com/forum3/viewtop ... 71#p308215

Code: Select all

u64 RookAttacks(u32 sq, u64 occ) {
  u32 l = ((u32)( occ >> ((sq & 7) +  7))) & 0x00020202;
  u32 h = ((u32)( occ >> ((sq & 7) + 32))) & 0x00010101;
  u32 r = ((u32)((occ << 1) >> (sq - (sq & 7)))) & 0x000000f6;
  u32 i = ( l | h | r) & 0x0000003f;
  u32 j = (((l | h | r) >> 6) | ((l | h) >> 12)) & 0x0000003f;
  return rookSuperSet1[sq][i] & rookSuperSet2[sq][j];
}

I don't remember having written this and I will have to relearn how it works but maybe it is a good start to optimise the rooks?

Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards

Re: Split Index Super Set Yielding (SISSY) Bitboards