Table-less bitboards (bitrays?)

hgm · Post by **hgm** » Tue Jun 18, 2013 8:38 am

For 16x16 boards magic bitboards seem out of the question, and even with rotated bitboard a single ray would have 14 relevant occupancy bits, requiring a table of 16K entries of 256 bits (512KB) for each of the 256 squares (128MB per orientation). Not very competative.

It recently came to my attention that it would be possible to use bitboard techniques without lookup tables, just using the ALU's carry propagation. If a word would contain the occupancy of a ray (bits set to 1 for each occupied square),

Code: Select all

typedef int BitRay;
BitRay sq, oc; int ray;
sq = squares&#91;orientation&#93;&#91;fromSqr&#93;; // square-number to bit mapping
ray = rayNumber&#91;orientation&#93;&#91;fromSqr&#93;; // lookup dictionary for rays
oc = occupancy&#91;ray&#93;; // relevant part of the 'occupancy' bitboard
oc &= ~sq; // clear from-square
oc -= sq; // flip all bits upto and including nearest occupied
oc ^= occupancy&#91;ray&#93;; // set flipped bits
oc &= ~sq; // remove from-square

would give you the attack set along the ray, but only in one direction. So you would need 8 orientations in stead of the usual 4. You could pack two opposite orientations in one word, however. In x64 this would work upto 32x32 boards. You would need a 'boarder guard' to prevent extension of a ray that hits the board edge into the higher half of the word, but this can be done by simply setting the occupancy of the edge square (requiring an extra oc |= BORDER before the suptraction, where BORDER is a contant).

So there would be no need for table lookup of the attack sets. Just a few tables with 4*boardArea (byte-size) elements, to map (orientation, square) pairs on ray numbers of the board representation (which in itself would consist out of 4*boardWidth rays, (actually boardWidth + 3*boardHeigt if boardWidth > boardHeight), the diagonal rays packed in the same way as in rotated bitboard), and a table of 4*boardArea elements the length of a ray to map out the bits corresponding to a given square in the ray of that orientation.

Has this type of representation ever been used for normal Chess (i.e. 8x8 boards)?

kbhearn · Post by **kbhearn** » Tue Jun 18, 2013 8:59 am

Seems like it would be a pain to get the update code working at first (each piece change needing to be recorded in eight different spots) but i suppose you could probably hide a lot of that behind some functions.

That said, the less table intensive bitboard techniques seem like they'd be just fine with a c++ class wrapping a n-bit int used in the same way 64 bit ints are used for chess. Could build in bitscan-and-reset functions for both directions and the logical operators and that would be all you need to calculate with a mask and bitscan to find the blocker location in each direction. the rest of your code could look just the same as it would for a uint64_t.

edit: your way is probably faster though, considering for a 32x32 board my way would involve 1 MB of ray masks, and each direction would involve 16 masks and up to 16 look ups. bitscanning to iterate a set in my method would also involve a bit of coaxing to make it both pretty and efficient.

hgm · Post by **hgm** » Tue Jun 18, 2013 9:16 am

kbhearn wrote:Seems like it would be a pain to get the update code working at first (each piece change needing to be recorded in eight different spots) but i suppose you could probably hide a lot of that behind some functions.

This is not different from ordinary rotated bitboards, is it? There you also have to update each change in four different spots. When opposite directions are stored head-to-head in a single in, it requires only four updates, each update changing pairs of bits. (So the 'squares' table would have elements that each have two bits set. E.g. for an 8x8 board the a4 element of the ranks orientation would be 0x8001, and of the files orientation 0x1008.) The only difference is that the updates requires an extra indirection through the rayNumber table, to see which part of the board you have to update, while conventional rotated bitboard would always update the entire board (because it fits in a single word).

Like

occupancy[rayNumber[RANKS][from]] ^= squares[RANKS][from];
occupancy[rayNumber[RANKS][to]] ^= squares[RANKS][to];

in stead of

occupancy[RANKS] ^= squares[RANKS][from];
occupancy[RANKS] ^= squares[RANKS][to];

But if the total board (for one orientation) is spread out over many machine words (already four for 16x16), doing the acces in the 'ray directory' to know what element you have to update, and then updating only that, seems already a winner (2 reads + 1 write in stead of 4 reads + 4 writes, but a longer latency as one depends on the other).

That said, the less table intensive bitboard techniques seem like they'd be just fine with a c++ class wrapping a n-bit int used in the same way 64 bit ints are used for chess. Could build in bitscan-and-reset functions for both directions and the logical operators and that would be all you need to calculate with a mask and bitscan to find the blocker location in each direction. the rest of your code could look just the same as it would for a uint64_t.

I am not sure what you consider 'less-table-intensive' bitboard techniques. Above I calculated that even on 16x16 normal rotated bitboard would take 0.5 GB in tables. 32x32 seems completely hopeless.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jun 18, 2013 10:39 am

hgm wrote:For 16x16 boards magic bitboards seem out of the question, and even with rotated bitboard a single ray would have 14 relevant occupancy bits, requiring a table of 16K entries of 256 bits (512KB) for each of the 256 squares (128MB per orientation). Not very competative.

It recently came to my attention that it would be possible to use bitboard techniques without lookup tables, just using the ALU's carry propagation. If a word would contain the occupancy of a ray (bits set to 1 for each occupied square),
Code: Select all
typedef int BitRay;
BitRay sq, oc; int ray;
sq = squares&#91;orientation&#93;&#91;fromSqr&#93;; // square-number to bit mapping
ray = rayNumber&#91;orientation&#93;&#91;fromSqr&#93;; // lookup dictionary for rays
oc = occupancy&#91;ray&#93;; // relevant part of the 'occupancy' bitboard
oc &= ~sq; // clear from-square
oc -= sq; // flip all bits upto and including nearest occupied
oc ^= occupancy&#91;ray&#93;; // set flipped bits
oc &= ~sq; // remove from-square
would give you the attack set along the ray, but only in one direction. So you would need 8 orientations in stead of the usual 4. You could pack two opposite orientations in one word, however. In x64 this would work upto 32x32 boards. You would need a 'boarder guard' to prevent extension of a ray that hits the board edge into the higher half of the word, but this can be done by simply setting the occupancy of the edge square (requiring an extra oc |= BORDER before the suptraction, where BORDER is a contant).

So there would be no need for table lookup of the attack sets. Just a few tables with 4*boardArea (byte-size) elements, to map (orientation, square) pairs on ray numbers of the board representation (which in itself would consist out of 4*boardWidth rays, (actually boardWidth + 3*boardHeigt if boardWidth > boardHeight), the diagonal rays packed in the same way as in rotated bitboard), and a table of 4*boardArea elements the length of a ray to map out the bits corresponding to a given square in the ray of that orientation.

Has this type of representation ever been used for normal Chess (i.e. 8x8 boards)?

I don't exactly understand your source code and how you deal with positive and negative rays. What is squares[orientation][fromSqr]? Only one bit or the ray mask?
Otherwise it looks similar to Subtracting a Rook from a Blocking Piece aka o^(o-2r) applied in Hyperbola Quintessence also for "negative" ray-directions.

hgm · Post by **hgm** » Tue Jun 18, 2013 12:21 pm

Indeed, this is essentially the o^(o-2r) trick.

The squares[orientation][sqr] is a single bit (or two bits, if positive and negative ray are stored in the same word.

E.g. for 8x8:

Code: Select all

int squares&#91;4&#93;&#91;64&#93; = &#123;
 // ranks
 &#123; 0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180,
   0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180,
   0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180,
   0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180,
   0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180,
   0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180,
   0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180,
   0x8001, 0x4002, 0x2004, 0x1008, 0x0810, 0x0420, 0x20240, 0x0180
 &#125;,
 // files
 &#123; 0x8001, 0x8001, 0x8001, 0x8001, 0x8001, 0x8001, 0x8001, 0x8001,
   0x4002, 0x4002, 0x4002, 0x4002, 0x4002, 0x4002, 0x4002, 0x4002,
   0x2004, 0x2004, 0x2004, 0x2004, 0x2004, 0x2004, 0x2004, 0x2004,
   0x1008, 0x1008, 0x1008, 0x1008, 0x1008, 0x1008, 0x1008, 0x1008,
   0x0810, 0x0810, 0x0810, 0x0810, 0x0810, 0x0810, 0x0810, 0x0810,
   0x0420, 0x0420, 0x0420, 0x0420, 0x0420, 0x0420, 0x0420, 0x0420,
   0x0240, 0x0240, 0x0240, 0x0240, 0x0240, 0x0240, 0x0240, 0x0240,
   0x0180, 0x0180, 0x0180, 0x0180, 0x0180, 0x0180, 0x0180, 0x0180
 &#125;,
 // diagonals
 &#123; // same as files
   ...
 &#125;,
 // anti-diagonals
 &#123;
 &#125;
&#125;;

char rayNumbers&#91;4&#93;&#91;64&#93; = &#123;
 // ranks
 &#123; 0,0,0,0,0,0,0,0,
   1,1,1,1,1,1,1,1,
   2,2,2,2,2,2,2,2,
   3,3,3,3,3,3,3,3,
   4,4,4,4,4,4,4,4,
   5,5,5,5,5,5,5,5,
   6,6,6,6,6,6,6,6,
   7,7,7,7,7,7,7,7,
 &#125;,
 // files
 &#123;  8, 9,10,11,12,13,14,15,
    8, 9,10,11,12,13,14,15,
    8, 9,10,11,12,13,14,15,
    8, 9,10,11,12,13,14,15,
    8, 9,10,11,12,13,14,15,
    8, 9,10,11,12,13,14,15,
    8, 9,10,11,12,13,14,15,
    8, 9,10,11,12,13,14,15
 &#125;,
 // diagonals
 &#123; 16,17,18,19,20,21,22,23,
   23,16,17,18,19,20,21,22,
   22,23,16,17,18,19,20,21,
   21,22,23,16,17,18,19,20,
   20,21,22,23,16,17,18,19,
   19,20,21,22,23,16,17,18,
   18,19,20,21,22,23,16,17,
   17,18,19,20,21,22,23,16
 &#125;,
 //  anti-diagonals
 &#123; 24,25,26,27,28,29,30,31,
   25,26,27,28,29,30,31,24,
   26,27,28,29,30,31,24,25,
   27,28,29,30,31,24,25,26,
   28,29,30,31,24,25,26,27,
   29,30,31,24,25,26,27,28,
   30,31,24,25,26,27,28,29,
   31,24,25,26,27,28,29,30
 &#125;
&#125;;

int occupancy&#91;32&#93;; // first 8&#58; ranks, next 8&#58; files, etc.
int colors&#91;2&#93;&#91;32&#93;;
int pieces&#91;6&#93;&#91;32&#93;;

Ray masks are not needed, as each ray is stored in a separate word. So you can select the ray by addressing it, rather than having to mask it out.

kbhearn · Post by **kbhearn** » Tue Jun 18, 2013 6:32 pm

i was thinking of simple calculation with ray masks. blockers = (ray & occ); catch the no blockers case (or have the outside of the board always marked as a blocker) and depending on direction, firstblocker = bsf(blockers); or firstblocker = bsr(blockers);

For 32x32 that'd be 1024 squares * 8 directions * 128 bytes/mask = 1 MB.

Gerd Isenberg · Post by **Gerd Isenberg** » Tue Jun 18, 2013 11:53 pm

How many sliders do you have on the 16*16 board?

What about AVX2 256-bit ymm bitboards and direction and set-wise Kogge-Stone and o^(o-2r) (for all sliders of one kind in parallel)? Due to SIMD-wise 16 * 16-bit sub, shift, there is no wrap issue. Ok that requires a new Haswell, but there is an AVX2 Software Development Emulator for SSE by Intel.

Code: Select all

__m256i eastAttacks (__m256i occ, __m256i gen /*rooks*/) &#123;
  __m256i tmp;
  occ  = _mm_or_si256 &#40;occ, gen&#41;;  // make rooks member of occupied
  tmp  = _mm_xor_si256&#40;occ, gen&#41;;  // occ - rooks
  tmp  = _mm_sub_epi16&#40;tmp, gen&#41;;  // occ - 2*rooks
  return _mm_xor_si256&#40;occ, tmp&#41;;  // occ ^ &#40;occ - 2*rooks&#41;
&#125;

__m256i westAttacks (__m256i occ, __m256i gen /*rooks*/) &#123; 
  __m256i pro;
  pro  = _mm_xor_si256&#40;  _mm256_cmpeq_epi8&#40;occ, occ&#41;, occ&#41;; // propagator &#58;&#58;= empty = ~occ
  gen  = _mm_or_si256 &#40;gen, _mm_and_si256 &#40;pro, _mm_srli_epi16&#40;gen, 1&#41;));
  pro  =                    _mm_and_si256 &#40;pro, _mm_srli_epi16&#40;pro, 1&#41;);
  gen  = _mm_or_si256 &#40;gen, _mm_and_si256 &#40;pro, _mm_srli_epi16&#40;gen, 2&#41;));
  pro  =                    _mm_and_si256 &#40;pro, _mm_srli_epi16&#40;pro, 2&#41;);
  gen  = _mm_or_si256 &#40;gen, _mm_and_si256 &#40;pro, _mm_srli_epi16&#40;gen, 4&#41;));
  pro  =                    _mm_and_si256 &#40;pro, _mm_srli_epi16&#40;pro, 4&#41;);
  gen  = _mm_or_si256 &#40;gen, _mm_and_si256 &#40;pro, _mm_srli_epi16&#40;gen, 8&#41;));
  return _mm_srli_epi16&#40;gen, 1&#41;;
&#125;

There is a byte-wise shift, where you can shift 2 bytes left/right for one rank16 up/down vertically, but that works only independently on both 16-byte halfs, so there is some extra twiddling required for the otherwise lost rank on the 16-byte bounderis. Shifting two ranks may be done by dword permutation, and masking off the "shifted" zero bits. Dito by four or eight ranks by qword permutation.

hgm · Post by **hgm** » Wed Jun 19, 2013 10:21 am

I used the 16 x 16 just as an example, but there are actually three games that would fit within these limits: Chu Shogi (12 x 12), Dai Shogi (15 x 15) and Tenjiku Shogi (16 x 16). Of course there are also games like Big Chess (16 x 16) or Omega Chess (12 x 12 outside, but with inaccessible parts).

Chu & Dai have 16 different piece types (out of 36) that have a slider component in one or more directions. Six of those types occur in two flavors: those that can promote to something else, and those that cannot. (But probably one should not distinguish those at the level of the piece-type bitboards, but just have one extra 'promotable' bitboard.) Nine of these piece types occur in the initial position, usually in pairs.

Tenjiku has a few more, but it also has 'jumping sliders', which basically experience the board as empty, as they can jump over any number of pieces. The snag is that they cannot always jump each other; the 4 types of those come in 3 'ranks', and can only jump lower-ranked pieces. So you would need several 'occupied' bitboards, indicating successively fewer pieces (only those above a certain rank). Well, Tenjiku is an absolute nightmare to program no matter how you do it...

Problem is that there are also very similar variants on still larger boards: Dai Dai Shogi (17 x 17), Maka Dai Dai Shogi (19 x 19) and Tai Shogi (25 x 25). These would not fit a 256-bit word, but a 64-bit 'bitray' would be able to hold two copies (the negative and positive) of the ray, and pose no special problems on an x64 architecture where 64-bit data-types incur no penalty. And even in 32-bit it only involves a slight overhead, as you would store positive and negative rays in separate words (and would not have to worry about carry overflows from one into the other in compensation). And then there is Taikyoku Shogi (36 x 36), where even the bitray gets awkward, but in x64 can still be done by storing the positive and negative rays in separate 64-bit ints.

The number of slider piece types is huge; e.g. the initial position of TAI Shogi:

(each piece depicted with a radial line in it slides in that direction. And most pieces promote to something else, often slider types not present in the initial position)
So I think it would be better not to choose a representation that limits you to 16 x 16.

Gerd Isenberg · Post by **Gerd Isenberg** » Wed Jun 19, 2013 10:53 am

Wow, I see, so Kogge-Stone with 4 ymm-registers for max 32x32 boards. Knight-Corner already has 512-bit regs

With up to 32-bit ranks, vertical shifting becomes even more easier with dword permutation. Each board doubling requires only one more parallel prefix step with Kogge-Stone.

But I agree that your ray-sets mapped to "positive" directions to apply the o^(o-2r) trick makes a lot of sense.

Evert · Post by **Evert** » Wed Jun 19, 2013 11:16 am

Sjaak uses lookup tables for boards with fewer than 128 squares (and otherwise mostly arbitrary dimensions).

As you say, this clearly doesn't scale well for larger boards and the reason it's done this way is mainly lazyness on my part. However, I have an experimental move generator for arbitrary sized bitboards using C++ templates that I plan to use for a future version of Sjaak. For this I should look into using a lookup-free method to generate at least sliding piece attacks, which should be relatively straightforward if I implement general riders first.

On my very long todo list.

Table-less bitboards (bitrays?)

Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)

Re: Table-less bitboards (bitrays?)