I did some magic bitboard "science" and mostly learned not to worry about it

Jakob Progsch · Post by **Jakob Progsch** » Thu Apr 22, 2021 10:27 am

One other thing that bugs me a bit is that we make all this effort to calculate the entire moveset of a piece in one go. But then later we probably end up iterating over that bitset anyway, right?

Somewhere you do something like:

Code: Select all

uint64_t bishop_moves = cool_magic_bitboard_function(bishop, board);
//...
while(bishop_moves) {
	bit = extract_one_bit(bishop_moves);
	//play the move, insert it into a list etc.
	//...
	bishop_moves &= ~bit;
}

When you could have just generated the sliding moves iteratively in the first place

Code: Select all

uint64_t bishop_move = bishop;
while(bishop_move & ~blockers) {
	bishop_move <<= 9u; // iterate towards positive direction
	//play the move, insert it into a list etc.
	//...
}
// repeat for the other directions

Of course being able to calculate these in bulk is still useful if you don't actually need to iterate over them. Such as when calculating controlled squares, looking for checks, pins, forks etc. but for raw move generation the first approach is almost a bit dubious compared to the second?

hgm · Post by **hgm** » Thu Apr 22, 2021 10:30 am

Exactly. This is why mailbox is faster than bitboard. What also appears to slow things down is that pieces don't have the same number of moves everywhere. E.g. if you do Knight moves from a lookup table uint64_t knightMoves[64], a Knight on a1 would have only 2 moves, and a Knight on e4 would have 8. So if you then extract the moves from knightMoves[A1] you would not have the expected 8 moves, but the extraction loop already terminates after 2 iterations. Which makes it much slower than a loop that would have done the expected 8 iterations (or unrolled those a compile time), because it will produce a mispredict at the looping branch.

Jakob Progsch · Post by **Jakob Progsch** » Thu Apr 22, 2021 10:40 am

Here my newness to chess programming specifically shows. Is the usual approach to physically create the move list and explicitly order it? The above thing is why I jumped through some hoops and try to specifically avoid this by generating them on the fly where you again can make some use of the bitwise set operations.

Say I only want captures then

Code: Select all

uint64_t bishop_moves = cool_magic_bitboard_function(bishop, board);
uint64_t bishop_captures = bishop_moves & opposing_pieces;
while(bishop_captures) {
	bit = extract_one_bit(bishop_captures);
	//play the move, insert it into a list etc.
	//...
	bishop_captures &= ~bit;
}

will iterate over less moves than the... uh... "direct iteration"?

Edit: I guess the other reason why I like the bitboards without creating an explicit list is that the storage is fixed at 16 uint64_t instead of some array whose pathological upper bound is like 200 moves but the average is more like 20.

hgm · Post by **hgm** » Thu Apr 22, 2021 10:50 am

Indeed, the usual approach is to write the moves in a moveList array, assigning a sort key to them. And then later extract the moves with the highest remaining sort key when you want to search another move.

And indeed, for generating only captures (as you would do in Quiescence Search), one first calculates the captures 'in bulk' through Boolean operations on the bitboards. So that your extraction loop then only has to deal with captures, and doesn't waste time on extracting non-captures, and having to test and reject those on a per-move basis.

The problem with storing the move set in 16 bitboards is that you cannot sort them. In the 'mailbox trials' thread I therefore use a representation of 32 uint32_t to store captures (and protections) per victim, rather than per attacker; then you can assign the bits that represent the captures in such a way that they extract in the order you want to play them. So that sorting is not needed at all. This would not work for non-captures, of course, which should be sorted by entirely different criteria than the type of the moving piece. But there are virtually no nodes in the tree that search non-captures. So it is not important how you handle these.

Gerd Isenberg · Post by **Gerd Isenberg** » Fri Apr 23, 2021 5:33 pm

Jakob Progsch wrote: ↑Thu Apr 22, 2021 10:40 am Here my newness to chess programming specifically shows. Is the usual approach to physically create the move list and explicitly order it? The above thing is why I jumped through some hoops and try to specifically avoid this by generating them on the fly where you again can make some use of the bitwise set operations.

Say I only want captures then
Code: Select all
uint64_t bishop_moves = cool_magic_bitboard_function(bishop, board);
uint64_t bishop_captures = bishop_moves & opposing_pieces;
while(bishop_captures) {
	bit = extract_one_bit(bishop_captures);
	//play the move, insert it into a list etc.
	//...
	bishop_captures &= ~bit;
}
will iterate over less moves than the... uh... "direct iteration"?

Edit: I guess the other reason why I like the bitboards without creating an explicit list is that the storage is fixed at 16 uint64_t instead of some array whose pathological upper bound is like 200 moves but the average is more like 20.

In my last chess program I used this direction-wise generation of legal moves color-flipper approach with 16 target bitboards (8 ray directions + 8 knight directions), and a finite state machine to pick moves in approapriate order.

Jakob Progsch · Post by **Jakob Progsch** » Mon Apr 26, 2021 9:53 am

Coming back to the original topic. I repeated the experiments on a Raspberry PI 4B running a 64bit Ubuntu (the default rpi os is 32bit).

Since there isn't a convenient rdtsc equivalent on ARM I measured in nanoseconds this time. So the absolute times are not directly comparable to the previous ones. The scale factor is 3.8. Meaning you'd have to multiply these numbers by 3.8 to get values comparable to the previous ones.

Rooks

Code: Select all

            arithmetic         kindergarten           magic
  M         t      t/M          t      t/M          t      t/M

  1       14.04   14.04       13.37   13.37       24.15   24.15
  2       24.73   12.37       18.72    9.36       25.76   12.88
  3       36.11   12.04       26.07    8.69       27.75    9.25
  4       47.47   11.87       32.75    8.19       31.97    7.99
  5       58.16   11.63       41.11    8.22       37.44    7.49
  6       69.52   11.59       49.91    8.32       44.70    7.45
  7       80.88   11.55       57.49    8.21       52.53    7.50
  8       94.92   11.87       69.51    8.69       60.18    7.52

Bishops

Code: Select all

            arithmetic         kindergarten           magic
  M         t      t/M          t      t/M          t      t/M

  1       12.70   12.70       13.37   13.37        9.83    9.83
  2       20.72   10.36       14.22    7.11       10.42    5.21
  3       31.41   10.47       20.72    6.91       12.46    4.15
  4       41.44   10.36       30.74    7.69       15.74    3.93
  5       52.80   10.56       36.09    7.22       19.34    3.87
  6       64.84   10.81       48.79    8.13       22.98    3.83
  7       75.53   10.79       56.81    8.12       26.64    3.81
  8       87.56   10.95       69.51    8.69       31.13    3.89

So a RPI4 core at 1.8GHz is about 4x slower than a core on my AMD 3900x at 4.3GHz. Which is actually pretty impressive In my opinion. The RPI certainly plays more chess per watt and money than my desktop

. There aren't any ARM specific optimizations here. The overall behavior is very similar than on the x86_64 CPU. For a moment I thought there might be nice potential for a ARM specific arithmetic version since it has bit reversal instructions. But annoyingly those only go up to 32bit operands from what I can tell.

Jakob Progsch · Post by **Jakob Progsch** » Mon May 17, 2021 7:54 pm

hgm wrote: ↑Thu Apr 22, 2021 10:50 am Indeed, the usual approach is to write the moves in a moveList array, assigning a sort key to them. And then later extract the moves with the highest remaining sort key when you want to search another move.

I was convinced this would come with a cost. After all you have to extract, pack, write to memory and later read and unpack every move when actually materializing that list. So naturally my perft implementation just directly recursed into the move as I extracted them from the bitboards. Now I tried it and confusingly explicitly creating that list and iterating over it in a second step is... marginally faster? I guess it may lead to better pipelining since the moves are generated in a sequentially "dense" fashion instead of constantly being "interrupted" by recursive calls?

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 3:18 pm

hgm wrote: ↑Thu Apr 22, 2021 10:50 am The problem with storing the move set in 16 bitboards is that you cannot sort them. In the 'mailbox trials' thread I therefore use a representation of 32 uint32_t to store captures (and protections) per victim, rather than per attacker; then you can assign the bits that represent the captures in such a way that they extract in the order you want to play them. So that sorting is not needed at all. This would not work for non-captures, of course, which should be sorted by entirely different criteria than the type of the moving piece. But there are virtually no nodes in the tree that search non-captures. So it is not important how you handle these.

There is a way to use 15 bits to store everything a move could ever contain.
From Square : 6
To Square : 6
From Piece : 4
To Piece (capturing, move to empty, taking pawn, taking knight etc) : 4
Specials (EP, castle, promotion) : 2
Promo piece: 2

Yes the sum here is 24 bits. But I am sure you get the idea of how to encode all branches of chess into a single number.
The way its usually done: https://chess.stackexchange.com/questio ... -bitboards
is wasteful imo. You can do every possible move with all possible information in 15 bits - and can round up to a nice 16 bit.

dangi12012 · Post by **dangi12012** » Mon Dec 13, 2021 3:50 pm

Is my idea wrong? We get a few moves to many - but still get a good number:

Code: Select all

int config = 14 + 2 + 8 + 22*10; //EP castling pawn push, all pawn moves. Todo: Pawns can only move to empty squares. Here we assume any square
	for (int i = 0; i < 64; i++) {
		config += std::popcount(Chess_Lookup::Lookup_Switch::KingAttacks[i]);
		config += std::popcount(Chess_Lookup::Lookup_Switch::KnightAttacks[i]);
		config += std::popcount(Chess_Lookup::Lookup_Switch::Queen(i, 0));
		config += std::popcount(Chess_Lookup::Lookup_Switch::Rook(i, 0));
		config += std::popcount(Chess_Lookup::Lookup_Switch::Bishop(i, 0));
	}
	config *= (2 + 6); //Per color. All 5 enemies + empty. King has to be excluded always because it cannot be taken

31296 is more than the total number of distinct moves in 8x8 chess. That fits in 15 bit.
And you would put a templated lamdbafunction into an array of that size. There you can store totally branchless code.

Any move would only be its own ID. By putting taking moves lower into the array you know by the value of the id that its a move that should be explored first.

hgm · Post by **hgm** » Mon Dec 13, 2021 4:12 pm

Executing functions from a table does not qualify as branchless. In indirect jump or call virtually guarantees a mispredict, as it would be very hard to predict what function will be chosen for calling.

I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it

Re: I did some magic bitboard "science" and mostly learned not to worry about it