Bitboards ?: C# vs C++

Chessnut1071 · Post by **Chessnut1071** » Thu Nov 18, 2021 11:27 pm

hgm wrote: ↑Thu Nov 18, 2021 11:04 pm Nothing. But the first doesn't really give you much information. Because there is no meaningful definition for when a search is finished. You can continue it forever, and while you do the number of nodes keeps growing. You can determine the time needed to complete a certain depth, but in the face of reductions and extensions depth is an ill-defined concept as well. You can determine how much time it needs to solve a given problem (like mate, or gaining a piece), but for a given search this would depend enormously on the problem at hand, and what will be very fast in one position will be excedingly slow in another. So you would have to average over many positions, and then make sure these positions are a representative example of what you would encounter in games. And weight them by importance. That boils down to actually playing games, and measuring Elo from the results. Attempts to take shortcuts on this, e.g. by optimizing speed for solving a test suite, almost universally backfire, and the better you are at solving, the lower the Elo usually goes.

Comparing nodes/sec on similar trees (from the same test position) is easy, though. It can be used to directly compare different algorithms for the same tasks (move generation, move sorting, updating the game state, evaluation) speedwise.

"'.... But the first doesn't really give you much information. Because there is no meaningful definition for when a search is finished....."

I think you missed my point of using deep ply mate-in-x puzzles. The search is terminated and finished. It doesn't go on forever like a game, which is exactly the reason for using known ending solutions. I also contend that there is no better method of checking the accuracy of your engine.

hgm · Post by **hgm** » Thu Nov 18, 2021 11:38 pm

But checkmate is not a representative task for what an engine has to do to to win a game. In most positions of a game the checkmate cannot be seen, and in positions where checkmates can be seen, games are almost always already decided. Optimizing engines by the speed at which they find checkmates, would lead to excessively weak engines. For one, you would 'gain' by refraining from quiescence search.

Chessnut1071 · Post by **Chessnut1071** » Fri Nov 19, 2021 12:34 am

hgm wrote: ↑Thu Nov 18, 2021 11:38 pm But checkmate is not a representative task for what an engine has to do to to win a game. In most positions of a game the checkmate cannot be seen, and in positions where checkmates can be seen, games are almost always already decided. Optimizing engines by the speed at which they find checkmates, would lead to excessively weak engines. For one, you would 'gain' by refraining from quiescence search.

"... But checkmate is not a representative task for what an engine has to do to to win a game. ..."

I agree with the above, but, the reason for the test was to compare pre-computed bitboards speed vs. pre-computed byte tables. I'm trying to develop a bitboard for comparative purposes, but, my bitboard board algorithm will probably have too many issues to really compare with a well written one. That is why I would like a bitboard enthusiast to run the test on a 14-ply 7 move mate puzzle. With time, engine calls and node speed per second, I can get some data to compare with. My pre-computed byte tables have no numeric calculations or if statements, just an array lookup and I can't understand how bitboards can be any faster in spite of the bit compares. Well see.

emadsen · Post by **emadsen** » Fri Nov 19, 2021 4:57 am

Chessnut1071 wrote: ↑Fri Nov 19, 2021 12:34 am ...to compare pre-computed bitboards speed vs. pre-computed byte tables... I can't understand how bitboards can be any faster in spite of the bit compares.

I'm trying to decipher the source of your incomprehension. Is it...

1) Not understanding that the "magic" of magic bitboards is how quickly they discard extraneous information? A multiply operation and a shift operation is all that's required to fetch the list of pseudo-legal moves for all conceivable combinations of occupancies along the rank / file or up / down diagonal directions the piece may move. If your rook is on e1, looking to advance up the board, but your own piece occupies e5, the rook can move only to e2, e3, or e4. The occupancy of e6, e7, and e8 are irrelevant to the mobility of your rook on e1. They can be jammed with pieces, they can be empty... doesn't matter. Magic bitboards discard this extraneous information with two operations to arrive at the correct list of pseudo-legal moves.

What are these "pre-computed bytes tables" you speak of? Do they throw away extraneous info as quickly as magic bitboards? Or must they examine square occupancy in a loop? Move to e2... allowed, move to e3... allowed, move to e4... allowed, move to e5... blocked. Then examine square occupancy in a loop to the west, then to the east. Slow. Or are you calculating the nearest neighbor in every direction every time a piece is moved? How expensive is that calculation? Is it cheaper than a multiply + shift?

Or is your incomprehension due to...

2) Trying too hard to understand in theory why some algorithm should outperform another algorithm? Welcome to chess programming. Don't linger too long in the world of theory. "If you have two horses and you want to know which of the two is the faster then race your horses."

In other words, I don't agree you "can't understand." You can. Write the code and measure it empirically.

Mergi · Post by **Mergi** » Fri Nov 19, 2021 8:01 am

to compare pre-computed bitboards speed vs. pre-computed byte tables. I'm trying to develop a bitboard for comparative purposes, but, my bitboard board algorithm will probably have too many issues to really compare with a well written one

If you have two move generators and want to compare their speed, you can use the perft algorithm and get a representative comparison, especially since you say that both generators are bitboard based.

That is why I would like a bitboard enthusiast to run the test on a 14-ply 7 move mate puzzle. With time, engine calls and node speed per second, I can get some data to compare with.

No you cant. The moment you start comparing how long it takes to search a position to a certain depth for 2 different engines, that dont have identical search routine, 90% of the comparison will be about how your search routine is written and not about the move generator at all.

hgm · Post by **hgm** » Fri Nov 19, 2021 9:20 am

emadsen wrote: ↑Fri Nov 19, 2021 4:57 am In other words, I don't agree you "can't understand." You can. Write the code and measure it empirically.

There is a difference between believing and understanding. What you suggest is only a cure for that he doesn't believe bitboards can be faster . But it will provide zero understanding of why.

Yet it should not be so difficult to understand. Suppose you want to generate the moves of a Bishop at a1. When you do that through pre-computed tables, the Bishop table for a1 will contain something like (b2, c3, d4, e5, f6, g7, h8). But a pre-computed table cannot ake account of the fact that, say, e5 is occupied by an opponent piece, so that f6, g7 and h8 are not valid targets. So the move generation loop for captures will look something like

Code: Select all

int i = start[BISHOP][A1];
do {
  int toSqr = moveTable[i];
  int victim = board[toSqr];
  if(victim != EMPTY) {
    if(COLOR(victim) != stm) { // we hit opponent
      moveList[--capturePtr] = NewMove(fromSqr, toSqr, victim); // generate the capture
      i = nextRay[i]; // remainder of ray is blocked, continue with next direction
    }
  } else i++; // next square on ray, or first of next ray
} while(...); // some test on i to see whether we are done with this piece

Now suppose that the diagonal is empty. That means the loop will have 7 iterations, each of which fetch the next toSqr, the piece on this square, test whether this actually is a piece (not), increment i, perform some test on i, and branch back to loop. That is a lot of operations (7 x 7 + 1 = 50, if I count right). Now compare that with how bitboards do it:

Code: Select all

uint64_t rays = mask[BISHOP][A1]; // reachable squares on empty board
rays &= occupied; // all pieces in its path
uint64_t magic = bishopMagics[A1];
uint64_t attacks = bishopTable[rays*magic>>58][A1]; // reachable squares with obstruction
attacks &= occupancy[opponent]; // reachable squares that contain opponent
while(attacks) { // extract those one by one
  int toSqr = BSF(attacks);
  moveList[--capturePtr] = NewMove(fromSqr, toSqr); // generate the capture
  attacks &= attacks - 1; // remove this attack
}

If the ray was empty, the while loop is not executed at all, and degenerates to a single branch for skipping it. There are just 2 AND instructions, a multiply, a shift and a few memory fetches (the mask, the magic, the attacks and two occupancies). That is only 10 operations. And 10 is a lot less than 50.

Even when e4 does contain an enemy piece, the precomputed table of squares method would perform 4 iterations of the loop before it breaks from it, taking 4 x 7 + 1 = 29 operations, plus whatever it takes to store the capture. The bitboard method would now execute the extraction loop once, taking 3 extra operations (so 13) plus storing the capture. Also here 13 < 29.

Does that answer your (i.e. Bill's) question?

Chessnut1071 · Post by **Chessnut1071** » Fri Nov 19, 2021 1:13 pm

hgm wrote: ↑Fri Nov 19, 2021 9:20 am
emadsen wrote: ↑Fri Nov 19, 2021 4:57 am In other words, I don't agree you "can't understand." You can. Write the code and measure it empirically.
There is a difference between believing and understanding. What you suggest is only a cure for that he doesn't believe bitboards can be faster . But it will provide zero understanding of why.

Yet it should not be so difficult to understand. Suppose you want to generate the moves of a Bishop at a1. When you do that through pre-computed tables, the Bishop table for a1 will contain something like (b2, c3, d4, e5, f6, g7, h8). But a pre-computed table cannot ake account of the fact that, say, e5 is occupied by an opponent piece, so that f6, g7 and h8 are not valid targets. So the move generation loop for captures will look something like
Code: Select all
int i = start[BISHOP][A1];
do {
  int toSqr = moveTable[i];
  int victim = board[toSqr];
  if(victim != EMPTY) {
    if(COLOR(victim) != stm) { // we hit opponent
      moveList[--capturePtr] = NewMove(fromSqr, toSqr, victim); // generate the capture
      i = nextRay[i]; // remainder of ray is blocked, continue with next direction
    }
  } else i++; // next square on ray, or first of next ray
} while(...); // some test on i to see whether we are done with this piece
Now suppose that the diagonal is empty. That means the loop will have 7 iterations, each of which fetch the next toSqr, the piece on this square, test whether this actually is a piece (not), increment i, perform some test on i, and branch back to loop. That is a lot of operations (7 x 7 + 1 = 50, if I count right). Now compare that with how bitboards do it:
Code: Select all
uint64_t rays = mask[BISHOP][A1]; // reachable squares on empty board
rays &= occupied; // all pieces in its path
uint64_t magic = bishopMagics[A1];
uint64_t attacks = bishopTable[rays*magic>>58][A1]; // reachable squares with obstruction
attacks &= occupancy[opponent]; // reachable squares that contain opponent
while(attacks) { // extract those one by one
  int toSqr = BSF(attacks);
  moveList[--capturePtr] = NewMove(fromSqr, toSqr); // generate the capture
  attacks &= attacks - 1; // remove this attack
}
If the ray was empty, the while loop is not executed at all, and degenerates to a single branch for skipping it. There are just 2 AND instructions, a multiply, a shift and a few memory fetches (the mask, the magic, the attacks and two occupancies). That is only 10 operations. And 10 is a lot less than 50.

Even when e4 does contain an enemy piece, the precomputed table of squares method would perform 4 iterations of the loop before it breaks from it, taking 4 x 7 + 1 = 29 operations, plus whatever it takes to store the capture. The bitboard method would now execute the extraction loop once, taking 3 extra operations (so 13) plus storing the capture. Also here 13 < 29.

Does that answer your (i.e. Bill's) question?

"Even when e4 does contain an enemy piece, the precomputed table of squares method would perform 4 iterations of the loop before it breaks from it, taking 4current positio x 7 + 1 = 29 operations, plus whatever it takes to store the capture. The bitboard method would now execute the extraction loop once, taking 3 extra operations (so 13) plus storing the capture. Also here 13 < 29."

Negative on that. My pre-computed move vectors contains only the possible moves from the given position, nothing extra. If there is none, no looping. Also, you need a bit scan forward to know when to stop. I use the same stop simply by examining if the square is occupied. These are mirror images of each other. Bitboards use bit ops, mine uses a lookup. The advantage of byte tables is they easier to understand and implement. If there is a speed difference it can't be much because they're both doing the same thing.

Just to be sure, I'm in the process is developing a bitboard algorithm for comparison. If it is faster, I'll use it; otherwise, it's the byte tables. I need to get to 30-ply with no pruning for a special project. I'm no where near that right now, but, I'm getting closer. I learned lot from this board, my engine is 50.6x faster since I logged on here a few months ago.

dangi12012 · Post by **dangi12012** » Fri Nov 19, 2021 3:01 pm

Chessnut1071 wrote: ↑Fri Nov 19, 2021 12:34 am ...to compare pre-computed bitboards speed vs. pre-computed byte tables... I can't understand how bitboards can be any faster in spite of the bit compares.

Ahh thats where you are wrong. Bitboards are not just a replacement for arrays - they enable you to have algorithms that work on 64 elements in one instruction.

Getting the attacked squares of a rook is not 14 shifts + 14 bittests.
Its a single PEXT instruction into a lookup table. The speed is INSANE - and only FPGAS could do better.

hgm · Post by **hgm** » Fri Nov 19, 2021 4:54 pm

@Bill: well, it is difficult to comment on the speed of what you are doing if I have no clue what you are doing. And if you ever posted the code for that, I must have missed it. I thought you were comparing regular maibox to regular bitboard.

If your pre-computed lists take account of whether e4 (or any other square on the ray) is occupied or not, you must have multiple lists for the same square of origin, and I wonder how you can select which one of those to use. It seems to me you cannot do that without examining the squares on that ray; potentially all of those when they are all empty.

Chessnut1071 · Post by **Chessnut1071** » Fri Nov 19, 2021 6:17 pm

hgm wrote: ↑Fri Nov 19, 2021 4:54 pm @Bill: well, it is difficult to comment on the speed of what you are doing if I have no clue what you are doing. And if you ever posted the code for that, I must have missed it. I thought you were comparing regular maibox to regular bitboard.

If your pre-computed lists take account of whether e4 (or any other square on the ray) is occupied or not, you must have multiple lists for the same square of origin, and I wonder how you can select which one of those to use. It seems to me you cannot do that without examining the squares on that ray; potentially all of those when they are all empty.

I'm thinking you're assuming an iterative loop with 8 vectors, or rays. My table has a move counter in front of each vector and I mask the entire board the same way you do with bitboards. I also include a check mask. If there is a blocker, the list is terminated in that direction. It never searches anything that's past the blocker or illegal. Also, with 1 bit you can carry 1 bit of information. With a byte, I get the piece number, direction and if it's a check move: piece 4-bits, direction 3-bits, check 1 - bit. With the move direction and enemy check mask, discovered check jumps out too and requires only 6-bits; piece type and direction. The advantage of bitboards quick machine cycles looses some of its advantage when you have to go back and capture the important tactical information after you have id the move. I'll try the bitboard approach, but, that's probably really going to drag since I'm still exploring them.

Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++

Re: Bitboards ?: C# vs C++