fast(‐ish) pseudo‐legal perft

zamfofex · Post by **zamfofex** » Sun Apr 23, 2023 10:56 pm

Hello, everyone! I’m not sure whether this might be interesting to anyone, but I have recently come up with a fairly terse pseudo‐legal move generator that ended up being fairly fast! The results are nothing groundbreaking, and the fact it doesn’t strictly count properly valid positions make it entirely useless for actually finding perft values.

But I’m happy with how “straightforward” I was able to make it. I think it’s very understandable to read, and it is under 250 lines of plain C, almost entirely under standard ISO C 99, except that with a few optimisations specific to GCC/Clang (such as ‘__attribute__’ and ‘#pragma’).

I call it coa perft!

Code: Select all

gcc -std=c99 -O3 -flto -march=native -o coa coa.c

Some limitations:

Castling is not considered.
The king may be put into (or left in) check and even be captured.
En passant is not considered.
Underpromotion is not considered.

In my computer, I get the following results:

Code: Select all

depth  7 -> 3327446432 nodes /   32.886  s = 101.181 Mn/s

Compared with hgm’s qperft:

Code: Select all

perft( 7)=   3195901860 (18.903 sec)

I might continue working on it to make the results more accurate. In any case, maybe this can serve as some kind of inspiration for someone!

Mike Sherwin · Post by **Mike Sherwin** » Mon Apr 24, 2023 1:15 am

I looked at your code. I don't know if it is interesting or not. It is not bitboard and it does not use a board type with a border. Instead you use x and y coordinates in a clever way. It runs decently fast but then it would run a little slower if it spent time on castling and en-passant. Using x/y coordinates means that piece on board testing can take place before actually looking to see what value is in board[ts]. But that also means two test instead of one in most cases as both x and y needs to be tested. On the plus side it is easy to understand code that a beginner can use to get a quick toy/hobby engine up and running in a short amount of time. So first just create your own TSCP type engine that others can learn from and it will most likely be interesting as a learning engine. Other than that I do not see anyone getting to excited about it.

zamfofex · Post by **zamfofex** » Mon Apr 24, 2023 2:39 am

Thank you for the feedback! I wasn’t expecting it to be particularly impressive by it or anything, I was just proud of how it turned out, and I wanted to be able to share it here. It feels like an achievement for me!

I think part of what allows for it to be fast is that I designed it in a way that allows for (some/most/all of) the bound checking to not need to happen at runtime at all by inlining.

So e.g. for the knight, when it is on the corner, it should end up optimizing the bounds checks away entirely and only verify whether the knight can move to two squares at most.

Note how the loops iterating over the positions on the board are marked to be unrolled. The idea was that the compiler should generate and inline 64 different functions, each to look at a the piece on a specific square and deepen the search.

Likewise, the computations of the array indices are also meant to be removed from the loop altogether. When you see ‘x + y * 8’, no multiplication or addition should need to be performed at runtime, because it was already performed at compile time in each of those 64 generated inline functions.

Mike Sherwin · Post by **Mike Sherwin** » Mon Apr 24, 2023 10:01 pm

Someone will have to explain to me how this "I think part of what allows for it to be fast is that I designed it in a way that allows for (some/most/all of) the bound checking to not need to happen at runtime at all by inlining.", works because I just can't see it.

For the knight I see the inline request (no guarantee the compiler will actually inline the code).

Code: Select all

inline void coa_move_knight(struct coa_board *board, int turn, int x0, int y0)
{
	coa_play(board, turn, x0, y0, x0 + 1, y0 + 2);
	coa_play(board, turn, x0, y0, x0 + 2, y0 + 1);
	coa_play(board, turn, x0, y0, x0 - 1, y0 + 2);
	coa_play(board, turn, x0, y0, x0 - 2, y0 + 1);
	coa_play(board, turn, x0, y0, x0 + 1, y0 - 2);
	coa_play(board, turn, x0, y0, x0 - 2, y0 + 1);
	coa_play(board, turn, x0, y0, x0 - 1, y0 - 2);
	coa_play(board, turn, x0, y0, x0 - 2, y0 - 1);
}

Then each line calls coa_play() where the bounds checking is done.

Code: Select all

inline int coa_play(struct coa_board *board, int turn, int x0, int y0, int x1, int y1)
{
	if (x1 < 0) return 1; // this is runtime code iiuc and these are bounds checks
	if (y1 < 0) return 1;
	if (x1 >= 8) return 1;
	if (y1 >= 8) return 1;
	
	unsigned char piece = board->pieces[x0 + y0 * 8];
	unsigned char other = board->pieces[x1 + y1 * 8];
	if ((other & coa_color) == (piece & coa_color)) return 1;
	
	board->pieces[x0 + y0 * 8] = coa_none;
	board->pieces[x1 + y1 * 8] = piece;
	coa_perft(board, turn ^ 1);
	board->pieces[x0 + y0 * 8] = piece;
	board->pieces[x1 + y1 * 8] = other;
	
	if (other != coa_none) return 1;
	return 0;
}

"Likewise, the computations of the array indices are also meant to be removed from the loop altogether. When you see ‘x + y * 8’, no multiplication or addition should need to be performed at runtime, because it was already performed at compile time in each of those 64 generated inline functions."

When X and Y are variable they can not be computed at compile time.

Someone please explain, my head is turning in circles.

zamfofex · Post by **zamfofex** » Tue Apr 25, 2023 6:23 am

At least on GCC, there is a strict guarantee that the functions will be inlined because of line 26, defining the ‘inline’ macro to ‘static inline __attribute((always_inline))’. If the function is not able to be inlined, the compilation will fail with an error. (GCC docs, Clang docs)

Because the functions are always inlined, and the loops are unrolled, the compiler will be able to optimise those checks away entirely.

‘coa_play’ is inlined into the ‘coa_move_knight’ function. ‘coa_move_knight’ is then inlined into ‘coa_move’, once for each possible x/y values (zero to seven each). Then, since the value of ‘x’ and ‘y’ are known (because those functions were inlined), the compiler knows the values of ‘x0’, ‘y0’, and thus of ‘x1’, ‘y1’, and is able to optimise those operations away entirely.

When you see e.g. ‘if (x1 < 0) return 1;’, remember that this was inlined into the unrolled loop code in the ‘coa_move’ function, so in the case ‘x1 < 0’, the compiler will be able to skip emitting the rest of the code for the function. And in any case, the comparison is not made at runtime.

For sliding pieces, I will admit it’s a bit different (and that wouldn’t necessarily be true to say for them), since their loops aren’t unrolled. (I have tried it, and it led to poorer performance overall.)

Mike Sherwin · Post by **Mike Sherwin** » Tue Apr 25, 2023 8:34 am

I think I'm starting to understand. This kind of thing must be newer than any book on C that I have.

So in the perft function there are unrolled 64 copies of coa_move with compile time knowledge of the x and y value.

Then each copy of coa_move has a copy of the code for each piece type for the known x and y coordinate.

So when coa_move_knight is called when x0 = 0 and y0 = 0 (a1) only these two calls are executed ...

coa_play(board, turn, x0, y0, x0 + 1, y0 + 2);
coa_play(board, turn, x0, y0, x0 + 2, y0 + 1);

... because the compiler is smart enough to eliminate the other 6 possibilities in that copy of the code.

And coa_play itself has a copy for every possibility so that [x1 + y1 * 8] is computed at compile time for that copy because the exact values are already known for that copy of the code.

When did this become a thing? I've only seen people talk like this about their code for the last couple of years but I never understood what was going on. The compiled code must be huge, like 64 times the size of a non-unrolled perft.

Do I understand now?

Thanks for taking the time to explain it!

hgm · Post by **hgm** » Tue Apr 25, 2023 9:33 am

Some remarks:

For x+8*y you would not need any instructions anyway; the address calculation units do this as a side effect of the memory load instruction, as it is one of the addressing modes in i386 architecture. Provided you access an array with byte-size elements.

By fully unrolling the nested loops you indeed get dedicated code for each of the 64 iterations (i.e. each board square). This indeed makes it possible for the code to be optimized for the known values of x and y, and eliminate conditional code for which the test on x and y would be false, while making the code unconditional when the condition is known to be true.

The catch is that examining all 64 squares is inefficient, as more than half will be empty, and half of the others will be enemy pieces. And then there is a switch for determining the piece type on each square. One can hope that the branches that implement this are easily predictable, though, as the occupant of a given square will not change very much in the perft tree.

Without paying attention to branch predictability a more efficient algorithm for doing the board scan is to use a piece list to keep track of the piece locations. Then you only have to loop over the piece list (16 elements max, instead of 64). And when that loop gets unrolled, each iteration corresponds to a specific piece type. Of course one could have done the unrolling partly 'by hand', by splitting the loop over pieces into a separate loop for each piece type (i.e. 8 iterations for the Pawns, non-looping code for King and Queen). Each such loop would then only need to contain the code for generating moves for the type it was for.

To actually generate the moves, there could be a switch on the square number, (obtained from the piece list), where each of the 64 cases is coded separately. The constancy of the piece locations during the perft would lead to good prediction of where a given piece stands, and the code the swicth would branch to would be optimized for that square, i.e. would not need to do the on-board test at run time. Usually the 64 cases are not all different; e.g. for a Knight all squares in the central 4x4 area would be the same, and generate all 8 moves (with destinations that are obtained by a simple addition of the constant step to the piece location). For a King there would only be 9 different cases: central, 4 x edge, 4 x corner, generating 8, 5 or 3 moves, respectively.

In the 'mailbox trials' I only had to generate new moves for a single piece (the moved one) in each node, and then it is of course not predictable anymore which piece this will be. Rather than working with code that depended on the piece location (by a switch on square number) I used a generic move-generation loop that got the steps from a table that was indexed by square number. (Actually I used two generic loops: one for leapers and one for sliders.) That table would then contain a variable number of board steps to make from each square, all guaranteed to be on board.

The problem here was that the loop branch became difficult to predict when the number of steps is variable. So this whole idea of limiting the move-generation attempts to destinations that would surely be on board was a bust. It proved faster to just generate moves in all directions, also those that went off board, so that the loop over steps had a fixed number of iterations, and could be branchlessly unrolled. Provided one did not actually test whether the destination was on board, but used a branchless method to get rid of the off-board moves.

As in the mailbox trials the moves were stored by setting a bit in the attack map for the destination square, it was easy to do this: the attack map was stored not as an 8x8 board, but as a larger structure (e.g. 10x12 or 16x12), so that it had guard bands containing elements that registered attacks on off-board squares. The off-board moves updated those, but since no one would ever be interested in their value, this was harmless. So the moves were always unconditionall generated for each of the 4 or 8 directions the piece could move in.

When storing moves as (from, to) pairs in an array the equivalent trick would be to always do the store, but make the incrementin gof the move pointer that keeps track of where the next move would be stored conditional. E.g. by writing moveCount += onbBoard[toSqr];, where the onBoard[] array would contain 1 for squares that are on the board, and 0 for squares that fell over the edge. There are other ways to do this without extra memory access as well, like moveCount += !(toSqr & 0x88); when you would use 0x88 square numbering.

hgm · Post by **hgm** » Tue Apr 25, 2023 11:06 am

This discussion sparks the following idea:

In the Mailbox Trials I only worried about generating captures, as I was interested in an engine, where typically some 95% of all nodes are QS nodes where you only need captures. For the captures it was efficient to store those as a bit map, with an attackers[pieceNr] element in the piece list where each bit corresponded to a piece, and would be set if that piece pseudo-legally attacked the piece with the given pieceNr. This made it easy to generate the moves victim by victim. It also helped in updating of the map, as when a square got evacuated, you could look up with a single memory access which pieces attacked it, and then displace the attack of the (discovered) sliders amongst those to the obstacle downstream.

For non-captures this is not a good method, as you don't want to generate those by to-square, and pieces have in general many more non-captures than captures, which you would all have to update when the piece moves. So it would be much more efficient to store the non-captures by piece that makes them (or by from-square). Then you would only have to run through the pieces to see where they can move to in a single memory access, rather than to run through all empty square to see what could land there. The moves of each piece could be stored as a bitboard, say in an array noncapts[pieceNr].

The speed now must come from incremental updating of the attackers[] and noncapts[] arrays, so that you don't have to generate those from scratch in every node. When a piece evacuates a square, attackers[pieceNr] would tell you which pieces attack it (friends as well as foes). You can mask the (not-yet-captured) sliders from this, look up their location, determine the direction of their attack, look up the location of the next obstacle downstream in the neighbor[] table, lookup what stands there on the board[], move the attack there by updating the attackers[] element of the piece that was standing there. And finally update the noncapts[] element for the discovered slider, e.g. by ORing it with a bitboard looked up from a table ray[evacuatedSqr][downstreamSquare].

When landing on an empty square you must update for blocking, and no attackers[] element is available for helping you with that (as these are only available for pieces). There isn't even a neighbor[] table element available for that, as these are kept track of only for occupied squares. This makes it more cumbersome. You would have to figure out first what the neighbors are. Which could be done by a ray scan over the mailbox board[], although this probably is the slowest method. You would only have to scan in 4 directions, though, as once you reach an occupied square, its neighbor[] element would be valid, and directly point you to the neighbor in the opposit direction. Once you have the neighbors, you could test if these can capture in the direction of the newly occupied square. If any of them do, you can add those neighbors to a newly created attackers[] element for the piece that is landing on the empty square. And if the neighbor that attacks you is a slider, the move blocked it, and you can remove the blocked moves from its noncapts[] element, by ANDing with ~ray[landingSquare][neighborSq].

So far no need for magic (or any other) attack getters; when you need attack sets you can directly load those from noncapts[] in the piece list. (We could adopt the convention that they also mark the square with the obstacle, although 'noncapts' would then be a poorly chosen name.) It would be nice if there was a more advanced way to figure out the nearest neighbors of a square that gets occupied, though. One could of course use a conventional attack getter for this, e.g. magic bitboards. But I wonder if there isn't a more efficient way here. Because we are only interested in getting a neighbor in one of the antipodal directions; once we have that the neighbor table will point us directly to the opposit neighbor. So we could hunt for neighbors in the direction of the carry propagation, using the latter to make life easier.

E.g. if 'mask[sqr]' would represent a single downstream ray starting at sqr in some direction, b = occupied & mask[sqr]; b &= -b; would give you a bitboard marking the nearest neighbor in the direction of the ray.

hgm · Post by **hgm** » Wed Apr 26, 2023 3:39 pm

It occurred to me that the horizon nodes would only have to calculate how much the number of pseudo-legal moves for the side to move changes because of the move leading to it. You don't care about the number of moves of the side that just moved. So you would, for instance, not have to calculate the number of moves the moved piece has from its new location.

The ways a move alters the number of moves of the opponent are:
1) the moves of the captured piece disappear
2) opponent protection of the to-square become actual captures
3) opponent slider moves over the to-square get blocked
4) opponent slider moves over the from-square get discovered

You will always have (4), but only for captures you will have (1) and (2), while for non-captures you will have (3) instead.

Now accounting for (1) would be a trivial lookup if we keep track of the mobility per piece incrementally, in addition to the total. (2) is just a popcount of the attackers[] element for the captured piece, masked for opponent attackers only. For (4) we would have to consult the attackers[] element of the moved piece before it moved, mask it for the enemy sliders, look up the location of those sliders in the piece list, determine the direction of their attack, look up the next downstream occupied square in that direction in the neighbor table, and determine its distance from the evacuated square. Sounds elaborate, but search statistics have shown that each move on the average only discovers 0.75 opponent slider on average. So often you would not have to do anything at all, because the moved piece was not attacked by sliders to begin with.

That leaves (3), which always is the Achilles heel of incremental methods. (And which in perft you will unfortunately have to do most of the time, as most moves in perft are non-captures.) You would have to determine what the nearest obstacles are in all 8 directions, which of those are enemy sliders, and for each of those proceed similar to (3) to see how many of their moves get lost by blocking. But since you don't really want to do the incremental update of the neighbor table and attack map here, but just would want to know how the mobility changes, you could run through the opponent sliders in the piece list, testing them for alignment with the to-square, and if there is (which usually will not be the case!) determine their distance from the to-square and current length of their path in that direction to know how many of their moves will be blocked.

It seems this would be enormously faster than calculating the number of moves for each and every piece of the opponent, whatever method you might use for that.

hgm · Post by **hgm** » Sun Apr 30, 2023 10:41 pm

I am still intrigued by this. Normally a p-perft(2) call (where p-perft is the pseudo-legal path count) would generate all moves for the side to move, and for each of the ~30 moves it thus obtains it would make them, and then call p-perft(1). Only the p-perft(1) is special, as it doesn't have to make the moves but can just count those. (Possibly by just taking the length of the generated move list.)

We could replace the perft(2) by a routine that in addition to creating a move list for the side to move, also runs the move generator for the opponent. For each piece it would first zero a counter, and then add the number of moves generated for that piece to it. If an attempted move hits a friendly piece, it can subtract 1 from the count of the latter, as that now effectively contributes -1 move by blocking one, in addition to the moves it can do itself. We also calculate the total of all moves we generated. (This would be p-perft(1) after null move.)

As a second preparation we calculates a blockCount for each friendly piece: when a slider move hits it, we look up in the neighbor table where the next downstream obstacle is, and add its distance to the blockCount.

After this preparation (which should take about as much time as doing perft(1) once), we are in a position to incrementally calculate the p-perft(1) of all ~30 daughter nodes incremantally. If the move is a capture, we can deduct the move count for the captured piece from the null-move p-perft(1), to account for the fact that it loses its own moves, and that the presence of an enemy instead of a friend on the capture square adds moves to other pieces which before where just 'protection'. We can also add the blockCount of the mover, to account for the discoverd slider moves over the from-square.

We would have to correct for the discoverd moves of the capture victim, though. So we would have to test whether that could capture to the from-square. Testing for alignment is enough, because we already know the path to it must be empty, as otherwise we could not have captured it. If the victim is a slider that can capture in the direction of the from-square we subtract the distance to the next obstacle in the opposit direction, as we had falsely added those as part of the blockCount of the mover.

For a non-capture it is a bit more involved: the treatment of the from-square (discovering moves) is the same as for captures. But for the to-square we would have to determine if any of the opponent sliders could capture to it. This can be done by getting the slider locations from the piece list, get the direction of their move to the to-square from an alignment table. If such a direction exists, we would determine the distance of the slider to the to-square by a table lookup, and the nearest neighbor of the slider in that direction from the neighbor table, and compare its distance from the slider with that of the to-square, to determine whether that latter is within the free path. If it is, the difference of the two is the number of blocked slider moves, and can be deducted from the p-perft(1) total for the daughter. In this case a correction is needed when the nearest neighbor of the slider in the relevant direction was equal to the from-square. In that case the move stayed on the ray, and we would have falsely assumed it discovered some moves of this slider. So then we would have to determine how many it were, to deduct them, by also looking up the next-nearest neighbor in that direction, and determine its distance to the from-square (like in the capture case).

All in all this would still be very much faster than generating all moves for the daughter from scratch. If we also can think of a fast way to determine how many of the moves in the daughter would be illegal, it would give a fantastically fast normal perft.

fast(‐ish) pseudo‐legal perft

fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft

Re: fast(‐ish) pseudo‐legal perft