Extension of bulk counting for specialized perft engines

Discussion of chess software programming and technical issues.

Moderator: Ras

alexjasson
Posts: 4
Joined: Fri Jul 26, 2024 8:10 am
Full name: Alex Jasson

Extension of bulk counting for specialized perft engines

Post by alexjasson »



tl;dw:
- It's faster to not generate any moves (in memory) and instead just directly count the number of moves in the position at "depth 1". This is pretty obvious and as far as I know Daniel Infuehr already did this in Gigantua.
- You don't need to explore moves at "depth 2" if they aren't going to affect the other players moves. You can just multiply the number of these moves by the number of moves the other player has.
- Might be possible to count all nodes at "depth 2" if you could calculate how much the current players moves will affect the next players moves. But this would be extremely time consuming to try and implement.

(Yes this is completely useless for chess engines)

Other random observations:
- PEXT is about 10% faster on my CPU (Intel i5-8400)
- Make/unmake vs copying made very little difference on my CPU

https://github.com/alexjasson/templechess
User avatar
hgm
Posts: 28476
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Extension of bulk counting for specialized perft engines

Post by hgm »

Some time ago I did some preliminary designing on such an incremental perft algorithm, before other activities required my attention.

The tricky thing is that you are dealing with legal moves, while the incremenal mobility I do in engines is typically based on pseudo-legal mobility. Moves that look innocent because they do not block or discover any opponent pieces can still create or abandon pins, or deliver checks, all of which would affect how many of the opponent's pseudo-legal moves become or remain illegal.

My conclusion was that it was doable, mostly in a quite efficient way. Most problematic case were moves that pin an opponent piece that was previously attacking the pinner from another direction. Normally creating a pin would make the piece lose all its moves not along the pin ray, and the algorithm would treat that by subtracting the number of pseudo-legal moves of the piece (which is kept track of in the piece table, as usual) from the opponent total, and adding moves along the pin ray (which are obtained as a nearly free side effect of detecting the pin). But when the pinning piece also unblocks or discovers moves of the pinned piece the recorded number of pseudo-legal moves of the latter is no longer correct.
alexjasson
Posts: 4
Joined: Fri Jul 26, 2024 8:10 am
Full name: Alex Jasson

Re: Extension of bulk counting for specialized perft engines

Post by alexjasson »

Interesting, I haven't read that post but it describes the same challenges I had when I implemented this algorithm over a year ago (https://github.com/alexjasson/templeche ... ses/tag/v1). Here you can see the first implementation of the algorithm which is slightly less efficient than the current one. I briefly tried counting all the moves at depth 2 similar to how you described but couldn't find an efficient method of doing so. I also wasn't sure if an efficient method is even possible.

Counting the full set of moves at depth 2 is much harder, but it would be pretty cool if someone can implement an algorithm to do it efficiently! I just posted this here in hopes someone might figure it out. But I will probably not be doing it myself.
abulmo2
Posts: 491
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Extension of bulk counting for specialized perft engines

Post by abulmo2 »

alexjasson wrote: Wed Mar 11, 2026 3:29 am - It's faster to not generate any moves (in memory) and instead just directly count the number of moves in the position at "depth 1". This is pretty obvious and as far as I know Daniel Infuehr already did this in Gigantua.
H.G. Muller's qperft used bulk counting since at least 2007 viewtopic.php?p=116586#p116586
- You don't need to explore moves at "depth 2" if they aren't going to affect the other players moves. You can just multiply the number of these moves by the number of moves the other player has.
That's interesting. I did some experiment with your code, with (perft-2) or without (perft-1) counting nodes through MoveSetMultiply:
On the standard starting position:

Code: Select all

$ time perft-1 "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 7 | grep Nodes
Nodes searched: 3195901860

real	0m4,230s
$ time perft-2 "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 7 | grep Nodes
Nodes searched: 3195901860

real	0m2,669s
Here your technique seems interesting as perft-2 is 1.58x faster than perft-1

On kiwipete:

Code: Select all

$ time perft-1 "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1" 6 | grep Nodes
Nodes searched: 8031647685

real	0m7,396s
 $ time perft-2 "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1" 6 | grep Nodes
Nodes searched: 8031647980

real	0m7,040s
Here it the improvement is less important as perft-2 is just 1.05x faster, ie a mere 5%.

Other important technique is to use an hash table and multithreading. With my own program mperft (version 4.0):
No bulk counting, no transposition table, no multithreading

Code: Select all

$mperft-4.0 -d 7 -q
perft  7 :         3195901860 leaves in     38.151 s       83769574 leaves/s
With bulk counting at depth 1 only:

Code: Select all

$mperft-4.0 -d 7 -q -b
perft  7 :         3195901860 leaves in      3.165 s     1009866103 leaves/s
With bulk counting & transposition table of 256 MBytes:

Code: Select all

$mperft-4.0 -d 7 -q -b -h 256
perft  7 :         3195901860 leaves in      0.682 s     4683992428 leaves/s
With bulk counting & transposition table of 256 MBytes & 32 threads:

Code: Select all

$mperft-4.0 -d 7 -q -b -h 256 -t 32
perft  7 :         3195901860 leaves in      0.051 s    62623611096 leaves/s
Specialized bulk counting is 5.63x faster than no bulk counting (ie making all moves and incrementin the node counter by 1). A transposition table speeds up perft by 9.94x and multithreading with 32 threads by another 13.37x (My CPU has got 16 physical cores and smp enabled).
Richard Delorme
alexjasson
Posts: 4
Joined: Fri Jul 26, 2024 8:10 am
Full name: Alex Jasson

Re: Extension of bulk counting for specialized perft engines

Post by alexjasson »

abulmo2 wrote: Thu Mar 12, 2026 10:41 am
- You don't need to explore moves at "depth 2" if they aren't going to affect the other players moves. You can just multiply the number of these moves by the number of moves the other player has.
That's interesting. I did some experiment with your code, with (perft-2) or without (perft-1) counting nodes through MoveSetMultiply:
On the standard starting position:

Code: Select all

$ time perft-1 "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 7 | grep Nodes
Nodes searched: 3195901860

real	0m4,230s
$ time perft-2 "rnbqkbnr/pppppppp/8/8/8/8/PPPPPPPP/RNBQKBNR w KQkq - 0 1" 7 | grep Nodes
Nodes searched: 3195901860

real	0m2,669s
Here your technique seems interesting as perft-2 is 1.58x faster than perft-1

On kiwipete:

Code: Select all

$ time perft-1 "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1" 6 | grep Nodes
Nodes searched: 8031647685

real	0m7,396s
 $ time perft-2 "r3k2r/p1ppqpb1/bn2pnp1/3PN3/1p2P3/2N2Q1p/PPPBBPPP/R3K2R w KQkq - 0 1" 6 | grep Nodes
Nodes searched: 8031647980

real	0m7,040s
Here it the improvement is less important as perft-2 is just 1.05x faster, ie a mere 5%.
That's correct. If you choose a position where the subset of moves counted at depth 2 is smaller (ie open positions with lots of pieces), then it won't be as fast. To test it fairly you'd need to test it on a lot of random positions and take an average. But the current algorithm is more of a proof of concept. I'll give full credit to anyone who can make any small improvements on it.

At the moment the algorithm explores captures, blocks, checks, discovered checks, potential pins, pins, castling, castling disruption, double pushes, en passant and promotion to depth 1. There's still a lot more room for improvement if you can add some of those moves to the subset of moves that gets counted at depth 2. For example, if you add capturing moves to the subset and then subtract the number of moves the captured piece had from the total node count.
Other important technique is to use an hash table and multithreading. With my own program mperft (version 4.0):
No bulk counting, no transposition table, no multithreading

Code: Select all

$mperft-4.0 -d 7 -q
perft  7 :         3195901860 leaves in     38.151 s       83769574 leaves/s
With bulk counting at depth 1 only:

Code: Select all

$mperft-4.0 -d 7 -q -b
perft  7 :         3195901860 leaves in      3.165 s     1009866103 leaves/s
With bulk counting & transposition table of 256 MBytes:

Code: Select all

$mperft-4.0 -d 7 -q -b -h 256
perft  7 :         3195901860 leaves in      0.682 s     4683992428 leaves/s
With bulk counting & transposition table of 256 MBytes & 32 threads:

Code: Select all

$mperft-4.0 -d 7 -q -b -h 256 -t 32
perft  7 :         3195901860 leaves in      0.051 s    62623611096 leaves/s
Specialized bulk counting is 5.63x faster than no bulk counting (ie making all moves and incrementin the node counter by 1). A transposition table speeds up perft by 9.94x and multithreading with 32 threads by another 13.37x (My CPU has got 16 physical cores and smp enabled).
Thanks for those stats. I just had a look at your mperft project on github. It's very impressive, well done. I haven't added multithreading or a transposition table yet. Hopefully some day I'll get around to it.
User avatar
hgm
Posts: 28476
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Extension of bulk counting for specialized perft engines

Post by hgm »

The algorithm I had in mind would at 2 ply first generate opponent moves, (i.e. bulk-counting pseudo-legal perft(1) after null move), and count those on a per-piece basis (the piece its 'move count'). For sliders that can capture an enemy, it would continue the slide to see how many more moves it would get in addition, and this would be added to a 'discovery count' of the blocking piece. (Note that this would cheaply detect discovered-check possibilities, if the extended slider path hits our King.) Pawn non-captures are considered sliding, and also add to the discovery count of a piece blocking those. Pieces attacked by a Pawn would get 1 subtracted from this discovery count, as the Pawn capture would disappear.

For capture moves that hit a friendly piece, the move count of that piece would be decremented, because capturing this piece would take its moves away, but would make the recaptures possible.

This makes it possible to quickly calculate the number of opponent pseudo-legal moves after removing a piece that will be moved in the perft(2) node: add its discovery count to the determined total after null move. You then still have to correct for placing that piece in its destination. If the move is a capture this would be achieved by subtracting the move count of the captured piece.

Special case is when a piece captures the piece that was blocking it. The described calculation would fail, as it would still count the moves that were discovered of a piece that is already captured. This can be easily detected, though, and such moves are sufficiently rare that you can do a normal perft(1) on those. Or, as a refinement, we could recalculate the number of moves that would have been discovered if the piece had not been captured. Or even store that in advance, in a 16x16 table that holds the number of moves a piece of the side to move would discover of each enemy piece.

This still leaves the matter of the effect of the piece placement. This can block slider moves. Where the calculation of the effect of 'lifting' a piece can be shared by all moves of that piece, blocking would have to be calculated on a per-move basis. Which makes it likely the most expensive part. We would have to determine which enemy sliders pass over the to-square of the move, and how many moves these would lose.

All this would eventually only give us the number of opponent pseudo-legal moves after each move at the perft(2) node. This would have to be corrected for the illegal moves: stepping the King into check, of moving away pinned pieces. Pins are probably rare, so it is not imporatant to handle that efficiently. They should be detected efficiently, though. But when generating the moves of the stm at 2 ply testing each slider for alignment with the enemy King, and extending the slide in that direction to the next piece in the path (if the first piece was an enemy) to see if it hits that King is cheap. (Especially since usually there wouldn't be alignment.) That gives existing pins, and we would need per-move testing to see if such a pin is abandoned (by moving the pinner, or blocking it). We would also have to test for new pins.

Moves would also have to be tested for delivering check; this is also rare, and we should call a dedicated evasion perft(1) for those.
alexjasson
Posts: 4
Joined: Fri Jul 26, 2024 8:10 am
Full name: Alex Jasson

Re: Extension of bulk counting for specialized perft engines

Post by alexjasson »

The above algorithm looks good. Most of the opponent pseudolegal moves involve blocking/unblocking moves and that's where you would get a lot of speed increase. Captures are not as common. When I tried this before, I handled blocking/unblocking of sliders by just recalculating their moves again with lookups. This ended up being roughly the same speed as just exploring the moves to depth 1. If you can handle unblocks of sliders without needing to do any extra lookups like you said, it would be faster than what I tried.

I didn't find any efficient way of handling illegal moves. Right now any move that might possibility affect king moves of the opponent is just explored to depth 1.

Thanks for taking interest in this thread.
User avatar
hgm
Posts: 28476
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Extension of bulk counting for specialized perft engines

Post by hgm »

The dilemma is always what info to collect that might speed things up. Because collecting that info often requires to clear it for all pieces or all squares. E.g. it would be nice to have an 'attack map', which for each piece would indicate which enemy pieces attack / protect it. Updating that during move generation is relatively cheap, you just set the bit corresponing to the piece you are generating moves of in the map element for the pieces it hits. But the map would then have to be cleared entirely at the start of the node.

So it might be faster to record this info per attacker: keep a bitmap of all pieces a given enemy piece attacks or protects. That still requires clearing of the elements corresponding to all enemy pieces, but clearing and updating the info then is done one piece at the time, so that it likely can be kept in a CPU register, rather than having to be shuttled to and from memory all the time. Also in this form the info is helpful: when you capture a piece you can cheaply check if the attacker occurred in the set of pieces attacked by the victim, after applying a mask that only leaves sliders. So that you know whether you have to correct for discovering moves of a piece that is no longer there.

Recording info on empty squares (such as which sliders attack it) is usually too expensive; there are many such squares for which the info would require clearing, and sliders typically can reach many empty squares. This makes it a problem to figure out how many enemy slider moves get blocked by landing on an empty square (which most moves in perft unfortunately do). The algorithm I described in the other thread might be a good primary filter:

At the start of the node:

Code: Select all

sliderRays = 0;
sliders = ALL_ENEMY_SLIDERS;
for(p = pop_lsb(sliders)) sliderRays |= signature[location[p]] & moveDirections[p];
And when playing a move:

Code: Select all

hit = signature[toSqr] & sliderRays;
if(hit) { // land on ray with enemy slider
  orth = hit & ORTHOGONAL_RAYS;
  if(orth) {
    sliders = ALL_ENEMY_SLIDERS & ORTHOGONALS;
    for(p = pop_lsb(sliders)) if(signature[location[p]] & orth) { // slider on same file or rank
      ... // test if free path
    }
  }
  diag = hit & DIAGONAL_RAYS;
  if(diag) {
    sliders = ALL_ENEMY_SLIDERS & DIAGONALS;
    for(p = pop_lsb(sliders)) if(signature[location[p]] & diag) { // slider on same diagonal
      ... // test if free path
    }
  }
}
The idea is that most moves would not have a hit, so that this code is rarely executed.
abulmo2
Posts: 491
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Extension of bulk counting for specialized perft engines

Post by abulmo2 »

An interesting old thread about perft with nullmove, which looks similar with what you do:
forum3/viewtopic.php?p=638511
Richard Delorme