A comparison of some Perft programs

syzygy · Post by **syzygy** » Tue Jan 13, 2026 1:47 am

phhnguyen wrote: ↑Mon Jan 12, 2026 10:59 am
hgm wrote: ↑Sun Jan 11, 2026 11:46 am The key to doing fast perfts seems to be hashing.
Agreed. Hashing is one of the largest gain methods for Perft.

It also makes the result unreliable, although I guess you can get the error probability as low as you want by repeating the calculation with different sets of 128-bit Zobrist keys.

However, from my experience as well as reading somewhere, hashing may give a factor of about 2 times speed. My computer has 8 cores, and multithreading could bring me a factor of 8 times faster

Way more than a factor of 2 and the factor increases as the perft is deeper (at least if the available memory including optionally disk increases with it).

Uri Blass · Post by **Uri Blass** » Tue Jan 13, 2026 6:29 am

syzygy wrote: ↑Tue Jan 13, 2026 1:47 am
phhnguyen wrote: ↑Mon Jan 12, 2026 10:59 am
hgm wrote: ↑Sun Jan 11, 2026 11:46 am The key to doing fast perfts seems to be hashing.
Agreed. Hashing is one of the largest gain methods for Perft.
It also makes the result unreliable, although I guess you can get the error probability as low as you want by repeating the calculation with different sets of 128-bit Zobrist keys.

However, from my experience as well as reading somewhere, hashing may give a factor of about 2 times speed. My computer has 8 cores, and multithreading could bring me a factor of 8 times faster
Way more than a factor of 2 and the factor increases as the perft is deeper (at least if the available memory including optionally disk increases with it).

For the part of unreliable you can simply save all the board in the way that chest(the best mate solver that I know) does.
Note that I never saw a wrong mate claim by stockfish inspite of all the hash problem.

I wonder if it is possible to get a position when stockfish with a single core claims mate in x that we can reproduce when chest show mo mate in x score or maybe the only mistake that is possible is that stockfish does not see mate when there is a mate.

It may be interesting to compare analysis of stockfish with modified stockfish that use for hash tables all the board like chest(with the same number of positions in the hash table) to see what is the probablility to get a different analysis.

hgm · Post by **hgm** » Tue Jan 13, 2026 9:46 am

syzygy wrote: ↑Tue Jan 13, 2026 1:47 am
phhnguyen wrote: ↑Mon Jan 12, 2026 10:59 am
hgm wrote: ↑Sun Jan 11, 2026 11:46 am The key to doing fast perfts seems to be hashing.
Agreed. Hashing is one of the largest gain methods for Perft.
It also makes the result unreliable, although I guess you can get the error probability as low as you want by repeating the calculation with different sets of 128-bit Zobrist keys.

However, from my experience as well as reading somewhere, hashing may give a factor of about 2 times speed. My computer has 8 cores, and multithreading could bring me a factor of 8 times faster
Way more than a factor of 2 and the factor increases as the perft is deeper (at least if the available memory including optionally disk increases with it).

Encoding a board position without supernumerary pieces with a Hufman encoding takes at most 32 x 1 bits (empty squares) + 16 x 3 bits (pawns) + 16 x 5 bits (pieces, N, B, R or KQ) + 4 bits (to distingush K and Q) = 164 bits. Casting rights can be indicated by encoding a Rook that can still castle as Pawn. So a 192-bit key should be more than long enough to uniquely encode positions. (E.g. in a 32-byte entry with 24 bytes key signature and 8-byte count.)

Hufman keys are not that difficult to update incrementally, despite the variable-length coding. You could have a board-size array that would indicate where every square starts in the key, and with some masking and shifting cut out the piece from its old square, and insert it at its new square. Hash probes will always be slow, and it wouldn't pay for the nodes near the leaves to do them. So there would also be no reason for these to update the key.

With pext it would be trivial to keep a key where each square is encoded with the maximum length, and pack it to make the signature.

phhnguyen · Post by **phhnguyen** » Tue Jan 13, 2026 11:15 am

chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm I'm curious how you compiled the code but I'm going to assume it's not using the highest optimization, because it was implemented with the intel compiler (best results by far).

phhnguyen wrote: ↑Sat Jan 10, 2026 11:25 pm 5) chessbit by Thomas Albert
The project was updated recently. I downloaded and ran the file chessbit_fastest.exe.

I didn't compile your code, but downloaded and ran your exe file chessbit_fastest.exe, from the 2025 Release. Was it the best one?

chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm Fyi, with hashing and parallel processing, chessbit is about 10 times faster.
If you're interested, I have pushed the code including the TT and multithread here

Wow, sounds promising! Yes, I am interested.
However, I am not a good Windows user, especially not good at optimising Visual Studio, and I don't have the Intel compiler either. If you could recompile your new code into a new Release, I will try. Don't worry about the result. It is just a benchmark on an old computer, which doesn't mean much.
I highly appreciate your work and that of all others on the list!

phhnguyen · Post by **phhnguyen** » Tue Jan 13, 2026 11:35 am

hgm wrote: ↑Tue Jan 13, 2026 9:46 am So a 192-bit key should be more than long enough to uniquely encode positions. (E.g. in a 32-byte entry with 24 bytes key signature and 8-byte count.)

Hufman keys are not that difficult to update incrementally, despite the variable-length coding.

Do you want to use Hufman keys to replace completely Zobrist keys? If yes, I am curious how you could convert from your 192-bit Huffman key into a hash index? If you use only some (low) bits, say 20 bits for 1M Hash entries, it means many pieces won't be counted, making many problems, such as conflict, bad distributions... for hash entries. If you use any math algorithm on 192-bit numbers will be terribly slow.

Or do you use both Zobrist keys and Huffman keys (one for hashing, one for verifying data)?

hgm · Post by **hgm** » Tue Jan 13, 2026 12:23 pm

Good point. I was thinking of using the key just for the signature in the hash entry, hence the full three 64-bit words. The it doesn't really matter if you derive the index from it in an invertible way. But it would of course have to be dependent on the location of all pieces, or you would get too many collisions. So I would just multiply the Hufman key with some randomly picked constant, and use as many high bits of the product as I need.

Of course the depth has to be included in the key as well, but there seems enough room for that in 192 bits.

[Edit] I suppose that if you use N bits for the index, and pick a multiplier that has the lowest N bits zero, the original high bits would be reconstructable by multiplying only the low bits with the multiplier, and subtracting the high bits of that result from the index. Then you would not have to store these high bits in the signature, they would be fully redundant. With 1 GB hash and 32 bytes per entry you would have 32M entries, i.e. 25-bit index.

chessbit · Post by **chessbit** » Tue Jan 13, 2026 9:04 pm

phhnguyen wrote: ↑Tue Jan 13, 2026 11:15 am
chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm I'm curious how you compiled the code but I'm going to assume it's not using the highest optimization, because it was implemented with the intel compiler (best results by far).

phhnguyen wrote: ↑Sat Jan 10, 2026 11:25 pm 5) chessbit by Thomas Albert
The project was updated recently. I downloaded and ran the file chessbit_fastest.exe.
I didn't compile your code, but downloaded and ran your exe file chessbit_fastest.exe, from the 2025 Release. Was it the best one?

chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm Fyi, with hashing and parallel processing, chessbit is about 10 times faster.
If you're interested, I have pushed the code including the TT and multithread here
Wow, sounds promising! Yes, I am interested.
However, I am not a good Windows user, especially not good at optimising Visual Studio, and I don't have the Intel compiler either. If you could recompile your new code into a new Release, I will try. Don't worry about the result. It is just a benchmark on an old computer, which doesn't mean much.
I highly appreciate your work and that of all others on the list!

I just added a new release on github if you want to try. The hashing table requires like 6GB of RAM so if it's too much you will have to reduce it in the code. It's hardcoded I didn't bother making it customizable. (If you want I can also recompile using a smaller table)
Btw, the difference between the version you have tested and this one is about 30 times faster. Just run "perft -d 8" for the multithreading. Very curious to know your results, as I was not able to properly compare different engines myself!

phhnguyen · Post by **phhnguyen** » Wed Jan 14, 2026 1:06 pm

chessbit wrote: ↑Tue Jan 13, 2026 9:04 pm
phhnguyen wrote: ↑Tue Jan 13, 2026 11:15 am
chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm I'm curious how you compiled the code but I'm going to assume it's not using the highest optimization, because it was implemented with the intel compiler (best results by far).

phhnguyen wrote: ↑Sat Jan 10, 2026 11:25 pm 5) chessbit by Thomas Albert
The project was updated recently. I downloaded and ran the file chessbit_fastest.exe.
I didn't compile your code, but downloaded and ran your exe file chessbit_fastest.exe, from the 2025 Release. Was it the best one?

chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm Fyi, with hashing and parallel processing, chessbit is about 10 times faster.
If you're interested, I have pushed the code including the TT and multithread here
Wow, sounds promising! Yes, I am interested.
However, I am not a good Windows user, especially not good at optimising Visual Studio, and I don't have the Intel compiler either. If you could recompile your new code into a new Release, I will try. Don't worry about the result. It is just a benchmark on an old computer, which doesn't mean much.
I highly appreciate your work and that of all others on the list!
I just added a new release on github if you want to try. The hashing table requires like 6GB of RAM so if it's too much you will have to reduce it in the code. It's hardcoded I didn't bother making it customizable. (If you want I can also recompile using a smaller table)
Btw, the difference between the version you have tested and this one is about 30 times faster. Just run "perft -d 8" for the multithreading. Very curious to know your results, as I was not able to properly compare different engines myself!

I was confused. Your release has only an exe file named "chessbit.exe" but not perft.exe. BTW, I have downloaded it (chessbit.exe) and run it with 3 ways of params:
chessbit.exe
chessbit.exe -d 8
chessbit.exe perft -d 8

However, for all cases, chessbit.exe has terminated immediately without printing to the screen anything (blank).

Thanks for your work!

chessbit · Post by **chessbit** » Wed Jan 14, 2026 4:47 pm

phhnguyen wrote: ↑Wed Jan 14, 2026 1:06 pm
chessbit wrote: ↑Tue Jan 13, 2026 9:04 pm
phhnguyen wrote: ↑Tue Jan 13, 2026 11:15 am
chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm I'm curious how you compiled the code but I'm going to assume it's not using the highest optimization, because it was implemented with the intel compiler (best results by far).

phhnguyen wrote: ↑Sat Jan 10, 2026 11:25 pm 5) chessbit by Thomas Albert
The project was updated recently. I downloaded and ran the file chessbit_fastest.exe.
I didn't compile your code, but downloaded and ran your exe file chessbit_fastest.exe, from the 2025 Release. Was it the best one?

chessbit wrote: ↑Mon Jan 12, 2026 8:49 pm Fyi, with hashing and parallel processing, chessbit is about 10 times faster.
If you're interested, I have pushed the code including the TT and multithread here
Wow, sounds promising! Yes, I am interested.
However, I am not a good Windows user, especially not good at optimising Visual Studio, and I don't have the Intel compiler either. If you could recompile your new code into a new Release, I will try. Don't worry about the result. It is just a benchmark on an old computer, which doesn't mean much.
I highly appreciate your work and that of all others on the list!
I just added a new release on github if you want to try. The hashing table requires like 6GB of RAM so if it's too much you will have to reduce it in the code. It's hardcoded I didn't bother making it customizable. (If you want I can also recompile using a smaller table)
Btw, the difference between the version you have tested and this one is about 30 times faster. Just run "perft -d 8" for the multithreading. Very curious to know your results, as I was not able to properly compare different engines myself!
I was confused. Your release has only an exe file named "chessbit.exe" but not perft.exe. BTW, I have downloaded it (chessbit.exe) and run it with 3 ways of params:
chessbit.exe
chessbit.exe -d 8
chessbit.exe perft -d 8

However, for all cases, chessbit.exe has terminated immediately without printing to the screen anything (blank).

Thanks for your work!

That is because the hashing table was not able to load (too much RAM needed). I would have implemented this as a command but I lost interest in developing it further for now

If you want, I have compiled it with a smaller table, I think any PC today should be able to run it (chessbit_26) => https://www.sendspace.com/file/rx102h

hgm · Post by **hgm** » Fri Jan 16, 2026 9:44 am

The latest (mailbox) engine I built used incrementally updated mobility. I wonder if this system could also be used in perft. After all, the final ply of a perft basically just returns the mobility. With the catch that it must be the legal mobility, rather than the pseudo-legal one. So you would have to deal with discovery and blocking of pins as well as of slider moves.

The most expensive part of incremental (pseudo-legal) mobility is adjusting the mobility for the blocking of opponent slider moves by non-captures in the forelast ply. The discovery is less of a problem, as it is caused by evacuating the from-square. And the piece on that square usually has many moves, that can share the result of that calculation. The to-squares typically are scattered much more, so they have to be treated individually. Pre-calculation of, say, how many opponent slider moves go over each square is not competative if the result of the caculation is not used multiple times, especially if for many squares it would not be used at all.

As an individual test for slider blocking one can test for alignment with each of the opponent sliders, which is a comparatively cheap table access based on the difference of square numbers, and which rules out most sliders. That an 'expensive' ray scan is then necessary for sliders that pass the test does not hurt very much. (And even then such a ray scan is of course extremely cheap compared to full move generation of all opponent's pieces.) I still don't like it that the alignment test has to be done for all opponent sliders individually, though.

Yesterday the following idea occurred to me. You could make a board-size table that holds a 'signature' for every square. This signature is then a set of ranks, files, diagonals and anti-diagonals (8 + 8 + 13 + 13 = 42 in total), where the bits are set for those rays the square is on. Alignment of two squares can then be tested by ANDing their signatures. If a set bit remains, there would be alignment, and the bit would tell you on which ray.

A pre-calculation could then OR the signature for all enemy sliders, after ANDing those with the directions they move in. (E.g. for a Rook we would clear the diagonal bits.) We could then AND the signature at the to-square of each move with this 'slider summary' as a primary filter. Moves for which this result is zero will immediately be eliminated as blocking any slider moves. If any bits are set these identify the ray on which blocking might occur, and as a consequence to some extent the slider thay could be blocked (orthogonal or diagonal). So you would typically only have to test 2 or 3 individually. It would also be possible to prepare a small table that for each ray that contains sliders specifies the sliders (moving along that ray) they contain. So you can simply look up which sliders the to-square is aligned with.

A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs

Re: A comparison of some Perft programs