Writing the fastest move generator. Up to 4BNodes/s

Discussion of chess software programming and technical issues.

Moderator: Ras

chessbit
Posts: 22
Joined: Fri Dec 29, 2023 4:47 pm
Location: Belgium
Full name: thomas albert

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by chessbit »

Aleks Peshkov wrote: Wed Sep 17, 2025 3:00 am
chessbit wrote: Tue Sep 16, 2025 10:05 pm For those interested, I have released the code on my dusty github: https://github.com/thuijbregts/chessbit
There are some notes about the implementation, and also a binary if you want to try it yourself. I would be interested to know how it performs on your machine!
Did not found any secret that any other bitboard legal generator on the planet does not do. BTW Stockfish goes fine with pseudolegal generator. But the code looks nice and pleasant to see. Well done!
I think the secret lies in many optimizations that make the code look simple yet effective. I don't know how else to explain a boost of 25-30% on Gigantua, which was already a much faster perft engine than the rest as far as I know.
chessbit
Posts: 22
Joined: Fri Dec 29, 2023 4:47 pm
Location: Belgium
Full name: thomas albert

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by chessbit »

syzygy wrote: Tue Sep 16, 2025 10:59 pm With clang++ compilation works, but it is not that much faster:
Glad that it works with clang++. I would really recommend using the binary available on github, which was compiled with the Intel compiler. The code is much faster than with clang, normally.

Not much faster than what, btw?
User avatar
Steve Maughan
Posts: 1295
Joined: Wed Mar 08, 2006 8:28 pm
Location: Florida, USA

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by Steve Maughan »

abulmo2 wrote: Wed Sep 17, 2025 11:55 am.<snip>
So I do not think chessbit is of interest in building a chess engine. And a speed of 100 Mnps on a single core is surely an illusion on current hardware, even for a simple engine.
Fascinating. You may we'll be right. My latest engine (Juggernaut) is written in Delphi (Pascal), which isn't the most performant language. I've tried to squeeze every last drop of speed out of it but it does perft 7 from the starting position in ~20 seconds (~160 Mnps), with bulk counting, on a fast laptop. In a normal middlegame position it searches ~7 mnps. The ratio is approximately 20:1, hence the 100 Mnps. Juggernaut also updates the evaluation and position hash, even in the perft — maybe this is more of a factor than I realized.

Thanks,

Steve
http://www.chessprogramming.net - Juggernaut & Maverick Chess Engine
Aleks Peshkov
Posts: 914
Joined: Sun Nov 19, 2006 9:16 pm
Location: Russia
Full name: Aleks Peshkov

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by Aleks Peshkov »

Steve Maughan wrote: Wed Sep 17, 2025 2:29 pmThe ratio is approximately 20:1
It is not. If a perft engine reports real nodes per second (not perft number per second), difference in NPS between perft(minimax) and alpha-beta should be less then 2 times.
syzygy
Posts: 5713
Joined: Tue Feb 28, 2012 11:56 pm

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by syzygy »

chessbit wrote: Wed Sep 17, 2025 12:56 pm
syzygy wrote: Tue Sep 16, 2025 10:59 pm With clang++ compilation works, but it is not that much faster:
Glad that it works with clang++. I would really recommend using the binary available on github, which was compiled with the Intel compiler. The code is much faster than with clang, normally.

Not much faster than what, btw?
Not much faster than what I got with gcc, which seemed to refuse to inline everything that was supposed to be inlined.
Dmi3
Posts: 4
Joined: Thu Sep 11, 2025 11:15 am
Location: Russia
Full name: Dmitry Molodov

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by Dmi3 »

Aleks Peshkov wrote: Wed Sep 17, 2025 5:06 pm
Steve Maughan wrote: Wed Sep 17, 2025 2:29 pmThe ratio is approximately 20:1
If a perft engine reports real nodes per second (not perft number per second), difference in NPS between perft(minimax) and alpha-beta should be less then 2 times.
Why? It all depends on the cost of QS, the presence or absence of Null-Move pruning (or Internal Iterative Deepening), and the cost of evaluation (incremental PESTO vs NEURO).
Aleks Peshkov
Posts: 914
Joined: Sun Nov 19, 2006 9:16 pm
Location: Russia
Full name: Aleks Peshkov

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by Aleks Peshkov »

Dmi3 wrote: Wed Sep 17, 2025 5:42 pm
Aleks Peshkov wrote: Wed Sep 17, 2025 5:06 pm
Steve Maughan wrote: Wed Sep 17, 2025 2:29 pmThe ratio is approximately 20:1
If a perft engine reports real nodes per second (not perft number per second), difference in NPS between perft(minimax) and alpha-beta should be less then 2 times.
Why? It all depends on the cost of QS, the presence or absence of Null-Move pruning (or Internal Iterative Deepening), and the cost of evaluation (incremental PESTO vs NEURO).
Bulk node counting is close amount of work to what chess program with legal move generator and simple PST eval do at leafs. In my perft I do TT lookup and update all incremental stuff (evaluation, repetition stack, attack tables in my case).
Rein Halbersma
Posts: 751
Joined: Tue May 22, 2007 11:13 am

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by Rein Halbersma »

abulmo2 wrote: Wed Sep 17, 2025 11:55 am The trick that makes chessbit faster is CTFE (compile time function evaluation) abusing metaprogramming with constexpr & templates. The recursive call to perft and the move generation are intricate and the compiler eliminates all the recursive calls up to depth 18 (the upper limit in chessbit) by slowly building a huge executable. On my computer chessbit executable size is 12.8 Mb versus 83kb for mperft.
3 years ago I realized that with C++20, you can simply put constexpr in front of everything and compute perft inside a static_assert. Did this for my draughts engine to depth 7 and 1M nodes. https://damforum.nl/viewtopic.php?t=8530
benvining
Posts: 38
Joined: Fri May 30, 2025 10:18 pm
Full name: Ben Vining

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by benvining »

Awesome! I was only able to get up to depth 4 within constexpr before my compiler complained that I'd hit the constexpr step limit. I haven't fussed with any flags for this, did you need to do that?
abulmo2
Posts: 473
Joined: Fri Dec 16, 2016 11:04 am
Location: France
Full name: Richard Delorme

Re: Writing the fastest move generator. Up to 4BNodes/s

Post by abulmo2 »

chessbit wrote: Wed Sep 17, 2025 12:50 pm
Could you try the executable available on github? The code was optimized with the Intel compiler from the start, clang and other compilers yield (much) worse results. I think it would be fairer to compare different engines with the best possible compilation results. For instance, I don't know if your engine would be better optimized with the Intel compiler.
I am basically a Linux user, not a Windows one. But I did try you executable on my son's computer:
I got 1.55BNps +/- 0.01Bnps. On the same computer, compiling your code with clang (the vanilla version, not the one included with MSVC) , I got slightly faster result: 1.58bnps +/- 0.01Bnps.
The thread is deviating into chess engine discussions, but this was not the intended goal of this project. I only wrote this code with a perft engine in mind. That being said, I think it would not be too difficult to change the existing code for the purpose of a chess engine. I might implement one just to prove it but I need a break :).
Well the title of the thread is about a move generator, not perft :-)
Even for perft, I do not think it is the fastest one, as it is missing a hashtable and parallel computation. For example, my program mperft, on my son's computer, using an hashtable is faster:

Code: Select all

.\mperft-1.0.exe -d 8 -H 24 -b
[...]
perft  8 :     84998978956 leaves in     16.563 s   5131858900 leaves/s
that's 5.13 Bnps.

The fastest perft can do parallel computation on GPUs: https://github.com/ankan-ban/perft_gpu. This one was fast enough to compute perft 15 of the starting position in a few days on distributed computers.
Richard Delorme