Just a followup to this very old thread. I took a break from chess programming after the original discussion but recently got back to it. Comparing what I have now, I'm quite happy with perft performance. #2 (for whatever it's worth ...)
Perft 7 of startpos on 11th Gen Intel(R) Core(TM) i7-1165G7 @ 2.80GHz, just copying the relevant output from the programs:
- Gigantua (giga_gcc): Perft Start 7: 3195901860 2797ms 1142.31 MNodes/s
- Mine (./r startpos 7): perft 3195901860 divides 20 Mnps 702.45 time 4.549617
- ZeroLogic (echo "go perft 7" | eval time ./ZeroLogic): Mn/s: 439
- Stockfish 17 (echo "go perft 7" | eval time ./stockfish-ubuntu-x86-64-avx512): 12.455s, so 256.59 Mnps
So a distant second to Gigantua, but not doing crazy templating/specialization like it does. I just refactored my code and made sure compiler optimizations are taking place. No transposition tables, no hashing, single threaded. Just pump through the moves. Bulk counting is when the depth is one and you just +1, right? I am doing that. Seems a logical choice. Unlike a lot of chess code, I don't have an "undo-move". I just clone the position and make the move, and throw it away when it's done. A significant performance? Doubt it, but it saved me from having to figure out how to undo a move.
I think compiler optimization actually plays a big part of this number. For example, I noticed the compiler replaced a bunch of bitwise logic code I have with blsr/tzcnt instructions. It needed to be able to understand that code to reduce several lines of high level language to a single assembly instruction. So, how smart your compiler is plays a huge part.