I was going through the ptx and comparing the pipeline bottlenecks of all algos. Bitrotation needs to rotate 3x per ray.
GeneticObstructionDifference (the AST sifted improvement of Obstruction Difference) did not emit optimal PTX.
So I eleminated all locals and optimized the shared memory into a struct. This answers the question of Array of Structs vs Struct of Arrays is for these algos).
It makes me happy to anounce a new best overall algorithm:
Code: Select all
NVIDIA GeForce RTX 3080
Black Magic - Fixed shift: 6.53 GigaQueens/s
QBB Algo : 60.49 GigaQueens/s
Bob Lookup : 58.17 GigaQueens/s
Kogge Stone : 40.33 GigaQueens/s
Hyperbola Quiescence : 16.91 GigaQueens/s
Switch Lookup : 5.52 GigaQueens/s
Slide Arithm : 87.93 GigaQueens/s
Pext Lookup : 15.92 GigaQueens/s
SISSY Lookup : 8.33 GigaQueens/s
Dumb 7 Fill : 26.51 GigaQueens/s
Obstruction Difference : 67.78 GigaQueens/s
Genetic Obstruction Diff : 121.13 GigaQueens/s
Leorik : 61.69 GigaQueens/s
SBAMG o^(o-3cbn) : 71.36 GigaQueens/s
NO HEADACHE : 30.62 GigaQueens/s
AVX Branchless Shift : 29.45 GigaQueens/s
Slide Arithmetic Inline : 71.04 GigaQueens/s
C++ Tree Sifter - 8 Rays : 88.21 GigaQueens/s
Bitrotation o^(o-2r) : 113.19 GigaQueens/s
I also made sure to do the same thing for Bitrotation to make everything as optimized as possible. But 3x bitrotation is just more work compared to 1x countlzero!
The code is quite pleasing and the masks array is the same among many algorithms!

I also created a Github release
https://github.com/Gigantua/Chess_Moveg ... e_2022.exe
Please share your results!
Last words: When looking up 121 Billion Queens per second that is not the number of calculated squares. Its the number of Queen positions. So the actual number of target squares calculated (relevant for actual perft or movegen) will be the sum of set bits in each and every result of the 121 Billion results per second.