UPDATE: Leorik Lookup and Template Optimisations added
We have two more algos - Leorik and LeorikIL (0kb version) by lithander which are different from other algos in terms of performance.
One thing that was not tested yet is that some board algorithms do know the square in terms of a constant. So I added a template optimisation where the int square is allowed to be moved as a template. This is the 4th comparison mode and helps performance to
another +80% improvement in some algos! So if you were to implement something on your own this could be a good reference to compare against.
So there you have it. Almost all algorithms testable in CLANG, or Microsoft MSVC - Singlethread, Multithread, Emulated Games and Constant Lookup mode for optimal comparison. Makefile included - just clone and run:
Code: Select all
git clone https://github.com/Gigantua/Chess_Movegen.git
Not all algorithms profit from a known square so that corresponds to the last yes/no. Notice how much faster some algos are getting from that:
Template optimisation:
Code: Select all
Exploading: 151.058MOps 6 kB Optimal perf: imul64 templ: no
Reference: 155.334MOps 0 kB Optimal perf: none templ: yes
Pext Emulated: 99.2648MOps 843 kB Optimal perf: none templ: no
KoggeStone: 113.804MOps 0 kB Optimal perf: none templ: no
RotatedBoard: 69.8078MOps 14 kB Optimal perf: none templ: no
QBB Algo: 186.446MOps 0 kB Optimal perf: countr_zero, countl_zero templ: yes
BobMike: 253.626MOps 8 kB Optimal perf: countr_zero, countl_zero templ: yes
Leorik: 228.003MOps 1 kB Optimal perf: countl_zero templ: no
Leorik Lookup: 202.608MOps 0 kB Optimal perf: countl_zero templ: no
Obstr. Diff: 258.002MOps 6 kB Optimal perf: countl_zero templ: no
Obstr. Inline: 212.233MOps 0 kB Optimal perf: countl_zero templ: yes
SlideArithm: 268.605MOps 2 kB Optimal perf: bzhi_u64, blsmsk_u64 templ: no
SlideA Inline: 193.436MOps 0 kB Optimal perf: bzhi_u64, blsmsk_u64 templ: no
Hyperbola Qsc: 316.284MOps 2 kB Optimal perf: bswap templ: no
Hyperb.Inline: 270.871MOps 0 kB Optimal perf: bswap templ: yes
SISSY BB: 262.717MOps 1409 kB Optimal perf: none templ: no
Hash Variable: 354.248MOps 729 kB Optimal perf: imul64 templ: yes
Hash Plain: 1010.3MOps 2306 kB Optimal perf: imul64 templ: no
Hash Fancy: 1199.83MOps 694 kB Optimal perf: imul64 templ: no
Pext : 1824.13MOps 843 kB Optimal perf: pext_u64 templ: yes
HyperCube: 329.147MOps 841 kB Optimal perf: none templ: yes
First commit perf:
Code: Select all
Exploading: 150.89MOps 6 kB Optimal perf: imul64
Reference: 68.93MOps 8 kB Optimal perf: none
KoggeStone: 111.98MOps 0 kB Optimal perf: none
RotatedBoard: 92.37MOps 14 kB Optimal perf: none
QBB Algo: 171.72MOps 0 kB Optimal perf: countr_zero, countl_zero
BobMike: 211.32MOps 8 kB Optimal perf: countr_zero, countl_zero
SlideArithm: 256.04MOps 2 kB Optimal perf: bzhi_u64, blsmsk_u64
XorRookSub: 297.78MOps 2 kB Optimal perf: bswap
Hash Variable: 399.36MOps 729 kB Optimal perf: imul64
Hash Plain: 529.61MOps 2306 kB Optimal perf: imul64
Hash Fancy: 597.36MOps 694 kB Optimal perf: imul64
Pext : 925.24MOps 843 kB Optimal perf: pext_u64
HyperCube: 310.30MOps 841 kB Optimal perf: none
Can someone recommend a table printing (single header)? (then I can add author and reference URL to the table itself)
Does somebody have Dumb7Fill sourcecode?
If you really want to go above and beyond:
Here is the fastest Compiler known on earth. Clang + facebooks bolt in a package
https://github.com/facebookincubator/BOLT
So you can build your own CLANG 14 with a post link optimiser and compile the makefile with that. Yields another +15% because of the optimal memory layout for the bigger algos.
With all the optimisations discussed in this thread - you get a huge bonus in performance for almost all algorithms.
The performance here is still a WORST CASE because every lookup sets a volatile uint64_t. During normal operation these things get overlapped and hidden with other calculation (20 years of hyperscalar CPU architectures)
Have a nice day -
Daniel