Comparison of all known Sliding lookup algorithms

smatovic · Post by **smatovic** » Wed Apr 20, 2022 1:09 pm

tcusr wrote: ↑Sun Jan 16, 2022 7:38 pm
smatovic wrote: ↑Sun Jan 16, 2022 6:36 pm
tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it
That is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...

--
Srdja
yes, to be competitive -mavx2 is required.
btw i have no fixed sized vector in my code, i templated the directions so that the functions will use a vector big enough to contain the number of directions, the only downside is that they have to have the same sign because of shifting, which i have not found a solution yet.

Inspired by your post with GCC vector-extensions I implemented ulong2 and ulong4 vectorized Kogge-Stone in Zeta Dva:

https://github.com/smatovic/ZetaDva/blo ... /movegen.c

I see only lil overall speedup compared to standard Kogge-Stone, maybe 10%, running on AVX2, SSE2 compile is ~50% slower, so I assume it utilizes the SIMD unit, but I am not sure if I have done it right, AFAIK Skylake has 4x64-bit scalar ALUs and 256-bit AVX2, so in theory the performance should be the same? I am not sure, how you use templates to hide fix sized vectors, but it would be interesting to ponder about how to use wider bit-width, 512b+, for multiple boards in upcoming architectures, I can only imagine to run n*pieces*ulong4 shifts at a same time, or alike.

--
Srdja

tcusr · Post by **tcusr** » Wed Apr 20, 2022 7:08 pm

smatovic wrote: ↑Wed Apr 20, 2022 1:09 pm
tcusr wrote: ↑Sun Jan 16, 2022 7:38 pm
smatovic wrote: ↑Sun Jan 16, 2022 6:36 pm
tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it
That is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...

--
Srdja
yes, to be competitive -mavx2 is required.
btw i have no fixed sized vector in my code, i templated the directions so that the functions will use a vector big enough to contain the number of directions, the only downside is that they have to have the same sign because of shifting, which i have not found a solution yet.
Inspired by your post with GCC vector-extensions I implemented ulong2 and ulong4 vectorized Kogge-Stone in Zeta Dva:

https://github.com/smatovic/ZetaDva/blo ... /movegen.c

I see only lil overall speedup compared to standard Kogge-Stone, maybe 10%, running on AVX2, SSE2 compile is ~50% slower, so I assume it utilizes the SIMD unit, but I am not sure if I have done it right, AFAIK Skylake has 4x64-bit scalar ALUs and 256-bit AVX2, so in theory the performance should be the same? I am not sure, how you use templates to hide fix sized vectors, but it would be interesting to ponder about how to use wider bit-width, 512b+, for multiple boards in upcoming architectures, I can only imagine to run n*pieces*ulong4 shifts at a same time, or alike.

--
Srdja

Code: Select all

template<size_t size>
using vec __attribute__((vector_size(8 * size))) = uint64_t;

i don't have the code anymore but this is how did it, all the functions then need a "size" template parameter where you can then generalize the code for every vector, you don't even need to pass it explicitly because of template deduction rules. but it's a lot of boilerplate though.
you may also want to look into this for portability.

i really can't comment about performance because i don't know enough about it.

dangi12012 · Post by **dangi12012** » Sat Apr 23, 2022 3:05 pm

Final Algorithmic Updates!

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor

Million Lookups/s Random Squares, Random Occupation/s:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
SBAMG o^(o-3cbn)                   268.565725                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       405.486229                    0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inf�hr                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
Hyperbola Quintessence o^(o-2r)    626.543516                    256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      572.912688                    0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray                      161.060638                    0       [0kb]       bswap                    no        Daniel Inf�hr (dangi12012)                   Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation                        117.578817                    0       [0kb]       ReverseBits              no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=20
Binary Neural Network              36.595017                     5852    [45kb]      pdep_u64, AVX2           no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                74.412098                     768     [6kb]       imul64                   no        Harald L��en                                  http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          83.331655                     0       [0kb]       none                     yes       Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               278.273130                    0       [0kb]       AVX2                     no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      66.443490                     107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         298.717010                    0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill
Kogge-Stone                        498.975022                    0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  46.388734                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          355.518687                    0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine
QBBEngine - Shifted Mask           343.973583                    0       [0kb]       countr_zero, countl_zero no        Fabio Gobbato                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=90#p924623
Classical Bob-Mike                 300.460456                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  378.937396                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inf�hr            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653
Leorik                             333.512133                    128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Leorik Inline                      351.574173                    0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Obstruction Difference             309.071241                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      328.424302                    0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     389.031894                    384     [3kb]       countl_zero              yes       Daniel Inf�hr and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Slide Arithmetic                   258.543795                    256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf�hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            149.183407                    0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf�hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Kindergarten                       599.784078                    16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    253.911292                    180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Fancy Magic BB - Variable shift    522.231845                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
Plain Magic BB                     552.234479                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       633.927817                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     781.127949                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          43.105213                     107680  [841kb]     none                     yes       Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723

There remains only 1 non implemented idea: Solving by linear algebra.
The input is an 8x8 binary matrix (the occupancy) which is an unknown variable (A).
Multiplied with a mask matrix M (which is known per square).
Multiplied with an unknown solve matrix S (which is unknown) - gives the result R which is the solution.
This can be tried out and solved in eigen or wolfram alpha. If it turns out that the occupancy is non reversible than its proven that there is no such solution - if its reversible then we get access to a solution working with linear algebra which was hardware and software optimised for the last 60 years.
Of course with 2 unknowns its not solvable - but we can enumerate all 2^14 * 64 occupancies and put them on a diagonal of a matrix. It would look similar to this:

The formula is simply AMS = R. Inverse (AM) and multiply it with R and you get the solution if it exists.
The dimensions of A would be 1048576 * 1048576. But since its very sparse that should be no problem at all. I will not put in the work there to solve it since I am sure that the resulting 64 8x8 matrices are interesting for Tensor/Matlab but the bottleneck in chessprogramming for sure is not the speed at which the queen moves can be calculated (at the moment). Maybe it interests you?

Final improvements
https://github.com/Gigantua/Chess_Moveg ... f6ff502eR1

New Algo - Bob Advanced
I saw the the initialisation is just the raymasks - so I made the constepr init code very small. Also looking up a shifted value is slower than calulating it. So that together with the reverse trick michael mentioned this i a new version that is 20% faster with half the memory footprint.

Genetic 8Ray
This is a completely new algorithm that was found last week by sifting the sytax tree of bitrotation. It works with 8 masks instead of 4 and in this commit I made smarter shift constants eliminating 4 instructions.

Genetic QBB
Quite pleasing to the eye its the same as QBB but with the elimination of one constant in the result calcuation. QBB is proven by the sifter to be optimized fully. I tried looking up intermediates which made it faster on msvc but slower on clang. So no lookups since clang is the better compiler.

Genetic Obstruction Difference
This one makes me proud the most. Its a quite pleasing faster mask lookup together with a removal of a mask (just xor two lines automatically eliminates the source bit) compared to Obstruction Difference. A faster line_attack method was found and implemented by c++ tree sifting. (Now the core algorithm is just 1 line of code like bitrotation). Also much smarter masking constants also made the table half as big and eliminated more unnecessary masking. So all together I found and implemented 4 improvements: (Core algo improvement, XOR, Masking, Smaller Table)
Keep in mind that the original Obstruction Difference was already optimized once when moving it into my repo.

Code: Select all

Obstruction Difference             309.071241                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      328.424302                    0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     389.031894                    384     [3kb]       countl_zero              yes       Daniel Inführ and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701

Important Preview
The next post will be the final post since I think I found everything I wanted to find. It will either be a nice scientific paper on arxiv or a post with a shared link to a pdf file since I have the feeling that I know and understand all of the above algorithms by heart - and could improve many of those way beyond the original authors.

I can see a nice conceptual correletion of the inner workings when comparing algorithms which gives rise to distinct idea groups clustered in a picture which should be made public. These are the groups: (Shifting, 4 Mask, 8 Mask, Hashing)

It now is an assortment of the best versions of all the known algorithms.

dangi12012 · Post by **dangi12012** » Thu May 05, 2022 11:01 am

New Algorithm: Foldinghash
https://github.com/Gigantua/Chess_Moveg ... h.hpp#L284

This is folding the 64bit occupancy into 32 bits (similar to matt taylors folding trick) but done so that it also works with the 64 relevant occupancies that exist for 4 rays. So that only 4 32 bit multiplication are needed to solve a queen. No 64 bit multiplication is needed anymore.
This brings the memory consumption to a good 50kb which is more cache friendly than traditional magic with 12 bits.

Code: Select all

Fancy Magic BB - Variable shift    495.437884                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       304.377242                    6468    [50kb]      none                     no        Daniel Inführ                                tbd
Plain Magic BB                     572.335805                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       637.814762                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy

Performance does not look friendly but row major vs colum major inside the struct matters because the compiler can currently not emit AVX intrinsics without shuffling. So this can break 300Mnps soon. Also I have more the gpu side in mind since 50kb fits in the fastest cuda memory and 32bit imul is perfect for gpus (where magic BBs currently are the slowest lookups).

dangi12012 · Post by **dangi12012** » Wed May 18, 2022 12:28 am

Symbol Exports and C# Interoperability + C# Translation

Discussion in this thread - http://www.talkchess.com/forum3/viewtop ... 69#p925969
Brought some new ideas.

By a simple c++ macro:

Code: Select all

#define ExportAlgo(X) extern "C" __declspec(dllexport) uint64_t __cdecl X##_Queen(int sq, uint64_t occ) { return X::Queen(sq, occ); }

Each and every single C++ algo is now being able to be called from other programming languages - and I only needed to add one line of code

Code: Select all

Dump of file Movegen_Compare.exe
File Type: EXECUTABLE IMAGE
  Section contains the following exports for Movegen_Compare.exe
    ordinal hint RVA      name

          1    0 00010F20 ArithmNT_t_Queen = ArithmNT_t_Queen
          2    1 00010F10 Arithm_t_Queen = Arithm_t_Queen
          3    2 00010C20 BinaryNetwork_t_Queen = BinaryNetwork_t_Queen
          4    3 00010C10 Bitrotate_t_Queen = Bitrotate_t_Queen
          5    4 00010EA0 BobAdvanced_t_Queen = BobAdvanced_t_Queen
          6    5 00010E90 Bob_t_Queen = Bob_t_Queen
          7    6 00010D60 Dumb7_t_Queen = Dumb7_t_Queen
          8    7 00010CC0 Explode_t_Queen = Explode_t_Queen
          9    8 0000D600 Fancy_t_Queen = ?Queen@Fancy_t@@SA_KH_K@Z (public: static unsigned __int64 __cdecl Fancy_t::Queen(int,unsigned __int64))
         10    9 00010F40 FoldingHash_t_Queen = FoldingHash_t_Queen
         11    A 00010C00 Genetic8Ray_t_Queen = Genetic8Ray_t_Queen
         12    B 00010F00 GeneticObstructionDiffV2_t_Queen = GeneticObstructionDiffV2_t_Queen
         13    C 00010EF0 GeneticObstructionDiff_t_Queen = GeneticObstructionDiff_t_Queen
         14    D 00010E80 GeneticQBB_t_Queen = GeneticQBB_t_Queen
         15    E 0000D560 HVar_t_Queen = ?Queen@HVar_t@@SA_KH_K@Z (public: static unsigned __int64 __cdecl HVar_t::Queen(int,unsigned __int64))
         16    F 0000D380 Hyper_t_Queen = ?Queen@Hyper_t@@SA_KH_K@Z (public: static unsigned __int64 __cdecl Hyper_t::Queen(int,unsigned __int64))
         17   10 00010BF0 HyperbolaNT_t_Queen = HyperbolaNT_t_Queen
         18   11 00010BE0 Hyperbola_t_Queen = Hyperbola_t_Queen
         19   12 00010F30 Kindergarten_t_Queen = Kindergarten_t_Queen
         20   13 00010DA0 Kogge_t_Queen = Kogge_t_Queen
         21   14 00010EC0 LeorikNT_t_Queen = LeorikNT_t_Queen
         22   15 00010EB0 Leorik_t_Queen = Leorik_t_Queen
         23   16 00010D40 Loop_t_Queen = Loop_t_Queen
         24   17 00010EE0 ObstructNT_t_Queen = ObstructNT_t_Queen
         25   18 00010ED0 Obstruct_t_Queen = Obstruct_t_Queen
         26   19 00010D50 PextEmu_t_Queen = PextEmu_t_Queen
         27   1A 0000D3F0 Pext_t_Queen = ?Queen@Pext_t@@SA_KH_K@Z (public: static unsigned __int64 __cdecl Pext_t::Queen(int,unsigned __int64))
         28   1B 0000D500 Plain_t_Queen = ?Queen@Plain_t@@SA_KH_K@Z (public: static unsigned __int64 __cdecl Plain_t::Queen(int,unsigned __int64))
         29   1C 00010E70 QBB_t_Queen = QBB_t_Queen
         30   1D 00010DB0 Rotate_t_Queen = Rotate_t_Queen
         31   1E 00010BD0 SBAMGNT_t_Queen = SBAMGNT_t_Queen
         32   1F 00010BC0 SBAMG_t_Queen = SBAMG_t_Queen
         33   20 00010520 Sissy_t_Queen = ?Queen@Sissy_t@@SA_KH_K@Z (public: static unsigned __int64 __cdecl Sissy_t::Queen(int,unsigned __int64))
         34   21 00010CD0 Switch_t_Queen = Switch_t_Queen

On top of that I have started to Translate C++ into C#.
I am really looking forward to see C# vs DllImport vs Native C++

So far there is only one algo but even that gives us a hint of how C# will perform. Keep in mind for calling dllimport we cannot inline and have to cross a boundary with pinvoke which has some overhead!

This will be very cool - and if someone wants to translate/or has a slinding piece algorithm in C# please let me know!

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor
Name    Performance [MQueens/s]
Switch               18,15
Switch (DllImport)   34,26

C:\Work\Chess_Movegen\Movegen_dotNET\bin\Release\net6.0\Movegen_dotNET.exe

Work in progress - but being able to compare C# and DllImported native C++ code for chess is really awesome.
As well as having opened the door to bind against rust, python and a myriad of other languages that can import a stdcall function.

dangi12012 · Post by **dangi12012** » Fri Jun 03, 2022 12:16 am

Substantial Chess C# findings!

Now for the past weeks I have been occupied with dynamic compilation. It is such a wonderful tool and in C# its easily possible to compile (not interpret) code at runtime from a string quite easily.
https://github.com/Gigantua/Chess_Moveg ... ilation.cs

A normal comparison between C++ and C# would be very boring since the results are already known and posted previously. So I didnt go and translated all the slider algorithms - but I made sure to incorporate the speciality of C#.

Now running the CSharp project gives this output:

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor
Verify Engines...OK
Algorithm                                Million Queen/s

Implementation Comparison
Pext NoInline                            130,36
Pext Delegate                            108,79
Pext Inlined                             294,06
Pext Manual Inlining                     296,07
Pext Pinvoke C++                         226,69
Pext Dynamic Compilation                 53,11

CSharp Native Code
ObstructionDiff                          94,84
Pext                                     208,47
Switch                                   18,64

Imported Algorithmic Comparison (Essentially PInvoke)
Imported: Obstruct                       109,46
Imported: ObstructNT                     61,31

Awesome Findings:

Pext NoInline
Base algorithm of PEXT implemented in C#. There have been many posts of people not finding Bmi2.X64.ParallelBitExtract - so this is the base implementation in CSharp as one would write. It is the fastest C# slider lookup ever written (for fast bmi2) - and only 130 lines of code.

Pext Delegate
C# does not have macros - but how to call many functions without writing code N times? With reflection its easily possible to target the public Queen function of all classes in a namespace. A delegate in CSharp comes at a huge cost!

Pext Inline
C# JIT is not very aggressive when inlining so this is the performance when adding [MethodImpl(MethodImplOptions.AggressiveInlining)]
https://github.com/Gigantua/Chess_Moveg ... on/Pext.cs

Pext Manual Inlining
I have observed that pulling (constant) intermediates outside of loops is a performance gain in C# sometimes . Even when aggressive inlining is turned on there is performance to be gained by moving constants to the outermost constant scope possible.

Pext Pinvoke C++
Calling C++ code from C# is easy. But the performance loss is there. (In the background the marshaller always checks the stack an does some cleanup to guard the runtime against overwritten memory etc. https://github.com/Gigantua/Chess_Moveg ... Interop.cs
Calling C++ or native code is the best option for performance in C# by far - if you outsource more than a single function call. C++ will run many times faster than optimised C# in most situations.

Pext Dynamic Compilation
This code is compiled from a runtime string with roslyn and the resulting assembly is dynamically loaded.
The performance here should in theory be the same as with delegates - but there are some extra security checks when running dynamically loaded code - but it is the coolest feature by far.
A longterm plan of mine is to incorporate this into actual engine or movegen software. When a position does not contain a knight for example - or castling is impossible etc - one can compile and load (or prepare) an assembly that is specialized and does not contain any code for these situations. Same for dynamic learning - different parts of the code disabled at runtime as if compiled with #if would give huge performance gains.

Summary
CSharp is really not as fast as C++ since literally the same code runs at 770 Million Queens/s with MSVC/CLANG. As a language it is perfect and compiletimes are instant. Dynamic compilation in C++ is almost impossible to pull off and as a bonus one can query the AST generated from a string in C#!
Try to stay away from delegates - and with the absence of macros there is no way around code duplication / manual inlining for optimal performance sometimes. Not doing AggressiveInlining for busy loops costs 50% of performance for this type of project.

As usual you can check out the new changes here:
https://github.com/Gigantua/Chess_Movegen

One of the most fun projects ever for me.

lithander · Post by **lithander** » Fri Jun 03, 2022 11:28 am

Could you maybe include some of the non-Pext algorithms to put the speed you get with PEXT into context? Especially Leorik's would be easy to add because it is already implemented in C#.

https://github.com/lithander/Leorik/blo ... itboard.cs

...and personally I'd find Hyperbola Quintessence also interesting because that was surprisingly fast in C++.

dangi12012 · Post by **dangi12012** » Fri Jun 03, 2022 8:32 pm

lithander wrote: ↑Fri Jun 03, 2022 11:28 am Could you maybe include some of the non-Pext algorithms to put the speed you get with PEXT into context? Especially Leorik's would be easy to add because it is already implemented in C#.

https://github.com/lithander/Leorik/blo ... itboard.cs

...and personally I'd find Hyperbola Quintessence also interesting because that was surprisingly fast in C++.

Haha that ZERO TABLE ALTERNATIVE code looks very familiar.
Here are the results:

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor
Verify Engines...OK
Algorithm                                Million Queen/s

Implementation Comparison
Pext NoInline                            129,43
Pext Delegate                            112,09
Pext Inlined                             296,21
Pext Manual Inlining                     303,52
Pext Dynamic Compilation                 70,29
Pext Pinvoke C++                         224,40

CSharp Native Code
Switch                                   20,17
ObstructionDiff                          156,67
Leorik                                   171,61
HyperbolaQsc                             228,62
Pext Inlined                             232,25

It took me a while to find that bswap is not where the other intrinsics stuff is but its inside: BinaryPrimitives.ReverseEndianness(b);
That could actually be the very first Hyperbola Qsc. algo in C# ever.

Take C# algos with a grain of salt. Pext Inlined does either 300 or 230mnps with the same copied code.
Your assumption was correct and HyperbolaQsc seems to be the best option for C# (because the lookup is a few KB and not 1MB and the speed is the same)

If you read this message - please share C# Results!
https://github.com/Gigantua/Chess_Movegen
Movegen_Compare.sln

lithander · Post by **lithander** » Sun Jun 05, 2022 1:42 am

I had to comment-out some code to get it to run but here are my results:

Code: Select all

AMD Ryzen 5 3600 6-Core Processor
Algorithm                                Million Queen/s

Implementation Comparison
Pext NoInline                            30,87
Pext Delegate                            32,44
Pext Inlined                             31,55
Pext Manual Inlining                     36,16
Pext Dynamic Compilation                 26,71

CSharp Native Code
Switch                                   16,07
ObstructionDiff                          82,92
Leorik                                   117,87
HyperbolaQsc                             42,47
Pext Inlined                             34,44

...it seems like you need a really cutting edge PC to reproduce your good results for PEXT and HyperbolaQSC. I'll stick with Leoriks implementation for the time being, I guess

dangi12012 · Post by **dangi12012** » Sun Jun 05, 2022 11:38 am

lithander wrote: ↑Sun Jun 05, 2022 1:42 am I had to comment-out some code to get it to run but here are my results:
Code: Select all
AMD Ryzen 5 3600 6-Core Processor
Algorithm                                Million Queen/s

Implementation Comparison
Pext NoInline                            30,87
Pext Delegate                            32,44
Pext Inlined                             31,55
Pext Manual Inlining                     36,16
Pext Dynamic Compilation                 26,71

CSharp Native Code
Switch                                   16,07
ObstructionDiff                          82,92
Leorik                                   117,87
HyperbolaQsc                             42,47
Pext Inlined                             34,44
...it seems like you need a really cutting edge PC to reproduce your good results for PEXT and HyperbolaQSC. I'll stick with Leoriks implementation for the time being, I guess

Oh wow that is interesting indeed. Pext is known but
_bswap64 is (1 1 0.25) according to agner fog on zen1-3.

Seems like C# is not using that intrinsic somehow. Anyways I will add fancy magic to see how it behaves with this language.

I love C#. Mostly for Linq and Property getter/setters that can be used like a field and of course the sub 1s compilation for big projects.

Thanks for testing!

Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms