Comparison of all known Sliding lookup algorithms

smatovic · Post by **smatovic** » Sun Jan 16, 2022 6:36 pm

double post

smatovic · Post by **smatovic** » Sun Jan 16, 2022 6:36 pm

tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it

That is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...

--
Srdja

tcusr · Post by **tcusr** » Sun Jan 16, 2022 7:38 pm

smatovic wrote: ↑Sun Jan 16, 2022 6:36 pm
tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it
That is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...

--
Srdja

yes, to be competitive -mavx2 is required.
btw i have no fixed sized vector in my code, i templated the directions so that the functions will use a vector big enough to contain the number of directions, the only downside is that they have to have the same sign because of shifting, which i have not found a solution yet.

dangi12012 · Post by **dangi12012** » Sun Jan 16, 2022 8:15 pm

UPDATE: Zero Table Algorithms added

I saw that many of the 2kb algorithms all take the same ray as inputs.
http://www.talkchess.com/forum3/viewtop ... =7&t=79140

These 64x4 rays can be calculated at runtime to take the tablesize of some algorithms down to ZERO! I called this versions INLINE. They will be slower on the CPU - but for the CUDA thread its important to have them here for comparison:

New 0 byte lookup versions:
Obstruction-difference Inline
SlideArithmetic Inline
HyperbolaQuiesc. Inline
Source: https://github.com/Gigantua/Chess_Movegen

The 0kb versions are garuanteed to never ever have any issues with cache pressure - and I am looking forward to a new cuda comparison for these.

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor
NAME            Million LU/s    Tablesize Instructions
Exploading:     136.67MOps      6 kB    Optimal perf: imul64
Reference:      127.53MOps      0 kB    Optimal perf: none
KoggeStone:     177.12MOps      0 kB    Optimal perf: none
RotatedBoard:   109.57MOps      14 kB   Optimal perf: none
QBB Algo:       207.42MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        318.09MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    373.76MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  196.72MOps      0 kB    Optimal perf: countl_zero
SlideArithm:    363.12MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  217.14MOps      0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  523.18MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  264.36MOps      0 kB    Optimal perf: bswap
SISSY BB:       228.16MOps      1409 kB Optimal perf: none
Hash Variable:  598.54MOps      729 kB  Optimal perf: imul64
Hash Plain:     641.27MOps      2306 kB Optimal perf: imul64
Hash Fancy:     792.53MOps      694 kB  Optimal perf: imul64
Pext  :         911.42MOps      843 kB  Optimal perf: pext_u64
HyperCube:      263.97MOps      841 kB  Optimal perf: none

Multithread Results:

Code: Select all

Pext  :         10265.80MOps    30Threads
Hash Fancy:     8878.84MOps     32Threads
Hash Variable:  7768.64MOps     31Threads
Hyperbola Qsc:  6700.89MOps     31Threads
Hash Plain:     6089.48MOps     30Threads
Obstr. Diff:    4973.52MOps     32Threads
SlideArithm:    4834.04MOps     32Threads
HyperCube:      4249.17MOps     31Threads
BobMike:        4193.42MOps     32Threads
Hyperb.Inline:  3296.19MOps     32Threads
SISSY BB:       3022.02MOps     31Threads
QBB Algo:       3001.00MOps     32Threads
SlideA Inline:  2850.78MOps     31Threads
Obstr. Inline:  2670.50MOps     32Threads
Exploading:     2300.39MOps     32Threads
KoggeStone:     2213.04MOps     32Threads
RotatedBoard:   2164.86MOps     31Threads
Reference:      1827.13MOps     31Threads

I still need Dumb7fill the code never produces correct results for me. (Upside down and if i fix this - its gets wrong results)
If you have a repo where Dumb7fill is imeplemnted let me know!
And as always - please run the sourcecode and post results!

Desperado · Post by **Desperado** » Sun Jan 16, 2022 9:00 pm

Hello Daniel.

What would be really interesting is to know where the algorithms are described or released before translation, just with a link:

Obstr. Diff 373.76MOps 6 kB Optimal perf: countl_zero

There are some names i never heard of before.

Desperado · Post by **Desperado** » Sun Jan 16, 2022 10:47 pm

Flawless comile with default project settings under vs 2022 (ISO-Standard C++20 (/std:c++20)). Compile time some seconds.

Code: Select all

AMD Ryzen Threadripper 2950X 16-Core Processor
Megalooks Random Positions/s:
Exploading:     62.95MOps       6 kB    Optimal perf: imul64
Reference:      36.60MOps       0 kB    Optimal perf: none
KoggeStone:     76.12MOps       0 kB    Optimal perf: none
RotatedBoard:   99.91MOps       14 kB   Optimal perf: none
QBB Algo:       103.67MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        109.88MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    142.53MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  78.90MOps       0 kB    Optimal perf: countl_zero
SlideArithm:    127.36MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  99.27MOps       0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  202.67MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  100.22MOps      0 kB    Optimal perf: bswap
SISSY BB:       173.10MOps      1409 kB Optimal perf: none
Hash Variable:  274.18MOps      729 kB  Optimal perf: imul64
Hash Plain:     389.73MOps      2306 kB Optimal perf: imul64
Hash Fancy:     381.59MOps      694 kB  Optimal perf: imul64
Pext  :         39.51MOps       843 kB  Optimal perf: pext_u64
HyperCube:      201.56MOps      841 kB  Optimal perf: none

Code: Select all

Hash Plain:     3016.46MOps     14Threads
Hash Fancy:     2990.45MOps     28Threads
Hash Variable:  2398.27MOps     30Threads
HyperCube:      2094.42MOps     31Threads
Hyperbola Qsc:  1847.61MOps     29Threads
Obstr. Diff:    1587.66MOps     32Threads
SlideArithm:    1344.93MOps     30Threads
RotatedBoard:   1313.73MOps     30Threads
BobMike:        1228.22MOps     31Threads
Hyperb.Inline:  1094.03MOps     30Threads
QBB Algo:       1077.47MOps     15Threads
SlideA Inline:  1068.18MOps     31Threads
Exploading:     961.22MOps      32Threads
Obstr. Inline:  870.57MOps      31Threads
KoggeStone:     850.21MOps      31Threads
SISSY BB:       841.54MOps      30Threads
Pext  :         635.15MOps      31Threads
Reference:      611.48MOps      31Threads

And another run ...

Code: Select all

AMD Ryzen Threadripper 2950X 16-Core Processor
Megalooks Random Positions/s:
Exploading:     63.08MOps       6 kB    Optimal perf: imul64
Reference:      37.41MOps       0 kB    Optimal perf: none
KoggeStone:     77.15MOps       0 kB    Optimal perf: none
RotatedBoard:   99.90MOps       14 kB   Optimal perf: none
QBB Algo:       102.78MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        110.04MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    143.53MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  78.11MOps       0 kB    Optimal perf: countl_zero
SlideArithm:    126.56MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  100.76MOps      0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  205.29MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  100.29MOps      0 kB    Optimal perf: bswap
SISSY BB:       177.88MOps      1409 kB Optimal perf: none
Hash Variable:  258.25MOps      729 kB  Optimal perf: imul64
Hash Plain:     398.92MOps      2306 kB Optimal perf: imul64
Hash Fancy:     377.25MOps      694 kB  Optimal perf: imul64
Pext  :         39.66MOps       843 kB  Optimal perf: pext_u64
HyperCube:      200.75MOps      841 kB  Optimal perf: none

Code: Select all

Summary:
Hash Plain:     3218.11MOps     15Threads
Hash Fancy:     2929.85MOps     27Threads
Hash Variable:  2471.23MOps     32Threads
HyperCube:      1994.40MOps     29Threads
Hyperbola Qsc:  1922.29MOps     32Threads
Obstr. Diff:    1653.29MOps     32Threads
SlideArithm:    1386.83MOps     31Threads
RotatedBoard:   1356.66MOps     32Threads
BobMike:        1242.58MOps     31Threads
QBB Algo:       1124.42MOps     16Threads
Hyperb.Inline:  1100.71MOps     32Threads
SlideA Inline:  1070.43MOps     31Threads
Exploading:     944.86MOps      31Threads
Obstr. Inline:  874.12MOps      30Threads
SISSY BB:       873.95MOps      15Threads
KoggeStone:     855.34MOps      31Threads
Pext  :         638.44MOps      31Threads
Reference:      608.46MOps      30Threads

Poor Pext

lithander · Post by **lithander** » Tue Jan 18, 2022 12:44 am

The friendly reception of my code snippet in your other thread made me curious how my sliding piece algorithm would do in comparison with what you have collected here. My additions are called Leorik and Leorik Lookup in the list below:

Code: Select all

Leorik:         164.23MOps      0 kB    Optimal perf: countl_zero
Leorik Lookup:  185.21MOps      1 kB    Optimal perf: countl_zero
Exploading:     110.41MOps      6 kB    Optimal perf: imul64
Reference:      70.35MOps       0 kB    Optimal perf: none
KoggeStone:     93.66MOps       0 kB    Optimal perf: none
RotatedBoard:   137.44MOps      14 kB   Optimal perf: none
QBB Algo:       151.66MOps      0 kB    Optimal perf: countr_zero, countl_zero
BobMike:        169.74MOps      8 kB    Optimal perf: countr_zero, countl_zero
Obstr. Diff:    188.54MOps      6 kB    Optimal perf: countl_zero
Obstr. Inline:  120.78MOps      0 kB    Optimal perf: countl_zero
SlideArithm:    205.38MOps      2 kB    Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline:  141.47MOps      0 kB    Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc:  238.90MOps      2 kB    Optimal perf: bswap
Hyperb.Inline:  125.25MOps      0 kB    Optimal perf: bswap
SISSY BB:       228.70MOps      1409 kB Optimal perf: none
Hash Variable:  325.30MOps      729 kB  Optimal perf: imul64
Hash Plain:     452.61MOps      2306 kB Optimal perf: imul64
Hash Fancy:     530.11MOps      694 kB  Optimal perf: imul64
Pext  :         42.32MOps       843 kB  Optimal perf: pext_u64
HyperCube:      223.86MOps      841 kB  Optimal perf: none

Code: Select all

Summary:
Hash Fancy:     2846.70MOps     12Threads
Hash Plain:     2293.42MOps     12Threads
Hash Variable:  1843.59MOps     12Threads
HyperCube:      1553.38MOps     12Threads
Hyperbola Qsc:  1379.56MOps     12Threads
SlideArithm:    1206.84MOps     12Threads
Obstr. Diff:    1142.92MOps     11Threads
BobMike:        1078.42MOps     12Threads
Leorik Lookup:  1074.17MOps     12Threads
Leorik:         967.64MOps      12Threads
QBB Algo:       887.62MOps      12Threads
SlideA Inline:  854.31MOps      12Threads
SISSY BB:       784.96MOps      20Threads
Obstr. Inline:  759.87MOps      18Threads
Hyperb.Inline:  754.54MOps      12Threads
RotatedBoard:   676.31MOps      5Threads
Exploading:     632.57MOps      18Threads
KoggeStone:     550.17MOps      12Threads
Pext  :         366.33MOps      12Threads
Reference:      350.86MOps      5Threads

I was happy to find out Leorik tops the list of the zero-table algo's! If you want to include one or both versions on your github let me know!

dangi12012 · Post by **dangi12012** » Tue Jan 18, 2022 2:11 pm

Update:
lithander was nice and sent me some sourcecode. We have to discuss if these algorithms are the same:
https://github.com/Gigantua/Chess_Moveg ... 9c999b84e1
https://github.com/Gigantua/Chess_Moveg ... Arithm.hpp

I am still missing these:
Dumb7Fill (someone please provide me a working solution - I never found a implementation that works)
Kindergarten Boards (https://www.chessprogramming.org/Kindergarten_Bitboards)
SBAMG (https://www.chessprogramming.org/SBAMG)
Bitfoot is in the wiki but is identical to Obstruction Difference: https://www.chessprogramming.org/Bitfoot#ABBitboards - so IMO not really its own algo

I will also write an AVX2 version of either Obstruction Difference, Slider Arithm, or SBAMG. Modern compilers do that by themselves normally so I dont expect a difference.

Todos:
Add above algos,
Print reference URL
Print algo Author
Table formatting and algo naming.

hgm · Post by **hgm** » Tue Jan 18, 2022 3:39 pm

Still no incrementally updated slider tables orth[64] and diag[64]?

lithander · Post by **lithander** » Tue Jan 18, 2022 4:03 pm

hgm wrote: ↑Tue Jan 18, 2022 3:39 pm Still no incrementally updated slider tables orth[64] and diag[64]?

I don't see how that's going to work in the testing framework Daniel has set up. If an algorithm can't be used in such a way where it accepts a square and an occupation-bitboard as parameters and returns the correct queen targets as a bitboard he's basically out.

Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms

Re: Comparison of all known Sliding lookup algorithms