Comparison of all known Sliding lookup algorithms
Moderator: Ras
-
- Posts: 3218
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Comparison of all known Sliding lookup algorithms
double post
Last edited by smatovic on Sun Jan 16, 2022 6:37 pm, edited 2 times in total.
-
- Posts: 3218
- Joined: Wed Mar 10, 2010 10:18 pm
- Location: Hamburg, Germany
- Full name: Srdja Matovic
Re: Comparison of all known Sliding lookup algorithms
That is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it
--
Srdja
-
- Posts: 325
- Joined: Tue Aug 31, 2021 10:32 pm
- Full name: Mateo
Re: Comparison of all known Sliding lookup algorithms
yes, to be competitive -mavx2 is required.smatovic wrote: ↑Sun Jan 16, 2022 6:36 pmThat is interesting, I assume you run on 256b AVX2, I also had the Intel Xeon Phi with 2x512b vector-units per x86 core in mind during design, to handle 16 bitboards in one run, some current Intel Xeons seem to have 2x512b units, if this descends to the desktop market it might be worth to take a look again...tcusr wrote: ↑Thu Jan 06, 2022 5:18 pm kogge stone vectorized seems promising, see https://zeta-chess.app26.de/post/64-bit ... tor-based/
i implemented kogge stone with GCC's builtin vectors in my CPU engine and while they were much faster than normal kogge stone it only gave a 7% speed up in search. it's not worth it
--
Srdja
btw i have no fixed sized vector in my code, i templated the directions so that the functions will use a vector big enough to contain the number of directions, the only downside is that they have to have the same sign because of shifting, which i have not found a solution yet.
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Comparison of all known Sliding lookup algorithms
UPDATE: Zero Table Algorithms added
I saw that many of the 2kb algorithms all take the same ray as inputs.
http://www.talkchess.com/forum3/viewtop ... =7&t=79140
These 64x4 rays can be calculated at runtime to take the tablesize of some algorithms down to ZERO! I called this versions INLINE. They will be slower on the CPU - but for the CUDA thread its important to have them here for comparison:
New 0 byte lookup versions:
Obstruction-difference Inline
SlideArithmetic Inline
HyperbolaQuiesc. Inline
Source: https://github.com/Gigantua/Chess_Movegen
The 0kb versions are garuanteed to never ever have any issues with cache pressure - and I am looking forward to a new cuda comparison for these.
Multithread Results:
I still need Dumb7fill the code never produces correct results for me. (Upside down and if i fix this - its gets wrong results)
If you have a repo where Dumb7fill is imeplemnted let me know!
And as always - please run the sourcecode and post results!
I saw that many of the 2kb algorithms all take the same ray as inputs.
http://www.talkchess.com/forum3/viewtop ... =7&t=79140
These 64x4 rays can be calculated at runtime to take the tablesize of some algorithms down to ZERO! I called this versions INLINE. They will be slower on the CPU - but for the CUDA thread its important to have them here for comparison:
New 0 byte lookup versions:
Obstruction-difference Inline
SlideArithmetic Inline
HyperbolaQuiesc. Inline
Source: https://github.com/Gigantua/Chess_Movegen
The 0kb versions are garuanteed to never ever have any issues with cache pressure - and I am looking forward to a new cuda comparison for these.
Code: Select all
AMD Ryzen 9 5950X 16-Core Processor
NAME Million LU/s Tablesize Instructions
Exploading: 136.67MOps 6 kB Optimal perf: imul64
Reference: 127.53MOps 0 kB Optimal perf: none
KoggeStone: 177.12MOps 0 kB Optimal perf: none
RotatedBoard: 109.57MOps 14 kB Optimal perf: none
QBB Algo: 207.42MOps 0 kB Optimal perf: countr_zero, countl_zero
BobMike: 318.09MOps 8 kB Optimal perf: countr_zero, countl_zero
Obstr. Diff: 373.76MOps 6 kB Optimal perf: countl_zero
Obstr. Inline: 196.72MOps 0 kB Optimal perf: countl_zero
SlideArithm: 363.12MOps 2 kB Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline: 217.14MOps 0 kB Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc: 523.18MOps 2 kB Optimal perf: bswap
Hyperb.Inline: 264.36MOps 0 kB Optimal perf: bswap
SISSY BB: 228.16MOps 1409 kB Optimal perf: none
Hash Variable: 598.54MOps 729 kB Optimal perf: imul64
Hash Plain: 641.27MOps 2306 kB Optimal perf: imul64
Hash Fancy: 792.53MOps 694 kB Optimal perf: imul64
Pext : 911.42MOps 843 kB Optimal perf: pext_u64
HyperCube: 263.97MOps 841 kB Optimal perf: none
Multithread Results:
Code: Select all
Pext : 10265.80MOps 30Threads
Hash Fancy: 8878.84MOps 32Threads
Hash Variable: 7768.64MOps 31Threads
Hyperbola Qsc: 6700.89MOps 31Threads
Hash Plain: 6089.48MOps 30Threads
Obstr. Diff: 4973.52MOps 32Threads
SlideArithm: 4834.04MOps 32Threads
HyperCube: 4249.17MOps 31Threads
BobMike: 4193.42MOps 32Threads
Hyperb.Inline: 3296.19MOps 32Threads
SISSY BB: 3022.02MOps 31Threads
QBB Algo: 3001.00MOps 32Threads
SlideA Inline: 2850.78MOps 31Threads
Obstr. Inline: 2670.50MOps 32Threads
Exploading: 2300.39MOps 32Threads
KoggeStone: 2213.04MOps 32Threads
RotatedBoard: 2164.86MOps 31Threads
Reference: 1827.13MOps 31Threads
I still need Dumb7fill the code never produces correct results for me. (Upside down and if i fix this - its gets wrong results)
If you have a repo where Dumb7fill is imeplemnted let me know!
And as always - please run the sourcecode and post results!
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Comparison of all known Sliding lookup algorithms
Hello Daniel.
What would be really interesting is to know where the algorithms are described or released before translation, just with a link:
Obstr. Diff 373.76MOps 6 kB Optimal perf: countl_zero
There are some names i never heard of before.
What would be really interesting is to know where the algorithms are described or released before translation, just with a link:
Obstr. Diff 373.76MOps 6 kB Optimal perf: countl_zero
There are some names i never heard of before.
-
- Posts: 879
- Joined: Mon Dec 15, 2008 11:45 am
Re: Comparison of all known Sliding lookup algorithms
Flawless comile with default project settings under vs 2022 (ISO-Standard C++20 (/std:c++20)). Compile time some seconds.
And another run ...
Poor Pext 
Code: Select all
AMD Ryzen Threadripper 2950X 16-Core Processor
Megalooks Random Positions/s:
Exploading: 62.95MOps 6 kB Optimal perf: imul64
Reference: 36.60MOps 0 kB Optimal perf: none
KoggeStone: 76.12MOps 0 kB Optimal perf: none
RotatedBoard: 99.91MOps 14 kB Optimal perf: none
QBB Algo: 103.67MOps 0 kB Optimal perf: countr_zero, countl_zero
BobMike: 109.88MOps 8 kB Optimal perf: countr_zero, countl_zero
Obstr. Diff: 142.53MOps 6 kB Optimal perf: countl_zero
Obstr. Inline: 78.90MOps 0 kB Optimal perf: countl_zero
SlideArithm: 127.36MOps 2 kB Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline: 99.27MOps 0 kB Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc: 202.67MOps 2 kB Optimal perf: bswap
Hyperb.Inline: 100.22MOps 0 kB Optimal perf: bswap
SISSY BB: 173.10MOps 1409 kB Optimal perf: none
Hash Variable: 274.18MOps 729 kB Optimal perf: imul64
Hash Plain: 389.73MOps 2306 kB Optimal perf: imul64
Hash Fancy: 381.59MOps 694 kB Optimal perf: imul64
Pext : 39.51MOps 843 kB Optimal perf: pext_u64
HyperCube: 201.56MOps 841 kB Optimal perf: none
Code: Select all
Hash Plain: 3016.46MOps 14Threads
Hash Fancy: 2990.45MOps 28Threads
Hash Variable: 2398.27MOps 30Threads
HyperCube: 2094.42MOps 31Threads
Hyperbola Qsc: 1847.61MOps 29Threads
Obstr. Diff: 1587.66MOps 32Threads
SlideArithm: 1344.93MOps 30Threads
RotatedBoard: 1313.73MOps 30Threads
BobMike: 1228.22MOps 31Threads
Hyperb.Inline: 1094.03MOps 30Threads
QBB Algo: 1077.47MOps 15Threads
SlideA Inline: 1068.18MOps 31Threads
Exploading: 961.22MOps 32Threads
Obstr. Inline: 870.57MOps 31Threads
KoggeStone: 850.21MOps 31Threads
SISSY BB: 841.54MOps 30Threads
Pext : 635.15MOps 31Threads
Reference: 611.48MOps 31Threads
Code: Select all
AMD Ryzen Threadripper 2950X 16-Core Processor
Megalooks Random Positions/s:
Exploading: 63.08MOps 6 kB Optimal perf: imul64
Reference: 37.41MOps 0 kB Optimal perf: none
KoggeStone: 77.15MOps 0 kB Optimal perf: none
RotatedBoard: 99.90MOps 14 kB Optimal perf: none
QBB Algo: 102.78MOps 0 kB Optimal perf: countr_zero, countl_zero
BobMike: 110.04MOps 8 kB Optimal perf: countr_zero, countl_zero
Obstr. Diff: 143.53MOps 6 kB Optimal perf: countl_zero
Obstr. Inline: 78.11MOps 0 kB Optimal perf: countl_zero
SlideArithm: 126.56MOps 2 kB Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline: 100.76MOps 0 kB Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc: 205.29MOps 2 kB Optimal perf: bswap
Hyperb.Inline: 100.29MOps 0 kB Optimal perf: bswap
SISSY BB: 177.88MOps 1409 kB Optimal perf: none
Hash Variable: 258.25MOps 729 kB Optimal perf: imul64
Hash Plain: 398.92MOps 2306 kB Optimal perf: imul64
Hash Fancy: 377.25MOps 694 kB Optimal perf: imul64
Pext : 39.66MOps 843 kB Optimal perf: pext_u64
HyperCube: 200.75MOps 841 kB Optimal perf: none
Code: Select all
Summary:
Hash Plain: 3218.11MOps 15Threads
Hash Fancy: 2929.85MOps 27Threads
Hash Variable: 2471.23MOps 32Threads
HyperCube: 1994.40MOps 29Threads
Hyperbola Qsc: 1922.29MOps 32Threads
Obstr. Diff: 1653.29MOps 32Threads
SlideArithm: 1386.83MOps 31Threads
RotatedBoard: 1356.66MOps 32Threads
BobMike: 1242.58MOps 31Threads
QBB Algo: 1124.42MOps 16Threads
Hyperb.Inline: 1100.71MOps 32Threads
SlideA Inline: 1070.43MOps 31Threads
Exploading: 944.86MOps 31Threads
Obstr. Inline: 874.12MOps 30Threads
SISSY BB: 873.95MOps 15Threads
KoggeStone: 855.34MOps 31Threads
Pext : 638.44MOps 31Threads
Reference: 608.46MOps 30Threads

-
- Posts: 914
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Comparison of all known Sliding lookup algorithms
The friendly reception of my code snippet in your other thread made me curious how my sliding piece algorithm would do in comparison with what you have collected here. My additions are called Leorik and Leorik Lookup in the list below:
I was happy to find out Leorik tops the list of the zero-table algo's! If you want to include one or both versions on your github let me know!
Code: Select all
Leorik: 164.23MOps 0 kB Optimal perf: countl_zero
Leorik Lookup: 185.21MOps 1 kB Optimal perf: countl_zero
Exploading: 110.41MOps 6 kB Optimal perf: imul64
Reference: 70.35MOps 0 kB Optimal perf: none
KoggeStone: 93.66MOps 0 kB Optimal perf: none
RotatedBoard: 137.44MOps 14 kB Optimal perf: none
QBB Algo: 151.66MOps 0 kB Optimal perf: countr_zero, countl_zero
BobMike: 169.74MOps 8 kB Optimal perf: countr_zero, countl_zero
Obstr. Diff: 188.54MOps 6 kB Optimal perf: countl_zero
Obstr. Inline: 120.78MOps 0 kB Optimal perf: countl_zero
SlideArithm: 205.38MOps 2 kB Optimal perf: bzhi_u64, blsmsk_u64
SlideA Inline: 141.47MOps 0 kB Optimal perf: bzhi_u64, blsmsk_u64
Hyperbola Qsc: 238.90MOps 2 kB Optimal perf: bswap
Hyperb.Inline: 125.25MOps 0 kB Optimal perf: bswap
SISSY BB: 228.70MOps 1409 kB Optimal perf: none
Hash Variable: 325.30MOps 729 kB Optimal perf: imul64
Hash Plain: 452.61MOps 2306 kB Optimal perf: imul64
Hash Fancy: 530.11MOps 694 kB Optimal perf: imul64
Pext : 42.32MOps 843 kB Optimal perf: pext_u64
HyperCube: 223.86MOps 841 kB Optimal perf: none
Code: Select all
Summary:
Hash Fancy: 2846.70MOps 12Threads
Hash Plain: 2293.42MOps 12Threads
Hash Variable: 1843.59MOps 12Threads
HyperCube: 1553.38MOps 12Threads
Hyperbola Qsc: 1379.56MOps 12Threads
SlideArithm: 1206.84MOps 12Threads
Obstr. Diff: 1142.92MOps 11Threads
BobMike: 1078.42MOps 12Threads
Leorik Lookup: 1074.17MOps 12Threads
Leorik: 967.64MOps 12Threads
QBB Algo: 887.62MOps 12Threads
SlideA Inline: 854.31MOps 12Threads
SISSY BB: 784.96MOps 20Threads
Obstr. Inline: 759.87MOps 18Threads
Hyperb.Inline: 754.54MOps 12Threads
RotatedBoard: 676.31MOps 5Threads
Exploading: 632.57MOps 18Threads
KoggeStone: 550.17MOps 12Threads
Pext : 366.33MOps 12Threads
Reference: 350.86MOps 5Threads
-
- Posts: 1062
- Joined: Tue Apr 28, 2020 10:03 pm
- Full name: Daniel Infuehr
Re: Comparison of all known Sliding lookup algorithms
Update:
lithander was nice and sent me some sourcecode. We have to discuss if these algorithms are the same:
https://github.com/Gigantua/Chess_Moveg ... 9c999b84e1
https://github.com/Gigantua/Chess_Moveg ... Arithm.hpp
I am still missing these:
Dumb7Fill (someone please provide me a working solution - I never found a implementation that works)
Kindergarten Boards (https://www.chessprogramming.org/Kindergarten_Bitboards)
SBAMG (https://www.chessprogramming.org/SBAMG)
Bitfoot is in the wiki but is identical to Obstruction Difference: https://www.chessprogramming.org/Bitfoot#ABBitboards - so IMO not really its own algo
I will also write an AVX2 version of either Obstruction Difference, Slider Arithm, or SBAMG. Modern compilers do that by themselves normally so I dont expect a difference.
Todos:
Add above algos,
Print reference URL
Print algo Author
Table formatting and algo naming.
lithander was nice and sent me some sourcecode. We have to discuss if these algorithms are the same:
https://github.com/Gigantua/Chess_Moveg ... 9c999b84e1
https://github.com/Gigantua/Chess_Moveg ... Arithm.hpp
I am still missing these:
Dumb7Fill (someone please provide me a working solution - I never found a implementation that works)
Kindergarten Boards (https://www.chessprogramming.org/Kindergarten_Bitboards)
SBAMG (https://www.chessprogramming.org/SBAMG)
Bitfoot is in the wiki but is identical to Obstruction Difference: https://www.chessprogramming.org/Bitfoot#ABBitboards - so IMO not really its own algo
I will also write an AVX2 version of either Obstruction Difference, Slider Arithm, or SBAMG. Modern compilers do that by themselves normally so I dont expect a difference.
Todos:
Add above algos,
Print reference URL
Print algo Author
Table formatting and algo naming.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Daniel Inführ - Software Developer
-
- Posts: 28335
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: Comparison of all known Sliding lookup algorithms
Still no incrementally updated slider tables orth[64] and diag[64]?
-
- Posts: 914
- Joined: Sun Dec 27, 2020 2:40 am
- Location: Bremen, Germany
- Full name: Thomas Jahn
Re: Comparison of all known Sliding lookup algorithms
I don't see how that's going to work in the testing framework Daniel has set up. If an algorithm can't be used in such a way where it accepts a square and an occupation-bitboard as parameters and returns the correct queen targets as a bitboard he's basically out.