Comparison of all known Sliding lookup algorithms

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Mike Sherwin wrote: Sun Jan 22, 2023 11:13 pm
381.2 -> 383.4 :D

Code: Select all

	static uint64_t Bishop(int sq, uint64_t occ)
	{
		return 
		  ss.dSubset[sq][(((occ & dMask[sq]) * 0x0202020202020202ull) >> 58)] + 
		  ss.aSubset[sq][(((occ & aMask[sq]) * 0x0202020202020202ull) >> 58)];
	}

	static uint64_t Rook(int sq, uint64_t occ)
	{
		uint64_t hIndex = (occ >> (sq & 0b11111000u) + 1) & 0x000000000000003f;
		uint64_t vIndex = ((((occ >> (sq & 7)) & 0x0101010101010101ull) * 0x0080402010080400ull) >> 58);
		return ss.vSubset[sq][vIndex] | ss.hSubset[sq][hIndex];
	}
I didnt use ss but your other improvements + I made sure to make the arrays static (which i forgot to do in the initial commit)

For me, this commit went from 380 to 490Mnps on my 3950x. :D
https://github.com/Gigantua/Chess_Moveg ... bb2fc1a658
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Yielding this outcome: :D

MSVC 2022

Code: Select all

Kindergarten Super SISSY Bitboards 491.898030                    16640   [130kb]     imul64                   no        Michael Sherwin                              https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
Fancy Magic BB - Variable shift    378.511164                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       204.041865                    6468    [50kb]      none                     no        Daniel Inführ                                tbd
Plain Magic BB                     488.175971                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       508.980537                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     722.584250                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
CLANG-14

Code: Select all

Kindergarten Super SISSY Bitboards 403.940311                    16640   [130kb]     imul64                   no        Michael Sherwin                              https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
Fancy Magic BB - Variable shift    507.717324                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       310.005843                    6468    [50kb]      none                     no        Daniel Inf�hr                                tbd
Plain Magic BB                     457.223196                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       542.716555                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     771.131668                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: Comparison of all known Sliding lookup algorithms

Post by j.t. »

Just one more CPU:

Code: Select all

clang++-15 -O3 -mllvm -inline-threshold=16000 -fdeclspec -march=native -funroll-loops -std=c++20 -pthread Main.cpp -o movegen_compare
Verify Engines...OK!

13th Gen Intel(R) Core(TM) i9-13900K

Million Lookups/s Random Squares, Random Occupation/s:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
SBAMG o^(o-3cbn)                   349.140491                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       357.366818                    0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inf�hr                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
GaloisField - AVX512               4.107445                      0       [0kb]       AVX512F_GFNI             no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
Hyperbola Quintessence o^(o-2r)    886.728555                    256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      508.352520                    0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray                      162.945000                    0       [0kb]       bswap                    no        Daniel Inf�hr (dangi12012)                   Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation                        122.911679                    0       [0kb]       ReverseBits              no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=20
Binary Neural Network              34.978753                     5852    [45kb]      pdep_u64, AVX2           no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                83.977980                     768     [6kb]       imul64                   no        Harald L��en	                              http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          89.008408                     0       [0kb]       none                     yes       Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               286.360293                    0       [0kb]       AVX2                     no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      87.052196                     107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         283.861664                    0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill        
Kogge-Stone                        400.634496                    0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  45.252941                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          250.801137                    0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine        
QBBEngine - Shifted Mask           225.445359                    0       [0kb]       countr_zero, countl_zero no        Fabio Gobbato                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=90#p924623
Classical Bob-Mike                 425.415134                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  479.101573                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inf�hr            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653
Leorik                             336.827002                    128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine   
Leorik Inline                      318.874321                    0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine   
Obstruction Difference             233.061599                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      295.365600                    0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     319.426480                    384     [3kb]       countl_zero              no        Daniel Inf�hr and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Genetic Obstruction Difference V2  298.445698                    768     [6kb]       countl_zero              no        Daniel Inf�hr                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Slide Arithmetic                   238.683012                    256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf�hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            144.360430                    0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf�hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Kindergarten                       773.685821                    16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    485.868416                    180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Kindergarten Super SISSY Bitboards 703.067132                    16640   [130kb]     imul64                   no        Michael Sherwin                              https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
Fancy Magic BB - Variable shift    999.736503                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       469.760951                    6468    [50kb]      none                     no        Daniel Inf�hr                                tbd                                               
Plain Magic BB                     753.563451                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       1152.461228                   88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     1451.599675                   107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          84.171368                     107680  [841kb]     none                     yes       Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Comparison of all known Sliding lookup algorithms

Post by Mike Sherwin »

dangi12012 wrote: Mon Jan 23, 2023 11:27 am
Mike Sherwin wrote: Sun Jan 22, 2023 11:13 pm
381.2 -> 383.4 :D

Code: Select all

	static uint64_t Bishop(int sq, uint64_t occ)
	{
		return 
		  ss.dSubset[sq][(((occ & dMask[sq]) * 0x0202020202020202ull) >> 58)] + 
		  ss.aSubset[sq][(((occ & aMask[sq]) * 0x0202020202020202ull) >> 58)];
	}

	static uint64_t Rook(int sq, uint64_t occ)
	{
		uint64_t hIndex = (occ >> (sq & 0b11111000u) + 1) & 0x000000000000003f;
		uint64_t vIndex = ((((occ >> (sq & 7)) & 0x0101010101010101ull) * 0x0080402010080400ull) >> 58);
		return ss.vSubset[sq][vIndex] | ss.hSubset[sq][hIndex];
	}
I didnt use ss but your other improvements + I made sure to make the arrays static (which i forgot to do in the initial commit)

For me, this commit went from 380 to 490Mnps on my 3950x. :D
https://github.com/Gigantua/Chess_Moveg ... bb2fc1a658
Wow - wow wow wow! :shock: :D

At least on AMD using MSVC 2022 KSSISSY is essentially even with Magic. I wonder why clang does not like my code?
Then according to j.t. on INTEL using clang the results are way worse for KSSISSY, oh well. I wonder if MSVC on INTEL would be alot better. Is optimization finished? Or can more be gained with fancy C++ like constexpr and templates which is beyond my paygrade? Either way I'm happy with the results and like you indicated before with less pressure on the cpu cache KSSISSY should be even more competitive in a real world chess engine! :)
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

j.t. wrote: Mon Jan 23, 2023 4:33 pm Just one more CPU:

Code: Select all

clang++-15 -O3 -mllvm -inline-threshold=16000 -fdeclspec -march=native -funroll-loops -std=c++20 -pthread Main.cpp -o movegen_compare
Verify Engines...OK!

13th Gen Intel(R) Core(TM) i9-13900K

Million Lookups/s Random Squares, Random Occupation/s:
GaloisField - AVX512               4.107445                      0       [0kb]       AVX512F_GFNI             no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
So since you have AVX512 you maybe wanna take a closer look at this one - native code is in the comments I couldnt verify it. But your 13900k has avx512 + gfni extensions to be able to run it.
Mike Sherwin wrote: Mon Jan 23, 2023 7:13 pm Wow - wow wow wow! :shock: :D

At least on AMD using MSVC 2022 KSSISSY is essentially even with Magic. I wonder why clang does not like my code?
Then according to j.t. on INTEL using clang the results are way worse for KSSISSY, oh well. I wonder if MSVC on INTEL would be alot better. Is optimization finished? Or can more be gained with fancy C++ like constexpr and templates which is beyond my paygrade? Either way I'm happy with the results and like you indicated before with less pressure on the cpu cache KSSISSY should be even more competitive in a real world chess engine! :)
Yeah I wouldnt worry to much about it. Sometimes MSVC is better and sometimes its Clang. Especially since both update regularly and msvc is like from last week and clang-14 is a tad older. I am currently building clang-16 (which is the master branch) to see if the c++ x86 backend improved.
I feel like there is more to be gained other than constexpr (constexpr does only improve performance where sq is a compiletime constant or known from the context)
Lookig at it - bishop seems solved.
For rook with a lot of effort there might be some alternate solution but I think its good for now. And like you said its even faster than magic in some cicumstances. Together with the smaller tablesize which will get more L1/2 cache hits it is definitely a big achievement.

There is one more idea to be tested. Taking the (maybe fastest) horizontal rook lookup from there -
https://github.com/Gigantua/Chess_Moveg ... a.hpp#L105
And substitute it for the hSubset? (advantage would be 64x64x8 = 32k vs 0.5k lookup)
https://github.com/Gigantua/Chess_Moveg ... d.hpp#L150
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
User avatar
j.t.
Posts: 263
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: Comparison of all known Sliding lookup algorithms

Post by j.t. »

dangi12012 wrote: Mon Jan 23, 2023 8:15 pm So since you have AVX512 you maybe wanna take a closer look at this one - native code is in the comments I couldnt verify it. But your 13900k has avx512 + gfni extensions to be able to run it.
As far as I know, the efficiency cores don't have AVX-512, so I would have to disable them to get AVX-512 support (if that works at all).

@Mike Sherwin, I used clang-15 instead of clang-14, maybe that could explain some of the difference.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Comparison of all known Sliding lookup algorithms

Post by Mike Sherwin »

j.t. wrote: Mon Jan 23, 2023 8:29 pm
dangi12012 wrote: Mon Jan 23, 2023 8:15 pm So since you have AVX512 you maybe wanna take a closer look at this one - native code is in the comments I couldnt verify it. But your 13900k has avx512 + gfni extensions to be able to run it.
As far as I know, the efficiency cores don't have AVX-512, so I would have to disable them to get AVX-512 support (if that works at all).

@Mike Sherwin, I used clang-15 instead of clang-14, maybe that could explain some of the difference.
At the moment MSVC 2022 does a much better job than clang with my code, 492 MSVC / 404 clang = 1.22 for a 22% better performance. So if I approximate your result for KSSISSY as such, 703 * 1.22 = 857.66 then at least KSSISSY beats Plain Magic BB by quite a lot. :D Of course though that really should be tested for the 13900k!
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Comparison of all known Sliding lookup algorithms

Post by Mike Sherwin »

dangi12012 wrote: Mon Jan 23, 2023 8:15 pm
j.t. wrote: Mon Jan 23, 2023 4:33 pm Just one more CPU:

Code: Select all

clang++-15 -O3 -mllvm -inline-threshold=16000 -fdeclspec -march=native -funroll-loops -std=c++20 -pthread Main.cpp -o movegen_compare
Verify Engines...OK!

13th Gen Intel(R) Core(TM) i9-13900K

Million Lookups/s Random Squares, Random Occupation/s:
GaloisField - AVX512               4.107445                      0       [0kb]       AVX512F_GFNI             no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
So since you have AVX512 you maybe wanna take a closer look at this one - native code is in the comments I couldnt verify it. But your 13900k has avx512 + gfni extensions to be able to run it.
Mike Sherwin wrote: Mon Jan 23, 2023 7:13 pm Wow - wow wow wow! :shock: :D

At least on AMD using MSVC 2022 KSSISSY is essentially even with Magic. I wonder why clang does not like my code?
Then according to j.t. on INTEL using clang the results are way worse for KSSISSY, oh well. I wonder if MSVC on INTEL would be alot better. Is optimization finished? Or can more be gained with fancy C++ like constexpr and templates which is beyond my paygrade? Either way I'm happy with the results and like you indicated before with less pressure on the cpu cache KSSISSY should be even more competitive in a real world chess engine! :)
Yeah I wouldnt worry to much about it. Sometimes MSVC is better and sometimes its Clang. Especially since both update regularly and msvc is like from last week and clang-14 is a tad older. I am currently building clang-16 (which is the master branch) to see if the c++ x86 backend improved.
I feel like there is more to be gained other than constexpr (constexpr does only improve performance where sq is a compiletime constant or known from the context)
Lookig at it - bishop seems solved.
For rook with a lot of effort there might be some alternate solution but I think its good for now. And like you said its even faster than magic in some cicumstances. Together with the smaller tablesize which will get more L1/2 cache hits it is definitely a big achievement.

There is one more idea to be tested. Taking the (maybe fastest) horizontal rook lookup from there -
https://github.com/Gigantua/Chess_Moveg ... a.hpp#L105
And substitute it for the hSubset? (advantage would be 64x64x8 = 32k vs 0.5k lookup)
https://github.com/Gigantua/Chess_Moveg ... d.hpp#L150
Thanks Daniel. I do need to get on with the development of my new engine Quixotic and as you indicated it is good enough for now! I will try that other horizontal code but I'm skeptical because of the operation count, 5 for KSS and 11 for HyperbolaQsc. But one never knows for sure unless tried. :)

Edit: On my computer your adding of static, 383 -> 385. :D
krunch
Posts: 10
Joined: Sat Sep 18, 2021 9:36 pm
Full name: Tony Schwebs

Re: Comparison of all known Sliding lookup algorithms

Post by krunch »

dangi12012 wrote: Mon Jan 23, 2023 8:15 pm
So since you have AVX512 you maybe wanna take a closer look at this one - native code is in the comments I couldnt verify it. But your 13900k has avx512 + gfni extensions to be able to run it.
I couldn't figure out the correct steps to get AVX512 instructions working, however we only need the 256 bit GFNI versions for Queens which seem to work:

Code: Select all

Verify Engines...OK!

13th Gen Intel(R) Core(TM) i9-13900K

Million Lookups/s Random Squares, Random Occupation/s:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
SBAMG o^(o-3cbn)                   332.169816                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       215.952793                    0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inführ                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
GaloisField - AVX512               773.161004                    0       [0kb]       AVX512F_GFNI             no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
Hyperbola Quintessence o^(o-2r)    335.816510                    256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      107.513876                    0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray                      58.763729                     0       [0kb]       bswap                    no        Daniel Inführ (dangi12012)                   Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation                        52.830621                     0       [0kb]       ReverseBits              no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=20
Binary Neural Network              44.667767                     5852    [45kb]      pdep_u64, AVX2           no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                75.366895                     768     [6kb]       imul64                   no        Harald Lüßen                                  http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          57.220132                     0       [0kb]       none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               247.297757                    0       [0kb]       AVX2                     no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      86.823393                     107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         90.561322                     0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill
Kogge-Stone                        143.259463                    0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  48.882147                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          245.401282                    0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine
QBBEngine - Shifted Mask           248.537255                    0       [0kb]       countr_zero, countl_zero no        Fabio Gobbato                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=90#p924623
Classical Bob-Mike                 317.352015                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  360.594259                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inführ            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653
Leorik                             317.960811                    128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Leorik Inline                      125.383071                    0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Obstruction Difference             348.109620                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      111.217179                    0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     335.078534                    384     [3kb]       countl_zero              no        Daniel Inführ and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Genetic Obstruction Difference V2  388.473977                    768     [6kb]       countl_zero              no        Daniel Inführ                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Slide Arithmetic                   317.034250                    256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            120.568965                    0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Kindergarten                       644.520235                    16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    460.233876                    180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Fancy Magic BB - Variable shift    669.429925                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       313.984887                    6468    [50kb]      none                     no        Daniel Inführ                                tbd
Plain Magic BB                     729.709515                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       911.632430                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     1392.418282                   107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          64.780195                     107680  [841kb]     none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723
Here is the updated code:

Code: Select all

namespace Chess_Lookup::GaloisField
{
	constexpr auto Size = 0;

	template<uint64_t bb>
	constexpr uint64_t mask_shift(int ranks) {
		return ranks > 0 ? bb >> (ranks << 3) : bb << -(ranks << 3);
	}

#	define dir_HO(X) (0xFFull << (X & 56))
#	define dir_VE(X) (0x0101010101010101ull << (X & 7))
#	define dir_D1(X) (mask_shift<0x8040201008040201ull>((X & 7) - (X >> 3)))
#	define dir_D2(X) (mask_shift<0x0102040810204080ull>(7 - (X & 7) - (X >> 3)))

	static __m256i* boardMask = new __m256i[64];

	static void InitMask() {

		for (int square = 0; square < 64; ++square) {
			boardMask[square] = _mm256_set_epi64x(dir_HO(square) ^ (1ull << square), dir_VE(square) ^ (1ull << square), dir_D1(square) ^ (1ull << square), dir_D2(square) ^ (1ull << square));
		}
	}

	//Reverses bits in all 64 bytes at once 
	static __m256i bit_reverse(__m256i input) {
		
		__m256i b = _mm256_gf2p8affine_epi64_epi8(input, _mm256_set1_epi64x(0x8040201008040201), 0x00);
		const __m256i shuffle_mask = _mm256_set_epi8(8, 9, 10, 11, 12, 13, 14, 15, 0, 1, 2, 3, 4, 5, 6, 7, 24, 25, 26, 27, 28, 29, 30, 31, 16, 17, 18, 19, 20, 21, 22, 23);
		return _mm256_shuffle_epi8(b, shuffle_mask);
	}

	//This can solve 8 rays, so all moves of two queens at once or 4 (rooks, bishops)
	static __m256i attack8(uint64_t occ, int square, __m256i mask) {
		__m256i o = _mm256_and_epi32(_mm256_set1_epi64x(occ), mask);
		__m256i sq = _mm256_set1_epi64x((1ull << square));
		__m256i sqRev = _mm256_set1_epi64x((0x8000000000000000ull >> square));
		return _mm256_and_epi32(_mm256_xor_epi32(_mm256_sub_epi64(o, sq), bit_reverse(_mm256_sub_epi64(bit_reverse(o), sqRev))), mask);
	}

	static uint64_t Queen(int sq, uint64_t occ) {
		__m256i result = attack8(occ, sq, _mm256_loadu_si256(boardMask + sq));
		__m256i result2 = _mm256_or_epi32(result, _mm256_permute4x64_epi64(result, 0x4E));
		return _mm256_or_epi32(result2, _mm256_permute4x64_epi64(result2, 0x4D)).m256i_u64[0];
	}

#undef dir_HO
#undef dir_VE
#undef dir_D1
#undef dir_D2
}
and corresponding assembly:

Code: Select all

00007FF7D8BF7C60  movzx       edx,byte ptr [r8]  
00007FF7D8BF7C64  movsxd      rax,edx  
00007FF7D8BF7C67  shl         rax,5  
00007FF7D8BF7C6B  vmovdqu     ymm4,ymmword ptr [rax+r13]  
00007FF7D8BF7C71  vpand       ymm3,ymm7,ymm4  
00007FF7D8BF7C75  vgf2p8affineqb ymm0,ymm3,ymm8,0  
00007FF7D8BF7C7B  vpshufb     ymm1,ymm0,ymm9  
00007FF7D8BF7C80  shrx        rcx,r14,rdx  
00007FF7D8BF7C85  vmovq       xmm0,rcx  
00007FF7D8BF7C8A  vpbroadcastq ymm0,xmm0  
00007FF7D8BF7C8F  vpsubq      ymm0,ymm1,ymm0  
00007FF7D8BF7C93  vgf2p8affineqb ymm1,ymm0,ymm8,0  
00007FF7D8BF7C99  vpshufb     ymm2,ymm1,ymm9  
00007FF7D8BF7C9E  shlx        rax,r12,rdx  
00007FF7D8BF7CA3  vmovq       xmm0,rax  
00007FF7D8BF7CA8  vpbroadcastq ymm0,xmm0  
00007FF7D8BF7CAD  vpsubq      ymm0,ymm3,ymm0  
00007FF7D8BF7CB1  vpxor       ymm1,ymm0,ymm2  
00007FF7D8BF7CB5  vpand       ymm2,ymm1,ymm4  
00007FF7D8BF7CB9  vpermq      ymm0,ymm2,4Eh  
00007FF7D8BF7CBF  vpor        ymm3,ymm2,ymm0  
00007FF7D8BF7CC3  vpermq      ymm1,ymm3,4Dh  
00007FF7D8BF7CC9  vpor        ymm0,ymm3,ymm1  
00007FF7D8BF7CCD  vmovq       rax,xmm0  
00007FF7D8BF7CD2  xor         rdi,rax  
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Very much - very cool!

Your code proves to me that this will be much better than expected!
I can see instantly from your code that the horizontal reduction can be done much more efficiently as well as AVX512 instructions having the same throughput as the 256 bit version. Meaning performance with optimisation can be doubled for sure!

Thinking of it - the horizontal reduction may not be needed in production code since just returning the AVX register to work on it directly (to enumerate valid squares) is as valid as uint64. It might even be preferred since masking away the pinned invalid directions is essentially a 4x64bit operation.

The final algo will probably be a template function that takes in <Rook, Rook, Bishop, Rook> <Queen, Queen> or any permutation of those. As well as smaller 256 bit versions like yours for: 1Queen, 2Bishops, 2Rooks, 1Rook 1Bishop.

Also when going the 64bit route there is a instruction few people ever use which is twice as fast for storing when not immediately reading back results

Code: Select all

_mm512_stream_si512 
non-temporal memory hint - means bypassing the cache hirarchy and being around twice as fast to store to heap.

So when you say you get 773MQ/s (with half the bits of avx512) and pext is 1392MQ/s this will be interesting indeed when being optimized fully.
But no promises before I have my hands on avx512 hardware.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer