Comparison of all known Sliding lookup algorithms

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

If somebody with a math background has time:

You can revisit rotated bitboards at any time with the galois field instruction which is 8x8 matrix multiplication.
GF2P8AFFINEQB can apply this to 8 bitboards at once with different matrices.

Mirroring, 90 Degree clockwise, reverting and all other transformations.

To reverse bits we multiply with Identity Transposed. Mirroring solved = I^t
What form do the matrices have to get the other transformations? (90° CW+CCW, 180°)

There is a huge batch of algos that only works with "positive rays" and if you look at CPW rotating BBs is expensive with 11+ instructions for 1BB.
GF2P8AFFINEQB uses 1 instruction for 8BBs at once.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
gxchess1
Posts: 29
Joined: Thu Sep 15, 2022 4:51 pm
Full name: Elad Yosef Cohen

Re: Comparison of all known Sliding lookup algorithms

Post by gxchess1 »

Wow thats huge if it does 700+M with no LUT at all. Plus it wasn't even optimized yet.
krunch
Posts: 10
Joined: Sat Sep 18, 2021 9:36 pm
Full name: Tony Schwebs

Re: Comparison of all known Sliding lookup algorithms

Post by krunch »

dangi12012 wrote: Tue Jan 31, 2023 10:09 am If somebody with a math background has time:

You can revisit rotated bitboards at any time with the galois field instruction which is 8x8 matrix multiplication.
GF2P8AFFINEQB can apply this to 8 bitboards at once with different matrices.

Mirroring, 90 Degree clockwise, reverting and all other transformations.

To reverse bits we multiply with Identity Transposed. Mirroring solved = I^t
What form do the matrices have to get the other transformations? (90° CW+CCW, 180°)

There is a huge batch of algos that only works with "positive rays" and if you look at CPW rotating BBs is expensive with 11+ instructions for 1BB.
GF2P8AFFINEQB uses 1 instruction for 8BBs at once.
Swapping the argument order with identity / identity transposed will get us diagonal mirroring / 90 degree rotation.

Code: Select all

	// . . X X X X . .     . . X X X X . .
	// . X X X X X X .     . X X X X X X .
	// . X X . . X X .     . X X . . X X .
	// . X X . X X . .     . . X X . X X .
	// . X X X X . . .     . . . X X X X .
	// . X X . X X . .     . . X X . X X .
	// . X X . . X X .     . X X . . X X .
	// . X X . . . X X     X X . . . X X .
	static __m256i mirror_horizontal(__m256i input) {
		return _mm256_gf2p8affine_epi64_epi8(input, _mm256_set1_epi64x(0x8040201008040201), 0);
	}

	// . . X X X X . .     . . . . . . . .
        // . X X X X X X .     X X X X X X X .
        // . X X . . X X .     X X X X X X X X
        // . X X . X X . .     . . . X . . X X
        // . X X X X . . .     . . X X X . X X
        // . X X . X X . .     . X X . X X X X
        // . X X . . X X .     X X . . . X X .
        // . X X . . . X X     X . . . . . . .
	static __m256i rotate_90_degrees(__m256i input) {
		return _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x8040201008040201), input, 0);
	}

	// . . X X X X . .     X . . . . . . .
        // . X X X X X X .     X X . . . X X .
        // . X X . . X X .     . X X . X X X X
        // . X X . X X . .     . . X X X . X X
        // . X X X X . . .     . . . X . . X X
        // . X X . X X . .     X X X X X X X X
        // . X X . . X X .     X X X X X X X .
        // . X X . . . X X     . . . . . . . .
	static __m256i mirror_diagonal(__m256i input) {
		return _mm256_gf2p8affine_epi64_epi8(_mm256_set1_epi64x(0x0102040810204080), input, 0);
	}
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Its 2023 and another two new movegeneration algorithms are born!

Namely: Correlationmagic and spectral Correlationmagic. The main idea is that sampling 200GB of magic candidates into 100GB of magic solutions you can start to correlate the solutions with each other to find a function that compresses the usual 64slot magic lookup into something smaller.
It was a big brain bug of mine and finally I found a solution to that problem.

That is numbers that satisfy the condition: (occ | mask) * magicM >> (64 - N) where magicM == magicM-1 >> 1 for long ranges. I went so far as to create a competition initially: https://www.talkchess.com/forum3/viewto ... 9853e38986

The competition didnt have all 100GB of magics to work with so that no perfect solution existed but with more data the chance of bigger buckets becomes greater. Also it was unclear what the function is. We can find magics that are valid for multiple squares etc.. In the end most good solutions are bithsifts and can be seen like this:

Code: Select all

uint64_t magic[8] {
                                                          0b000000000000000000000001110000000000000010000000000001000001000000000000,
                                                  0b000000000000000000000000000000011100000000000000100000000000010000000000,
                                          0b000000000000000000000100000000000000000111000000000000001000000000000000,
                                  0b000000000000000000001000000001000000000000000001100000000000001000000000,
                          0b000000000000000000000010000010000000010000000000000000011000000000000000,
                  0b000000000000000000010000000000100000100000000100000000000000000100000000,
          0b000000000000000000000001000100000000001000001000000001000000000000000000,
  0b000000000000000000100000000000010001000000000010000010000000010000000000,
};
I executed exactly the tricks to work on all of the data concurrently via memory mapped IO so that all of that can be accessed all the time without memory overflows. In the end only a gpu approach was used since 16cores of CPU are around 60x slower than my GPU for this type of integer arithmetic.

So without any ado lets look at the code itself:

Code: Select all

static constexpr uint64_t bishop_magics[8] = {
	 281754153844992ull,
	 1100594872576ull,
	 281483600331392ull,
	 2286985260581120ull,
	 1143509276953600ull,
	 563501328237568ull,
	 141289400076288ull,
	 565157600174592ull
};
static inline uint64_t bishop(int sq, uint64_t occ) {
	uint64_t masked_occ = occ | bishop_masks[sq];
	uint64_t magic = bishop_magics[sq >> 3] >> (sq & 7);
	uint64_t index = (masked_occ * magic) >> 55;
	return  *(attack_table_bishop + (sq << 9) + index);
};
The algo in english is this:
Take only 8 magics and shift them up to 8 times - use usual fancy magic on top of that and index into a table with a square dependent offset. This skips two numbers we usually need to look up. The magic (can be known inside a ZMM register) and the offset (which is sq shifted).

From usually 64 elements we can find lots of solutions that are valid when shifted up to 8x. I made no attempts to make the overall attack_table_bishop small seeing in my adding sq << 9 by default. Yet you can see we need much less hot memory and actually can fit all magics inside a ZMM register. The masks still need to be looked up or calculated for 64 squares.


Now what about the second part the spectral magic?
In working with specific popcounts we can get rid of the expensive Imul64 and multiply via bitshifts. This is only valid for SIMD - non simd 6x bitshift is slower than imul64.
This is important because your and my and everyones computer on this planet in AVX and AVX512 cannot do IMUL64 in a simdwise manner. Bitshifts on the other hand even when multiple of them are used are all not hindering this compiler optimisation.

This is a proof oc concept for the algorithm and the author already has much better ideas for this algo. Since 1, 2, 3 bit solutions also exist it is expected that some squares can be looked up by mask and shift alone which in the end is very fast. Here I picked popcount = 6 for all solutions.

These are the same magics as above but for expressiveness reasons I show here the literally same multiplication and shifting. This is essentially a hybrid of spectral shift decomposition and correlationmagic.

Code: Select all

static constexpr uint8_t bishop_spectralmagic[64][6] = {
{ 8, 16, 22, 32, 38, 48 },
{ 8, 16, 20, 23, 30, 40 },
{ 7, 9, 17, 25, 33, 48 },
{ 8, 14, 20, 30, 45, 51 },
{ 10, 11, 22, 34, 44, 50 },
{ 10, 23, 29, 30, 39, 49 },
{ 11, 15, 23, 31, 39, 47 },
{ 9, 13, 25, 33, 41, 49 } };

uint64_t inline spectral_bishop(int sq, uint64_t occ) {
	uint64_t masked_occ = occ | bishop_masks[sq];
	uint64_t* attacks = attack_table_bishop + (sq << 9);
	const uint8_t* s = bishop_spectralmagic[sq >> 3];
	uint64_t index = (
		(masked_occ << (s[0] - (sq & 7))) +
		(masked_occ << (s[1] - (sq & 7))) +
		(masked_occ << (s[2] - (sq & 7))) +
		(masked_occ << (s[3] - (sq & 7))) +
		(masked_occ << (s[4] - (sq & 7))) +
		(masked_occ << (s[5] - (sq & 7)))) >> 55;
	return attacks[index];
};
The algo in english is this:
Take only 8 magics and shift them up to 8 times - use bithsift multiplication on top of that and index into a table with a square dependent offset.
We can easily find magics with 5 or even 4 bits that that is open for the future. Here I chose 6 bits.

Thats all nice Daniel but what does that mean in performance?

Code: Select all

Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
Correlation Magic                  380.457183                    294976  [2304kb]    imul64                   no        Daniel Inführ                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=82224#p949688
Spectral Magic                     144.279394                    294976  [2304kb]    none                     no        Daniel Inführ                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=82224#p949688
Gigantua Rotation Algorithm        526.645174                    768     [6kb]       AVX2 or AVX512 + GFNI    no        Daniel Inführ (dangi12012)                   https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81707&sid=ec3532ea1ce1e8014d2b579b29fddd34
SBAMG o^(o-3cbn)                   235.408406                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
Fancy Magic BB - Variable shift    377.520046                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       211.290669                    6468    [50kb]      none                     no        Daniel Inführ                                tbd
Plain Magic BB                     478.842733                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       520.237679                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     764.414955                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
So correlation magic eeked out a small win against Variable shift Magic BB, but not against Black Magic BB - Fixed shift. This can be expected as there are still many opportunities to improve this. For example no attempt was made to make the table small yielding 2304kb against condensing this easily to 800kb. Another point is that shifting increasing 12 to 14 for rooks by >> (64-12) yields more magics. You can see the rook didnt let itself compress much and I sifted through all 64 bit numbers up to a popcount of 9 which is ncr(64,9) + ncr(64,8) ... so a lot of numbers to try.

The nice thing about slot lookup algorithms generally is that we dont necessarily need to store the bitboards at the calculated index. We might as well have another structure there. For example an array of SqFrom, SqTo since these are known. Here we chose to return the bitmap of the attacked squares.


Finally I was able to work with this now because I have replaced random magic sampling by exhaustive search. With modern architectures the amount of VRAM or RAM does not matter because a fast NVME SSD and memory mapped IO is around as fast as a DDR2 stick from 2010 for sequential access. Allowing the chess community to access 100s of GB of working memory for this problem and future chess problems easily.

I personally will stick with my Gigantua Rotation Algorithm since that is hamstrung on AVX2 but still faster than magic lookups and completely fits in L1.
You can see all of the code here: https://github.com/Gigantua/Chess_Moveg ... aa0e47077d
Future movegens and releases of mine will be on a google indexed page or blog but I will link here of course too. If you have questions let me know!
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Busy week and I need to push this idea out there.

When having fancy magic solutions we can limit the popcount to for example 3. This only brings a few solutions so the fixed shift needs to be increased a lot from 12 to someting a lot bigger. If we allow ourselfes to do that we can do 64bit multiplication by 3shifts and 2 additions which is insanely cheap.
Altough Modern processors can do imul64 faster so this movegen is applicable to plattforms where hardware IMUL64 is not existing or with less reciprocal througput than 3 shifts and 2 adds.

So now I want to publish the spectral version of fancy magic with explicit terms. (So everything lives in a switch instead of an array). In addition to the above 2 algos:
https://godbolt.org/z/rKMq64bdT
https://github.com/Gigantua/Chess_Moveg ... 16cb753093

IMO this is just beautiful and would have been great before x86-64 became a thing. As mentioned above applicable anywhere fancy magics IMUL64 is too expensive as the underlying lookup table is 100% compatible.

Code: Select all

    static inline uint64_t rook(int sq, uint64_t occ) {
        uint64_t m = occ | rook_masks[sq];
        switch (sq) {
        case 0: return (attack_table + 84)[((m << 14) + (m << 39) + (m << 55)) >> 46];
        case 1: return (attack_table + 87)[((m << 13) + (m << 36) + (m << 54)) >> 46];
        case 2: return (attack_table + 78)[((m << 12) + (m << 37) + (m << 55)) >> 46];
        case 3: return (attack_table + 60)[((m << 11) + (m << 36) + (m << 55)) >> 46];
        case 4: return (attack_table + 66)[((m << 10) + (m << 35) + (m << 55)) >> 46];
        case 5: return (attack_table + 72)[((m << 9) + (m << 34) + (m << 55)) >> 46];
        case 6: return (attack_table + 0)[((m << 8) + (m << 33) + (m << 56)) >> 46];
        case 7: return (attack_table + 21)[((m << 7) + (m << 32) + (m << 55)) >> 46];
        case 8: return (attack_table + 111)[((m << 13) + (m << 30) + (m << 46)) >> 46];
        case 9: return (attack_table + 390)[((m << 13) + (m << 31) + (m << 47)) >> 46];
        case 10: return (attack_table + 102)[((m << 13) + (m << 28) + (m << 47)) >> 46];
        case 11: return (attack_table + 75)[((m << 10) + (m << 27) + (m << 46)) >> 46];
        case 12: return (attack_table + 57)[((m << 9) + (m << 26) + (m << 46)) >> 46];
        case 13: return (attack_table + 30)[((m << 8) + (m << 25) + (m << 46)) >> 46];
        case 14: return (attack_table + 0)[((m << 0) + (m << 25) + (m << 49)) >> 46];
        case 15: return (attack_table + 9)[((m << 7) + (m << 30) + (m << 46)) >> 46];
        case 16: return (attack_table + 96)[((m << 14) + (m << 39) + (m << 55)) >> 46];
        case 17: return (attack_table + 168)[((m << 13) + (m << 38) + (m << 52)) >> 46];
        case 18: return (attack_table + 1056)[((m << 4) + (m << 21) + (m << 40)) >> 46];
        case 19: return (attack_table + 516)[((m << 3) + (m << 22) + (m << 41)) >> 46];
        case 20: return (attack_table + 213)[((m << 10) + (m << 20) + (m << 39)) >> 46];
        case 21: return (attack_table + 126)[((m << 2) + (m << 20) + (m << 41)) >> 46];
        case 22: return (attack_table + 216)[((m << 0) + (m << 20) + (m << 42)) >> 46];
        case 23: return (attack_table + 108)[((m << 0) + (m << 15) + (m << 41)) >> 46];
        case 24: return (attack_table + 24)[((m << 13) + (m << 30) + (m << 46)) >> 46];
        case 25: return (attack_table + 15)[((m << 11) + (m << 29) + (m << 52)) >> 46];
        case 26: return (attack_table + 0)[((m << 12) + (m << 31) + (m << 53)) >> 46];
        case 27: return (attack_table + 12)[((m << 11) + (m << 31) + (m << 52)) >> 46];
        case 28: return (attack_table + 114)[((m << 10) + (m << 31) + (m << 47)) >> 46];
        case 29: return (attack_table + 186)[((m << 9) + (m << 31) + (m << 42)) >> 46];
        case 30: return (attack_table + 576)[((m << 8) + (m << 31) + (m << 47)) >> 46];
        case 31: return (attack_table + 39)[((m << 8) + (m << 31) + (m << 53)) >> 46];
        case 32: return (attack_table + 417)[((m << 7) + (m << 23) + (m << 46)) >> 46];
        case 33: return (attack_table + 993)[((m << 5) + (m << 23) + (m << 46)) >> 46];
        case 34: return (attack_table + 381)[((m << 5) + (m << 23) + (m << 44)) >> 46];
        case 35: return (attack_table + 114)[((m << 3) + (m << 23) + (m << 44)) >> 46];
        case 36: return (attack_table + 102)[((m << 2) + (m << 25) + (m << 37)) >> 46];
        case 37: return (attack_table + 96)[((m << 3) + (m << 25) + (m << 44)) >> 46];
        case 38: return (attack_table + 51)[((m << 2) + (m << 24) + (m << 41)) >> 46];
        case 39: return (attack_table + 90)[((m << 6) + (m << 22) + (m << 31)) >> 46];
        case 40: return (attack_table + 6)[((m << 15) + (m << 38) + (m << 53)) >> 46];
        case 41: return (attack_table + 27)[((m << 14) + (m << 37) + (m << 61)) >> 46];
        case 42: return (attack_table + 324)[((m << 13) + (m << 19) + (m << 36)) >> 46];
        case 43: return (attack_table + 2805)[((m << 10) + (m << 28) + (m << 35)) >> 46];
        case 44: return (attack_table + 633)[((m << 1) + (m << 18) + (m << 38)) >> 46];
        case 45: return (attack_table + 5166)[((m << 2) + (m << 17) + (m << 36)) >> 46];
        case 46: return (attack_table + 1068)[((m << 3) + (m << 17) + (m << 32)) >> 46];
        case 47: return (attack_table + 174)[((m << 0) + (m << 17) + (m << 31)) >> 46];
        case 48: return (attack_table + 3)[((m << 6) + (m << 29) + (m << 46)) >> 46];
        case 49: return (attack_table + 0)[((m << 6) + (m << 29) + (m << 52)) >> 46];
        case 50: return (attack_table + 0)[((m << 6) + (m << 28) + (m << 51)) >> 46];
        case 51: return (attack_table + 0)[((m << 6) + (m << 27) + (m << 50)) >> 46];
        case 52: return (attack_table + 3)[((m << 6) + (m << 26) + (m << 51)) >> 46];
        case 53: return (attack_table + 3)[((m << 7) + (m << 25) + (m << 50)) >> 46];
        case 54: return (attack_table + 12)[((m << 8) + (m << 33) + (m << 56)) >> 46];
        case 55: return (attack_table + 12)[((m << 7) + (m << 32) + (m << 53)) >> 46];
        case 56: return (attack_table + 0)[((m << 0) + (m << 14) + (m << 39)) >> 46];
        case 57: return (attack_table + 45)[((m << 0) + (m << 13) + (m << 38)) >> 46];
        case 58: return (attack_table + 33)[((m << 0) + (m << 12) + (m << 37)) >> 46];
        case 59: return (attack_table + 12)[((m << 0) + (m << 11) + (m << 36)) >> 46];
        case 60: return (attack_table + 6)[((m << 0) + (m << 10) + (m << 35)) >> 46];
        case 61: return (attack_table + 27)[((m << 0) + (m << 9) + (m << 34)) >> 46];
        case 62: return (attack_table + 45)[((m << 0) + (m << 25) + (m << 48)) >> 46];
        case 63: return (attack_table + 54)[((m << 0) + (m << 15) + (m << 33)) >> 46];
        default: unreachable();
        }
    }
To note here is that m << 0 is free. The table is only a very tiny amount bigger than individual expanded magics (think of it as a buffer size of 2^18 per square). Usually this is 2^12 bits. Anyways the table size is log2(274645) = 18.06 only slightly bigger than single squares.

This is the explicit form where we see the multiplication in Init() (check the github) and the results here.
Also it is time to publish my performance test of current MSVC as of 2023 vs Clang for Computerchess movegenerators:

Here is the raw data of every movegenerators performance in MSVC vs Clang. Copy paste to excel if you want to analyse this.

Code: Select all

Name	CLANG	MSVC	Clang vs MSVC
Correlation Magic	380,235084	383,455187	-0,839759927
Spectral Magic	165,823542	140,443389	18,07144728
Spectral Magic Switch	138,425403	78,232906	76,9401267
SBAMG o^(o-3cbn)	291,481517	207,833811	40,24740036
SBAMG Inline	384,061423	132,008845	190,9361286
GaloisField - AVX512	4,550884	4,942163	-7,917160968
Gigantua Rotation Algo	591,667773	458,393529	29,07419838
Hyperbola Quintessence	786,918095	234,588145	235,4466591
Hyperbola Quintessence	553,289165	85,048851	550,5545442
Genetic 8 Ray	84,806722	41,185067	105,9161929
Bitrotation	117,713879	38,867366	202,8604485
Binary Neural Network	34,035661	48,590727	-29,95441085
Exploding Bitboards	74,24909	64,411367	15,27327156
Reference (Switch Look	57,78067	35,884307	61,01932803
AVX Branchless Shift	271,728106	194,76915	39,51290849
Pext Emulated	71,338672	64,783951	10,11781606
Dumb7 Fill	97,631418	66,632484	46,52225482
Kogge-Stone	481,000789	111,765852	330,3647137
Rotated Bitboards	46,585472	44,765362	4,065889158
QBBEngine	314,724898	176,489587	78,32491047
QBBEngine - Shifted Ma	318,546483	167,736265	89,90913086
Classical Bob-Mike	309,387596	202,548602	52,74733716
Advanced Bob-Mike	333,170104	212,279071	56,94910592
Leorik	301,119182	220,354676	36,65204999
Leorik Inline	320,125672	113,105453	183,032925
Obstruction Difference	291,878385	242,44305	20,39049377
Obstruction Difference	287,345546	92,558469	210,4476004
Genetic Obstruction Di	367,765361	272,002263	35,20672841
Genetic Obstruction Di	387,127914	313,839211	23,35230922
Slide Arithmetic	255,623887	249,682903	2,379411617
Slide Arithmetic Inlin	112,280204	110,186908	1,899768346
Kindergarten	439,524489	436,557944	0,679530642
SISSY Bitboards	223,645943	204,062476	9,596799659
Kindergarten Super SIS	447,212391	510,440204	-12,38691868
Fancy Magic BB - Varia	534,579364	389,204	37,35197069
FoldingHash - 4x fancy	312,658241	218,534971	43,07011805
Plain Magic BB	491,756405	504,557415	-2,537076975
Black Magic BB - Fixed	638,793243	520,870409	22,63957252
Pext constexpr	866,371859	763,187887	13,52012706
HyperCube	89,265628	69,692085	28,08574747
			
AVERAGE	306,905654	210,6734077	45,67840211
I dont recommend MSVC or GCC for that matter for shipping chessprogramming binaries in the aspect of movegens. A few years back the performance difference was not that big. Seems that LLVM backend finds many good optimisations that MSVC is missing.

Here is the graph. Keep in mind that 100 means that clang produces code that is TWICE as fast. and even the run of the mill 30% uplift is non trivial. The aggregate averate of all movegens is 45,67% uplift for Clang++-16. Compiler flags used: clang++ -std=c++20 -O3 -march=native

Image

If you have Zen4 contact me.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
martinn
Posts: 20
Joined: Fri Mar 10, 2023 9:33 am
Full name: Martin Novák

Re: Comparison of all known Sliding lookup algorithms

Post by martinn »

Hi Daniel,
could you also add Split Pext into your Comparison framework? Here is a link for code https://gist.github.com/martinnovaak/b9 ... 1001b13fe8.

And this is link https://www.talkchess.com/forum3/viewto ... 10#p949575 for thread where the algorithm was discussed.

Martin
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Comparison of all known Sliding lookup algorithms

Post by dangi12012 »

Another algorithm comes into the repo. Split Pext.
It technically existed in there for over 1 year but the C# to C++ translation had a bug and was disabled. Now it works and even gets an improvement suggested by our friend Mike Sherwin.

Outcome:
Around the same speed as Magic BBs and 20% slower than PEXT at 1/7th the memory cost. For engine perft with cache pressure you would need to test there.

Code: Select all

Fancy Magic BB - Variable shift    532.018585                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       318.205335                    6468    [50kb]      none                     no        Daniel Inf�hr                                tbd
Plain Magic BB                     478.390423                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       654.241259                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     824.616122                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
Split Pext                         615.332063                    16608   [129kb]     pext_u64                 no        Thomas Jahn                                  https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79049&start=340
HyperCube                          92.042865                     107680  [841kb]     none                     yes       Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723
Outcome for this line of code:

Code: Select all

// horizontal pext is probably slower than kindergarten horizontal
// Daniel says: Confirmed and pext will be slower than kindergarten lookup on other platforms as well. 
return horizontal_subset[square][_pext_u64(occupancy, horizontal_mask[square])];  


Interesting Trivia: I have all these algos in my head and I knew other algos use a different horizontal getter which I assumed were fastest.
A few years back the fastest horizontal code was known as this, and this is what hyperbola uses:

Code: Select all

static constexpr uint64_t horizontal_attack(uint64_t pieces, uint32_t x) {
	uint32_t file_mask = x & 7;
	uint32_t rank_mask = x & 56;
	uint64_t o = (pieces >> rank_mask) & 126;

	return ((uint64_t)rank_attack[o * 4 + file_mask]) << rank_mask;
}
I tested this and its slower than the approach in the repo which is this:

Code: Select all

horizontal_subset[square][(occupancy >> horizontal_shift_table[square]) & 63];
So in summary:
3xPext + 1x Hyperbola horizontal getter = 571 MNPS, Naive 4x PEXT = 591 MNPS, 3xPext + 1x Kindergarten = 615 MNPS.
Final summary: Slightly slower than magic and pext but 1/7th the memory.

Wait until I show Gigantua V2 movegen (or engine, depends on how fast NNUE runs in cuda) :!:
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
jlrdh
Posts: 2
Joined: Wed Aug 09, 2023 11:20 pm
Full name: Julien Dehaes

Re: Comparison of all known Sliding lookup algorithms

Post by jlrdh »

Hello Daniel, first of all, thanks for your great work. You'll find hereafter the results for a old Intel hardware. I'll post other results from older Intel machines in the near future. BTW, would you also have interest for results for ARM processors or before-Ryzen AMD processors?
Hope this helps,
Best regards,
Julien

Code: Select all

$ ./movegen_compare 
Verify Engines...OK!

Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz

Million Lookups/s Random Squares, Random Occupation, Transposition Table Cache Pressure Simulation:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
Correlation Magic                  230.548117                    294976  [2304kb]    imul64                   no        Daniel Inf�hr                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=82224#p949688
Spectral Magic                     93.106684                     294976  [2304kb]    none                     no        Daniel Inf�hr                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=82224#p949688
Spectral Magic Switched            55.550977                     274773  [2146kb]    none                     no        Daniel Inf�hr                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=160
SBAMG o^(o-3cbn)                   136.710128                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       195.245299                    0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inf�hr                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
GaloisField - AVX512               3.149912                      0       [0kb]       AVX512F_GFNI             no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
Hyperbola Quintessence o^(o-2r)    212.713752                    256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      259.804464                    0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray                      71.250432                     0       [0kb]       bswap                    no        Daniel Inf�hr (dangi12012)                   Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation                        57.089636                     0       [0kb]       ReverseBits              no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=20
Binary Neural Network              16.845716                     5852    [45kb]      pdep_u64, AVX2           no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                38.731858                     768     [6kb]       imul64                   no        Harald L��en                                  http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          47.126966                     0       [0kb]       none                     yes       Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               115.677512                    0       [0kb]       AVX2                     no        Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      41.512009                     107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         125.971578                    0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill        
Kogge-Stone                        218.000741                    0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  23.080858                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          153.420210                    0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine        
QBBEngine - Shifted Mask           146.027925                    0       [0kb]       countr_zero, countl_zero no        Fabio Gobbato                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=90#p924623
Classical Bob-Mike                 147.223936                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  161.354347                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inf�hr            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653
Leorik                             149.322507                    128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine   
Leorik Inline                      147.319707                    0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine   
Obstruction Difference             148.202371                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      141.345824                    0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     161.427833                    384     [3kb]       countl_zero              no        Daniel Inf�hr and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Genetic Obstruction Difference V2  179.270111                    768     [6kb]       countl_zero              no        Daniel Inf�hr                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Slide Arithmetic                   146.628112                    256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf�hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            114.738006                    0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf�hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Kindergarten                       265.159355                    16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    86.191465                     180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Kindergarten Super SISSY Bitboards 272.379785                    16672   [130kb]     imul64                   no        Michael Sherwin                              https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
Fancy Magic BB - Variable shift    262.772526                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       165.183501                    6468    [50kb]      none                     no        Daniel Inf�hr                                tbd                                               
Plain Magic BB                     221.416569                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       302.002876                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     365.125694                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
Split Pext                         361.105356                    16608   [129kb]     pext_u64                 no        Group Idea                                   https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79049&start=340
HyperCube                          32.622309                     107680  [841kb]     none                     yes       Daniel Inf�hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723

Code: Select all

$ cat /proc/cpuinfo | tail -n26

vendor_id       : GenuineIntel
cpu family      : 6
model           : 61
model name      : Intel(R) Core(TM) i5-5300U CPU @ 2.30GHz
stepping        : 4
microcode       : 0x22
cpu MHz         : 2305.209
cache size      : 3072 KB
physical id     : 0
siblings        : 4
core id         : 1
cpu cores       : 2
apicid          : 3
initial apicid  : 3
fpu             : yes
fpu_exception   : yes
cpuid level     : 20
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti fsgsbase tsc_adjust bmi1 hle avx2 smep bmi2 erms invpcid rtm rdseed adx smap intel_pt xsaveopt dtherm ida arat pln pts
bugs            : cpu_meltdown spectre_v1 spectre_v2 spec_store_bypass l1tf mds swapgs taa itlb_multihit srbds mmio_unknown
bogomips        : 4589.04
clflush size    : 64
cache_alignment : 64
address sizes   : 39 bits physical, 48 bits virtual
power management:
jlrdh
Posts: 2
Joined: Wed Aug 09, 2023 11:20 pm
Full name: Julien Dehaes

Re: Comparison of all known Sliding lookup algorithms

Post by jlrdh »

Hello, you'll find results for another older intel PC. I had to disable the Split Pext algorithm due to 3 errors (one for each _pext_u64) when compiling on that machine without BMI2 support.
Hope this helps,
Best regards,

Code: Select all

$ ./movegen_compare	
Verify Engines...OK!

         Intel(R) Atom(TM) CPU N450   @ 1.66GHz

Million Lookups/s Random Squares, Random Occupation, Transposition Table Cache Pressure Simulation:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
Correlation Magic                  8.308243                      294976  [2304kb]    imul64                   no        Daniel Inf hr                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=82224#p949688
Spectral Magic                     7.434693                      294976  [2304kb]    none                     no        Daniel Inf hr                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=82224#p949688
Spectral Magic Switched            5.665985                      274773  [2146kb]    none                     no        Daniel Inf hr                                https://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=160
SBAMG o^(o-3cbn)                   22.926894                     576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       14.172273                     0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inf hr                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
GaloisField - AVX512               0.389438                      0       [0kb]       AVX512F_GFNI             no        Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
Hyperbola Quintessence o^(o-2r)    32.793463                     256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      10.564876                     0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray                      7.372978                      0       [0kb]       bswap                    no        Daniel Inf hr (dangi12012)                   Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation                        2.480609                      0       [0kb]       ReverseBits              no        Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=20
Binary Neural Network              0.087823                      5852    [45kb]      pdep_u64, AVX2           no        Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                7.576257                      768     [6kb]       imul64                   no        Harald L á  en	                              http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          15.936083                     0       [0kb]       none                     yes       Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               6.843997                      0       [0kb]       AVX2                     no        Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      6.888965                      107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         15.652464                     0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill        
Kogge-Stone                        25.938833                     0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  10.435298                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          7.683107                      0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine        
QBBEngine - Shifted Mask           7.548652                      0       [0kb]       countr_zero, countl_zero no        Fabio Gobbato                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=90#p924623
Classical Bob-Mike                 7.945425                      1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  8.939939                      520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inf hr            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653
Leorik                             14.237092                     128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine   
Leorik Inline                      11.753295                     0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine   
Obstruction Difference             13.768134                     768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      10.471520                     0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     14.465473                     384     [3kb]       countl_zero              no        Daniel Inf hr and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Genetic Obstruction Difference V2  14.531934                     768     [6kb]       countl_zero              no        Daniel Inf hr                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Slide Arithmetic                   14.093006                     256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            11.584118                     0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inf hr              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Kindergarten                       14.423311                     16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    2.478371                      180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Kindergarten Super SISSY Bitboards 12.489517                     16672   [130kb]     imul64                   no        Michael Sherwin                              https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
Fancy Magic BB - Variable shift    12.879612                     93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       17.916294                     6468    [50kb]      none                     no        Daniel Inf hr                                tbd                                               
Plain Magic BB                     5.829206                      295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       15.015647                     88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     6.892822                      107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          6.329301                      107680  [841kb]     none                     yes       Daniel Inf hr (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723

Code: Select all

$ cat /proc/cpuinfo | tail -n26
vendor_id	: GenuineIntel
cpu family	: 6
model		: 28
model name	: Intel(R) Atom(TM) CPU N450   @ 1.66GHz
stepping	: 10
microcode	: 0x107
cpu MHz		: 1611.148
cache size	: 512 KB
physical id	: 0
siblings	: 2
core id		: 0
cpu cores	: 1
apicid		: 1
initial apicid	: 1
fpu		: yes
fpu_exception	: yes
cpuid level	: 10
wp		: yes
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good nopl cpuid aperfmperf pni dtes64 monitor ds_cpl est tm2 ssse3 cx16 xtpr pdcm movbe lahf_lm dtherm
bugs		:
bogomips	: 3325.06
clflush size	: 64
cache_alignment	: 64
address sizes	: 32 bits physical, 48 bits virtual
power management: