Faster than fancy magic - fits in L1

Discussion of chess software programming and technical issues.

Moderator: Ras

dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Faster than fancy magic - fits in L1

Post by dangi12012 »

I finally consolidated all of my research of my into a movegeneration algorithms that fits into L1 and from my perspective is the final movegenerator I will create for the forseeable future.

It uses a lookup of just 768 slots in 6kb- which already makes it attractive since the transposition table congests with lookups from fancy magic which needs 694kb.
In essence it is a optimized galois field movegenerator with support for normal modern cpus. (So everything of the past decade)

Assembly for normal CPU:

Code: Select all

        mov     r8, qword ptr [rip + Chess_Lookup::Classified::msk@GOTPCREL]
        movsxd  rax, edi
        vmovdqa ymm3, ymmword ptr [rip + .LCPI0_0] # ymm3 = [15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15,15]
        vmovdqa ymm5, ymmword ptr [rip + .LCPI0_1] # ymm5 = [0,128,64,192,32,160,96,224,16,144,80,208,48,176,112,240,0,128,64,192,32,160,96,224,16,144,80,208,48,176,112,240]
        vmovdqa ymm6, ymmword ptr [rip + .LCPI0_2] # ymm6 = [0,8,4,12,2,10,6,14,1,9,5,13,3,11,7,15,0,8,4,12,2,10,6,14,1,9,5,13,3,11,7,15]
        mov     rcx, qword ptr [rip + Chess_Lookup::Classified::sql@GOTPCREL]
        vmovdqa ymm7, ymmword ptr [rip + .LCPI0_3] # ymm7 = [7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8,7,6,5,4,3,2,1,0,15,14,13,12,11,10,9,8]
        vmovq   xmm1, rsi
        mov     rdx, qword ptr [rip + Chess_Lookup::Classified::sqr@GOTPCREL]
        shl     rax, 5
        vpbroadcastq    ymm1, xmm1
        vmovdqa ymm0, ymmword ptr [r8 + rax]
        vpand   ymm1, ymm0, ymm1
        vpsubq  ymm2, ymm1, ymmword ptr [rcx + rax]
        vpand   ymm4, ymm1, ymm3
        vpsrlw  ymm1, ymm1, 4
        vpand   ymm1, ymm1, ymm3
        vpshufb ymm4, ymm5, ymm4a
        vpshufb ymm1, ymm6, ymm1
        vpshufb ymm4, ymm4, ymm7
        vpshufb ymm1, ymm1, ymm7
        vpor    ymm1, ymm1, ymm4
        vpsubq  ymm1, ymm1, ymmword ptr [rdx + rax]
        vpand   ymm4, ymm1, ymm3
        vpsrlw  ymm1, ymm1, 4
        vpand   ymm1, ymm1, ymm3
        vpshufb ymm4, ymm5, ymm4
        vpshufb ymm1, ymm6, ymm1
        vpshufb ymm3, ymm4, ymm7
        vpshufb ymm1, ymm1, ymm7
        vpor    ymm1, ymm1, ymm3
        vpxor   ymm1, ymm2, ymm1
        vpand   ymm0, ymm1, ymm0
        vextracti128    xmm1, ymm0, 1
        vpor    xmm0, xmm0, xmm1
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpor    xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        vzeroupper
        ret
Assembly that gets generated if your CPU with -march=native supports GFNI and AVX512. (For example znver4)

Code: Select all

        mov     r8, qword ptr [rip + Chess_Lookup::Classified::msk@GOTPCREL]
        movsxd  rax, edi
        vpbroadcastq    ymm3, qword ptr [rip + .LCPI0_0] # ymm3 = [9241421688590303745,9241421688590303745,9241421688590303745,9241421688590303745]
        mov     rcx, qword ptr [rip + Chess_Lookup::Classified::sql@GOTPCREL]
        mov     rdx, qword ptr [rip + Chess_Lookup::Classified::sqr@GOTPCREL]
        vpbroadcastq    ymm1, rsi
        shl     rax, 5
        vmovdqa ymm0, ymmword ptr [r8 + rax]
        vpand   ymm1, ymm0, ymm1
        vpsubq  ymm2, ymm1, ymmword ptr [rcx + rax]
        vgf2p8affineqb  ymm1, ymm1, ymm3, 0
        vpsubq  ymm1, ymm1, ymmword ptr [rdx + rax]
        vgf2p8affineqb  ymm1, ymm1, ymm3, 0
        vpternlogq      ymm1, ymm0, ymm2, 72
        vextracti128    xmm0, ymm1, 1
        vpor    xmm0, xmm1, xmm0
        vpshufd xmm1, xmm0, 238                 # xmm1 = xmm0[2,3,2,3]
        vpor    xmm0, xmm0, xmm1
        vmovq   rax, xmm0
        vzeroupper
        ret
With that being said the last 5 instructions are not needed when the rest of the program also works on xmm registers directly.
Here is the result with the znver3 version:

--------------------------------------------------------------------------------------------------------------------------------
AMD Ryzen 9 5950X 16-Core Processor

Million Lookups/s Random Squares, Random Occupation/s:
Name Performance [MQueens/s] Tablesize Dependencies Template Author Reference
SBAMG o^(o-3cbn) 236.461030 576 [4kb] countl_zero, bswap yes Syed Fahad http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline 135.169323 0 [0kb] countl_zero, bswap yes Syed Fahad and Daniel Inführ http://www.talkchess.com/forum3/viewtopic.php?t=59845
GaloisField - AVX512 5.964714 0 [0kb] AVX512F_GFNI no Daniel Inführ (dangi12012) http://www.talkchess.com/forum3/viewtop ... =7&t=81335
Hyperbola Quintessence o^(o-2r) 293.663045 256 [2kb] bswap no Ryan Mack https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline 97.483496 0 [0kb] bswap yes Ryan Mack https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray 44.142536 0 [0kb] bswap no Daniel Inführ (dangi12012) Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation 41.177125 0 [0kb] ReverseBits no Daniel Inführ (dangi12012) http://www.talkchess.com/forum3/viewtop ... 8&start=20
Unpublished 524.436100 768 [6kb] AVX2 no Daniel Inführ (dangi12012) Unpublished
Binary Neural Network 50.935576 5852 [45kb] pdep_u64, AVX2 no Daniel Inführ (dangi12012) http://www.talkchess.com/forum3/viewtop ... =7&t=79332
Exploding Bitboards 67.071720 768 [6kb] imul64 no Harald Lüßen http://www.open-aurec.com/wbforum/viewt ... 3&start=80
Reference (Switch Lookup) 36.798630 0 [0kb] none yes Daniel Inführ (dangi12012) http://www.talkchess.com/forum3/viewtop ... so#p907362
AVX Branchless Shift 197.760820 0 [0kb] AVX2 no Daniel Inführ (dangi12012) http://www.talkchess.com/forum3/viewtop ... 5&start=60
Pext Emulated 67.034859 107904 [843kb] none no Zach Wegner https://randombit.net/bitbashing/posts/ ... tions.html
Dumb7 Fill 69.428542 0 [0kb] none no Gunnar Andersson https://www.chessprogramming.org/Dumb7Fill
Kogge-Stone 113.308413 0 [0kb] none no Peter M. Kogge, Harold S. Stone https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards 46.859935 1848 [14kb] none no Robert Hyatt https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine 177.567695 0 [0kb] countr_zero, countl_zero yes Fabio Gobbato https://www.chessprogramming.org/QBBEngine
QBBEngine - Shifted Mask 169.989730 0 [0kb] countr_zero, countl_zero no Fabio Gobbato http://www.talkchess.com/forum3/viewtop ... 90#p924623
Classical Bob-Mike 207.630418 1024 [8kb] countr_zero, countl_zero yes Robert Hyatt and Michael Sherwin https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike 213.972092 520 [4kb] countr_zero, countl_zero no Michael Sherwin and Daniel Inführ http://www.talkchess.com/forum3/viewtop ... 50#p924653
Leorik 226.305699 128 [1kb] countl_zero no Thomas Jahn (lithander) https://github.com/lithander/MinimalChessEngine
Leorik Inline 114.931146 0 [0kb] countl_zero no Thomas Jahn (lithander) https://github.com/lithander/MinimalChessEngine
Obstruction Difference 241.274217 768 [6kb] countl_zero no Michael Hoffmann http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline 93.470013 0 [0kb] countl_zero yes Michael Hoffmann http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference 272.860354 384 [3kb] countl_zero no Daniel Inführ and Michael Hoffmann http://www.talkchess.com/forum3/viewtop ... =7&t=79701
Genetic Obstruction Difference V2 314.512388 768 [6kb] countl_zero no Daniel Inführ http://www.talkchess.com/forum3/viewtop ... =7&t=79701
Slide Arithmetic 254.999044 256 [2kb] bzhi_u64, blsmsk_u64 no Jakob Progsch and Daniel Inführ http://www.talkchess.com/forum3/viewtop ... hm#p914767
Slide Arithmetic Inline 111.597532 0 [0kb] bzhi_u64, blsmsk_u64 no Jakob Progsch and Daniel Inführ http://www.talkchess.com/forum3/viewtop ... Arithm#p91
Kindergarten 448.205051 16640 [130kb] imul64 no Urban Koistinen https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards 208.441390 180416 [1409kb] none no Michael Sherwin http://www.talkchess.com/forum3/viewtop ... =7&t=73083
Kindergarten Super SISSY Bitboards 500.132535 16640 [130kb] imul64 no Michael Sherwin https://www.talkchess.com/forum3/viewto ... 4&start=30
Fancy Magic BB - Variable shift 400.688650 93376 [729kb] imul64 yes Pradu Kannan https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic 218.656741 6468 [50kb] none no Daniel Inführ tbd
Plain Magic BB 496.907579 295168 [2306kb] imul64 no Lasse Hansen https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift 511.605299 88891 [694kb] imul64 no Onno Garms and Volker Annuss https://www.chessprogramming.org/Magic_ ... hift_Fancy
Pext constexpr 797.212944 107904 [843kb] pext_u64 yes Zach Wegner https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube 69.920951 107680 [841kb] none yes Daniel Inführ (dangi12012)
--------------------------------------------------------------------------------------------------------------------------------

I probably will publish once I can verify with a Zen4 processor - the AVX512 version looks much jucier in terms of performance and it replaces 20! instructions with a single one that takes 0.5 cylces to execute! So even with the overhead of 20 additional instructions it has the performance above.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Faster than fancy magic - fits in L1

Post by Mike Sherwin »

Hi Daniel,

Nice work! :D

But I have to ask if you saw martinn's improvement.
Kindergarten Super SISSY Bitboards 89.717466 16640 [130kb] imul64 no
Michael Sherwin https://www.talkchess.com/forum3/viewto ... 4&start=30
Modified KGSS 102.305105 16672 [130kb] imul64 no
https://www.talkchess.com/forum3/viewto ... 4&start=30
Fancy Magic BB - Variable shift 81.528829 93376 [729kb] imul64 yes
Pradu Kannan https://www.chessprogramming.org/Magic_Bitboards#Fancy
Plain Magic BB 88.170178 295168 [2306kb] imul64 no
Lasse Hansen https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift 90.846201 88891 [694kb] imul64 no
Onno Garms and Volker Annuss https://www.chessprogramming.org/Magic_ ... hift_Fancy

Code: Select all

    static uint64_t hMask[64];
    static uint64_t dMask[64];
    static uint64_t aMask[64];
    static uint64_t vSubset[64][64];
    static uint64_t hSubset[64][64];
    static uint64_t dSubset[64][64];
    static uint64_t aSubset[64][64];
    static unsigned int shift_horizontal_table[64]; // new lookup table for shifts in calculation of hIndex

    static constexpr uint64_t Size = (64 * 64 * 4 + 64 * 4) * sizeof(uint64_t) + 64 * sizeof(unsigned int);

    enum { FILEa = 0, FILEh = 7};
    enum { RANK1 = 0, RANK8 = 7};

    static void InitSquare(int sq)
    {
        int ts, dx, dy;
        uint64_t occ, index;
        int x = sq % 8;
        int y = sq / 8;

        // Initialize Kindergarten Super SISSY Bitboards
        // diagonals
        for (ts = sq + 9, dx = x + 1, dy = y + 1; dx < FILEh && dy < RANK8; dMask[sq] |= 1ull << ts, ts += 9, dx++, dy++);
        for (ts = sq - 9, dx = x - 1, dy = y - 1; dx > FILEa && dy > RANK1; dMask[sq] |= 1ull << ts, ts -= 9, dx--, dy--);

        // anti diagonals
        for (ts = sq + 7, dx = x - 1, dy = y + 1; dx > FILEa && dy < RANK8; aMask[sq] |= 1ull << ts, ts += 7, dx--, dy++);
        for (ts = sq - 7, dx = x + 1, dy = y - 1; dx < FILEh && dy > RANK1; aMask[sq] |= 1ull << ts, ts -= 7, dx++, dy--);

        // diagonal indexes
        for (index = 0; index < 64; index++)
        {
            dSubset[sq][index] = 0;
            occ = index << 1;
            if ((sq & 7) != FILEh && (sq >> 3) != RANK8)
            {
                for (ts = sq + 9; ; ts += 9)
                {
                    dSubset[sq][index] |= (1ull << ts);
                    if (occ & (1ull << (ts & 7))) break;
                    if ((ts & 7) == FILEh || (ts >> 3) == RANK8) break;
                }
            }
            if ((sq & 7) != FILEa && (sq >> 3) != RANK1)
            {
                for (ts = sq - 9; ; ts -= 9)
                {
                    dSubset[sq][index] |= (1ull << ts);
                    if (occ & (1ull << (ts & 7))) break;
                    if ((ts & 7) == FILEa || (ts >> 3) == RANK1) break;
                }
            }
        }

        // anti diagonal indexes
        for (index = 0; index < 64; index++)
        {
            aSubset[sq][index] = 0;
            occ = index << 1;
            if ((sq & 7) != FILEa && (sq >> 3) != RANK8)
            {
                for (ts = sq + 7; ; ts += 7) {
                    aSubset[sq][index] |= (1ull << ts);
                    if (occ & (1ull << (ts & 7))) break;
                    if ((ts & 7) == FILEa || (ts >> 3) == RANK8) break;
                }
            }
            if ((sq & 7) != FILEh && (sq >> 3) != RANK1)
            {
                for (ts = sq - 7; ; ts -= 7)
                {
                    aSubset[sq][index] |= (1ull << ts);
                    if (occ & (1ull << (ts & 7))) break;
                    if ((ts & 7) == FILEh || (ts >> 3) == RANK1) break;
                }
            }
        }

        // horizontal lines
        for (ts = sq + 1, dx = x + 1; dx < FILEh; hMask[sq] |= 1ull << ts, ts += 1, dx++);
        for (ts = sq - 1, dx = x - 1; dx > FILEa; hMask[sq] |= 1ull << ts, ts -= 1, dx--);

        // vertical indexes
        for (index = 0; index < 64; index++) {
            vSubset[sq][index] = 0;
            uint64_t blockers = 0;
            for (int i = 0; i <= 5; i++) {
                if (index & (1ull << i)) {
                    blockers |= (1ull << (((5 - i) << 3) + 8));
                }
            }
            if ((sq >> 3) != RANK8) {
                for (ts = sq + 8; ; ts += 8) {
                    vSubset[sq][index] |= (1ull << ts);
                    if (blockers & (1ull << (ts - (ts & 7)))) break;
                    if ((ts >> 3) == RANK8) break;
                }
            }
            if ((sq >> 3) != RANK1) {
                for (ts = sq - 8; ; ts -= 8) {
                    vSubset[sq][index] |= (1ull << ts);
                    if (blockers & (1ull << (ts - (ts & 7)))) break;
                    if ((ts >> 3) == RANK1) break;
                }
            }
        }

        // horizontal indexes
        for (index = 0; index < 64; index++) {
            hSubset[sq][index] = 0;
            occ = index << 1;
            if ((sq & 7) != FILEh) {
                for (ts = sq + 1; ; ts += 1) {
                    hSubset[sq][index] |= (1ull << ts);
                    if (occ & (1ull << (ts & 7))) break;
                    if ((ts & 7) == FILEh) break;
                }
            }
            if ((sq & 7) != FILEa) {
                for (ts = sq - 1; ; ts -= 1) {
                    hSubset[sq][index] |= (1ull << ts);
                    if (occ & (1ull << (ts & 7))) break;
                    if ((ts & 7) == FILEa) break;
                }
            }
        }
    }

    static void Init()
    {
        for (int i = 0; i < 64; i++) {
            InitSquare(i);
        }
        
        for (int i = 0; i < 64; i++) {
            shift_horizontal_table[i] = (i & 56) + 1;
        }
    }

    static constexpr uint64_t file_b2_b7 = 0x0002020202020200ull;
    static constexpr uint64_t file_a2_a7 = 0x0001010101010100ull;
    static constexpr uint64_t diag_c2h7  = 0x0080402010080400ull;

    static uint64_t Bishop(int sq, uint64_t occ)
    {
        return  dSubset[sq][(((occ & dMask[sq]) * file_b2_b7) >> 58)] +
                aSubset[sq][(((occ & aMask[sq]) * file_b2_b7) >> 58)];
    }

    static uint64_t Rook(int sq, uint64_t occ)
    {
        return hSubset[sq][(occ >> shift_horizontal_table[sq]) & 63] +
               vSubset[sq][((((occ >> (sq & 7)) & file_a2_a7) * diag_c2h7) >> 58)];
    }

    static uint64_t Queen(int sq, uint64_t occ)
    {
        return Rook(sq, occ) | Bishop(sq, occ);
    }
I'm still recovering from my hospital stay so I have not tried it yet. If my mind will work maybe I can give it a try today.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Faster than fancy magic - fits in L1

Post by Mike Sherwin »

I don't know what I was afraid of, the changes were easy to make.

My R9 3950x does not seem as friendly to my code as other AMD cpu's are.

Kindergarten 373.199313 16640 [130kb] imul64 no Urban Koistinen https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards 137.815618 180416 [1409kb] none no Michael Sherwin http://www.talkchess.com/forum3/viewtop ... =7&t=73083
(martinn) Kindergarten Super SISSY Bitboards 429.176568 16640 [130kb] imul64 no Michael Sherwin https://www.talkchess.com/forum3/viewto ... 4&start=30
Fancy Magic BB - Variable shift 317.670081 93376 [729kb] imul64 yes Pradu Kannan https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic 184.006521 6468 [50kb] none no Daniel Inführ tbd
Plain Magic BB 445.170237 295168 [2306kb] imul64 no Lasse Hansen https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift 482.550957 88891 [694kb] imul64 no Onno Garms and Volker Annuss https://www.chessprogramming.org/Magic_ ... hift_Fancy
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Faster than fancy magic - fits in L1

Post by dangi12012 »

Hey glad you are getting better.

Here is my run to run - but its only 1 system and 1 compiler so once I release it people can make their own conclusions. I updated with the code you posted above so it can be seen that both algorithms are reliably better than the older "fancy magic" many people use for the last many years - both in speed and table size.

I maybe will release before I can test the faster (Intel 13th or Zen4) AVX512 codepath - each day runs faster and faster and you never know how much time you have left to push the envelope in a topic dear to your heart.

RUN1:

Code: Select all

AMD Ryzen 9 5950X 16-Core Processor

Million Lookups/s Random Squares, Random Occupation/s:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
SBAMG o^(o-3cbn)                   227.778845                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       126.747719                    0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inführ                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
GaloisField - AVX512               5.758809                      0       [0kb]       AVX512F_GFNI             no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
Hyperbola Quintessence o^(o-2r)    286.238921                    256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      94.225041                     0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray                      42.574385                     0       [0kb]       bswap                    no        Daniel Inführ (dangi12012)                   Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation                        39.602073                     0       [0kb]       ReverseBits              no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=20
CLASSIFIED			   519.596135                    768     [6kb]       AVX2                     no        Daniel Inführ (dangi12012)                   https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81707&sid=ec3532ea1ce1e8014d2b579b29fddd34
Binary Neural Network              50.579152                     5852    [45kb]      pdep_u64, AVX2           no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                63.710750                     768     [6kb]       imul64                   no        Harald Lüßen                                  http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          34.935088                     0       [0kb]       none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               190.211766                    0       [0kb]       AVX2                     no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      64.885488                     107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         66.033988                     0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill
Kogge-Stone                        109.478283                    0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  44.925100                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          169.303809                    0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine
QBBEngine - Shifted Mask           162.268909                    0       [0kb]       countr_zero, countl_zero no        Fabio Gobbato                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=90#p924623
Classical Bob-Mike                 200.064954                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  207.116452                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inführ            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653
Leorik                             218.016918                    128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Leorik Inline                      109.900496                    0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Obstruction Difference             234.004432                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      89.950019                     0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     263.007934                    384     [3kb]       countl_zero              no        Daniel Inführ and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Genetic Obstruction Difference V2  284.360998                    768     [6kb]       countl_zero              no        Daniel Inführ                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Slide Arithmetic                   249.061350                    256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            107.228391                    0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Kindergarten                       438.541674                    16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    200.003000                    180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Kindergarten Super SISSY Bitboards 508.602589                    16672   [130kb]     imul64                   no        Michael Sherwin                              https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
Fancy Magic BB - Variable shift    373.689518                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       214.786012                    6468    [50kb]      none                     no        Daniel Inführ                                tbd
Plain Magic BB                     459.326979                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       503.052691                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     762.092503                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          68.921178                     107680  [841kb]     none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p916723
Run2:

Code: Select all

Verify Engines...OK!

AMD Ryzen 9 5950X 16-Core Processor

Million Lookups/s Random Squares, Random Occupation/s:
Name                               Performance [MQueens/s]       Tablesize           Dependencies             Template  Author                                       Reference
SBAMG o^(o-3cbn)                   230.790357                    576     [4kb]       countl_zero, bswap       yes       Syed Fahad                                   http://www.talkchess.com/forum3/viewtopic.php?t=59845
SBAMG Inline                       126.820857                    0       [0kb]       countl_zero, bswap       yes       Syed Fahad and Daniel Inführ                 http://www.talkchess.com/forum3/viewtopic.php?t=59845
GaloisField - AVX512               5.576593                      0       [0kb]       AVX512F_GFNI             no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=81335
Hyperbola Quintessence o^(o-2r)    256.252344                    256     [2kb]       bswap                    no        Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Hyperbola Quintessence Inline      90.952045                     0       [0kb]       bswap                    yes       Ryan Mack                                    https://www.chessprogramming.org/Hyperbola_Quintessence
Genetic 8 Ray                      40.505088                     0       [0kb]       bswap                    no        Daniel Inführ (dangi12012)                   Abstract C++ Syntax Tree Sifter (c) Daniel Infuehr
Bitrotation                        37.074814                     0       [0kb]       ReverseBits              no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=20
CLASSIFIED			   529.253144                    768     [6kb]       AVX2                     no        Daniel Inführ (dangi12012)                   https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81707&sid=ec3532ea1ce1e8014d2b579b29fddd34
Binary Neural Network              51.032146                     5852    [45kb]      pdep_u64, AVX2           no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79332
Exploding Bitboards                64.321174                     768     [6kb]       imul64                   no        Harald Lüßen                                  http://www.open-aurec.com/wbforum/viewtopic.php?f=4&t=4523&start=80
Reference (Switch Lookup)          35.303614                     0       [0kb]       none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78235&p=907362&hilit=espresso#p907362
AVX Branchless Shift               190.284577                    0       [0kb]       AVX2                     no        Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=60
Pext Emulated                      64.468048                     107904  [843kb]     none                     no        Zach Wegner                                  https://randombit.net/bitbashing/posts/haswell_bit_permutations.html
Dumb7 Fill                         65.857555                     0       [0kb]       none                     no        Gunnar Andersson                             https://www.chessprogramming.org/Dumb7Fill
Kogge-Stone                        110.149385                    0       [0kb]       none                     no        Peter M. Kogge, Harold S. Stone              https://www.chessprogramming.org/Kogge-Stone_Algorithm
Rotated Bitboards                  44.562284                     1848    [14kb]      none                     no        Robert Hyatt                                 https://www.chessprogramming.org/Rotated_Bitboards
QBBEngine                          167.786924                    0       [0kb]       countr_zero, countl_zero yes       Fabio Gobbato                                https://www.chessprogramming.org/QBBEngine
QBBEngine - Shifted Mask           162.279530                    0       [0kb]       countr_zero, countl_zero no        Fabio Gobbato                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79005&start=90#p924623
Classical Bob-Mike                 196.511141                    1024    [8kb]       countr_zero, countl_zero yes       Robert Hyatt and Michael Sherwin             https://www.chessprogramming.org/Classical_Approach
Advanced Bob-Mike                  208.266225                    520     [4kb]       countr_zero, countl_zero no        Michael Sherwin and Daniel Inführ            http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79078&start=50#p924653
Leorik                             217.702951                    128     [1kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Leorik Inline                      107.126272                    0       [0kb]       countl_zero              no        Thomas Jahn (lithander)                      https://github.com/lithander/MinimalChessEngine
Obstruction Difference             232.726189                    768     [6kb]       countl_zero              no        Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Obstruction Difference Inline      89.840547                     0       [0kb]       countl_zero              yes       Michael Hoffmann                             http://www.talkchess.com/forum3/viewtopic.php?t=29087
Genetic Obstruction Difference     266.869487                    384     [3kb]       countl_zero              no        Daniel Inführ and Michael Hoffmann           http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Genetic Obstruction Difference V2  290.625300                    768     [6kb]       countl_zero              no        Daniel Inführ                                http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79701
Slide Arithmetic                   248.601925                    256     [2kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Slide Arithmetic Inline            106.298129                    0       [0kb]       bzhi_u64, blsmsk_u64     no        Jakob Progsch and Daniel Inführ              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=78693&p=914767&hilit=SlideArithm#p914767
Kindergarten                       437.261692                    16640   [130kb]     imul64                   no        Urban Koistinen                              https://www.chessprogramming.org/Kindergarten_Bitboards
SISSY Bitboards                    192.489694                    180416  [1409kb]    none                     no        Michael Sherwin                              http://www.talkchess.com/forum3/viewtopic.php?f=7&t=73083
Kindergarten Super SISSY Bitboards 509.400995                    16672   [130kb]     imul64                   no        Michael Sherwin                              https://www.talkchess.com/forum3/viewtopic.php?f=7&t=81234&start=30
Fancy Magic BB - Variable shift    394.711394                    93376   [729kb]     imul64                   yes       Pradu Kannan                                 https://www.chessprogramming.org/Magic_Bitboards#Fancy
FoldingHash - 4x fancy magic       212.353374                    6468    [50kb]      none                     no        Daniel Inführ                                tbd
Plain Magic BB                     468.303283                    295168  [2306kb]    imul64                   no        Lasse Hansen                                 https://www.chessprogramming.org/Magic_Bitboards#Plain
Black Magic BB - Fixed shift       509.266529                    88891   [694kb]     imul64                   no        Onno Garms and Volker Annuss                 https://www.chessprogramming.org/Magic_Bitboards#Fixed_shift_Fancy
Pext constexpr                     743.988572                    107904  [843kb]     pext_u64                 yes       Zach Wegner                                  https://www.chessprogramming.org/BMI2#PEXTBitboards
HyperCube                          68.823543                     107680  [841kb]     none                     yes       Daniel Inführ (dangi12012)                   http://www.talkchess.com/forum3/viewtopic.php?f=7&t=79004&p=916723&hilit=hypercube#p91672
Also a Zen4 7800x3d needs less energy than a 3950x in idle and otherwise so with german energy failed energy policies and insane prices - it makes a difference. As a 28 year old with 3 children its important to have low energy costs.
So Zen4 system coming this year for me.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Faster than fancy magic - fits in L1

Post by Mike Sherwin »

Thanks Daniel for verifying Martin Novak's modification. I never thought a memory lookup would be faster than a couple of operations on a register. Martin did not say what compiler he used to get a 13% increase over Black Magic. Maybe he will let us know.

Yes, we need some people to do more testing if the truth is to be established. In both runs KGSS did come ahead of Black Magic by a tiny margin. So I'm pleased with that result! :D
martinn
Posts: 20
Joined: Fri Mar 10, 2023 9:33 am
Full name: Martin Novák

Re: Faster than fancy magic - fits in L1

Post by martinn »

Hi Mike and Daniel,
I used GCC compiler when I was sharing my results. My MSVC is not big fun of Black Magic:

AMD Ryzen 7 5800H with Radeon Graphics

Million Lookups/s Random Squares, Random Occupation/s:
Name Performance [MQueens/s] Tablesize
SISSY Bitboards 51.930634 180416 [1409kb]
Kindergarten Super SISSY Bitboards 58.108870 16672 [130kb]
KGSSB constexpr 58.518276 16672 [130kb]
Fancy Magic BB - Variable shift 22.130832 93376 [729kb]
FoldingHash - 4x fancy magic 19.368410 6468 [50kb]
Plain Magic BB 55.146025 295168 [2306kb]
Black Magic BB - Fixed shift 23.363252 88891 [694kb]

I tried to make everything constexpr (in KGSSB constexpr) but the speedup is minimal. (The only good thing about it is that you don't have to call Init function.)

With GCC results are much better don't know why.

Kindergarten Super SISSY Bitboards 101.452907 16672 [130kb]
Fancy Magic BB - Variable shift 82.953453 93376 [729kb]
Plain Magic BB 87.797533 295168 [2306kb]
Black Magic BB - Fixed shift 90.756345 88891 [694kb]
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Faster than fancy magic - fits in L1

Post by Mike Sherwin »

martinn wrote: Sat Mar 18, 2023 6:50 pm Hi Mike and Daniel,
I used GCC compiler when I was sharing my results. My MSVC is not big fun of Black Magic:

AMD Ryzen 7 5800H with Radeon Graphics

Million Lookups/s Random Squares, Random Occupation/s:
Name Performance [MQueens/s] Tablesize
SISSY Bitboards 51.930634 180416 [1409kb]
Kindergarten Super SISSY Bitboards 58.108870 16672 [130kb]
KGSSB constexpr 58.518276 16672 [130kb]
Fancy Magic BB - Variable shift 22.130832 93376 [729kb]
FoldingHash - 4x fancy magic 19.368410 6468 [50kb]
Plain Magic BB 55.146025 295168 [2306kb]
Black Magic BB - Fixed shift 23.363252 88891 [694kb]

I tried to make everything constexpr (in KGSSB constexpr) but the speedup is minimal. (The only good thing about it is that you don't have to call Init function.)

With GCC results are much better don't know why.

Kindergarten Super SISSY Bitboards 101.452907 16672 [130kb]
Fancy Magic BB - Variable shift 82.953453 93376 [729kb]
Plain Magic BB 87.797533 295168 [2306kb]
Black Magic BB - Fixed shift 90.756345 88891 [694kb]
Thanks Martin for the compiler information. I looked up the specs for the 5800H. It does not have very much cache memory. That might be why KGSSB does so well in your test since it uses less memory than Magic. And that is good to know because a lot of chess engine enthusiast use laptops! :D
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Faster than fancy magic - fits in L1

Post by dangi12012 »

Update on this movegen. I could successfully integrate it into a bigger project - and what is really missing is finally for me getting an AVX512 capable CPU. While using AVX2 its just around the same speed as existing solutions (a tad faster) but for instance:

Calculating std::countr_zero() for 4 elements at once requires this handstand:

Code: Select all

inline u64x4 find_first_set() const {
		__m256i lsb = _mm256_and_si256(YMM, _mm256_sub_epi64(_mm256_setzero_si256(), YMM));
		lsb = _mm256_sub_epi64(lsb, _mm256_set1_epi64x(1));

		//https://arxiv.org/pdf/1611.07612.pdf
		__m256i lookup = _mm256_setr_epi8(0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4, 0, 1, 1, 2, 1, 2, 2, 3, 1, 2, 2, 3, 2, 3, 3, 4);
		__m256i low_mask = _mm256_set1_epi8(0x0f);
		__m256i lo = _mm256_and_si256(lsb , low_mask);
		__m256i hi = _mm256_and_si256(_mm256_srli_epi32(lsb , 4), low_mask);
		__m256i popcnt1 = _mm256_shuffle_epi8(lookup, lo);
		__m256i popcnt2 = _mm256_shuffle_epi8(lookup, hi);
		__m256i total = _mm256_add_epi8(popcnt1, popcnt2);
		return _mm256_sad_epu8(total, _mm256_setzero_si256());
}
Equivalent to this 4times and in 64 bits:

Code: Select all

unsigned int lowest_set_bit_index(unsigned int num) {
    return (num & -num) ? __builtin_popcount((num & -num) - 1) : 32;
}
Which is quite fast (faster than 4x std::countr_zero which is one instruction to begin with)
BUT: with proper AVX512 this comes down to:

Code: Select all

return _mm256_tzcnt_epi64(YMM); 
So currently it comes down to 450Million nodes/s for kiwipete and other perf start positions (without bulk counting and no zobrist hash and on a single thread) + a small eval function for each leaf.
With having movegen, board structure and other parts of the system being fully vectorized. Pext is not even applicable to this kind of system.
What I mean by that is that I expect a huge speedup by AVX512 beyond any performance seen so far. A fully vectorized system returns YMM registers and works on them natively:
For instance a rook and an unrelated bishop can be solved in one call by getting the masks - and then running the attack4 function.

Code: Select all

static inline u64x4 RookBishop_Rays(int sq_rook, int sq_bish, uint64_t occ) {
		return attack4(occ,
			_mm256_blend_epi32(sql[sq_rook].YMM, sql[sq_bish].YMM, 0b11110000),
			_mm256_blend_epi32(sqr[sq_rook].YMM, sqr[sq_bish].YMM, 0b11110000),
			_mm256_blend_epi32(msk[sq_rook].YMM, msk[sq_bish].YMM, 0b11110000));
	}
Whereas a queen is this:

Code: Select all

static inline u64x4 HV_D12(int sq, uint64_t occ) {
        return attack4(occ, sql[sq], sqr[sq], msk[sq]);
}
returing the 4 rays seperately but in a single register has the advantage of being able to also solve pin situations in 1 instruction further down the line.

Additional advantage:
Then the usual #define Bitloop(X) for(;X; X &= (X - 1))
Becomes this: #define Bitloop_4x64(X) for(;X.is_not_zero(); X &= (X - 1))

Code: Select all

inline bool is_not_zero() const {
		return _mm256_testz_si256(YMM, YMM) == 0;
	}
Same number of conditional branches - 4x the work.

So in summary - I feel very confident to publish further results here with AVX512 in the coming months. I have identified ~15 mm256* AVX512 instructions that are very very juicy for chess and replace multiple lines of code for many bitboards with 1 instruction.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
dangi12012
Posts: 1062
Joined: Tue Apr 28, 2020 10:03 pm
Full name: Daniel Infuehr

Re: Faster than fancy magic - fits in L1

Post by dangi12012 »

So now its truly coming together. Having a full running implementation of SIMD __m256i (which is 4x uint64_t)

I have a full move implementation that goes from a native SIMD board via shuffle directly into the new movegenerator - which spits out the target bits again in 4x uint64_t simdwise manner.
These bitboards get enumerated like you would normally do it - but 4bbs at a time.

For example all sliders - independent of bishop, rook, queen can be consolidated into this loop:
Pins and check evasion is solved by respective masks - pin_valid and check_evasion like in the gigantua movegen.

Code: Select all

u64x4 sliders = brd.rrbb_vec() & pin_valid;
Bitloop_4x64(sliders) {
u64x4 source_squares = sliders.get_lsb();
auto mv = MV::Quad_Sliders(sliders.find_first_set(), occ) & check_evasion;
Bitloop_4x64(mv) {
	apply_move_sliders<core>(brd, source_squares, mv.get_lsb(), mv.is_not_zero_mask());
}}
Now these bitboards can be shuffled and masked in a way that the output of the respective least significant bits feed back into the position directly- to change the bits in the way that corresponds to a move.
Most helpful was this resource: https://www.intel.com/content/www/us/en ... hs=AVX_ALL

Now this V2 movegen does not use templates anymore - so compilation is fast, does not use PEXT or Fancy magic but the optimized galois field movegen which works on very well also on AVX2.

So here is the results with disabled bulk counting, no transposition table and single thread on an ryzen 5950x:

Gigantua - bulk count disabled by chaning this to perft0 and if constexpr depth 0 https://github.com/Gigantua/Gigantua/bl ... ua.cpp#L35
compiletime: 4min

Code: Select all

Perft Start 1: 20 4ms 0.00405186 MNodes/s
Perft Start 2: 400 7ms 0.0513018 MNodes/s
Perft Start 3: 8902 4ms 1.79367 MNodes/s
Perft Start 4: 197281 12ms 15.6237 MNodes/s
Perft Start 5: 4865609 14ms 325.96 MNodes/s
Perft Start 6: 119060324 239ms 498.054 MNodes/s
Perft Start 7: 3195901860 6332ms 504.706 MNodes/s
OK

Perft Kiwi 1: 48 0ms 12 MNodes/s
Perft Kiwi 2: 2039 0ms 145.643 MNodes/s
Perft Kiwi 3: 97862 0ms 482.079 MNodes/s
Perft Kiwi 4: 4085603 7ms 510.956 MNodes/s
Perft Kiwi 5: 193690690 355ms 544.765 MNodes/s
Perft Kiwi 6: 8031647685 15545ms 516.643 MNodes/s
OK

Perft Midgame 1: 46 1ms 0.028732 MNodes/s
Perft Midgame 2: 2079 0ms 189 MNodes/s
Perft Midgame 3: 89890 0ms 505 MNodes/s
Perft Midgame 4: 3894594 6ms 581.978 MNodes/s
Perft Midgame 5: 164075551 288ms 568.779 MNodes/s
Perft Midgame 6: 6923051137 12289ms 563.321 MNodes/s
OK

Perft Endgame 1: 38 0ms 9.5 MNodes/s
Perft Endgame 2: 1129 0ms 161.286 MNodes/s
Perft Endgame 3: 37035 0ms 420.852 MNodes/s
Perft Endgame 4: 1023977 1ms 538.935 MNodes/s
Perft Endgame 5: 31265700 62ms 501.053 MNodes/s
Perft Endgame 6: 849167880 1566ms 541.987 MNodes/s
OK

Perft aggregate: 18999768562 36744ms 517.078 MNodes/s

SIMD Movegen (galois field), SIMD board structure, template free + color agnostic, no bulk count no TT single thread.
compiletime: 15s

Code: Select all

Perft start 1: 20 0ms 1.16347 MNodes/s
Perft start 2: 400 0ms 212.766 MNodes/s
Perft start 3: 8902 0ms 488.316 MNodes/s
Perft start 4: 197281 0ms 547.581 MNodes/s
Perft start 5: 4865609 7ms 615.325 MNodes/s
Perft start 6: 119060324 200ms 594.959 MNodes/s
Perft start 7: 3195901860 4913ms 650.472 MNodes/s

Perft kiwi 1: 48 0ms 69.5652 MNodes/s
Perft kiwi 2: 2039 0ms 723.05 MNodes/s
Perft kiwi 3: 97862 0ms 1167.53 MNodes/s
Perft kiwi 4: 4085603 3ms 1039.66 MNodes/s
Perft kiwi 5: 193690690 162ms 1192.62 MNodes/s
Perft kiwi 6: 8031647685 7992ms 1004.85 MNodes/s

Perft midgame 1: 46 0ms 57.5 MNodes/s
Perft midgame 2: 2079 0ms 434.937 MNodes/s
Perft midgame 3: 89890 0ms 550.5 MNodes/s
Perft midgame 4: 3894594 6ms 587.478 MNodes/s
Perft midgame 5: 164075551 270ms 607.21 MNodes/s
Perft midgame 6: 6923051137 11039ms 627.097 MNodes/s

Perft endgame 1: 38 0ms 52.7778 MNodes/s
Perft endgame 2: 1129 0ms 530.047 MNodes/s
Perft endgame 3: 37035 0ms 924.027 MNodes/s
Perft endgame 4: 1023977 1ms 845.535 MNodes/s
Perft endgame 5: 31265700 35ms 883.452 MNodes/s
Perft endgame 6: 849167880 1052ms 807.175 MNodes/s

Perft aggregate: 18999768562 25686ms 739.667 MNodes/s
Todos: Hash for the leaf nodes
This is still with AVX2 where there are many functions that can be made much faster with avx512 but I hit a wall for now.

A true novelty: with avx2 its quite cheap to do inversion of all bits in 4x uint64_t - so my pawns dont walk up and down the board (<< 8 >>) but the color to move is always on the left of the board! so a pawn marching forward is always pawn >> 1. This has huge advantages of not needing to "if white" or lookup[color] or having seperate codepath as well as not needing a mask on the a and h files for the pawns. Many constants also consolidate away etc. Truly color agnostic chessprogramming helps a lot in terms of performance.
Worlds-fastest-Bitboard-Chess-Movegenerator
Daniel Inführ - Software Developer
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Faster than fancy magic - fits in L1

Post by Mike Sherwin »

When you are ready to fully publish this to make it available to everyone you should also make it a header file library so everyone can actually use it. :D