Devlog of Leorik

lithander · Post by **lithander** » Sun Apr 02, 2023 1:37 am

martinn wrote: ↑Sat Apr 01, 2023 10:32 pm I've been thinking for a while now about combining PEXT with KiSS. I have no idea if it is just stupid or not. On my computer pext commands are quite slow so i cannot test it very well.

Not stupid at all. Having implemented PEXT on Thursday and KiSS on Friday I also thought how much easier it would have been to do KiSS with PEXT. And based on my test results posted above KiSS has an edge over PEXT in an engine presumably because it has a way smaller lookup table. But PEXT spends less time with computing indices if you have the right hardware. So combining them could get you the speed and simplicity of PEXT with the cache-friendlyness of storing only 10% the amount of precomputed bitboards. We'll just restore the bitboards at runtime by combining two bitboard-subsets just like KiSS.

Mike Sherwin wrote: ↑Sat Apr 01, 2023 11:40 pm That sounds interesting. Maybe even amazing. My cpu is also not PEXT friendly. So we will have to hope Thomas thinks the same as us.

martinn wrote: ↑Sat Apr 01, 2023 10:32 pmI don't know how Thomas changed the code so I cannot say where it would change there.

I planned to do that once I'm done with my KiSS reimplementation. I'm not saying it needs that much refactoring, especially not because computing the subsets at startup is not performance critical but I have a different code style than Mike and re-implementing his algorithms is helping me understand all the little details. And maybe you find some of my implementation details worthy to be ported back to C++, it's always good to have a choice.

I'm sorry that I haven't pushed any of the code to Github yet but I like to work at my own pace. I'll share the code as soon as I'm feeling like I achieved what I wanted to and then I'm eager to collaborate on it.

If you want to try the idea now you could implement the movegen you have in mind starting from C++ KiSS and add it here: https://github.com/Gigantua/Chess_Movegen

martinn wrote: ↑Sat Apr 01, 2023 10:32 pm Also in C++ i got quite good speed up when I used shift_horizontal_table[sq] instead of (square & 0b111000) | 1). Wouldn't it also make it little difference in c#?

I took that out because in my tests it was performing better when just computing the value from scratch. It's only two boolean operations after all and in C# reading from an array always involves a bounds-check.

lithander · Post by **lithander** » Sun Apr 02, 2023 3:34 pm

I had a productive lunch break and implemented two novel move generators. I'm not even kidding. PEXT makes this so easy.

So my original PEXT implementation (that I understand is the 'classic' version) looks like this:

Code: Select all

        static readonly ulong[] Attacks = new ulong[5248 + 102400];
        static readonly ulong[] BishopOffset = new ulong[64];
        static readonly ulong[] RookOffset = new ulong[64];
        static readonly ulong[] BishopMask = new ulong[64]
        static readonly ulong[] RookMask = new ulong[64]

        public static ulong RookAttacks(ulong occupation, int square)
        {
            return Attacks[RookOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, RookMask[square])];
        }

        public static ulong BishopAttacks(ulong occupation, int square)
        {
            return Attacks[BishopOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, BishopMask[square])];
        }

Here all the bitboards are pregenerated. There exist 5248 bishop attacks for all legal combinations of bishop positions and blockers! And a whooping 102400 rook attacks! So that's more then 100K bitboards we need to store.

Now the subset idea as used in KiSS comes into play: If we can reconstruct the attack-bitboard from two subsets at runtime we don't need to store so many bitboards. How many subsets do we have to store if we pack them as dense as possible? The answer is 1664 for the bishop and 5120 for the rook:

Code: Select all

        static readonly ulong[] DiagonalMask = new ulong[64];
        static readonly ulong[] AntiDiagonalMask = new ulong[64];
        static readonly ulong[] HorizontalMask = new ulong[64];
        static readonly ulong[] VerticalMask = new ulong[64];

        static readonly ulong[] Attacks = new ulong[1664 + 5120]; //6784

        static readonly int[] DiagonalOffset = new int[64];
        static readonly int[] AntiDiagonalOffset = new int[64];
        static readonly int[] HorizontalOffset = new int[64];
        static readonly int[] VerticalOffset = new int[64];

        public static ulong RookAttacks(ulong occupation, int square)
        {
            ulong a = Attacks[HorizontalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, HorizontalMask[square])];
            ulong b = Attacks[VerticalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, VerticalMask[square])];
            return a | b;
        }

        public static ulong BishopAttacks(ulong occupation, int square)
        {
            ulong a = Attacks[DiagonalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, DiagonalMask[square])];
            ulong b = Attacks[AntiDiagonalOffset[square] + Bmi2.X64.ParallelBitExtract(occupation, AntiDiagonalMask[square])];
            return a | b;
        }

Note how we need twice the amount of 64-sized arrays to look up offsets and twice the amount of masks so you may add that on top. But still the memory footprint is way less then 10K bitboards now. An order of magnitutde less!

But we don't have to store the bitboards densely. If you think about it (or print it to the console^^) for each square there can't be more than 64 distinct attack patterns along one line. If we allow us a bit more space to store the bitboards we can store them sparsely and don't have to keep track of the offsets but instead just compute them on the fly.

Code: Select all

        static readonly ulong[] DiagonalMask = new ulong[64];
        static readonly ulong[] AntiDiagonalMask = new ulong[64];
        static readonly ulong[] HorizontalMask = new ulong[64];
        static readonly ulong[] VerticalMask = new ulong[64];
        
        static readonly ulong[] Attacks = new ulong[4 * 64 * 64];//16.384
        
        public static ulong RookAttacks(ulong occupation, int square)
        {
            int offset = 8192 + square * 128;
            ulong a = Attacks[offset + Bmi2.X64.ParallelBitExtract(occupation, HorizontalMask[square])];
            ulong b = Attacks[offset + 64 + Bmi2.X64.ParallelBitExtract(occupation, VerticalMask[square])];
            return a | b;
        }

        public static ulong BishopAttacks(ulong occupation, int square)
        {
            int offset = square * 128;
            ulong a = Attacks[offset + Bmi2.X64.ParallelBitExtract(occupation, DiagonalMask[square])];
            ulong b = Attacks[offset + 64 + Bmi2.X64.ParallelBitExtract(occupation, AntiDiagonalMask[square])];
            return a | b;
        }

So more than half of the slots in the Attacks bitboard are now empty. Wasted space. But not a lot of wasted space compared to PEXT. 16K is still 90K *less* than classical PEXT, so this could do well in real world scenarios, provided that the Perft speed is still competative.

And not surprisingly it is:

Code: Select all

Leorik Perft Sparse SubsetPEXT (16K)

1 OK! 1121ms, 106192K NPS
2 OK! 1703ms, 113734K NPS
3 OK! 1629ms, 109633K NPS
4 OK! 6716ms, 105122K NPS
5 OK! 0ms, 115093K NPS
6 OK! 1410ms, 116317K NPS

Total: 1361558651 Nodes, 12580ms, 108223K NPS

Code: Select all

Leorik Perft Dense SubsetPEXT (7K)
1 OK! 1063ms, 111974K NPS
2 OK! 1735ms, 111630K NPS
3 OK! 1652ms, 108100K NPS
4 OK! 7091ms, 99569K NPS
5 OK! 0ms, 84334K NPS
6 OK! 1475ms, 111225K NPS

Total: 1361558651 Nodes, 13017ms, 104593K NPS

You can see that both versions perform a little better than KiSS on hardware with native PEXT support (like my Ryzen 5900X) and both versions also have a small memory footprint. The sparse one equal to KiSS and the dense one uses less than half of that. On the other hand the densely packed one is 4M slower than the one that uses twice the memory. So both seem equally promising for a real life scenario with high cache pressure. And both have a likely chance to be performing better than classical PEXT and also better than KiSS!

I didn't think I'd catch the slider-move-gen bug but I've fallen right down the rabbit hole this weekend, haha. I bet at least Dangi and Michael will agree that this is exciting! I don't know about the other readers... if you want me to clarify something feel free to ask.

Mike Sherwin · Post by **Mike Sherwin** » Sun Apr 02, 2023 4:38 pm

The 5000 game test runs should be quite revealing?

lithander · Post by **lithander** » Wed Apr 05, 2023 11:03 pm

Mike Sherwin wrote: ↑Sun Apr 02, 2023 4:38 pm The 5000 game test runs should be quite revealing?

It turned out quite frustrating in how it seems to contradict my earlier findings.

I tried to be more rigorous with my tests. This time I let the new versions play against the version 2.4 of Leorik released on github.. The other thing is that I played only 12 matches in parallel. That way each engine should have a physical core mostly to itself and all cores run at the same clock-speed. Nps goes up compared to relying on SMT and I adjusted the the time-control accordingly to 5s + 200ms.

First of all I wanted to know how much changing from .Net 6 to .Net 8 contributed so I tested that in isolation:

Leorik-2.4 compiled on .Net 6 takes 21,7s to search the startpos to depth 21.
Leorik-2.4 compiled on .Net 8 takes only 20,0s to search to depth 21, making it ~8% faster.

This seems to give it an edge in selfplay of ~10 Elo:
Score of Leorik-2.4-Net8 vs Leorik-2.4: 2733 - 2420 - 4847 [0.516] 10000
Elo difference: 10.9 +/- 4.9, LOS: 100.0 %, DrawRatio: 48.5[/size]

All versions with alternative slider generation code also use .Net 8 so to be an improvement they should provide more then 10 Elo against the reference engine.

1st Place: Leorik-2.4-PEXT-Net8 with +27 Elo and 18,3s! (Source)
2nd Place: Leorik-2.4-SparseSubsetPEXT-Net8 with +23 Elo and 19,2s! (Source)
3rd Place: Leorik-2.4-Classic-Net8 with +11 Elo and 20,0s! (Source)
4th Place: Leorik-2.4-KiSS-Net8 with +10 Elo and 19,7s! (Source)
5th Place: Leorik-2.4-DenseSubsetPEXT-Net8 with +8 Elo and 20,0s! (Source)

The Elo gains correlate pretty well with the time to depth on a standard position. I should migrate the master branch to .Net 8 for +10 Elo but otherwise only PEXT and SparseSubsetPEXT would be an improvement (on hardware that supports it natively) over my classic implementation. Good for me and a disappointment for KiSS.

I couldn't understand why KiSS was underperforming now after promising results earlier. I reverted my recent changes, played 18.000 games in total, but to no avail. A pity, because the way it computes indices to find the right subsets is quite clever.

Feel free to check out the branch and run your own experiments. Just go to Bitboard.cs to change the used slider generator by uncommenting the #define of your choice and recompile. If you find improvements, let me know. And if anyone wants me to explain any of the algorithms in detail beyond just providing source code feel free to ask.

I'm happy to know now that my original movegen (Classic.cs) doesn't perform so bad that it needs changing. But I may provide a PEXT binary in the future for those who have the appropriate hardware.

And now I'll try to get some value out all these self-play games on my hard-disk and revisit the evaluation and the tuner!

Mike Sherwin · Post by **Mike Sherwin** » Thu Apr 06, 2023 1:06 am

Beware of Greeks bearing gifts.

First the allied tribe of the Dangi is defeated.

lithander wrote: ↑Sat Apr 01, 2023 3:28 pm Here are 10K games at 5s + 500ms time control against Dangi's binary.

Code: Select all

Score of Leorik-2.4-KiSS-Net8 vs Leorik-2.4-Dangi: 2664 - 2417 - 4919  [0.512] 10000
...      Leorik-2.4-KiSS-Net8 playing White: 1300 - 995 - 2705  [0.530] 5000
...      Leorik-2.4-KiSS-Net8 playing Black: 1364 - 1422 - 2214  [0.494] 5000
...      White vs Black: 2722 - 2359 - 4919  [0.518] 10000
Elo difference: 8.6 +/- 4.8, LOS: 100.0 %, DrawRatio: 49.2 %

And my own PEXT implementation against Dangi's with the same time control but fewer games:

Code: Select all

Score of Leorik-2.4-Pext-Net8 vs Leorik-2.4-D: 478 - 441 - 1039  [0.509] 1958
...      Leorik-2.4-Pext-Net8 playing White: 246 - 211 - 524  [0.518] 981
...      Leorik-2.4-Pext-Net8 playing Black: 232 - 230 - 515  [0.501] 977
...      White vs Black: 476 - 443 - 1039  [0.508] 1958
Elo difference: 6.6 +/- 10.5, LOS: 88.9 %, DrawRatio: 53.1 %

The Greek gift.

lithander wrote: ↑Sat Apr 01, 2023 3:28 pm Congratz, Mike! I've become really impressed with your work! =)

Opening the gate from the inside.

lithander wrote: ↑Wed Apr 05, 2023 11:03 pm It turned out quite frustrating in how it seems to contradict my earlier findings.

Victory.

lithander wrote: ↑Wed Apr 05, 2023 11:03 pm 3rd Place: Leorik-2.4-Classic-Net8 with +11 Elo and 20,0s! (Source)
4th Place: Leorik-2.4-KiSS-Net8 with +10 Elo and 19,7s! (Source)

Victory celebration.

lithander wrote: ↑Wed Apr 05, 2023 11:03 pm I'm happy to know now that my original movegen (Classic.cs) doesn't perform so bad that it needs changing.

Off to more conquest with the spoils of the Greek Gift.

lithander wrote: ↑Wed Apr 05, 2023 11:03 pm And now I'll try to get some value out all these self-play games on my hard-disk and revisit the evaluation and the tuner!

lithander wrote: ↑Wed Apr 05, 2023 11:03 pm

lithander · Post by **lithander** » Thu Apr 06, 2023 8:50 am

No reason to celebrate. If Helena turns out to be just a few thousand selfplay games I would have rather kept that beautiful

.

lithander · Post by **lithander** » Thu Apr 06, 2023 9:23 am

Mike Sherwin wrote: ↑Thu Apr 06, 2023 1:06 am Beware of Greeks bearing gifts.

It actually saddens me that this is your interpretation btw. I'll pledge to never cross the aegean sea no more. Let every tribe deal with their own frustrations and not mingle lest it will cause more strife.

JoAnnP38 · Post by **JoAnnP38** » Thu Apr 06, 2023 10:56 am

lithander wrote: ↑Thu Apr 06, 2023 9:23 am
Mike Sherwin wrote: ↑Thu Apr 06, 2023 1:06 am Beware of Greeks bearing gifts.
It actually saddens me that this is your interpretation btw. I'll pledge to never cross the aegean sea no more. Let every tribe deal with their own frustrations and not mingle lest it will cause more strife.

I always thought that the moral of the story of the Trojan war was instead -- "beware of gifts bearing Greeks", no?

Mike Sherwin · Post by **Mike Sherwin** » Fri Apr 07, 2023 12:57 am

Leorik.KiSS - Leorik.Classic : 57.5/100 43-28-29 (111101011101100=01====110110==101=0====0=1=0==110101==011==110011011=10011=1====00011111011001===001) 58% +56

Code: Select all

using System.Runtime.CompilerServices;
using static Leorik.Core.Bitboard;

namespace Leorik.Core.Slider
{
    //This class implements Mike Sherwin's Kindergarten Super SISSY Bitboards where SISSY stands for Split Index Sub Set Yielding
    //...henceforth we shall abreviate it as KiSS <3
    public static class KiSS
    {
        static readonly ulong[] dSubset = new ulong[64 * 64];
        static readonly ulong[] aSubset = new ulong[64 * 64];
        static readonly ulong[] vSubset = new ulong[64 * 64];
        static readonly ulong[] hSubset = new ulong[64 * 64];

        const int FILE_A = 0;
        const int FILE_H = 7;
        const int RANK_1 = 0;
        const int RANK_8 = 7;
        const ulong FILE_A2_A7 = 0x0001010101010100;
        const ulong DIAGONAL_C2_H7 = 0x0080402010080400;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong BishopAttacks(ulong occupation, int square)
        {
            return
                dSubset[(occupation & Blocker.DiagonalMask[square]) * FILE_A2_A7 >> 57] | 
                aSubset[(occupation & Blocker.AntiDiagonalMask[square]) * FILE_A2_A7 >> 57];
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong RookAttacks(ulong occupation, int square)
        {
            ulong hIndex = occupation >> (square & 0b111000 | 1) & 63;
            ulong vIndex = (occupation >> (square & 0b000111) & FILE_A2_A7) * DIAGONAL_C2_H7 >> 58;
            return vSubset[hIndex] | hSubset[vIndex];
        }

        //Table initialization

        static KiSS()
        {
            //Init Subsets
            for (int square = 0; square < 64; square++)
                for (int index = 0; index < 64; index++)
                {
                    int offset = square * 64 + index;
                    dSubset[offset] = GetDiagonalSubset(square, index);
                    aSubset[offset] = GetAntiDiagonalSubset(square, index);
                    hSubset[offset] = GetHorizontalSubset(square, index);
                    vSubset[offset] = GetVerticalSubset(square, index);
                }
        }

        static ulong GetDiagonalSubset(int square, int index)
        {
            ulong blockers = GetBlockers(square, index);
            ulong result = 0;
            for (int sq = square; IsFree(blockers, sq) && File(sq) < FILE_H && Rank(sq) < RANK_8; sq += 9)
                result |= 1UL << sq + 9;
            for (int sq = square; IsFree(blockers, sq) && File(sq) > FILE_A && Rank(sq) > RANK_1; sq -= 9)
                result |= 1UL << sq - 9;
            return result;
        }

        static ulong GetAntiDiagonalSubset(int square, int index)
        {
            ulong blockers = GetBlockers(square, index);
            ulong result = 0;
            for (int sq = square; IsFree(blockers, sq) && File(sq) > FILE_A && Rank(sq) < RANK_8; sq += 7)
                result |= 1UL << sq + 7;
            for (int sq = square; IsFree(blockers, sq) && File(sq) < FILE_H && Rank(sq) > RANK_1; sq -= 7)
                result |= 1UL << sq - 7;
            return result;
        }

        static ulong GetHorizontalSubset(int square, int index)
        {
            ulong blockers = GetBlockers(square, index);
            ulong result = 0;
            for (int sq = square; IsFree(blockers, sq) && File(sq) < FILE_H; sq++)
                result |= 1UL << sq + 1;
            for (int sq = square; IsFree(blockers, sq) && File(sq) > FILE_A; sq--)
                result |= 1UL << sq - 1;
            return result;
        }

        static ulong GetVerticalSubset(int square, int index)
        {
            ulong blockers = GetVerticalBlockers(square, index);
            ulong result = 0;
            for (int sq = square; IsVerticalFree(blockers, sq) && Rank(sq) < RANK_8; sq += 8)
                result |= 1UL << sq + 8;
            for (int sq = square; IsVerticalFree(blockers, sq) && Rank(sq) > RANK_1; sq -= 8)
                result |= 1UL << sq - 8;
            return result;
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static bool IsFree(ulong mask, int square) => (mask & 1UL << File(square)) == 0;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static ulong GetBlockers(int square, int index) => (ulong)(index << 1) & ~(1UL << File(square));

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static bool IsVerticalFree(ulong mask, int square) => (mask & 1UL << square - File(square)) == 0;

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        static ulong GetVerticalBlockers(int square, int index)
        {
            //Place the 6 'index' bits FEDCBA like this, leave standing square ZERO 
            //  - - - - - - - - 
            //  A - - - - - - - 
            //  B - - - - - - - 
            //  C - - - - - - - 
            //  D - - - - - - - 
            //  E - - - - - - - 
            //  F - - - - - - - 
            //  - - - - - - - - 
            ulong blockers = 0;
            for (int i = 0, shift = 48; i < 6; i++, shift -= 8)
                if ((index & 1 << i) > 0) //index bit is set?
                    if (shift != square - File(square)) //don't block standing square
                        blockers |= 1UL << shift;
            return blockers;
        }

        //~~~ DEBUG ~~~

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong _BishopAttacks(ulong occupation, int square)
        {
            ulong dIndex = (occupation & Blocker.DiagonalMask[square]) * FILE_A2_A7 >> 57;
            ulong aIndex = (occupation & Blocker.AntiDiagonalMask[square]) * FILE_A2_A7 >> 57;
            return GetDiagonalSubset(square, (int)dIndex) | GetAntiDiagonalSubset(square, (int)aIndex);
        }

        [MethodImpl(MethodImplOptions.AggressiveInlining)]
        public static ulong _RookAttacks(ulong occupation, int square)
        {
            ulong hIndex = occupation >> (square & 0b111000 | 1) & 63;
            ulong vIndex = (occupation >> (square & 0b000111) & FILE_A2_A7) * DIAGONAL_C2_H7 >> 58;
            return GetHorizontalSubset(square, (int)hIndex) | GetVerticalSubset(square, (int)vIndex);
        }
    }
}

lithander · Post by **lithander** » Fri Apr 07, 2023 1:46 am

You need to offset the indices in RookAttacks and BishopAttacks by the square * 64 or it won't work. The build you are testing can't have been compiled with the code you posted.

And I wish 100 games were enough but whatever result you get after 100 games is within a confidence window of +/- 50 Elo. To get it down to a more useful +/- 5 Elo you have to run 10.000 games.

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik