Devlog of Leorik

lithander · Post by **lithander** » Sat Mar 19, 2022 11:07 pm

Leorik has a logo!!

pedrojdm2021 · Post by **pedrojdm2021** » Sun Mar 27, 2022 6:34 pm

That's a very cool one! you drew it?

lithander · Post by **lithander** » Sun Mar 27, 2022 10:02 pm

pedrojdm2021 wrote: ↑Sun Mar 27, 2022 6:34 pm That's a very cool one! you drew it?

Yes, I used Inkscape which is a free vector graphics editor. Very programmer friendly, you don't need a graphics tablet or even much drawing skills as everything is composed from simple shapes converted to paths.

Henk · Post by **Henk** » Sun Mar 27, 2022 10:53 pm

Reminds me of voodoo James Bond Live at let die movie.

Mike Sherwin · Post by **Mike Sherwin** » Mon Mar 28, 2022 2:50 am

Cute logo!

Leoric the skeleton king from Diablo 1
https://static.wikia.nocookie.net/diabl ... 0603170950

lithander · Post by **lithander** » Mon Mar 28, 2022 10:17 am

Mike Sherwin wrote: ↑Mon Mar 28, 2022 2:50 am Cute logo!

Leoric the skeleton king from Diablo 1
https://static.wikia.nocookie.net/diabl ... 0603170950

Unshackled from the constraints of minimalism and simplicity my bare-bones engine MinimalChess rises from it's grave, stronger than ever.

..and not only is Leoric an iconic boss from one of my favorite video games, his signature ability is to raise skeletal knights and archers that shield him and fight for him. Not unlike chess.

Aside from that reference the engine is named after my sons Leonard and Frederik.

mvanthoor · Post by **mvanthoor** » Mon Mar 28, 2022 3:22 pm

lithander wrote: ↑Mon Mar 28, 2022 10:17 am Aside from that reference the engine is named after my sons Leonard and Frederik.

And you named Leonard after Leonard McCoy, didn't you?

To take the analogy a bit further... McCoy was probably a distant relative of the Skeleton King because everyone called him Bones

lithander · Post by **lithander** » Mon Mar 28, 2022 6:43 pm

It's rare that programming something these days makes me go "WHOAAAT?!?" like when I was a kid in the 90s. But here's a crazy story on optimization...

A.k.a. how to tune high-quality PSTs from scratch (material values) in 20 seconds.

All previous versions of Leorik have just taken their PST values from MinimalChess. I was pretty happy with the quality but tuning these tables took a lot of time. Literally days of trial and error with hours of waiting in between.

For Leorik I wanted to revisit the tuning process to get shorter iteration times for my upcoming experiments with a better evaluation.

I'm not sure how familiar everybody here is with tuning? Maybe a short recap/introduction can't hurt. Everyone knows PSTs I guess. But where do the values come from? It turns out you can write a program to "extract" these values mathematically from a set of annotated positions. That's typically just a text file where each line is a FEN string of the position and a game result: Is it winning, losing or a draw? And you have a lot of such lines. Maybe a million.

It's easy to imagine that you can use such a data set to assess the predictive quality of your evaluation. You typically use a sigmoid function to map the unbounded evaluation score given in centipawns into the range of [-1..1] the spectrum between a win and a loss. The difference of of the stored outcome and your prediction is the error. Iterate over the complete set of positions and compute the mean squared error of your eval. The goal of tuning is now to find a configuration of PST values that minimizes this error function!

Maybe you've heard of Texel's tuning method or used it yourself. That's what I used for MinimalChess too because it's really... really... simple. You literally just add +1 to one of your PST values and compare the result of the error function before and after the change. Did it help? If not you try to decrease the value instead. You do that for all PST values and you keep doing it until neither increasing nor decreasing any of the PST values will minimize the error function further.

This is pretty slow because for each change of *one* value you'll have to run the error function again over all positions. In my case 725.000 positions. You can do that only a few times per second and you need a lot of +1 and -1 until you arrive at good PSTs if you start from just material values.

In the last year I remember a few posts where people mentioned they were using gradient descent for their tuning. I also remembered a good online course on machine learning that I did a few years ago (and subsequently forgot all details about) so I watched the relevant videos again. Andrew Ng is really a good teacher but the stuff is very math heavy and doesn't seem to fit at all to what I'm doing with my chess engine. Lot's of greek symbols and partial deriviatives. How does all that relate to my incremental evaluation function? With tapering based on gamephase? The vector of coefficients is easy: That's just all the PST values together. A vector with 756 elements. But what's my feature vector?

So I turned the question around: What feature vector would produce the same result when multiplied with a vector of coefficients as my engine's current evaluation? Turns out you can calculate such a vector for each FEN and at that point the tuning has little to do anymore with the details of your engine. Now it fits the information you find on wikipedia or elsewhere.

Code: Select all

        public static float[] GetFeatures(BoardState pos, double phase)
        {
            float[] result = new float[N];

            //phase is used to interpolate between endgame and midgame score but we want to incorporate it into the features vector
            //score = midgameScore + phase * (endgameScore - midgameScore)
            //score = midgameScore + phase * endgameScore - phase * midgameScore
            //score = phase * endgameScore + (1 - phase) * midgameScore;
            float phaseEg = (float)(phase);
            float phaseMG = (float)(1 - phase);

            ulong occupied = pos.Black | pos.White;
            for (ulong bits = occupied; bits != 0; bits = Bitboard.ClearLSB(bits))
            {
                int square = Bitboard.LSB(bits);
                Piece piece = pos.GetPiece(square);
                int pieceOffset = ((int)piece >> 2) - 1;
                int squareIndex = (piece & Piece.ColorMask) == Piece.White ? square ^ 56 : square;
                int sign = (piece & Piece.ColorMask) == Piece.White ? 1 : -1;

                int iMg = pieceOffset * 128 + 2 * squareIndex;
                int iEg = iMg + 1;
                result[iMg] += sign * phaseMG;
                result[iEg] += sign * phaseEg;
            }
            return result;
        }

Now to compute the MSE on the training set you don't need the engine's evaluation routines anymore. Great!

Code: Select all

        public static double MeanSquareError(List<Data2> data, float[] coefficients, double scalingCoefficient)
        {
            double squaredErrorSum = 0;
            foreach (Data2 entry in data)
            {
                float eval = Evaluate(entry.Features, coefficients);
                squaredErrorSum += SquareError(entry.Result, eval, scalingCoefficient);
            }
            double result = squaredErrorSum / data.Count;
            return result;
        }

        public static float Evaluate(float[] features, float[] coefficients)
        {
            //dot product of features vector with coefficients vector
            float result = 0;
            for (int i = 0; i < N; i++)
                result += features[i] * coefficients[i];
            return result;
        }

Now you have just lots of arrays of 768 floats which you multiply-add together with a standard dot product. And the result is your evaluation. And still you try to find values for the coefficients so that the MeanSquareError on the total set is minimized. That's where it clicked for me conceptually and it became about optimizing the speed of the implementation again.

So, first of all gradient descent is already much faster than Texel's tuning method. Because with one pass over the training data set you accumulate information for each coefficient at the same time: You take note whether it was on average contributing too much or too little to the result of the evaluations. (That information is the gradient) And based on that information you can adjust all coefficients at once into the right direction! With one pass over the training set you can improve all 768 values of the PSTs at once. When you start with just material values (all Pawn tables include only values of 100 for example) about 2000 such iterations later you have already pretty decent quality PSTs. And one iteration would take less than a second to compute. So it takes maybe half an hour to compute a good set of PSTs from scratch.

But can we make it faster?

The first attempt I did was to use SIMD instructions.

Code: Select all

public static float EvaluateSIMD(float[] features, float[] coefficients)
        {
            //dot product of features vector with coefficients vector
            float result = 0;
            int slots = Vector<float>.Count;
            for (int i = 0; i < N; i += slots)
            {
                Vector<float> vF = new Vector<float>(features, i);
                Vector<float> vC = new Vector<float>(coefficients, i);
                result += Vector.Dot(vF, vC);
            }
            return result;
        }

This does help. It's pretty much exactly twice as fast as the previously shown implementation which was using a simple for-loop over an array of floats. But given that there's space for 8 32bit floats in a 256bit vector I was kinda hoping for more. But I guess that's about right for C#. Not sure how much better C/C++ would do.

If you look at the feature vector however you realize that only a few percent of elements are actually non-zero. The vast majority of the multiplications and additions don't affect the result at all. So I introduced an index buffer to each feature vector storing the indices that have non-zero values.

Code: Select all

private static float Evaluate(float[] features, short[] indices, float[] coefficients)
        {
            //dot product of a selection (indices) of elements from the features vector with coefficients vector
            float result = 0;
            foreach (short i in indices)
                result += features[i] * coefficients[i];
            return result;
        }

Turns out this is already doing better than the SIMD version despite not involving any SIMD. It's almost twice as fast: An iteration of gradient descent over the complete set of position takes 200ms now. With the SIMD implementation it was 400ms. With just plain for loops it was 700ms.

But even though the index buffer helps to save pointless computations all these zeros still take a lot of space up in memory. After loading the 725.000 positions that each consist of 756 floats to encode the features the process uses 2500MB of memory. To store mostly zeroes!

So I changed the encoding to store features as a tuple of (float, short) where the float is the feature value and the short the index. No need to store zero's and no need to have an index buffer. Just iterate over these tuples and you get the value (the float) and the index of the coeffecient to multiply with (the short) and it's sorted correctly so that you can utilize the same cache lines as much as possible.
The process memory shrunk to just 300 MB that way. And now a full iteration of gradient descent takes only 65ms! This is a great example of how important cache friendly programming is these days! And it was my first big WHOAAT? moment of the day.

This is already quite fast. Fast enough for all practical purposes probably but it's still just running on one thread. In theory this workload should be well suited for parallelization. And I was waiting for an excuse to try the new Task Parallel Library that was recently added to .Net
It's one of the examples where the documentation leaves you scratching your head for a while. Seriously... just look at the method signatures of the different overloads.

But in actual code it looks quite elegant.

Code: Select all

public static void Minimize(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
        {
            float[] accu = new float[N];
            foreach (Data2 entry in data)
            {
                float eval = Evaluate(entry.Features, coefficients);
                double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
                float error = (float)(sigmoid - entry.Result);

                foreach (Feature f in entry.Features)
                    accu[f.Index] += error * f.Value;
            }

            for (int i = 0; i < N; i++)
                coefficients[i] -= alpha * accu[i] / data.Count;
        }

...becomes...

Code: Select all

        public static void MinimizeParallel(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
        {
            //each thread maintains a local accu. After the loop is complete the accus are combined
            Parallel.ForEach(data,
                //initialize the local variable accu
                () => new float[N],
                //invoked by the loop on each iteration in parallel
                (entry, loop, accu) => 
                {
                    float eval = Evaluate(entry.Features, coefficients);
                    double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
                    float error = (float)(sigmoid - entry.Result);

                    foreach (Feature f in entry.Features)
                        accu[f.Index] += error * f.Value;

                    return accu;
                },
                //executed when each partition has completed.
                (accu) =>
                {
                    lock(coefficients)
                    {
                        for (int i = 0; i < N; i++)
                            coefficients[i] -= alpha * accu[i] / data.Count;
                    }
                }
            );
        }

It's pretty cool how my specific use-case where you need a thread-local accumulation buffer is not a major problem for the API even though picking the right overload was a bit tricky. But most of the added lines were actually comments. And you don't have to deal with any of the details like how many threads or cores are best suited for that kind of problem. So it really feels like you're just programming the equivalent of "MAKE THIS RUN FAST PLZZZ!"
And how fast is it? If you run the program after this change all 12 logical cores of my Ryzen 3600 are reported to be utilized equally with 75% load. And the net result is another 700% speedup! Now it's 9ms per iteration of gradient descent. That was the 2nd time my jaw dropped today.

That's 2000 epochs in less than 20 seconds. And that allows you to tune tapered PSTs from scratch just starting with material values in the time it takes to... well... I don't know much that could be accomplished in 20 seconds, actually.

Mike Sherwin · Post by **Mike Sherwin** » Mon Mar 28, 2022 7:20 pm

lithander wrote: ↑Mon Mar 28, 2022 6:43 pm It's rare that programming something these days makes me go "WHOAAAT?!?" like when I was a kid in the 90s. But here's a crazy story on optimization...

A.k.a. how to tune high-quality PSTs from scratch (material values) in 20 seconds.

All previous versions of Leorik have just taken their PST values from MinimalChess. I was pretty happy with the quality but tuning these tables took a lot of time. Literally days of trial and error with hours of waiting in between.

For Leorik I wanted to revisit the tuning process to get shorter iteration times for my upcoming experiments with a better evaluation.

I'm not sure how familiar everybody here is with tuning? Maybe a short recap/introduction can't hurt. Everyone knows PSTs I guess. But where do the values come from? It turns out you can write a program to "extract" these values mathematically from a set of annotated positions. That's typically just a text file where each line is a FEN string of the position and a game result: Is it winning, losing or a draw? And you have a lot of such lines. Maybe a million.

It's easy to imagine that you can use such a data set to assess the predictive quality of your evaluation. You typically use a sigmoid function to map the unbounded evaluation score given in centipawns into the range of [-1..1] the spectrum between a win and a loss. The difference of of the stored outcome and your prediction is the error. Iterate over the complete set of positions and compute the mean squared error of your eval. The goal of tuning is now to find a configuration of PST values that minimizes this error function!

Maybe you've heard of Texel's tuning method or used it yourself. That's what I used for MinimalChess too because it's really... really... simple. You literally just add +1 to one of your PST values and compare the result of the error function before and after the change. Did it help? If not you try to decrease the value instead. You do that for all PST values and you keep doing it until neither increasing nor decreasing any of the PST values will minimize the error function further.

This is pretty slow because for each change of *one* value you'll have to run the error function again over all positions. In my case 725.000 positions. You can do that only a few times per second and you need a lot of +1 and -1 until you arrive at good PSTs if you start from just material values.

In the last year I remember a few posts where people mentioned they were using gradient descent for their tuning. I also remembered a good online course on machine learning that I did a few years ago (and subsequently forgot all details about) so I watched the relevant videos again. Andrew Ng is really a good teacher but the stuff is very math heavy and doesn't seem to fit at all to what I'm doing with my chess engine. Lot's of greek symbols and partial deriviatives. How does all that relate to my incremental evaluation function? With tapering based on gamephase? The vector of coefficients is easy: That's just all the PST values together. A vector with 756 elements. But what's my feature vector?

So I turned the question around: What feature vector would produce the same result when multiplied with a vector of coefficients as my engine's current evaluation? Turns out you can calculate such a vector for each FEN and at that point the tuning has little to do anymore with the details of your engine. Now it fits the information you find on wikipedia or elsewhere.
Code: Select all
        public static float[] GetFeatures(BoardState pos, double phase)
        {
            float[] result = new float[N];

            //phase is used to interpolate between endgame and midgame score but we want to incorporate it into the features vector
            //score = midgameScore + phase * (endgameScore - midgameScore)
            //score = midgameScore + phase * endgameScore - phase * midgameScore
            //score = phase * endgameScore + (1 - phase) * midgameScore;
            float phaseEg = (float)(phase);
            float phaseMG = (float)(1 - phase);

            ulong occupied = pos.Black | pos.White;
            for (ulong bits = occupied; bits != 0; bits = Bitboard.ClearLSB(bits))
            {
                int square = Bitboard.LSB(bits);
                Piece piece = pos.GetPiece(square);
                int pieceOffset = ((int)piece >> 2) - 1;
                int squareIndex = (piece & Piece.ColorMask) == Piece.White ? square ^ 56 : square;
                int sign = (piece & Piece.ColorMask) == Piece.White ? 1 : -1;

                int iMg = pieceOffset * 128 + 2 * squareIndex;
                int iEg = iMg + 1;
                result[iMg] += sign * phaseMG;
                result[iEg] += sign * phaseEg;
            }
            return result;
        } 
Now to compute the MSE on the training set you don't need the engine's evaluation routines anymore. Great!
Code: Select all
        public static double MeanSquareError(List<Data2> data, float[] coefficients, double scalingCoefficient)
        {
            double squaredErrorSum = 0;
            foreach (Data2 entry in data)
            {
                float eval = Evaluate(entry.Features, coefficients);
                squaredErrorSum += SquareError(entry.Result, eval, scalingCoefficient);
            }
            double result = squaredErrorSum / data.Count;
            return result;
        }

        public static float Evaluate(float[] features, float[] coefficients)
        {
            //dot product of features vector with coefficients vector
            float result = 0;
            for (int i = 0; i < N; i++)
                result += features[i] * coefficients[i];
            return result;
        }
Now you have just lots of arrays of 768 floats which you multiply-add together with a standard dot product. And the result is your evaluation. And still you try to find values for the coefficients so that the MeanSquareError on the total set is minimized. That's where it clicked for me conceptually and it became about optimizing the speed of the implementation again.

So, first of all gradient descent is already much faster than Texel's tuning method. Because with one pass over the training data set you accumulate information for each coefficient at the same time: You take note whether it was on average contributing too much or too little to the result of the evaluations. (That information is the gradient) And based on that information you can adjust all coefficients at once into the right direction! With one pass over the training set you can improve all 768 values of the PSTs at once. When you start with just material values (all Pawn tables include only values of 100 for example) about 2000 such iterations later you have already pretty decent quality PSTs. And one iteration would take less than a second to compute. So it takes maybe half an hour to compute a good set of PSTs from scratch.

But can we make it faster?

The first attempt I did was to use SIMD instructions.
Code: Select all
public static float EvaluateSIMD(float[] features, float[] coefficients)
        {
            //dot product of features vector with coefficients vector
            float result = 0;
            int slots = Vector<float>.Count;
            for (int i = 0; i < N; i += slots)
            {
                Vector<float> vF = new Vector<float>(features, i);
                Vector<float> vC = new Vector<float>(coefficients, i);
                result += Vector.Dot(vF, vC);
            }
            return result;
        }
This does help. It's pretty much exactly twice as fast as the previously shown implementation which was using a simple for-loop over an array of floats. But given that there's space for 8 32bit floats in a 256bit vector I was kinda hoping for more. But I guess that's about right for C#. Not sure how much better C/C++ would do.

If you look at the feature vector however you realize that only a few percent of elements are actually non-zero. The vast majority of the multiplications and additions don't affect the result at all. So I introduced an index buffer to each feature vector storing the indices that have non-zero values.
Code: Select all
private static float Evaluate(float[] features, short[] indices, float[] coefficients)
        {
            //dot product of a selection (indices) of elements from the features vector with coefficients vector
            float result = 0;
            foreach (short i in indices)
                result += features[i] * coefficients[i];
            return result;
        }
Turns out this is already doing better than the SIMD version despite not involving any SIMD. It's almost twice as fast: An iteration of gradient descent over the complete set of position takes 200ms now. With the SIMD implementation it was 400ms. With just plain for loops it was 700ms.

But even though the index buffer helps to save pointless computations all these zeros still take a lot of space up in memory. After loading the 725.000 positions that each consist of 756 floats to encode the features the process uses 2500MB of memory. To store mostly zeroes! So I changed the encoding to store features as a tuple of (float, short) where the float is the feature value and the short the index. No need to store zero's and no need to have an index buffer. Just iterate over these tuples and you get the value (the float) and the index of the coeffecient to multiply with (the short) and it's sorted correctly so that you can utilize the same cache lines as much as possible.
The process memory shrunk to just 300 MB that way. And now a full iteration of gradient descent takes only 65ms! This is a great example of how important cache friendly programming is these days! And it was my first big WHOAAT? moment of the day.

This is already quite fast. Fast enough for all practical purposes probably but it's still just running on one thread. In theory this workload should be well suited for parallelization. And I was waiting for an excuse to try the new Task Parallel Library that was recently added to .Net
It's one of the examples where the documentation leaves you scratching your head for a while. Seriously... just look at the method signatures of the different overloads.

But in actual code it looks quite elegant.
Code: Select all
public static void Minimize(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
        {
            float[] accu = new float[N];
            foreach (Data2 entry in data)
            {
                float eval = Evaluate(entry.Features, coefficients);
                double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
                float error = (float)(sigmoid - entry.Result);

                foreach (Feature f in entry.Features)
                    accu[f.Index] += error * f.Value;
            }

            for (int i = 0; i < N; i++)
                coefficients[i] -= alpha * accu[i] / data.Count;
        }
...becomes...
Code: Select all
        public static void MinimizeParallel(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
        {
            //each thread maintains a local accu. After the loop is complete the accus are combined
            Parallel.ForEach(data,
                //initialize the local variable accu
                () => new float[N],
                //invoked by the loop on each iteration in parallel
                (entry, loop, accu) => 
                {
                    float eval = Evaluate(entry.Features, coefficients);
                    double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
                    float error = (float)(sigmoid - entry.Result);

                    foreach (Feature f in entry.Features)
                        accu[f.Index] += error * f.Value;

                    return accu;
                },
                //executed when each partition has completed.
                (accu) =>
                {
                    lock(coefficients)
                    {
                        for (int i = 0; i < N; i++)
                            coefficients[i] -= alpha * accu[i] / data.Count;
                    }
                }
            );
        }
It's pretty cool how my specific use-case where you need a thread-local accumulation buffer is not a major problem for the API even though picking the right overload was a bit tricky. But most of the added lines were actually comments. And you don't have to deal with any of the details like how many threads or cores are best suited for that kind of problem. So it really feels like you're just programming the equivalent of "MAKE THIS RUN FAST PLZZZ!"
And how fast is it? If you run the program after this change all 12 logical cores of my Ryzen 3600 are reported to be utilized equally with 75% load. And the net result is another 700% speedup! Now it's 9ms per iteration of gradient descent. That was the 2nd time my jaw dropped today.

That's 2000 epochs in less than 20 seconds. And that allows you to tune tapered PSTs from scratch just starting with material values in the time it takes to... well... I don't know much that could be accomplished in 20 seconds, actually.

It seems to me that Alpha Zero needing 40 million games to tune its evaluation makes it look like 750,000 positions might not be enough. Genetic mutation strategies take way to long starting from scratch. However, now that you have good PSTs why not try to make them better using some genetic mutation strategy. I would randomly create 1000 different small variant versions and have them play each other. After so many games I'd eliminate the bottom 500 and randomly create 500 more from say the top 100.

mvanthoor · Post by **mvanthoor** » Tue Mar 29, 2022 12:05 am

lithander wrote: ↑Mon Mar 28, 2022 6:43 pm

Now to compute the MSE on the training set you don't need the engine's evaluation routines anymore. Great!

Nice write-up; I'll have to re-read that again when I pick up development of Rustic again.

However: you don't need your evaluation function? How can this be? PST's are just the basis where most evaluations start, and then they stack lots of other terms on top of them. You also need to tune those, and they need to be tuned in conjunction with the PST values; so maybe your method will work if you only tune PST's, but as soon as you are adding other terms in the evaluation, I suspect you'll need it again.

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik