Devlog of Leorik

Discussion of chess software programming and technical issues.

Moderator: Ras

User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Devlog of Leorik

Post by lithander »

mvanthoor wrote: Tue Mar 29, 2022 12:05 am Nice write-up; I'll have to re-read that again when I pick up development of Rustic again.
Thanks. I was thinking about you when I decided to write this almost tutorial-like update. Let's hope it will prove useful!
mvanthoor wrote: Tue Mar 29, 2022 12:05 am However: you don't need your evaluation function? How can this be? PST's are just the basis where most evaluations start, and then they stack lots of other terms on top of them. You also need to tune those, and they need to be tuned in conjunction with the PST values; so maybe your method will work if you only tune PST's, but as soon as you are adding other terms in the evaluation, I suspect you'll need it again.
At the moment Leorik has only PSTs as an evaluation so for *me* that was not a problem. If you have other eval terms and you don't plan to tune them you can just treat them as a constant offset to whatever your Dot(feature, coefficients) function spits out. If you want to tune these terms too you have to find a way to include them in the feature vector I guess. But in either case you wouldn't have to call your engines eval function during the tuning process. (afaik)
Mike Sherwin wrote: Mon Mar 28, 2022 7:20 pm It seems to me that Alpha Zero needing 40 million games to tune its evaluation makes it look like 750,000 positions might not be enough.
That's like comparing apples with a forest of apple trees. A rather small 192x15 Leela Zero network is still 21MB large. That's a LOT of weights they have to "tune" in their engine. My PSTs are less then 2KB in comparison and so I also need much less data to tune them.
Mike Sherwin wrote: Mon Mar 28, 2022 7:20 pm Genetic mutation strategies take way to long starting from scratch. However, now that you have good PSTs why not try to make them better using some genetic mutation strategy. I would randomly create 1000 different small variant versions and have them play each other. After so many games I'd eliminate the bottom 500 and randomly create 500 more from say the top 100.
As you have explained and demonstrated earlier PSTs alone are just too limited to really capture all the details of what makes one board position better than another. So even if evolutionary algorithms would be able to find better coefficients than gradient descent I don't expect large Elo gains from that.

My plan is instead to expand the concept of what is considered the typical use of PSTs. Why limit yourself to only a 'phase' parameter? I'm thinking of adding additional parameters for something like the positions of the kings. Why just a midgame and an endgame table per piece? I'm thinking of splitting pawns into different tables (and thus values) for passed pawns or isolated pawns etc.

What Leorik is missing is the ability to reason about concepts like Pawn structure, King safety and mobility. But instead of adding hand crafted evaluation terms on top of the PSTs or just skipping all that to adopt NNUEs right away (which promises the biggest Elo gains) I'm interested in exploring the vast space between neural networks and classical PSTs.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
jswaff
Posts: 105
Joined: Mon Jun 09, 2014 12:22 am
Full name: James Swafford

Re: Devlog of Leorik

Post by jswaff »

I'm graduating from the Texel method to gradient descent (with mini-batches). It's working beautifully, but I simplified my evaluation by disabling "tapered eval" and the draw-scaling-factor while I was working everything out. Now that it's working I'm trying to work out how to fold tapered eval back in. Can you speak a bit more about how you did that?


BTW I just finished the ML course you referenced by Andrew Ng. :-) That really helped tremendously.

EDIT: I think I've answered my own question. I'm currently doing a simple matrix multiplication to compute the eval score for each record:

SimpleMatrix b = x.mult(theta);

x is the feature matrix (one row per training record and one column per feature), and theta contains the weights. So to do tapered eval I suppose I'll have to do a little more work here, like calculating the middle game score (using x and theta), and the endgame score (using x and theta), and then do the interpolation just as the Eval function does. I still don't need to call the evaluation but it's a bit more work than a simple matrix multiplication. Am I on the right track?


lithander wrote: Mon Mar 28, 2022 6:43 pm It's rare that programming something these days makes me go "WHOAAAT?!?" like when I was a kid in the 90s. But here's a crazy story on optimization...

A.k.a. how to tune high-quality PSTs from scratch (material values) in 20 seconds.

All previous versions of Leorik have just taken their PST values from MinimalChess. I was pretty happy with the quality but tuning these tables took a lot of time. Literally days of trial and error with hours of waiting in between.

For Leorik I wanted to revisit the tuning process to get shorter iteration times for my upcoming experiments with a better evaluation.

I'm not sure how familiar everybody here is with tuning? Maybe a short recap/introduction can't hurt. Everyone knows PSTs I guess. But where do the values come from? It turns out you can write a program to "extract" these values mathematically from a set of annotated positions. That's typically just a text file where each line is a FEN string of the position and a game result: Is it winning, losing or a draw? And you have a lot of such lines. Maybe a million.

It's easy to imagine that you can use such a data set to assess the predictive quality of your evaluation. You typically use a sigmoid function to map the unbounded evaluation score given in centipawns into the range of [-1..1] the spectrum between a win and a loss. The difference of of the stored outcome and your prediction is the error. Iterate over the complete set of positions and compute the mean squared error of your eval. The goal of tuning is now to find a configuration of PST values that minimizes this error function!

Maybe you've heard of Texel's tuning method or used it yourself. That's what I used for MinimalChess too because it's really... really... simple. You literally just add +1 to one of your PST values and compare the result of the error function before and after the change. Did it help? If not you try to decrease the value instead. You do that for all PST values and you keep doing it until neither increasing nor decreasing any of the PST values will minimize the error function further.

This is pretty slow because for each change of *one* value you'll have to run the error function again over all positions. In my case 725.000 positions. You can do that only a few times per second and you need a lot of +1 and -1 until you arrive at good PSTs if you start from just material values.

In the last year I remember a few posts where people mentioned they were using gradient descent for their tuning. I also remembered a good online course on machine learning that I did a few years ago (and subsequently forgot all details about) so I watched the relevant videos again. Andrew Ng is really a good teacher but the stuff is very math heavy and doesn't seem to fit at all to what I'm doing with my chess engine. Lot's of greek symbols and partial deriviatives. How does all that relate to my incremental evaluation function? With tapering based on gamephase? The vector of coefficients is easy: That's just all the PST values together. A vector with 756 elements. But what's my feature vector?

So I turned the question around: What feature vector would produce the same result when multiplied with a vector of coefficients as my engine's current evaluation? Turns out you can calculate such a vector for each FEN and at that point the tuning has little to do anymore with the details of your engine. Now it fits the information you find on wikipedia or elsewhere.

Code: Select all

        public static float[] GetFeatures(BoardState pos, double phase)
        {
            float[] result = new float[N];

            //phase is used to interpolate between endgame and midgame score but we want to incorporate it into the features vector
            //score = midgameScore + phase * (endgameScore - midgameScore)
            //score = midgameScore + phase * endgameScore - phase * midgameScore
            //score = phase * endgameScore + (1 - phase) * midgameScore;
            float phaseEg = (float)(phase);
            float phaseMG = (float)(1 - phase);

            ulong occupied = pos.Black | pos.White;
            for (ulong bits = occupied; bits != 0; bits = Bitboard.ClearLSB(bits))
            {
                int square = Bitboard.LSB(bits);
                Piece piece = pos.GetPiece(square);
                int pieceOffset = ((int)piece >> 2) - 1;
                int squareIndex = (piece & Piece.ColorMask) == Piece.White ? square ^ 56 : square;
                int sign = (piece & Piece.ColorMask) == Piece.White ? 1 : -1;

                int iMg = pieceOffset * 128 + 2 * squareIndex;
                int iEg = iMg + 1;
                result[iMg] += sign * phaseMG;
                result[iEg] += sign * phaseEg;
            }
            return result;
        } 
Now to compute the MSE on the training set you don't need the engine's evaluation routines anymore. Great!

Code: Select all

        public static double MeanSquareError(List<Data2> data, float[] coefficients, double scalingCoefficient)
        {
            double squaredErrorSum = 0;
            foreach (Data2 entry in data)
            {
                float eval = Evaluate(entry.Features, coefficients);
                squaredErrorSum += SquareError(entry.Result, eval, scalingCoefficient);
            }
            double result = squaredErrorSum / data.Count;
            return result;
        }

        public static float Evaluate(float[] features, float[] coefficients)
        {
            //dot product of features vector with coefficients vector
            float result = 0;
            for (int i = 0; i < N; i++)
                result += features[i] * coefficients[i];
            return result;
        }
Now you have just lots of arrays of 768 floats which you multiply-add together with a standard dot product. And the result is your evaluation. And still you try to find values for the coefficients so that the MeanSquareError on the total set is minimized. That's where it clicked for me conceptually and it became about optimizing the speed of the implementation again.

So, first of all gradient descent is already much faster than Texel's tuning method. Because with one pass over the training data set you accumulate information for each coefficient at the same time: You take note whether it was on average contributing too much or too little to the result of the evaluations. (That information is the gradient) And based on that information you can adjust all coefficients at once into the right direction! With one pass over the training set you can improve all 768 values of the PSTs at once. When you start with just material values (all Pawn tables include only values of 100 for example) about 2000 such iterations later you have already pretty decent quality PSTs. And one iteration would take less than a second to compute. So it takes maybe half an hour to compute a good set of PSTs from scratch.

But can we make it faster?

The first attempt I did was to use SIMD instructions.

Code: Select all

public static float EvaluateSIMD(float[] features, float[] coefficients)
        {
            //dot product of features vector with coefficients vector
            float result = 0;
            int slots = Vector<float>.Count;
            for (int i = 0; i < N; i += slots)
            {
                Vector<float> vF = new Vector<float>(features, i);
                Vector<float> vC = new Vector<float>(coefficients, i);
                result += Vector.Dot(vF, vC);
            }
            return result;
        }
This does help. It's pretty much exactly twice as fast as the previously shown implementation which was using a simple for-loop over an array of floats. But given that there's space for 8 32bit floats in a 256bit vector I was kinda hoping for more. But I guess that's about right for C#. Not sure how much better C/C++ would do.

If you look at the feature vector however you realize that only a few percent of elements are actually non-zero. The vast majority of the multiplications and additions don't affect the result at all. So I introduced an index buffer to each feature vector storing the indices that have non-zero values.

Code: Select all

private static float Evaluate(float[] features, short[] indices, float[] coefficients)
        {
            //dot product of a selection (indices) of elements from the features vector with coefficients vector
            float result = 0;
            foreach (short i in indices)
                result += features[i] * coefficients[i];
            return result;
        }
Turns out this is already doing better than the SIMD version despite not involving any SIMD. It's almost twice as fast: An iteration of gradient descent over the complete set of position takes 200ms now. With the SIMD implementation it was 400ms. With just plain for loops it was 700ms.

But even though the index buffer helps to save pointless computations all these zeros still take a lot of space up in memory. After loading the 725.000 positions that each consist of 756 floats to encode the features the process uses 2500MB of memory. To store mostly zeroes! :evil: So I changed the encoding to store features as a tuple of (float, short) where the float is the feature value and the short the index. No need to store zero's and no need to have an index buffer. Just iterate over these tuples and you get the value (the float) and the index of the coeffecient to multiply with (the short) and it's sorted correctly so that you can utilize the same cache lines as much as possible.
The process memory shrunk to just 300 MB that way. And now a full iteration of gradient descent takes only 65ms! This is a great example of how important cache friendly programming is these days! And it was my first big WHOAAT? moment of the day.

This is already quite fast. Fast enough for all practical purposes probably but it's still just running on one thread. In theory this workload should be well suited for parallelization. And I was waiting for an excuse to try the new Task Parallel Library that was recently added to .Net
It's one of the examples where the documentation leaves you scratching your head for a while. Seriously... just look at the method signatures of the different overloads.

But in actual code it looks quite elegant.

Code: Select all

public static void Minimize(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
        {
            float[] accu = new float[N];
            foreach (Data2 entry in data)
            {
                float eval = Evaluate(entry.Features, coefficients);
                double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
                float error = (float)(sigmoid - entry.Result);

                foreach (Feature f in entry.Features)
                    accu[f.Index] += error * f.Value;
            }

            for (int i = 0; i < N; i++)
                coefficients[i] -= alpha * accu[i] / data.Count;
        }
...becomes...

Code: Select all

        public static void MinimizeParallel(List<Data2> data, float[] coefficients, double scalingCoefficient, float alpha)
        {
            //each thread maintains a local accu. After the loop is complete the accus are combined
            Parallel.ForEach(data,
                //initialize the local variable accu
                () => new float[N],
                //invoked by the loop on each iteration in parallel
                (entry, loop, accu) => 
                {
                    float eval = Evaluate(entry.Features, coefficients);
                    double sigmoid = 2 / (1 + Math.Exp(-(eval / scalingCoefficient))) - 1;
                    float error = (float)(sigmoid - entry.Result);

                    foreach (Feature f in entry.Features)
                        accu[f.Index] += error * f.Value;

                    return accu;
                },
                //executed when each partition has completed.
                (accu) =>
                {
                    lock(coefficients)
                    {
                        for (int i = 0; i < N; i++)
                            coefficients[i] -= alpha * accu[i] / data.Count;
                    }
                }
            );
        }
It's pretty cool how my specific use-case where you need a thread-local accumulation buffer is not a major problem for the API even though picking the right overload was a bit tricky. But most of the added lines were actually comments. And you don't have to deal with any of the details like how many threads or cores are best suited for that kind of problem. So it really feels like you're just programming the equivalent of "MAKE THIS RUN FAST PLZZZ!"
And how fast is it? If you run the program after this change all 12 logical cores of my Ryzen 3600 are reported to be utilized equally with 75% load. And the net result is another 700% speedup! Now it's 9ms per iteration of gradient descent. That was the 2nd time my jaw dropped today.

That's 2000 epochs in less than 20 seconds. And that allows you to tune tapered PSTs from scratch just starting with material values in the time it takes to... well... I don't know much that could be accomplished in 20 seconds, actually.:P
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Devlog of Leorik

Post by lithander »

jswaff wrote: Sat Apr 16, 2022 4:31 am I'm graduating from the Texel method to gradient descent (with mini-batches). It's working beautifully, but I simplified my evaluation by disabling "tapered eval" and the draw-scaling-factor while I was working everything out. Now that it's working I'm trying to work out how to fold tapered eval back in. Can you speak a bit more about how you did that?


BTW I just finished the ML course you referenced by Andrew Ng. :-) That really helped tremendously.

EDIT: I think I've answered my own question. I'm currently doing a simple matrix multiplication to compute the eval score for each record:

SimpleMatrix b = x.mult(theta);

x is the feature matrix (one row per training record and one column per feature), and theta contains the weights. So to do tapered eval I suppose I'll have to do a little more work here, like calculating the middle game score (using x and theta), and the endgame score (using x and theta), and then do the interpolation just as the Eval function does. I still don't need to call the evaluation but it's a bit more work than a simple matrix multiplication. Am I on the right track?
Tapered evaluation just means that you have both midgame and endgame tables and want to interpolate between them based on a phase. So for example the phase of a position says that you currently have to take 70% of the midgame score and 30% of the endgame score.

You probably first calculated the midgame and endgame evaluations separately and only then interpolated between them based on phase. But once you realize you can roll that step into the features you don't have to do any seperate interpolation step.

Ideally you'll want to tune all the coefficients (midgame & endgame tables) at the same time. So the amount of coefficients doubles and that means you also need twice the amount of features, now. Where each feature was previously either 0 or 1 (either a piece is present (1) or it isn't (0)) your feature will now be in the range of [0..1] based on the phase. So in the example above you'd just set all the midgame features to 0.7 and all the endgame features to 0.3. And that should calculate exactly the same score as what your tapered evaluation did.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
jswaff
Posts: 105
Joined: Mon Jun 09, 2014 12:22 am
Full name: James Swafford

Re: Devlog of Leorik

Post by jswaff »

lithander wrote: Sat Apr 16, 2022 10:58 pm
jswaff wrote: Sat Apr 16, 2022 4:31 am I'm graduating from the Texel method to gradient descent (with mini-batches). It's working beautifully, but I simplified my evaluation by disabling "tapered eval" and the draw-scaling-factor while I was working everything out. Now that it's working I'm trying to work out how to fold tapered eval back in. Can you speak a bit more about how you did that?


BTW I just finished the ML course you referenced by Andrew Ng. :-) That really helped tremendously.

EDIT: I think I've answered my own question. I'm currently doing a simple matrix multiplication to compute the eval score for each record:

SimpleMatrix b = x.mult(theta);

x is the feature matrix (one row per training record and one column per feature), and theta contains the weights. So to do tapered eval I suppose I'll have to do a little more work here, like calculating the middle game score (using x and theta), and the endgame score (using x and theta), and then do the interpolation just as the Eval function does. I still don't need to call the evaluation but it's a bit more work than a simple matrix multiplication. Am I on the right track?
Tapered evaluation just means that you have both midgame and endgame tables and want to interpolate between them based on a phase. So for example the phase of a position says that you currently have to take 70% of the midgame score and 30% of the endgame score.

You probably first calculated the midgame and endgame evaluations separately and only then interpolated between them based on phase. But once you realize you can roll that step into the features you don't have to do any seperate interpolation step.

Ideally you'll want to tune all the coefficients (midgame & endgame tables) at the same time. So the amount of coefficients doubles and that means you also need twice the amount of features, now. Where each feature was previously either 0 or 1 (either a piece is present (1) or it isn't (0)) your feature will now be in the range of [0..1] based on the phase. So in the example above you'd just set all the midgame features to 0.7 and all the endgame features to 0.3. And that should calculate exactly the same score as what your tapered evaluation did.
Ah, of course! I have all the PSTs in theta (weights vector) and I have middle and endgame features. I hadn't considered using 'fractional features' - makes perfect sense. Thanks for the hint.
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Devlog of Leorik

Post by Mike Sherwin »

I got a winning position against the Skeleton King and then threw it away giving the Skeleton King equality. But managed to get another winning position and this time grinding his bones to dust I found the Undead Crown among his remains. :D This is probably the last version of Leorik that I'll be able to win against.
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Devlog of Leorik

Post by lithander »

Mike Sherwin wrote: Sun Apr 17, 2022 11:20 pm I got a winning position against the Skeleton King and then threw it away giving the Skeleton King equality. But managed to get another winning position and this time grinding his bones to dust I found the Undead Crown among his remains. :D This is probably the last version of Leorik that I'll be able to win against.
Given it's rating of ~2550 (CCRL) I think that is very impressive that you can still win!

I'm currently experimenting with ideas to improve the evaluation without hardcoding chess knowledge into the engine. (And without using outright neural networks)

Inspired by the idea of using the game phase (midgame..endgame) to modify the material values of pieces I wanted to explore something similar based on the king position. It's basically another set of 6 PSTs with small values modifying the "normal" material values but instead of applying these values based on the phase of the game it's going to be multiplied with a value that reflects the general position of the King. Probably in the range [-1..1] and derived from a (tuned) lookup table.

I'm quite curious what this structure will allow me to "encode" in terms of positional knowledge. For example it should be possible to bump the value of pawns that are serving as a pawn shield for the castled king only when this castling move has actually happened.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Devlog of Leorik

Post by Mike Sherwin »

lithander wrote: Tue Apr 19, 2022 2:29 pm
Mike Sherwin wrote: Sun Apr 17, 2022 11:20 pm I got a winning position against the Skeleton King and then threw it away giving the Skeleton King equality. But managed to get another winning position and this time grinding his bones to dust I found the Undead Crown among his remains. :D This is probably the last version of Leorik that I'll be able to win against.
Given it's rating of ~2550 (CCRL) I think that is very impressive that you can still win!
The reason I was able to win those games was mainly because there was at no time anything difficult for me to solve. Only the second game was there a lot of action but I only had to solve simple problems that did not really feel like problems. I never felt pressured in any of the games.

I've played 8 long time control games against SF 3 thru 10 and my record is 4 losses and 4 draws for something like a 3000 performance rating. Also I played 3 knight odds games against SF10 at ltc and got 2 draws and one win. Here is my win and a draw.
forum3/viewtopic.php?f=11&t=69474&hilit ... 30#p786107
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Devlog of Leorik

Post by Mike Sherwin »

How is the evaluation coming along?
User avatar
lithander
Posts: 915
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Devlog of Leorik

Post by lithander »

Mike Sherwin wrote: Fri Apr 29, 2022 6:20 pm How is the evaluation coming along?
There hasn't been much progress sadly. Programming Leorik has to compete with gaming for the same time budget. And at the moment Elden Ring wins almost all evenings! ;)

My idea about the king-phase set of tables that modify the PSTs didn't really work out as well as I hoped. It was a lot of effort to implement (the evaluation and tuner is now roughly twice as complicated as it was with just tapered PSTs) but if I pit this version against it's predecessor I get only +20 Elo from it.

It might be that my data could be improved (if you want to tune the king-phase lookup table it's probably good to have a greater variety of positions with exotic King placements) and there are a few things left I could try. But it feels a bit like I'm "polishing a turd"... the alternative would be to hand-write code to extract relevant features and tune those. Maybe focus on pawn/king structure first so a pawn-hash can be used to make this fast. That's promising good results and is relatively straight forward but it's also a tad more boring than trying "novelties"... and so instead of making a hard decision I'm just procrastinating by working towards becoming the next Elden Lord, instead!

But thanks for asking! Might be that a bit of public interest is just what I needed to tip the scales in Leorik's favor! :D
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
Mike Sherwin
Posts: 965
Joined: Fri Aug 21, 2020 1:25 am
Location: Planet Earth, Sol system
Full name: Michael J Sherwin

Re: Devlog of Leorik

Post by Mike Sherwin »

lithander wrote: Mon May 02, 2022 2:18 pm
Mike Sherwin wrote: Fri Apr 29, 2022 6:20 pm How is the evaluation coming along?
There hasn't been much progress sadly. Programming Leorik has to compete with gaming for the same time budget. And at the moment Elden Ring wins almost all evenings! ;)

My idea about the king-phase set of tables that modify the PSTs didn't really work out as well as I hoped. It was a lot of effort to implement (the evaluation and tuner is now roughly twice as complicated as it was with just tapered PSTs) but if I pit this version against it's predecessor I get only +20 Elo from it.

It might be that my data could be improved (if you want to tune the king-phase lookup table it's probably good to have a greater variety of positions with exotic King placements) and there are a few things left I could try. But it feels a bit like I'm "polishing a turd"... the alternative would be to hand-write code to extract relevant features and tune those. Maybe focus on pawn/king structure first so a pawn-hash can be used to make this fast. That's promising good results and is relatively straight forward but it's also a tad more boring than trying "novelties"... and so instead of making a hard decision I'm just procrastinating by working towards becoming the next Elden Lord, instead!

But thanks for asking! Might be that a bit of public interest is just what I needed to tip the scales in Leorik's favor! :D
OliThink 5.10.1
Complete rewrite of OliThink with a very fast move generator (based on OliPerft). Engine runs without any chess information. No pawnstructure, no tables etc.
ELO about 2950. 128 MB fix Hashsize.
2848 CCRL 40/15