Devlog of Leorik

j.t. · Post by **j.t.** » Wed May 24, 2023 2:17 pm

lithander wrote: ↑Wed May 24, 2023 2:05 am ... here's the visualization!

Very nice! In the visualization, is the black king the "enemy king" or how should I imagine that? Are we looking only at white pieces? Maybe for future visualizations, having green and red meaning different things might be confusing for colorblind people (I don't know how bad it is, I just noticed it because at first I was looking at this on my phone, which I set to grayscale).

What I did to somewhat alleviate the problem of king squares that only have very few positions was to have a position where the king is at square X not only contribute to the gradient of the PST of that square X but also like 0.2 times to the gradient of PSTs of king squares in the 3x3 neighborhood and like 0.1 times to the 5x5 neighborhood.

lithander · Post by **lithander** » Wed May 24, 2023 3:44 pm

j.t. wrote: ↑Wed May 24, 2023 2:17 pm Very nice! In the visualization, is the black king the "enemy king" or how should I imagine that?

Yes, in the visualization "we" are white and it's our PSTs and the board is oriented in a way that white is at the bottom and the black pieces would be on the top. In reality you could use the same PSTs for black by mirroring the rank and negating the values you find.

j.t. wrote: ↑Wed May 24, 2023 2:17 pm Maybe for future visualizations, having green and red meaning different things might be confusing for colorblind people (I don't know how bad it is, I just noticed it because at first I was looking at this on my phone, which I set to grayscale).

I'm not sure if the visualization ever makes it into the master branch. I just felt the need to visualize the values my tuner was spitting out at me to get an intuition on how to move forward. A grayscale gradient would be even easier to do but for me personally it expresses less subtleties than one that's transitioning from red over yellow to green.

j.t. wrote: ↑Wed May 24, 2023 2:17 pm What I did to somewhat alleviate the problem of king squares that only have very few positions was to have a position where the king is at square X not only contribute to the gradient of the PST of that square X but also like 0.2 times to the gradient of PSTs of king squares in the 3x3 neighborhood and like 0.1 times to the 5x5 neighborhood.

Are you using 64 different sets of PSTs indexed by the friendly king? Or the enemy king? Or both e.g. 64x64 sets?

I haven't decided yet how to move forward but I hoped that I could get away with only doubling or tripling the amount of total parameters that my supposedly handcrafted eval relies on by finding a clever encoding of how both King's positioning should influences the Piece-Square values.

Before I did the visualization I was thinking about a linear model based on the king's file and rank but looking at the visualization that doesn't seem so promising anymore.

j.t. · Post by **j.t.** » Wed May 24, 2023 5:20 pm

lithander wrote: ↑Wed May 24, 2023 3:44 pm Are you using 64 different sets of PSTs indexed by the friendly king? Or the enemy king? Or both e.g. 64x64 sets?

One set of 64 PSTs indexed by the friendly king and one set indexed by the enemy king (and then the values get added together). If I understood correctly, that's similar to what you ended up doing. 64×64 was too big for the stack, and I think probably also overkill. Especially because I have “only” around 6 million training positions.

An idea to reduce the number of parameters would be to the PSTs for a king position a1 is the same but mirrored on h1. I haven't tried that yet though.

lithander · Post by **lithander** » Wed May 24, 2023 5:25 pm

j.t. wrote: ↑Wed May 24, 2023 5:20 pm
lithander wrote: ↑Wed May 24, 2023 3:44 pm Are you using 64 different sets of PSTs indexed by the friendly king? Or the enemy king? Or both e.g. 64x64 sets?
One set of 64 PSTs indexed by the friendly king and one set indexed by the enemy king (and then the values get added together). If I understood correctly, that's similar to what you ended up doing.

It's what I've been doing for the visualization at least, yes! But I haven't tried to use the tuned PSTs in the engines actual evaluation yet. Good to know that just adding them up works, though!

JoAnnP38 · Post by **JoAnnP38** » Wed May 24, 2023 6:31 pm

lithander wrote: ↑Wed May 24, 2023 3:44 pm Are you using 64 different sets of PSTs indexed by the friendly king? Or the enemy king? Or both e.g. 64x64 sets?

I haven't decided yet how to move forward but I hoped that I could get away with only doubling or tripling the amount of total parameters that my supposedly handcrafted eval relies on by finding a clever encoding of how both King's positioning should influences the Piece-Square values.

It can get really expensive in terms of your evaluation. With my 2x2 model I can still incrementally update mg/eg values related to all the boards, but it's not like it's free. Could you imagine a 64x64 set of PSTs!!! I think I'll switch to NNUE before that.

lithander wrote: ↑Wed May 24, 2023 3:44 pm Before I did the visualization I was thinking about a linear model based on the king's file and rank but looking at the visualization that doesn't seem so promising anymore.

I wondered about that when you first mentioned it. It always seems more complex than that, but I was hoping you worked out a solution. It always felt like more of a multidimensional polynomial to me. (My head is already hurting.) Maybe there is a good-enough heuristic somewhere in all this?

j.t. · Post by **j.t.** » Wed May 24, 2023 10:27 pm

JoAnnP38 wrote: ↑Wed May 24, 2023 6:31 pm It can get really expensive in terms of your evaluation. With my 2x2 model I can still incrementally update mg/eg values related to all the boards, but it's not like it's free.

As described above, I use two 64 sized sets of PSTs. I don't do any incremental evaluation (i.e. all eval is done for each leaf), and if I completely remove the PST from my eval, I get a speed improvement of at most 10%. The most expensive stuff of the rest of my evaluation for this experiment is not much more than some mobility and some smaller tables.

EDIT: also, because I can't edit my earlier comment: I in fact already added the vertical mirroring and just forgot it ...

lithander · Post by **lithander** » Sat May 27, 2023 12:46 pm

Looking at the visualization at a cursory glance it seems that the tiles of the same piece type share a lot of similarity. So instead of trying to encode the tiles content I thought I could also take a different route to simplify and just share the same tile for each piece of the same type regardless of the positioning of the piece.

So I added 5 tables for P, N, B, R, Q indexed by the friendly king square and another 5 such tables for the opponent king square. (10x64x2 new values in total)
The material evaluation looks like this now Material = Material[piece,square] + KingRelative[piece,kingSquare] + OpponentKingRelative[piece, oppKingSquare]

An example why that makes sense would be the idea that "A king in the middle of the board is easier to attack by sliders making them more valuable" and the tables can encode that kind of thinking well.

I just used my old set of training data (it might make sense to select different positions but for the first experiment I didn't bother) and the results where quite promising:

Code: Select all

Score of Leorik-2.4.2 vs Leorik-2.4-Pext-Net8: 2168 - 1602 - 2650  [0.544] 6420
...      Leorik-2.4.2 playing White: 1250 - 672 - 1288  [0.590] 3210
...      Leorik-2.4.2 playing Black: 918 - 930 - 1362  [0.498] 3210
...      White vs Black: 2180 - 1590 - 2650  [0.546] 6420
Elo difference: 30.7 +/- 6.5, LOS: 100.0 %, DrawRatio: 41.3 %

JoAnnP38 wrote: ↑Wed May 24, 2023 6:31 pm It can get really expensive in terms of your evaluation. With my 2x2 model I can still incrementally update mg/eg values related to all the boards, but it's not like it's free. Could you imagine a 64x64 set of PSTs!!! I think I'll switch to NNUE before that.

I don't think it should really be that expensive performance wise. It just takes up more memory. Even with 4096 sets of PSTs you would still only use one set at a time, right? And you could do incremental updates with that set as long as no king moves. Only when one of the king moves would you have to pick a different set and recompute the eval but even that is still fast compared to everything else that's going on. You only ever use one set of a time. (what chrisw explained recently as being a form of sparseness)

My (too?) simple approach at doing king relative eval is working similar to that: Everything is incrementally updated but when a king moves you compute the king-relative term from scratch. And that's still fast because for each piece you just fetch the (mg, eg) values from the two king's tables and multiply them with PopCount(pieceBitboard & color) fore each (piece, color) combination and add the results up.

j.t. wrote: ↑Wed May 24, 2023 10:27 pm As described above, I use two 64 sized sets of PSTs. I don't do any incremental evaluation (i.e. all eval is done for each leaf), and if I completely remove the PST from my eval, I get a speed improvement of at most 10%. The most expensive stuff of the rest of my evaluation for this experiment is not much more than some mobility and some smaller tables.

Do you know how much Elo the 64 sets of PSTs brought you over just having one set i.e. not doing king relative eval? I try to get a feel for how good my +30 Elo really are in terms of encoding king-relative information.

j.t. · Post by **j.t.** » Sat May 27, 2023 1:20 pm

lithander wrote: ↑Sat May 27, 2023 12:46 pm Do you know how much Elo the 64 sets of PSTs brought you over just having one set i.e. not doing king relative eval? I try to get a feel for how good my +30 Elo really are in terms of encoding king-relative information.

I think initially in self-play I got 50 Elo or so, but I wasn't testing very systematically back then (I think that was with only the zurichess set). More recently, it was 100 Elo if I used a normal PST instead (with ~6 million training positions). However, in that case, I don't remember if it was self play or against a pool. Witek seemed to get even 200 Elo in Caissa (with billions of training positions and compared to an only PST evaluation (so no other king safety features or something like that)).

lithander · Post by **lithander** » Wed Jun 07, 2023 2:58 am

TLDR: I'm still working on making my evaluation "king relative".

A quick recap: A few posts back I described how I was tuning separate sets of PSTs and visualized them. The similarity between tiles of the same piece-type inspired me to first implement a minimal approximation based on just 10 (additional) piece-square tables. With these each piece's material value is modified based on the position of the two kings. The placement of the piece however is not considered for that bonus. Despite this rather crude approximation the additional tables helped to gain a few Elo.

But my original idea was a bit more ambitious then just adding a bunch of additional lookup tables. The idea started with a simple realization: While the original meaning of a PSTs is just a lookup table, for "tapered PSTs" that work much better in practice, this is no longer true. Instead each piece-square value is now defined by a very tiny linear function: value = mg + eg * phase

What I wanted to try is to stick to this linear model but add more terms to it. Terms that are based on the king files and ranks. My expectation was that a higher dimensionality would allow me to increase the accuracy of the linear model but of course it would also have a higher performance cost to compute these increasingly complex formulas each time a piece moves (Leorik uses incremental updates) and most importantly recompute *all* pieces each time one of the kings move. Would it be worth it? That's why I started with the minimal viable thing with the 10 (or even just 5) lookup tables to establish the baseline the high-dimensional linear model would have to beat.

The other problem with my idea was that the best process I could come up with to compute the coefficients for my linear model was quite cumbersome: I planned to train a lot of "normal" PSTs on selected subsets of my training data. (that's where the visualization came from) Then I wanted to extract the relevant features of a position of the subset and use the matching set of PSTs to get the expected value of each piece on that position. That way I'd be able to gather input and expected output data for each piece-square combination.

Then I planned to find the coefficients using QR decomposition as implemented in MathNet library:

Code: Select all

            // Solve for the regression coefficients using QR decomposition
            QR<double> qr = inputs.QR();
            
            Matrix<double> qTranspose = qr.Q.Transpose();
            Vector<double> y = qTranspose * outputs;
            
            Vector<double> coefficients = qr.R.Solve(y);

But did I really want to add a 3rd party dependancy to Leorik? So far everything was "from scratch" so I tried to learn more about how to solve multiple linear regression problems on my own. And only when "gradient descent" was mentioned it hit me that I had already everything I needed! There's really no need to compute all these sets of PSTs on subsets of the training set. I should just modify my tuner to work with an arbitrary amount of dimensions (instead of just midgame and endgame) and then I could start to describe the linear model I had in mind by just generating the "variables" and it'd find me the coefficients automatically that minimize MSE.

In my tuner a chess position is converted into an array of floats and by increasing the dimensionality of my linear model from 2 dimensions (mg + eg) to something like 14 dimensions all that really changed was the size of that float array. (and about a hundred lines of implementation details^^)

The code below adds 14 features for each piece-square combination in a position into an array of about 11,000 floats total.

Code: Select all

        private static void AddFeatures(float[] features, int index, int sign, float phase, int kingSqr, int oppKingSqr)
        {
            float kFile = Bitboard.File(kingSqr) / 3.5f - 1f;
            float kRank = Bitboard.Rank(kingSqr) / 3.5f - 1f;
            float oppKFile = Bitboard.File(oppKingSqr) / 3.5f - 1f;
            float oppKRank = Bitboard.Rank(oppKingSqr) / 3.5f - 1f;
            //Base
            features[index++] += sign;
            features[index++] += sign * kFile * kFile;
            features[index++] += sign * kFile;
            features[index++] += sign * kRank * kRank;
            features[index++] += sign * kRank;
            features[index++] += sign * oppKFile * oppKFile;
            features[index++] += sign * oppKFile;
            features[index++] += sign * oppKRank * oppKRank;
            features[index++] += sign * oppKRank;
            //Endgame (* phase)
            features[index++] += sign * phase;
            features[index++] += sign * kFile * phase;
            features[index++] += sign * kRank * phase;
            features[index++] += sign * oppKFile * phase;
            features[index++] += sign * oppKRank * phase;
        }

...and the tuner finds the associated coefficients that then look like this:

Code: Select all

405,  -3.57f, -13.43f,  29.03f,  32.36f,  -8.51f,  10.91f,  11.32f,  28.78f, 143.63f,  24.87f,  76.69f,  45.98f,  61.77f

And that's how material of a piece on a square is now calculated: Look the above coefficients up and use them in a formula that is based on the same features that we used in the tuner:

Code: Select all

        private void AddMaterial(int pieceIndex, int squareIndex, int myKingSqr, int oppKingSqr)
        {
            int entryIndex = Weights.MaterialTerms * ((pieceIndex << 6) | squareIndex);

            float kFile = Bitboard.File(myKingSqr) / 3.5f - 1f;
            float kRank = Bitboard.Rank(myKingSqr) / 3.5f - 1f;
            float oppKFile = Bitboard.File(oppKingSqr) / 3.5f - 1f;
            float oppKRank = Bitboard.Rank(oppKingSqr) / 3.5f - 1f;

            Material.Base += Weights.MaterialWeights[entryIndex + 0] // Base
                + Weights.MaterialWeights[entryIndex + 1] * kFile * kFile
                + Weights.MaterialWeights[entryIndex + 2] * kFile
                + Weights.MaterialWeights[entryIndex + 3] * kRank * kRank
                + Weights.MaterialWeights[entryIndex + 4] * kRank
                + Weights.MaterialWeights[entryIndex + 5] * oppKFile * oppKFile
                + Weights.MaterialWeights[entryIndex + 6] * oppKFile
                + Weights.MaterialWeights[entryIndex + 7] * oppKRank * oppKRank
                + Weights.MaterialWeights[entryIndex + 8] * oppKRank;

            Material.Endgame += Weights.MaterialWeights[entryIndex + 9] // Phase-Bonus
                + Weights.MaterialWeights[entryIndex + 10] * kFile
                + Weights.MaterialWeights[entryIndex + 11] * kRank
                + Weights.MaterialWeights[entryIndex + 12] * oppKFile
                + Weights.MaterialWeights[entryIndex + 13] * oppKRank;
        }

For now it seems to work as expected: The MSE (mean squared error between predicted outcome and label on a set of 2.8M positions sourced from selfplay games) goes down significantly which is indicating that the accuracy has improved.

And the other expectation is also met: It's pretty slow and the NPS of my engine tanks from 7M nps to 3M nps.

If that sounds crazy, I'm sorry. Not even I'm convinced that it's a good idea but it wasn't obviously bad either which piqued my curiosity about whether it'd work or not. And now I know that it "worked" in so far that a build with that eval doesn't stand a chance in a time-limited match but in a node-limited match the new version is ~70Elo above base line Leorik 2.4.

The remaining work is all about optimizing the code for performance which to me is the most fun part of chess programming.
For example the above looks like a great opportunity to use SIMD, right? The expensive part of the code above is basically just a Dot product. And I'm flexible to adjust the number of features to exactly fill up a 256bit registers.

In the worst case scenario I still might not make any real progress on Leorik here but at least it's a fun way of not making progress! =)

lithander · Post by **lithander** » Sat Jun 10, 2023 2:59 pm

Optimizing the speed of my king relative material functions went better than I dared to hope.

I now use 18 terms to encode the material value of a piece on a specific square. 9 of them are used for the midgame value and 9 for endgame value, this is still kept seperate for backward compatibility with the other parts of the eval. Of these 9 terms one is a base value (intercept) and the remaining 8 are based on king-relative features. (see code example in the last post)

Initially my implementation caused a big speed loss (50% nps) but it's important to realize that incremental updates still work, that the king relative features only need to be updated when the king moves, that features for black and white pieces are just shuffled versions of the same 8 values. And last but not least that 8 floats can fit exactly into a 256bit SIMD registers.

Multiplying 8 king-relative features of the current position with the weights that my tuner found and stored for each piece-square combination comes down to a single call:

Code: Select all

Vector256.Dot(vars, weights)

And this is surprisingly fast! The other major thing I did to improve speed is that I don't compute the king features from scratch but also store them with the position's eval and update it only when necessary.

Here's the performance comparision (measured on the startpos searched to depth 22) of the three different king-relative evaluations that I tried:

Code: Select all

Leorik-2.4 (no-king-relative features) 7.5M nps
Leorik-2.4.1 (5 king-relative tables per king (=10)) 7.3M nps
Leorik-2.4.2 (5 king-relative tables for own king only) 7.35M nps
Leorik-2.4.3 (18 king-relative terms per Piece-Square) 6.5M nps

The 18 dimensional linear model is able to encode how a piece on a specific square relates to the two kings and that improves the eval quality a lot more then what the lookup tables can do. This makes 2.4.3 significantly stronger then the other king-relative versions (~ +20 Elo) despite it evaluating 1M less nodes per second.

The above results may suggest that the two established paradigms of A) adding up values pulled from different sets of context-specific piece-square tables and B) using NNUE are maybe not the only viable approaches. Considering the ubiquitousness of SIMD capable hardware it seems interesting to explore whether using linear models as described above presents a viable alternative, simply because it allows you to leverage the capabilities of modern hardware better and compute the multiply and add operations on 8 terms (using 32bit floats) in parallel.

Also, I found that this approach solves issues with overfitting some PSTs. When tuning context-specific sets of PSTs you can use only a subset of your trainingdata for each of them. For some exotic positions that means you may not have enough labeled positions to derive meaningful general results from. With a linear function that you try to fit over a much larger space it seems that overfitting is much less of a problem. I should probably do another visualization based on the linear model. Could be interesting to compare.

This stuff is has me operating at the limits of my mathematical abilities btw. Let me know if you think my conclusions make no sense!

Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik

Re: Devlog of Leorik