Evaluation roadblocks

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
algerbrex
Posts: 596
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Evaluation roadblocks

Post by algerbrex »

I'm in the process of trying to improve Blunder by using automatic evaluation parameter tuning via texel tuning, as I realize that I'm past the stage where I can play around with a parameter and hand-tune it effectively.

And I've made a lot of good progress with my tuner, implementing many suggestions and ideas from folks around here, but I'm still running into a few issues though:

First, with my current hardware (Google Pixelbook, 8GB ram, i5 processor), I can only load in about 200k-100k positions for tuning. So part of the reason I suspect my tuning isn't producing very good results right now is that its overfitting. I've tried to compensate for this by shuffling the training data frequently during the tuning, but can this number of positions be made to work to any effective degree?

Second, the tuning process itself is producing weird values, as I mentioned above. For instance, I'm currently trying to tune some mobility tables and values being produced don't make much sense intuitively after running a few minutes:

Code: Select all

Iteration 13 complete. Best error=0.122326
Knight MG mobility bonuses
[-54 -23 -22 -31 -15 -24 -38 -52 -61]
Bishop MG mobility bonuses
[34 -55 -29 -3 18 4 -5 -19 -8 -27 -41 -60 -59 22]
Rook MG mobility bonuses
[-7 59 -40 -29 -28 -22 -31 -20 -19 -23 -37 -1 -45 -44 -58]
Queen MG mobility bonuses
[-4 -8 -12 54 -75 -9 7 8 24 20 11 7 -7 -26 10 -24 -48 -37 -6 -60 11 27 73 69 -30 66 42 73 24]
Knight EG mobility bonuses
[-19 -23 -12 19 10 6 -13 3 -61]
Bishop EG mobility bonuses
[24 -70 -14 -43 -37 -1 5 -4 -13 -12 9 -20 1 -28]
Rook EG mobility bonuses
[-7 39 -70 -59 -8 -32 19 15 16 12 18 4 20 -14 -3]
Queen EG mobility bonuses
[-4 -13 -12 -11 -55 6 -68 -22 -71 -70 -14 -33 33 59 0 36 62 23 -6 50 1 12 -17 -6 75 76 -28 78 24]
Now I realize the tuning takes longer than several minutes, but it seems that the first few minutes are indicative of the tuner's effectiveness. Any idea what could be causing this to happen? It might also be helpful to note that right now I always limit the tuner to only tune the parameters I'm currently interested in tuning. So the tuner always only sees a subset of the evaluation parameters.

I'm also implementing mobility evaluation by counting the number of pseudo-legal moves a particular piece has, excluding squares controlled by enemy pawns. And I've also tried starting the weights from intuitive values, and from 0s, and both approaches work quite poorly.

Lastly, part of the reason I'm focusing on improving evaluation is that right now, adding various pruning methods to Blunder doesn't contribute much Elo and my working theory behind why this is happening is because the evaluation isn't good enough right now to where it can really capitalize on the extra nodes it can search. Does this sound about right?
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Evaluation roadblocks

Post by lithander »

algerbrex wrote: Mon Aug 30, 2021 9:51 am First, with my current hardware (Google Pixelbook, 8GB ram, i5 processor), I can only load in about 200k-100k positions for tuning.
Intuitively it would seem that you should easily be able to fit orders of magnitude more positions into 8GB of memory.

Did a rough calculation and that limit you state would mean that each position takes up 40KB of memory? That seems quite excessive! Did you look into reducing the memory that footprint?
Last edited by lithander on Mon Aug 30, 2021 1:37 pm, edited 2 times in total.
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Evaluation roadblocks

Post by lithander »

algerbrex wrote: Mon Aug 30, 2021 9:51 am Lastly, part of the reason I'm focusing on improving evaluation is that right now, adding various pruning methods to Blunder doesn't contribute much Elo and my working theory behind why this is happening is because the evaluation isn't good enough right now to where it can really capitalize on the extra nodes it can search. Does this sound about right?
The original PeSTO reaches over 3000 ELO with [url=
http://rofchade.nl/?p=307]only tapered Piece Square Tables as an evaluation[/url]. This strength can only be explained by it having a great search that explores the right branches deep enough regardless of a simplistic evaluation. So even a simple (well tuned) evaluation should allow you to capitalize on extra nodes in search. If you don't maybe you're pruning and reducing too many good moves?

I have just released a new version of MinimalChess that searches twice as deep as the previous version with the same NPS. So there's some heavy pruning and reductions going on and my eval is also pretty simplistic. (PST + a 13th table for mobility)
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
algerbrex
Posts: 596
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: Evaluation roadblocks

Post by algerbrex »

lithander wrote: Mon Aug 30, 2021 1:18 pm
algerbrex wrote: Mon Aug 30, 2021 9:51 am First, with my current hardware (Google Pixelbook, 8GB ram, i5 processor), I can only load in about 200k-100k positions for tuning.
Intuitively it would seem that you should easily be able to fit orders of magnitude more positions into 8GB of memory.

Did a rough calculation and that limit you state would mean that each position takes up 40KB of memory? That seems quite excessive! Did you look into reducing the memory that footprint?
Yes, you're right. I should've explained a bit more why I was limited to so few positions.

I can easily load ~2-3M positions into memory if I store each position as say, a fen string. But it's a tradeoff between more positions in memory and speed.

If I load the positions into memory as say, fen strings, and then load each fen string when calculating the mean-square error, this works, but it begins to become incredibly slow as I increase the number of positions. So although I could use this approach, I'm not entirely sure it'd be practical. It's simply been much faster and more practical to load all of the positions into memory as Position objects once at the start of the tuning.
User avatar
algerbrex
Posts: 596
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: Evaluation roadblocks

Post by algerbrex »

lithander wrote: Mon Aug 30, 2021 1:32 pm
algerbrex wrote: Mon Aug 30, 2021 9:51 am Lastly, part of the reason I'm focusing on improving evaluation is that right now, adding various pruning methods to Blunder doesn't contribute much Elo and my working theory behind why this is happening is because the evaluation isn't good enough right now to where it can really capitalize on the extra nodes it can search. Does this sound about right?
The original PeSTO reaches over 3000 ELO with [url=
http://rofchade.nl/?p=307]only tapered Piece Square Tables as an evaluation[/url]. This strength can only be explained by it having a great search that explores the right branches deep enough regardless of a simplistic evaluation. So even a simple (well tuned) evaluation should allow you to capitalize on extra nodes in search. If you don't maybe you're pruning and reducing too many good moves?

I have just released a new version of MinimalChess that searches twice as deep as the previous version with the same NPS. So there's some heavy pruning and reductions going on and my eval is also pretty simplistic. (PST + a 13th table for mobility)
Ah yes, I forgot about PeSTO, that is true. Perhaps I do need to play around more with the pruning that's occurring. Your results with MinimalChess sound encouraging, so I'll look back at the dev versions of Blunder with pruning.

Also, I've seen your newest version of MinimalChess and the table for mobility, and I was quite curious about how using a single table to evaluate mobility worked? The usual approach I've seen is to use a mobility table for each piece.
User avatar
lithander
Posts: 881
Joined: Sun Dec 27, 2020 2:40 am
Location: Bremen, Germany
Full name: Thomas Jahn

Re: Evaluation roadblocks

Post by lithander »

algerbrex wrote: Mon Aug 30, 2021 3:13 pm It's simply been much faster and more practical to load all of the positions into memory as Position objects once at the start of the tuning.
Sure, I do it the same way. But why would a Position instance be several thousand bytes large?
algerbrex wrote: Mon Aug 30, 2021 3:20 pm Also, I've seen your newest version of MinimalChess and the table for mobility, and I was quite curious about how using a single table to evaluate mobility worked? The usual approach I've seen is to use a mobility table for each piece.
This is my mobility table...

Code: Select all

        public static short[] MobilityValues = new short[13 * 6]
        {
         //                 opponent                            friendly                  
         // -     P     N     B     R     Q     K      P      N     B     R     Q     K   
            0,    0,   28,   36,   17,   31,   89,     10,    9,   11,    3,   -1,   -5,  // P
            1,   -4,    0,   20,   19,   18,   34,      1,    3,    0,    2,    4,    3,  // N
            2,   -1,   22,    0,   15,   30,   67,     -8,    3,   54,    4,   -1,   -3,  // B
            2,   -1,    6,   15,    0,   31,   36,    -11,    3,   -2,    4,    2,   -1,  // R
            3,   -3,   -3,    4,    1,    0,   75,     -3,    5,    6,    1,  -99,   -4,  // Q
            0,   30,    3,   12,    5,  -99,    0,      6,    4,    7,   -8,    6,    0,  // K
        };
For each piece you look at all the squares that it can move to, the pieces it attacks and the friendly pieces it protects and accumulate a bonus by adding up the value you find in this table. So for example the Rook (4th row) receives a bonus of 2cp per empty square. He also receives a malus of - 11 cp for "protecting" a friendly pawn. So by this the engine is encouraged to provide the rook with open files and ranks. The engine gets a big bonus if it uses a rook to threaten opponent bishops (+15) or queens (+31). Pawns are encouraged to protect each other (+10) which helps with the pawn structure.

The values were initialized to zero and then auto tuned but they make some sense in hindsight. Some extremely large values like -99 or +54 seem odd at first glance but then you realize that they are just very rare or even impossible. But the smaller values seem to make some sense and they helped the engine play better. So... that's my thirteenth table! :)
Minimal Chess (simple, open source, C#) - Youtube & Github
Leorik (competitive, in active development, C#) - Github & Lichess
User avatar
j.t.
Posts: 239
Joined: Wed Jun 16, 2021 2:08 am
Location: Berlin
Full name: Jost Triller

Re: Evaluation roadblocks

Post by j.t. »

algerbrex wrote: Mon Aug 30, 2021 3:13 pm I can easily load ~2-3M positions into memory if I store each position as, say, a fen string. But it's a trade-off between more positions in memory and speed.
But why not store the position in the format you use in your chess engine? It shouldn't be much bigger than an FEN string. If you use bitboards it should even be a bit smaller.

Independently of this, It may become indeed a performance issue to calculate the gradient and error for all positions for each gradient step (though if you don't have a very complex/big evaluation function it should only be an issue for maybe 10'000'000 or more positions), then you might want to use stochastic gradient descent (or generally optimize in batches, this trick could also be used for other optimization methods).

In my experience, however, it is makes more sense to invest more time and optimize over as many positions as possible at the same time. This resulted in better performing evaluations (not only regarding the error, but also in test games).

Also, if you are not very carefully in designing your evaluation features, I would always tune all parameters, especially because it shouldn't make a big performance difference.
User avatar
algerbrex
Posts: 596
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: Evaluation roadblocks

Post by algerbrex »

lithander wrote: Mon Aug 30, 2021 4:09 pm Sure, I do it the same way. But why would a Position instance be several thousand bytes large?

Huh, I assumed storing millions of position objects was consuming too much memory but I never actually checked. But yes, even with the size of my position objects the amount of ram used should only be 0.1-0.3 GB, well under 8.

Linux was simply showing the process had been killed whenever I tried to load millions of positions into memory, so I assumed it was a memory issue. But I definitely need to look back into now, thanks.
User avatar
algerbrex
Posts: 596
Joined: Sun May 30, 2021 5:03 am
Location: United States
Full name: Christian Dean

Re: Evaluation roadblocks

Post by algerbrex »

j.t. wrote: Mon Aug 30, 2021 4:17 pm
algerbrex wrote: Mon Aug 30, 2021 3:13 pm I can easily load ~2-3M positions into memory if I store each position as, say, a fen string. But it's a trade-off between more positions in memory and speed.
But why not store the position in the format you use in your chess engine? It shouldn't be much bigger than an FEN string. If you use bitboards it should even be a bit smaller.

Independently of this, It may become indeed a performance issue to calculate the gradient and error for all positions for each gradient step (though if you don't have a very complex/big evaluation function it should only be an issue for maybe 10'000'000 or more positions), then you might want to use stochastic gradient descent (or generally optimize in batches, this trick could also be used for other optimization methods).

In my experience, however, it is makes more sense to invest more time and optimize over as many positions as possible at the same time. This resulted in better performing evaluations (not only regarding the error, but also in test games).

Also, if you are not very carefully in designing your evaluation features, I would always tune all parameters, especially because it shouldn't make a big performance difference.
Oh I am, but see my replies to Lithander. I need to go back and actually investigate what the issue was when I was trying to load in millions of positions.

Also, I’ll try tuning all of the parameters in the evaluation, but from my testing before this would make quite a big performance distance, unless I’m doing something grossly wrong.