How Do You Automatically Tune Your Evaluation Tables

tpetzke · Post by **tpetzke** » Fri Jan 31, 2014 9:04 am

Hi Peter & Miguel,

thanks for sharing this. I will definitely give it a try in combination with my genetic algorithm framework that I already have.

The GA will just do the optimization of the parameters but I will use your method as an alternative fitness function. Probably as Miguel I will use quiet positions as I then only have a fast eval call and not a slower qsearch call. So its a bit more pre processing to save time later.

My current algorithm uses game play in combination with solving a test set of easy positional problems. A parameter set that solves a significant amount less positions than the winner of the last tournament is not allowed to play, it will not win the tournament anyway. This saves a bit of time in the early generations, in the later (where the entropy has already dropped) all parameters sets unfortunately jump over that threshold.

This makes the fitness function very expensive and so the algorithm takes a lot of time (2-3 weeks for a reasonable convergence of about 500 bits in my evaluation weights).

When I first optimized my evaluation between iCE 0.3 and iCE 1.0 I gained more than 100 ELO, the gains by subsequent runs after eval modification now are much smaller, so it is almost not worth to effort. Your approach might give it another boost.

My game play fitness function is still beneficial for tuning the parameters that control the search (futility margins, LMR, null move margins etc.). I can't image an alternative to game play in that area.

Thomas...

Evert · Post by **Evert** » Fri Jan 31, 2014 12:53 pm

petero2 wrote:the pawn race bonus which is only 157cp even though it is supposed to trigger only when one side can promote a queen and the opponent can not.

Investigating the cause for these and other odd looking parameter values may suggest further improvements to the evaluation function.

This one I think I understand, at the very least I've tried tuning the value of an unstoppable pawn several times (in more than one program) and it always converges to slightly below a minor for me too (so a bonus of ~150cp).

The reason is that the unstoppable pawn is not worth a queen, it is worth (slightly less than) the least valuable piece that the opponent is able to sacrifice to keep it from promoting. In general, that would be a minor piece.
This does suggest that the value should depend on the material remaining on the board, and the value should approach that of a queen when the pawn can never be captured at all, and it should approach the value of a rook when the opponent has no minors (but does have rooks). Experiments in that direction never produced any measurable improvements though...

tpetzke · Post by **tpetzke** » Fri Jan 31, 2014 3:41 pm

I only bother with unstoppable pawns when the opponent has no piece left. Unlike all my other eval code this one is dependent on the side to move (1 tempo can make the difference) so it interferes with my eval cache that stores evaluations independent of the stm.

Therefore I limit the positions where this code is executed and usually if no pieces are left my eval is pretty fast, so here it does not hurt a lot.

Thomas

Evert · Post by **Evert** » Fri Jan 31, 2014 4:23 pm

tpetzke wrote:I only bother with unstoppable pawns when the opponent has no piece left. Unlike all my other eval code this one is dependent on the side to move (1 tempo can make the difference) so it interferes with my eval cache that stores evaluations independent of the stm.

I keep track of whether any part of the evaluation would depend on side to move and if it does I skip the eval cache (or rather, I skip flipping the sign and storing the position with reverse colours).
Probably by the time this really matters the game is decided anyway...

petero2 · Post by **petero2** » Sat Feb 01, 2014 1:24 am

michiguel wrote:It looks almost identical to what I have been doing with Gaviota
http://www.talkchess.com/forum/viewtopi ... =&start=11

Interesting. I also have the impression that it has made my engine play more "wild" moves. Sometimes that leads to beautiful crushing wins, but unfortunately at other times it leads to humiliating losses where my engine looks like a patzer.

michiguel wrote:The technique has some other advantages. For instance, you can tell what parameters are really not doing anything and become candidates to be removed.

I have noticed this too, for example the protected passed pawn bonus in my engine is currently 0. In this case it is a candidate for further investigation, not removal, because I believe a protected passed pawn is supposed to be a good thing. Maybe the implementation is broken somehow.

petero2 · Post by **petero2** » Sat Feb 01, 2014 2:02 am

Evert wrote:
petero2 wrote:the pawn race bonus which is only 157cp even though it is supposed to trigger only when one side can promote a queen and the opponent can not.

Investigating the cause for these and other odd looking parameter values may suggest further improvements to the evaluation function.
This one I think I understand, at the very least I've tried tuning the value of an unstoppable pawn several times (in more than one program) and it always converges to slightly below a minor for me too (so a bonus of ~150cp).

The reason is that the unstoppable pawn is not worth a queen, it is worth (slightly less than) the least valuable piece that the opponent is able to sacrifice to keep it from promoting. In general, that would be a minor piece.
This does suggest that the value should depend on the material remaining on the board, and the value should approach that of a queen when the pawn can never be captured at all, and it should approach the value of a rook when the opponent has no minors (but does have rooks). Experiments in that direction never produced any measurable improvements though...

My pawn race bonus is only used in pawn endgames. It is based on a description from Robert Hyatt how the pawn race code in crafty works. It takes a number of things into account:

1. The pawn distance to the promotion square for the most advanced white and black pawn.
2. The side to move.
3. If the opponent king is outside the "square" of the pawn.
4. If a king is blocking its own pawn.
5. If a pawn will give check when it promotes.

Things that are not taken into account:
1. Whether the promoted pawn will control the promotion square of the opponents' pawn.
2. After promotion the result could be a drawn KQKP endgame with a pawn on a2/c2/f2/h2.
3. Tactical stuff like in the Reti endgame study.
4. If one side has more than one passed pawn, only the most advanced pawn that "outsquares" the opponent king is considered. This may not be the best pawn to advance.

My assumption was that the ignored factors are rare enough not to matter, but maybe that is an invalid assumption.

petero2 · Post by **petero2** » Mon Feb 03, 2014 7:10 pm

AlvaroBegue wrote:There is another possible definition for the evaluation error, which is the cross-entropy of the results given the evaluation:

E = -1/N * sum(i=1..N, WhiteWon * Log(Sigmoid(qScore(pos))) + BlackWon * Log(1-Sigmoid(qScore(pos))))

where WhiteWon is 1 for a white victory, 1/2 for a draw and 0 for a black victory, and BlackWon is the obvious analogous number for black.

I wonder if the resulting weights would be very different.

Thanks for this formula. I have now tested it and have observed the following:

1. It took my local search algorithm around 19 hours to find a local minimum. For some reason the search was "spiraling" in towards the minimum instead of going more directly towards it. The Gauss Newton method could perhaps have avoided this problem.

2. Many evaluation weights were changed. Most by a small amount, but some by a large amount. See this diff

3. The effect on playing strength (-2.2 elo) in a 30000 game match was too small to say for sure which version is best:

Code: Select all

Rank Name           Elo      +      - games   score   oppo.   draws 
   1 Texel104a4     1.1    3.9    3.7 30222   50.4%    -1.1   39.2% 
   2 TexelHead     -1.1    3.7    3.9 30222   49.6%     1.1   39.2% 
            Tex Tex
Texel104a4      910
TexelHead    89

4. I also looked at different subsets of the games where particular types of material imbalances were present. The filtering only detects if the imbalance is present in a game, it does not try to determine if the imbalance was stable, so imbalances from the middle of capture sequences are also included. Anyway:

Code: Select all

         Rank Name         Elo    +    - games score oppo. draws 
QvsRR  &#58;    1 TexelHead     21   12   12   608   55%   -21   21% 
QvsRM  &#58;    1 TexelHead     24   11   10   960   56%   -24   14% 
QvsRMM &#58;    1 TexelHead     38   20   20   244   60%   -38   17% 
BvsPPP &#58;    1 TexelHead     10    8    8  1515   53%   -10   21% 
NvsPPP &#58;    1 TexelHead     12    9    9  1349   53%   -12   22% 
BvsPPPP&#58;    1 TexelHead     28   15   14   443   58%   -28   23% 
NvsPPPP&#58;    1 TexelHead     24   16   16   356   57%   -24   21%

Note that these subsets of games were all drawn from the same set of 30000 games, so I'm not sure how statistically significant the results are. However it appears to me that the new version is better at handling some types of material imbalances.

AlvaroBegue · Post by **AlvaroBegue** » Mon Feb 03, 2014 7:58 pm

Wow, thanks for testing the alternative formula! It's fun to go through the weights to see what changes quite a bit (e.g., a white bishop on d1 and e1 during the middle game gets much more penalized, if I am reading the tables correctly). I wonder to what extent this is just reflecting the fact that those scores are not very well determined by the data (perhaps they don't happen very often?).

Your 30,000-game test seems to indicate that the exact formula you use doesn't matter a whole lot for playing performance. I wouldn't try to read too much into the performance in small subsets of that test.

petero2 · Post by **petero2** » Mon Feb 03, 2014 8:23 pm

AlvaroBegue wrote:Wow, thanks for testing the alternative formula! It's fun to go through the weights to see what changes quite a bit (e.g., a white bishop on d1 and e1 during the middle game gets much more penalized, if I am reading the tables correctly). I wonder to what extent this is just reflecting the fact that those scores are not very well determined by the data (perhaps they don't happen very often?).

Your 30,000-game test seems to indicate that the exact formula you use doesn't matter a whole lot for playing performance. I wouldn't try to read too much into the performance in small subsets of that test.

Actually it is the score for a white bishop on d8/e8 in the middle game that is changed from -5 to -50.

The tables are defined for black pieces (hence the final "b" in the variable names) and the white piece tables are mirror versions implemented by the TableParamMirrored class. All evaluation values are defined so that positive values are good for white (and the score is negated as the last step in the evaluation function in case side to move == black).

All in all, this means that you can think of the tables as valid for white pieces with the usual orientation of the chess board, so that a1 = lower left corner, h1 = lower right corner, a8 = upper left corner.

I have a function that determines how often a particular evaluation weight affects the q-search score. For a white bishop on d8/e8 in the middle games, the result is 0.26% of the positions (from a total of about 8.8 million positions) and 7.2% of the games (from a total of 64000 games).

I don't know how stable the parameter estimates are for the less frequently used evaluation weights. I will readily confess that I am nowhere near finished investigating this tuning algorithm.

gladius · Post by **gladius** » Wed Feb 05, 2014 12:04 am

petero2 wrote: The average error E is now a function of the M parameters. Find parameter values such that E is a local minimum in parameter space. The exact optimization method is not that important. It could for example use local search varying one parameter at a time, the Gauss-Newton method, the conjugate gradient method, or a hybrid approach mixing those methods.

If the evaluation function parameters are all of integer type with finite range (which they typically are in a chess engine), the local search method is guaranteed to eventually terminate since there is only a finite number of elements in the parameter space. Also in this case the conventional gradient isn't mathematically defined, but you can use difference quotients instead of the "real" gradient in the GN and CG methods.

Hi Peter,

This optimization method sounds really nice, and thanks for the really clear explanation! I'd been wanting to try something like it for a long time.

I've got the engine (Stockfish [1]) all hooked up to expose it's internal eval features, and built up the fen database with game results from recent games. The part I don't seem to be moving very fast on is the actual optimization

. My math skills are on the lower side, so I was just trying out the scipy optimization toolkit. It has a congujate-gradient method built in, but it doesn't seem to be doing a great job so far [2].

Can you be more specific by what you meant by using the difference quotient?

Thanks!
Gary

[1] https://github.com/glinscott/Stockfish/ ... r...tuning
[2] https://gist.github.com/glinscott/8814218

How Do You Automatically Tune Your Evaluation Tables

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm