How Do You Automatically Tune Your Evaluation Tables

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: The texel evaluation function optimization algorithm

Post by AlvaroBegue »

Thanks for that, Peter.

There is another possible definition for the evaluation error, which is the cross-entropy of the results given the evaluation:

E = -1/N * sum(i=1..N, WhiteWon * Log(Sigmoid(qScore(pos))) + BlackWon * Log(1-Sigmoid(qScore(pos))))

where WhiteWon is 1 for a white victory, 1/2 for a draw and 0 for a black victory, and BlackWon is the obvious analogous number for black.

I wonder if the resulting weights would be very different.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: The texel evaluation function optimization algorithm

Post by michiguel »

petero2 wrote:
AlvaroBegue wrote:A different approach is to try to make the evaluation function be a good predictor of game result, without playing games. (Of course I would play many games after tuning to verify the new weights are better than the previous ones.)

The basic mechanism would be something like this:
(1) Get a database of [~1 million?] positions with associated results.
(2) From each position, run quiescence search and extract the position that the evaluation ultimately came from; in the process, write a new database of quiescent positions with associated results.
(3) Define the probability of winning as sigmoid(C*evaluation), where sigmoid(x):=1/(1+exp(-x)) and the constant C is chosen so the evaluation has the usual scale in pawns (I got C=0.58 or something like that, but I am quoting from memory).
(4) Use non-linear regression to estimate the parameters of the evaluation function that maximize the [log-]likelihood of the results.

One needs to do something to handle draws, but probably treating them as half a victory and half a loss would be fine.

Notice that if your evaluation function is a linear combination of terms and you are trying to figure out the coefficients, step (4) is logistic regression.

I have only done small-scale tests with this idea, but the Junior team seems to have used it extensively, as described in this paper: http://www.ratio.huji.ac.il/node/2362 . They seem to handle draws in a complicated way, but other than that I think their ideas are similar to mine (I haven't read the paper in a while).
Since about one month ago, I have been using something similar to tune the evaluation function in my chess engine texel. Here is a description of the method.

Method

Take 64000 games played at a fast time control (such as 1s+0.08s/move) between the current and/or previous versions of the engine. Extract all positions from those games, except positions within the opening book and positions where the engine found a mate score during the game. This typically gives about 8.8 million positions which are stored as FEN strings in a text file.

Define the average evaluation error E:

Code: Select all

E = 1/N * sum(i=1,N, (result[i] - Sigmoid(qScore(pos[i])))^2)
where:

N is the number of test positions.

result is the result of the game corresponding to position i. 0 for black win, 0.5 for draw and 1 for white win.

qScore(pos) is the value returned by the chess engine quiescence function. The algorithm assumes the qScore function is deterministic. If transposition tables or the history heuristic is used in the qScore function this may not be the case.

Sigmoid(s) = 1 / (1 + 10^(-K*s/400))

K is a scaling constant.

Compute the K that minimizes E. K is never changed again by the algorithm.

If needed, refactor the source code so that the qScore function depends on a set of evaluation function parameters w[j], 1<=j<=M.

The average error E is now a function of the M parameters. Find parameter values such that E is a local minimum in parameter space. The exact optimization method is not that important. It could for example use local search varying one parameter at a time, the Gauss-Newton method, the conjugate gradient method, or a hybrid approach mixing those methods.

If the evaluation function parameters are all of integer type with finite range (which they typically are in a chess engine), the local search method is guaranteed to eventually terminate since there is only a finite number of elements in the parameter space. Also in this case the conventional gradient isn't mathematically defined, but you can use difference quotients instead of the "real" gradient in the GN and CG methods.

Advantages

One big advantage of this method is that it can simultaneously optimize several hundreds of evaluation function parameters in a reasonable amount of time. The first step that collects a set of games does not take any extra time, because those games have already been played when previous changes to the engine were tested. The step that converts from PGN to FEN only takes a few minutes. The time consuming step is to compute the local minimum. In my engine M is currently around 400 and computing the gradient takes around 25 minutes on a 16-core Dell T620 computer. A local minimum can usually be computed within 6 hours, faster if only small changes to the parameter values are needed.

While 6 hours may seem like a lot, consider how long it would take CLOP to simultaneously optimize 400 parameters (assuming you have enough memory to actually do that). I have not worked out the math but I guess it would take at least ten years to get reliable results.

Another advantage is that no external source of knowledge is needed. You don't need to find a large set of games played by engines or humans that are stronger than the engine you want to improve. While this may not be a practical advantage for the current strength of my engine, I still find this property theoretically pleasing, somewhat similar to the concept of self calibration.

A third advantage is that the need for different evaluation terms to be "orthogonal" disappears, since the optimization will automatically deal with dependencies between evaluation terms.

A fourth advantage, compared to the method used by Amir Ban to tune the Junior evaluation function, is that my method does not need the implementation of a "drawishness" function in the chess engine.

Concerns

My biggest concern about this method is what is called Correlation does not imply causation in statistics. I don't have a specific example for chess, but the Wikipedia example about ice cream sales and drowning accidents is quite illuminating. It is a known fact that when ice cream sales increase, so does drowning accidents, but that does not imply that stopping ice cream sales would reduce the number of drowning accidents. It is conceivable that the chess engine would learn to appreciate positional features that are correlated to increased winning chances, even though actively striving to reach such positions does not increase winning chances. Anyway the method does work in my engine, possibly because of the counter argument that "Correlation is not causation but it sure is a hint".

Another concern is that the chess engine will not be able to learn things that it does not understand at all. For example, assume the chess engine did not know how to win KBNK endings, so all games leading to this endgame would end in a draw. Because of this E would tend to be smaller if the evaluation functions scored all KBNK endings as draws, so there is a risk the evaluation actually gets worse after optimization.

On the other hand, the algorithm can be expected to improve the engine's knowledge about things that it partially already knows, or knowledge about things that are good even if you don't know it. For example, assume the engine does not know that the bishop pair deserves a bonus. However, if the engine has general knowledge about mobility it is likely that it will win more than it loses anyway when it has the bishop pair advantage. Therefore the engine will be able to learn that the bishop pair deserves a bonus. If the algorithm is then repeated with a new set of games, the engine might learn that the bishop pair is even better when it knows that it should not unnecessarily trade a bishop.

It may also be argued that since more than one position is picked from each game (in fact about 140 positions per game on average), the result is invalid because there is a dependence between terms in the summation that defines E, but least squares theory in general assumes that different data points are independent. The way I see it, E has contributions from 64000 independent events and the fact that there are 140 times more terms in the summation is similar to what would happen if you solved a weighted least squares problem where the weights are determined by how often a particular type of position is present in typical games.

While it would probably be better (or at least equally good) to take one position each from 8.8 million different games, obtaining that many games would require more computation time than I am willing to spend. I believe 64000 independent events is more than enough to estimate 400 parameters anyway.

Results

When I started using this method, I calculated that K=1.13 best matched the then current evaluation function in my engine. I have not changed the value of K since then and do not intend to change it in the future either, since it is just an arbitrary scaling constant.

Since 2014-01-01 when the first evaluation change based on this algorithm was included in my engine, the elo improvements caused by evaluation weight tuning (measured using 32000 games at 1+0.08) have been:

24.6 + 4.0 + 5.8 + 2.8 + 12.8 + 39.4 + 10.2 = 99.6 elo

The 24.6 improvement was when I made most of the evaluation parameters accessible to the optimization algorithm. The next three improvements came when I in three stages exposed more evaluation parameters to the algorithm. The 12.8 improvement came when I re-created the set of test positions based on the most recently played games.

The 39.4 improvement came when I changed the criteria for which positions to include in the test set. Initially I removed also all positions where the q-search score deviated too much from the search score in the actual game (which conveniently is saved by cutechess-cli in PGN comments). I believed that including those positions would just raise the "noise level" of the data and cause a worse solution to be found. Apparently this is not the case. I now believe that even though including these positions causes noise, the q-search function has to deal with them all the time in real games, so trying to learn how those positions should be evaluated on average is still beneficial.

The 10.2 improvement came when I added separate queen piece square tables for middle game and end game. Previously texel only had one queen PST, which had horizontal, vertical and diagonal symmetry.

The last improvement was made after the texel 1.03 release. The others are included in the 1.03 release and I believe they are responsible for most of the strength increase in texel 1.03.

I don't know if the large strength increase was made possible because the method is good or because the texel evaluation function was particularly bad before I started using this algorithm. Anyway I don't expect to be able to get equally big improvements in the future as I got in texel 1.03, but I hope the algorithm will nevertheless help finding a fair amount of smaller evaluation improvements.

Future improvements

Currently local search is used to find a local optimum. I believe it would be faster to initially use a few Gauss-Newton iterations and then switch to local search when the remaining corrections are small.

A number of evaluation parameters have values that appear quite unintuitive. For example, the value of a white queen on b7/g7 (and a black queen on b2/g2 because of symmetry) in the middle game has a value of -128cp. I have not investigated the cause yet, but I believe the explanation is that almost always when a white/black queen is on b7/b2 in the early middle game, it is because it has just eaten the "poisoned pawn", and the opponent is about to win back a pawn by playing Rb1 and Rxb7. If placing the queen on b7/g7/b2/g2 for other reasons is much less common, the optimization algorithm will assign a large penalty for a queen on these squares. While this improves the evaluation function in the average case, it will also cause large evaluation errors for the uncommon cases that do not fall under the "poisoned pawn" category. Implementing a more specific rule about poisoned pawns could be profitable.

Other examples of odd parameter values include a white king on b8 which in the middle game gives white a 200cp bonus, a white knight on a8 in the middle game which gives white a 223cp penalty, the pawn race bonus which is only 157cp even though it is supposed to trigger only when one side can promote a queen and the opponent can not.

Investigating the cause for these and other odd looking parameter values may suggest further improvements to the evaluation function.

Another interesting approach is to apply data mining techniques on the residuals to try to discover missing knowledge in the evaluation function. I don't know how successful this approach will be. It may be more profitable to test different variants of already known chess evaluation principles instead.


It looks almost identical to what I have been doing with Gaviota
http://www.talkchess.com/forum/viewtopi ... =&start=11
But it did not generate much interest.

The main difference is that I do not use quies search values, but the score of eval. For that reason, I use position that are already quiescent. Also, I do not use games played by Gaviota, but games from the computer databases, which allow me to use a handful of millions of positions. I am no sure if those many are needed.

The technique has some other advantages. For instance, you can tell what parameters are really not doing anything and become candidates to be removed.

Miguel
PS: I will read your post in more detail.
tpetzke
Posts: 686
Joined: Thu Mar 03, 2011 4:57 pm
Location: Germany

Re: The texel evaluation function optimization algorithm

Post by tpetzke »

Hi Peter & Miguel,

thanks for sharing this. I will definitely give it a try in combination with my genetic algorithm framework that I already have.

The GA will just do the optimization of the parameters but I will use your method as an alternative fitness function. Probably as Miguel I will use quiet positions as I then only have a fast eval call and not a slower qsearch call. So its a bit more pre processing to save time later.

My current algorithm uses game play in combination with solving a test set of easy positional problems. A parameter set that solves a significant amount less positions than the winner of the last tournament is not allowed to play, it will not win the tournament anyway. This saves a bit of time in the early generations, in the later (where the entropy has already dropped) all parameters sets unfortunately jump over that threshold.

This makes the fitness function very expensive and so the algorithm takes a lot of time (2-3 weeks for a reasonable convergence of about 500 bits in my evaluation weights).

When I first optimized my evaluation between iCE 0.3 and iCE 1.0 I gained more than 100 ELO, the gains by subsequent runs after eval modification now are much smaller, so it is almost not worth to effort. Your approach might give it another boost.

My game play fitness function is still beneficial for tuning the parameters that control the search (futility margins, LMR, null move margins etc.). I can't image an alternative to game play in that area.

Thomas...
Thomas...

=======
http://macechess.blogspot.com - iCE Chess Engine
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: The texel evaluation function optimization algorithm

Post by Evert »

petero2 wrote:the pawn race bonus which is only 157cp even though it is supposed to trigger only when one side can promote a queen and the opponent can not.

Investigating the cause for these and other odd looking parameter values may suggest further improvements to the evaluation function.
This one I think I understand, at the very least I've tried tuning the value of an unstoppable pawn several times (in more than one program) and it always converges to slightly below a minor for me too (so a bonus of ~150cp).

The reason is that the unstoppable pawn is not worth a queen, it is worth (slightly less than) the least valuable piece that the opponent is able to sacrifice to keep it from promoting. In general, that would be a minor piece.
This does suggest that the value should depend on the material remaining on the board, and the value should approach that of a queen when the pawn can never be captured at all, and it should approach the value of a rook when the opponent has no minors (but does have rooks). Experiments in that direction never produced any measurable improvements though...
tpetzke
Posts: 686
Joined: Thu Mar 03, 2011 4:57 pm
Location: Germany

Re: The texel evaluation function optimization algorithm

Post by tpetzke »

I only bother with unstoppable pawns when the opponent has no piece left. Unlike all my other eval code this one is dependent on the side to move (1 tempo can make the difference) so it interferes with my eval cache that stores evaluations independent of the stm.

Therefore I limit the positions where this code is executed and usually if no pieces are left my eval is pretty fast, so here it does not hurt a lot.

Thomas
Thomas...

=======
http://macechess.blogspot.com - iCE Chess Engine
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: The texel evaluation function optimization algorithm

Post by Evert »

tpetzke wrote:I only bother with unstoppable pawns when the opponent has no piece left. Unlike all my other eval code this one is dependent on the side to move (1 tempo can make the difference) so it interferes with my eval cache that stores evaluations independent of the stm.
I keep track of whether any part of the evaluation would depend on side to move and if it does I skip the eval cache (or rather, I skip flipping the sign and storing the position with reverse colours).
Probably by the time this really matters the game is decided anyway...
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: The texel evaluation function optimization algorithm

Post by petero2 »

michiguel wrote:It looks almost identical to what I have been doing with Gaviota
http://www.talkchess.com/forum/viewtopi ... =&start=11
Interesting. I also have the impression that it has made my engine play more "wild" moves. Sometimes that leads to beautiful crushing wins, but unfortunately at other times it leads to humiliating losses where my engine looks like a patzer.
michiguel wrote:The technique has some other advantages. For instance, you can tell what parameters are really not doing anything and become candidates to be removed.
I have noticed this too, for example the protected passed pawn bonus in my engine is currently 0. In this case it is a candidate for further investigation, not removal, because I believe a protected passed pawn is supposed to be a good thing. Maybe the implementation is broken somehow.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: The texel evaluation function optimization algorithm

Post by petero2 »

Evert wrote:
petero2 wrote:the pawn race bonus which is only 157cp even though it is supposed to trigger only when one side can promote a queen and the opponent can not.

Investigating the cause for these and other odd looking parameter values may suggest further improvements to the evaluation function.
This one I think I understand, at the very least I've tried tuning the value of an unstoppable pawn several times (in more than one program) and it always converges to slightly below a minor for me too (so a bonus of ~150cp).

The reason is that the unstoppable pawn is not worth a queen, it is worth (slightly less than) the least valuable piece that the opponent is able to sacrifice to keep it from promoting. In general, that would be a minor piece.
This does suggest that the value should depend on the material remaining on the board, and the value should approach that of a queen when the pawn can never be captured at all, and it should approach the value of a rook when the opponent has no minors (but does have rooks). Experiments in that direction never produced any measurable improvements though...
My pawn race bonus is only used in pawn endgames. It is based on a description from Robert Hyatt how the pawn race code in crafty works. It takes a number of things into account:

1. The pawn distance to the promotion square for the most advanced white and black pawn.
2. The side to move.
3. If the opponent king is outside the "square" of the pawn.
4. If a king is blocking its own pawn.
5. If a pawn will give check when it promotes.

Things that are not taken into account:
1. Whether the promoted pawn will control the promotion square of the opponents' pawn.
2. After promotion the result could be a drawn KQKP endgame with a pawn on a2/c2/f2/h2.
3. Tactical stuff like in the Reti endgame study.
4. If one side has more than one passed pawn, only the most advanced pawn that "outsquares" the opponent king is considered. This may not be the best pawn to advance.

My assumption was that the ignored factors are rare enough not to matter, but maybe that is an invalid assumption.
petero2
Posts: 684
Joined: Mon Apr 19, 2010 7:07 pm
Location: Sweden
Full name: Peter Osterlund

Re: The texel evaluation function optimization algorithm

Post by petero2 »

AlvaroBegue wrote:There is another possible definition for the evaluation error, which is the cross-entropy of the results given the evaluation:

E = -1/N * sum(i=1..N, WhiteWon * Log(Sigmoid(qScore(pos))) + BlackWon * Log(1-Sigmoid(qScore(pos))))

where WhiteWon is 1 for a white victory, 1/2 for a draw and 0 for a black victory, and BlackWon is the obvious analogous number for black.

I wonder if the resulting weights would be very different.

Thanks for this formula. I have now tested it and have observed the following:

1. It took my local search algorithm around 19 hours to find a local minimum. For some reason the search was "spiraling" in towards the minimum instead of going more directly towards it. The Gauss Newton method could perhaps have avoided this problem.

2. Many evaluation weights were changed. Most by a small amount, but some by a large amount. See this diff

3. The effect on playing strength (-2.2 elo) in a 30000 game match was too small to say for sure which version is best:

Code: Select all

Rank Name           Elo      +      - games   score   oppo.   draws 
   1 Texel104a4     1.1    3.9    3.7 30222   50.4%    -1.1   39.2% 
   2 TexelHead     -1.1    3.7    3.9 30222   49.6%     1.1   39.2% 
            Tex Tex
Texel104a4      910
TexelHead    89    
4. I also looked at different subsets of the games where particular types of material imbalances were present. The filtering only detects if the imbalance is present in a game, it does not try to determine if the imbalance was stable, so imbalances from the middle of capture sequences are also included. Anyway:

Code: Select all

         Rank Name         Elo    +    - games score oppo. draws 
QvsRR  &#58;    1 TexelHead     21   12   12   608   55%   -21   21% 
QvsRM  &#58;    1 TexelHead     24   11   10   960   56%   -24   14% 
QvsRMM &#58;    1 TexelHead     38   20   20   244   60%   -38   17% 
BvsPPP &#58;    1 TexelHead     10    8    8  1515   53%   -10   21% 
NvsPPP &#58;    1 TexelHead     12    9    9  1349   53%   -12   22% 
BvsPPPP&#58;    1 TexelHead     28   15   14   443   58%   -28   23% 
NvsPPPP&#58;    1 TexelHead     24   16   16   356   57%   -24   21% 
Note that these subsets of games were all drawn from the same set of 30000 games, so I'm not sure how statistically significant the results are. However it appears to me that the new version is better at handling some types of material imbalances.
AlvaroBegue
Posts: 931
Joined: Tue Mar 09, 2010 3:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: The texel evaluation function optimization algorithm

Post by AlvaroBegue »

Wow, thanks for testing the alternative formula! It's fun to go through the weights to see what changes quite a bit (e.g., a white bishop on d1 and e1 during the middle game gets much more penalized, if I am reading the tables correctly). I wonder to what extent this is just reflecting the fact that those scores are not very well determined by the data (perhaps they don't happen very often?).

Your 30,000-game test seems to indicate that the exact formula you use doesn't matter a whole lot for playing performance. I wouldn't try to read too much into the performance in small subsets of that test.