How Do You Automatically Tune Your Evaluation Tables

jdart · Post by **jdart** » Wed Mar 12, 2014 7:54 pm

Also see:

http://en.wikipedia.org/wiki/NEWUOA and related algorithms. CONDOR (http://www.applied-mathematics.net/ is one implementation derived from these. Another is in the dlib library (http://dlib.net/optimization.html).

These algorithms are for optimization problems where the function to be optimized is possibly expensive to compute and noisy.

--Jon

Evert · Post by **Evert** » Thu Mar 13, 2014 9:50 am

Let me see if I understand correctly what the issue is here.

In a simple case, where weights do not depend on game-phase, the evaluation function should return values such that if E1 > E2, then the probability to win the position corresponding to E1 (say P1) is larger than the position corresponding to E2 (say P2). The absolute scale of the evaluation is arbitrary, but one typically fixes the value of the pawn (say, VP = 100, but this is again arbitrary).

In practice, the evaluation is really a 2-vector, for middle-game and end-game scores, which is converted to a scalar by a linear weight depending on the material on the board ("game phase"). We should still have the property that if E1 > E2, then the probability of winning P1 is larger than the probability of winning P2. However, we now have two scale factors: for the middle-game and for the end game. One is still arbitrary, but the other one needs to be fixed somehow, otherwise the interpolation makes no sense.

A simple way to do this is to simply set VP(middle-game) = VP(end game), then tuning will make sure that other terms that become relatively more important in the end game get a higher weight in the end game than terms that are less important.

However, fixing the value of the pawn comes at a price: it is not guaranteed that the probability of winning the game given a +1 evaluation (say) is the same in the middle-game as it is in the end game. In fact, this is very probably not true (and I'm not even thinking about pathological cases where the extra pawn is meaningless; those you should detect and score as nearly equal anyway), but this is a tacit assumption in the tuning method described by Peter.

Instead, one should keep the probability P(win; eval) constant for a given evaluation score across different game phases. But if we also fix the value of the pawn in the end game, the problem is overdetermined (there is no free parameter to play with). Conclusion: the end game value of the pawn has to be allowed to vary along with the other weights in order for this tuning method to work properly.

Does that sound right? It does explain why I got nothing sensible when I tried to apply it to just positional weights (but not piece values).

tpetzke · Post by **tpetzke** » Thu Mar 13, 2014 12:34 pm

Hi Evert,

I guess you're right.

My reasoning was that the other endgame related terms in the evaluation can modify the pawn material value enough if required.

A simple case would be a constant to increase/decrease the PSQ values of the pawn. But as there are more pawn related terms there are multiple ways to do that.

However I'm is not sure whether it can really be compensated. Maybe I rerun the test later.

Thomas...

Evert · Post by **Evert** » Thu Mar 13, 2014 12:56 pm

tpetzke wrote: My reasoning was that the other endgame related terms in the evaluation can modify the pawn material value enough if required.

A simple case would be a constant to increase/decrease the PSQ values of the pawn. But as there are more pawn related terms there are multiple ways to do that.

That was my initial thought too: the value of the pawn is just an arbitrary constant, so what does it matter? The other terms just scale to get the correct relative score. But then it occurred to me that the mapping of evaluation score to winning chances is not necessarily constant and could be messed up by fixing the value of the pawn.

I guess one could test if this is true by plotting the winning rate for a given evaluation advantage as a function of game phase (so a 3-D plot). Ideally you'd probably want that to be independent of game phase.

Note that things get messy again when you consider that the communication protocol states that the printed score is supposed to be in "centi-pawns", which does more-or-less imply fixing the (typical) value of the pawn at 100, for both middle and end game. I guess most people ignore this anyway.

Steve Maughan · Post by **Steve Maughan** » Thu Mar 13, 2014 1:44 pm

Hi Evert,

Evert wrote:(...) But then it occurred to me that the mapping of evaluation score to winning chances is not necessarily constant and could be messed up by fixing the value of the pawn.(...)

Exactly!

This effect may turn out to be cosmetic. But if the pawn's value is significantly different in the opening to the endgame (see Marco Belli's post earlier in the thread http://goo.gl/au9O39) then it could be interesting. I suspect a swing of 100% could easily account for the 20 ELO difference noted by Thomas.

Also, an engine with such a low pawn value in the opening is going to have an "interesting" playing style with lots of pawn sacrifices in return for active play.

Steve

petero2 · Post by **petero2** » Thu Mar 13, 2014 10:25 pm

Evert wrote:However, fixing the value of the pawn comes at a price: it is not guaranteed that the probability of winning the game given a +1 evaluation (say) is the same in the middle-game as it is in the end game. In fact, this is very probably not true (and I'm not even thinking about pathological cases where the extra pawn is meaningless; those you should detect and score as nearly equal anyway), but this is a tacit assumption in the tuning method described by Peter.

I don't think there is any such assumption in my method. My method only fixates K, not the value of any evaluation weight (such as the nominal value of a pawn). Also my method optimizes on the scores returned by the q-search function, which is not a vector-value. The interpolation between MG and EG scores has to happen before search sees the score, otherwise alpha-beta would not work.

In fact my method assumes that there is not a perfect match between q-search scores and win probability. The goal of the optimization is to adjust the weights to make the mis-match smaller.

Also note that if the evaluation function internally uses (MG,EG) pairs for its weights, such weights would correspond to two parameters in the optimization problem. You could also use additional parameters to describe the weighting formula. For example, in texel many evaluation terms are weighted like this:

Code: Select all

S = S_mg  if material >= param1
    S_eg  if material <= param2
    S_mg + &#40;S_eg - S_mg&#41; * &#40;param1 - material&#41; / &#40;param1 - param2&#41; otherwise

The parameters "param1" and "param2" are also included in the optimization problem.

Evert wrote:Does that sound right? It does explain why I got nothing sensible when I tried to apply it to just positional weights (but not piece values).

From your post I get the impression that you are not really using the method I described, but instead a variation of it. In my method the q-search score is used to compute E, the set of training positions include non-quiet positions and positions where the evaluation score is very large, and no evaluation weights are fixed to a particular value (such as pawn=100).

It is possible to optimize only a subset of the evaluation parameters. However as I wrote in the description of my method, one big objection to this method is that it assumes that "correlation implies causation". This is also a possible reason you didn't see any improvement. Nevertheless, I have so far increased the strength of texel about 150 elo points in 2.5 months using this method. It is quite likely that the method will stop working at some point though, probably long before texel gets close to stockfish strength.

Evert · Post by **Evert** » Fri Mar 14, 2014 10:49 am

petero2 wrote: I don't think there is any such assumption in my method. My method only fixates K, not the value of any evaluation weight (such as the nominal value of a pawn). Also my method optimizes on the scores returned by the q-search function, which is not a vector-value. The interpolation between MG and EG scores has to happen before search sees the score, otherwise alpha-beta would not work.

Of course; I think you misunderstand what I was trying to say.

Assuming a traditional linear interpolation between mid-game and end-game scores (I gather you do something slightly different, which honestly sounds reasonable), the evaluation is a 2-vector composed of mid-game and end-game which gets projected to a 1D value for the purpose of alpha-beta. In principle, the two evaluations can use a very different scale (say, centi-pawns for the mid-game and milli-pawns for the end game for a slightly stupid example), but they need to be normalised with respect to each other in order to do the interpolation (of course, if you don't do that and get it badly wrong the program plays bad chess). One easy normalisation is just fixing the value of the pawn in both cases (as Thomas was doing). However, when doing that the winning rate given a +0.5 (say) advantage in the end game need not be the same as the winning rate for a +0.5 advantage in the middle game. This would show up as noise if you pass the +1 score through a logistic function and compare that with the actual outcome of the games from which the positions were taken. Contrary to the noise inherent in playing games, this is a systematic error, an inherent spread in the results that you cannot get rid of unless you allow the value of the pawn to change with game phase.

I know I say evaluation rather than q-score, but it comes down to the same thing; the q-search is only needed to make sure you reach a quiet position that the evaluation can say something about.

In fact my method assumes that there is not a perfect match between q-search scores and win probability. The goal of the optimization is to adjust the weights to make the mis-match smaller.

Of course. I do know how an optimisation works.

However, it should still not matter whether a given score comes from a mid-game or an end-game position: the win probability (distribution) for a given score should be independent of game-phase, otherwise you may get swamped by systematic uncertainties. This, it seems to me, is a tacit assumption that is required for the method to work well.

From your post I get the impression that you are not really using the method I described, but instead a variation of it. In my method the q-search score is used to compute E, the set of training positions include non-quiet positions and positions where the evaluation score is very large,

Of course. Unless you filter out non-quiet positions first (say, positions where qscore != eval) you cannot use eval directly, but I don't see why you'd do that (except to throw away useful positions and possibly sabotage your statistics). I do throw away positions that are "close to" mate, since the evaluation is pointless for those anyway.

However as I wrote in the description of my method, one big objection to this method is that it assumes that "correlation implies causation". This is also a possible reason you didn't see any improvement.

Well, I was also trying to use it on a very young engine which is very poorly tuned to begin with (also not an ortho-chess engine, but that hardly matters for the method), where the range of drawish scores is much larger than in chess (a pawn-advantage is worth less than it is in chess). It also doesn't recognise some dead drawn material combinations. This all combines to give a relatively weak correlation between score and win probability, which is probably a contributing factor to why this didn't work. I'll get back to it at some point in the future.

Michel · Post by **Michel** » Fri Mar 14, 2014 10:57 am

Nevertheless, I have so far increased the strength of texel about 150 elo points in 2.5 months using this method.

Wow!

tpetzke · Post by **tpetzke** » Fri Mar 14, 2014 12:12 pm

Of course. Unless you filter out non-quiet positions first (say, positions where qscore != eval) you cannot use eval directly, but I don't see why you'd do that (except to throw away useful positions and possibly sabotage your statistics). I do throw away positions that are "close to" mate, since the evaluation is pointless for those anyway.

A call to eval and a call to qSearch is computational wise a huge difference. If you want non-quiet positions in the set you should replace the non-quiet position with the position from the qSearch the score comes from.

Thomas...

Antonio Torrecillas · Post by **Antonio Torrecillas** » Fri Mar 14, 2014 4:37 pm

tpetzke wrote:
Of course. Unless you filter out non-quiet positions first (say, positions where qscore != eval) you cannot use eval directly, but I don't see why you'd do that (except to throw away useful positions and possibly sabotage your statistics). I do throw away positions that are "close to" mate, since the evaluation is pointless for those anyway.
A call to eval and a call to qSearch is computational wise a huge difference. If you want non-quiet positions in the set you should replace the non-quiet position with the position from the qSearch the score comes from.

In a tactical path , the static or the quiesce values are not trustable.
If you bother with the computational effort of quiesce evaluation. Just consider the positions where selected from a pgn file and a full search has made the move. If, in a given position, a quiesce will return a different value than the static evaluation, there is two possibilities:
The quiesce move was played, then the quiesce value is presumably better.
The quiesce capture was not played so the static and quiesce value are just as good/bad.

This is what I do:When I parse the pgn file I discard every position from where a capture or a check or an evasion or a promote move was made. This is easy and quick to do: you look for a '=','+','x' characters in the pgn move or a '+' in the previous move for an evasion. I discard positions where a tactical shot is developing and I expect the rest is quiet enough.(I do tune with static values).

best regards.
Antonio.

How Do You Automatically Tune Your Evaluation Tables

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm