How Do You Automatically Tune Your Evaluation Tables

AlvaroBegue · Post by **AlvaroBegue** » Sat Mar 15, 2014 1:29 am

Antonio Torrecillas wrote:
so I'm looking for cheaper alternatives
The methods with quick convergence use to be prone to overfitting.

My understanding of overfitting is that it's primarily a problem of the model being too flexible (i.e., having too many parameters to tune). I say "primarily" because there are "early stopping" and regularization methods that can help ameliorate overfitting in some cases.

[Disclaimer: I don't have much experience with tuning chess evaluation functions, but I have some experience with optimization in other contexts for my work.]

Antonio Torrecillas · Post by **Antonio Torrecillas** » Sat Mar 15, 2014 5:25 pm

My understanding of overfitting is that it's primarily a problem of the model being too flexible (i.e., having too many parameters to tune). I say "primarily" because there are "early stopping" and regularization methods that can help ameliorate overfitting in some cases.

By overfitting I mean: parameters take wrong values.

For example: reset all parameters to zero. and tune just one parameter (pawn value). The result will be overestimated, 187 to say a number. Next tune the knight value, it has to be over the pawn value, the same of a bishop and below a rook or queen. All of them set to zero by now. So the knight get a value of say 217.Each step incorporate a new feature in the eval, being better it is far from optimal. Each time we try to explain a wrong evaluation with the wrong feature we get a bad value.(pawn value against the whole evaluation).Quick convergence, very few parameters, overfitting ? may be this is not the correct word.

Is it able to leave a local optimum again or will it optimize towards it and then stop

I forgot to say that we need a minimun number of votes to award an increment. This number controls the inertia of the system. If the number is small, as the data is entered the parameters bounce to adapt (Quick convergence and keep the system in the local optima.). If the number is large, a parameter only move if there is strong evidence it must do this (low speed convergence, no local bouncings.).
For sure, it is not the ultimate tuning method, but it show that other "ad hoc" systems can work fine.
Another example is that you can tune without minimizing the error function: You can tune maximizing the correlation, you do not look then for a close matching of values but for a similar shape of the evaluation function.

AlvaroBegue · Post by **AlvaroBegue** » Sat Mar 15, 2014 6:56 pm

Antonio Torrecillas wrote:
My understanding of overfitting is that it's primarily a problem of the model being too flexible (i.e., having too many parameters to tune). I say "primarily" because there are "early stopping" and regularization methods that can help ameliorate overfitting in some cases.
By overfitting I mean: parameters take wrong values.

For example: reset all parameters to zero. and tune just one parameter (pawn value). The result will be overestimated, 187 to say a number. Next tune the knight value, it has to be over the pawn value, the same of a bishop and below a rook or queen. All of them set to zero by now. So the knight get a value of say 217.Each step incorporate a new feature in the eval, being better it is far from optimal. Each time we try to explain a wrong evaluation with the wrong feature we get a bad value.(pawn value against the whole evaluation).Quick convergence, very few parameters, overfitting ? may be this is not the correct word.

Oh, that's not what "overfitting" means. An example of an overfit model is one with a million parameters, so if you don't have an overwhelming amount of training data, the optimizer has enough knobs to tweak that it can learn almost the exact value of every position in the training data. This would look very good when you measure the performance in the training set (a.k.a. "in-sample performance"), but then would produce very odd results in new positions.

Another way to think about it is that the training data contains useful information and also random noise. An overfit model is one that has so much capacity for learning that it learns both the useful information and the noise.

petero2 · Post by **petero2** » Sun Mar 16, 2014 11:30 am

Evert wrote:
petero2 wrote: I don't think there is any such assumption in my method. My method only fixates K, not the value of any evaluation weight (such as the nominal value of a pawn). Also my method optimizes on the scores returned by the q-search function, which is not a vector-value. The interpolation between MG and EG scores has to happen before search sees the score, otherwise alpha-beta would not work.
Of course; I think you misunderstand what I was trying to say.

Assuming a traditional linear interpolation between mid-game and end-game scores (I gather you do something slightly different, which honestly sounds reasonable), the evaluation is a 2-vector composed of mid-game and end-game which gets projected to a 1D value for the purpose of alpha-beta. In principle, the two evaluations can use a very different scale (say, centi-pawns for the mid-game and milli-pawns for the end game for a slightly stupid example), but they need to be normalised with respect to each other in order to do the interpolation (of course, if you don't do that and get it badly wrong the program plays bad chess). One easy normalisation is just fixing the value of the pawn in both cases (as Thomas was doing). However, when doing that the winning rate given a +0.5 (say) advantage in the end game need not be the same as the winning rate for a +0.5 advantage in the middle game. This would show up as noise if you pass the +1 score through a logistic function and compare that with the actual outcome of the games from which the positions were taken. Contrary to the noise inherent in playing games, this is a systematic error, an inherent spread in the results that you cannot get rid of unless you allow the value of the pawn to change with game phase.

My emphasis above. I would argue that if a +0.5 advantage does not have the same winning rate in MG and EG, the program could make sub-optimal decisions when trying to decide if it should keep many pieces on the board or if it should trade down into an endgame. One idea behind my method is that this difference is not noise, but a systematic error that should be corrected. However if the available tuning parameters and the structure of the evaluation function do not make it possible to correct this type of error (for example because the value of a pawn has been fixed), the result could be that the optimized parameters actually make the engine play worse.

There is one other assumption in my method that I have not mentioned before though. It is conceivable that a chess engine has an evaluation function where there is a perfect match between winning rate and evaluation score, W=f(S). f may not be a logistic function, but as long as f is strictly increasing, it does not matter to the alpha-beta algorithm what the function is. (This would not be true in the presence of various score-based forward pruning techniques, but I'm going to ignore that for now).

If f is not the same logistic function that my method tries to optimize for, the method will try to adjust the evaluation parameters to make the relation between W and S more similar to the logistic function. This will typically improve E, but will also typically make the evaluation worse.

Michel · Post by **Michel** » Sun Mar 16, 2014 11:38 am

Oh, that's not what "overfitting" means. An example of an overfit model is one with a million parameters, so if you don't have an overwhelming amount of training data, the optimizer has enough knobs to tweak that it can learn almost the exact value of every position in the training data. This would look very good when you measure the performance in the training set (a.k.a. "in-sample performance"), but then would produce very odd results in new positions.

That's why you should only use a certain fraction of your training data for fitting and the remaining fraction for testing the quality of the fit. If the quality of the fit is substantially worse than what you expected then you have a case of overfitting.

A function with many parameters will do badly on this test.

petero2 · Post by **petero2** » Mon Mar 31, 2014 9:11 pm

tpetzke wrote:Hi,

I tried it with I and used a genetic algorithm to optimize the evaluation error. I documented my results here

http://macechess.blogspot.de/2014/03/th ... ng_10.html

Hi,

I re-read your blog post today and realized there is one aspect of my method that I have not mentioned yet. The local search method is able to find good values also for parameters that have a relatively small effect on E. As a simplified example, assume that:

Code: Select all

E=&#40;par1-25&#41;^2+1e-4*&#40;par2-40&#41;^2

The first parameter has a 10000 times larger effect on E than the second, but local search will not have any problems finding the best value for the second parameter. I suspect that genetic algorithms may have problems in this case though.

In a real evaluation function the first parameter could correspond to for example the bishop value which affects a large number of positions, and the second parameter could correspond to a material imbalance correction for a queen vs rook+bishop imbalance, which would affect a much smaller number of positions.

Maybe this effect can explain why many of your non-major evaluation terms became zero after the optimization.

tpetzke · Post by **tpetzke** » Tue Apr 01, 2014 10:41 am

Hi Peter,

I found another possible problem, in the first tuning I used the constant -0,58 to match score to winning probabilities (I think I took it from the wiki).

I later measured what constant really matches score to winning probabilities with the current eval of iCE best, and it was -0.20. So my engine has a higher resolution in eval, I would have to divide all my scores by 2 to bring the absolute score in the range of other engines. So I tried to find values to mirror a probability distribution that is not present.

I have explained it here
http://macechess.blogspot.de/2014/03/pa ... n-ice.html

If a term has no influence on the outcome it will not converge at all. Its entropy will just stay high. But the terms converged to 0, which actually means 0 was found as the best value. Otherwise terms of lower importance converge just later, their entropy stays high until the more important terms have been committed. It is like fine tuning.

After using -0.2 as constant it looked better. It still converged. I also verified that the new weights improve the evaluation of a set of positions that were not used for tuning. The final weights played as good as the weights I had before, but it took only 2 days instead of 4 weeks to calculate them. So I consider this a success.

http://macechess.blogspot.de/2014/03/so ... sults.html

One thing that is not covered yet is the interaction of the weights with the search parameters like futility pruning margins. For a complete picture it would be necessary to retune them as well.

Thomas..

How Do You Automatically Tune Your Evaluation Tables

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm