How Do You Automatically Tune Your Evaluation Tables

Antonio Torrecillas · Sat Mar 01, 2014 10:33 pm

How about getting a strong engine and scoring all of the position, then converting this into a probability using the sigmoid function?

I've tried this a couple of times.Then I compared mi evaluation with the two engines I used and this was my conclusion. This work fine but your evaluation need to be similar to the mentor engine, philosophy and/or implementation.
Surprisingly, a run with the similarity tool of Don D. don't put these mentor engines on top of similarity.
So the current implementation of the evaluation function will determine which mentor is best and I start to think the maximum elo this evaluation can reach.(tuning parameters only).Improving an evaluation must be a sequence of modifying the evaluation, then tune and so on.

Steve Maughan · Post by **Steve Maughan** » Mon Mar 03, 2014 5:07 pm

I've give some more thought to the fitness function.

I think one of the reasons it may work so well is the range of sensitivity. The sigmoid function is *really* sensitive in the critical -100 to +100 centi-pawns. Outside of this range it's close to the asymptotes. So effectively you're tuning the engine to be sensitive in this range - which is a good thing.

In contrast if you used a simple least-squares error formula based on the positions raw score, a "game-won" score could be +400 or +800 or +1200. The variation, which is almost irrelevant in this range, would drown out the subtle variations between -100 and +100 centi-pawns.

Steve

petero2 · Post by **petero2** » Mon Mar 03, 2014 8:52 pm

Steve Maughan wrote:I think one of the reasons it may work so well is the range of sensitivity. The sigmoid function is *really* sensitive in the critical -100 to +100 centi-pawns. Outside of this range it's close to the asymptotes. So effectively you're tuning the engine to be sensitive in this range - which is a good thing.

In contrast if you used a simple least-squares error formula based on the positions raw score, a "game-won" score could be +400 or +800 or +1200. The variation, which is almost irrelevant in this range, would drown out the subtle variations between -100 and +100 centi-pawns.

This was actually one of the reasons why I chose this fitness function. Another reason was that the relation between pawn advantage and win percentage had already been found to roughly follow a logistic function, see https://chessprogramming.wikispaces.com ... e,+and+ELO.

The third reason was that I had read about H.G Muller's method to calibrate piece values for non-orthodox chess, by playing a number of games from unbalanced starting positions, calculate the win percentage and then use the inverse of the sigmoid function to calculate the piece value. This method is equivalent to solving the following minimization problem:

Code: Select all

     N
min sum &#40;r_i - sigmoid&#40;V&#41;)^2
 V  i=1

This follows from the fact that the sum is minimized when sigmoid(V) = mean(r_i). From there it was not a big step to generalize this to more than one unknown variable to be optimized.

michiguel · Post by **michiguel** » Tue Mar 04, 2014 5:01 pm

jdart wrote:Sounds like Miguel is doing it a different way, using a large fixed set of positions from comp games (I infer, not necessarily his own engine). I am not sure whether this is better than continually re-generating the test set from computer matches. But it is simpler.

--Jon

Yes, that is correct. The main reason is that I quickly get a large number of reasonable quality positions. In a way, parameters will reflect a statistical survey of the data, and if the data is not enough to cover the whole landscape, the final results could show wild numbers for the ones that are not well represented. At least, that is what I thought.

Using positions generated by the engine has other advantages. If there is a hole in the evaluation or a parameter is wrong, it will be well represented and hence, seriously punished. So, those will be fixed. For instance, if you have that NN is better than Q (to give a wild example), there won't be many examples in comp database. But, if you play with your engine, the engine itself will be driving the games towards those imbalances (and the opponent engine won't disagree...).

Using games generated by Gaviota has been in my todo list for a while. So, I do not have proof of any kind, but I believe the best may be to set the parameters first with a broad and diverse number of games from a database, and later, use self generated ones.

Miguel

Bloodbane · Post by **Bloodbane** » Tue Mar 04, 2014 5:47 pm

I would like to add some of my experiences and thoughts on tuning by curve fitting.

First, I am using what Österlund posted(as in that was version 1) with a lot of modifications by now. My sigmoid is 1/(1+exp(-x/C)) which is equivalent to the sigmoid Österlund posted except with a different C(pretty obvious). What didn't work for me was calculating the constant by minimizing C. Or changing from least squares to cross-entropy. The values were always weaker. So I came up with the idea of calculating C by fitting my sigmoid to match the sigmoid in http://chessprogramming.wikispaces.com/ ... 2C+and+ELO. This has worked and I have gotten a nice improvement so far. Don't know if it is theoretically sound or anything though.

I've also been having a problem with what I think is overfitting. I tuned material, pawn structure and king safety at the same time over a weekend. The iteration count was 222 with no end in sight. The values in 222 were horrible and it turns out the best values I have so far found have been within the first 50 iterations(which are better than the originals, mind you). My data set is 1 million positions so I think it should be large enough for this not to happen, but SOMETHING is wrong(and it isn't the code, as it terminates properly with smaller number of variables and the final values are the best ones). Any feedback on this is appreciated.

I think that the relationship between dynamic play and curve fitting comes from the fact that we are fitting our evaluation function to match the game result. I can't really explain WHY though. It's just a feeling and I haven't thought about it that much. Might be interesting to compare the playing styles of an engine with evaluation values gotten by temporal difference learning and curve fitting.

By the way, this thread is a goldmine of information. I've saved every post on my computer just in case.

AlvaroBegue · Post by **AlvaroBegue** » Tue Mar 04, 2014 5:58 pm

I have been thinking about the least-squares -vs- cross-entropy issue, and I now think that least-squares is preferable, even in theory. My thinking is not completely clear about the situation, but the argument that convinced me is that the penalty for thinking that a position is totally won when it is not is infinite in the case of the cross-entropy formula, but only amounts to the misclassification of 1 position in the case of least-squares.

On the issue of over-fitting, you could have a set of positions on the side that you use for measuring out-of-sample fitness. That doesn't fix the problem, but it detects it, which is better than nothing.

michiguel · Post by **michiguel** » Tue Mar 04, 2014 6:20 pm

Bloodbane wrote:I would like to add some of my experiences and thoughts on tuning by curve fitting.

First, I am using what Österlund posted(as in that was version 1) with a lot of modifications by now. My sigmoid is 1/(1+exp(-x/C)) which is equivalent to the sigmoid Österlund posted except with a different C(pretty obvious). What didn't work for me was calculating the constant by minimizing C. Or changing from least squares to cross-entropy. The values were always weaker. So I came up with the idea of calculating C by fitting my sigmoid to match the sigmoid in http://chessprogramming.wikispaces.com/ ... 2C+and+ELO. This has worked and I have gotten a nice improvement so far. Don't know if it is theoretically sound or anything though.

I've also been having a problem with what I think is overfitting. I tuned material, pawn structure and king safety at the same time over a weekend. The iteration count was 222 with no end in sight. The values in 222 were horrible and it turns out the best values I have so far found have been within the first 50 iterations(which are better than the originals, mind you). My data set is 1 million positions so I think it should be large enough for this not to happen, but SOMETHING is wrong(and it isn't the code, as it terminates properly with smaller number of variables and the final values are the best ones). Any feedback on this is appreciated.

I think that the relationship between dynamic play and curve fitting comes from the fact that we are fitting our evaluation function to match the game result. I can't really explain WHY though. It's just a feeling and I haven't thought about it that much. Might be interesting to compare the playing styles of an engine with evaluation values gotten by temporal difference learning and curve fitting.

By the way, this thread is a goldmine of information. I've saved every post on my computer just in case.

Overfitting implies the data is not big enough to be representative. What is most likely, and i have seen it, is that the parameters used are totally inappropriate or redundant. You will get crazy numbers if you try to fit a parabola with a straight line, or a straight line with an equation with 5 parameters. Sometimes, a parameter with a crazy number could be completely eliminated and the eval won't suffer. The other numbers will adjust accordingly and that would mean that the parameter removed was redundant. That is EXTREMELY useful and an eye opener for me countless of times. This technique has proven to me that it has been usefull not only to fit reasonable numbers, but to alert me when some of them could be tossed.

Miguel

tpetzke · Post by **tpetzke** » Tue Mar 04, 2014 7:07 pm

The values in 222 were horrible

What makes them horrible ? Do they look strange ? Do they calculate a bigger error E than earlier versions or is the engine playing with them weaker than previous generations ?

In my tests the values derived are often far away from intuitive values. I also repeated tuning runs that derived then very different parameter sets with almost equal strength.

So strange looking values are not so bad, however they might give you a hint about important and lesser important (maybe even redundant) parameters.

If the error E increases over time (while it should decreases) indicates a problem with the tuning algorithm. This can be fixed.

Worst case is the final set produces the lowest error but performs bad in game play. This means that we are not solving the real problem we have. I have experienced that before. I once used a set of EPD positions (up to 175000) and counted how often a certain parameter set found the correct move in a 2 ply search. The set was filtered so there was a chance to find the correct move within 2 ply (no deep tactics or so). At the end I had a set that was really good in finding those moves, however it played 50 ELO worse than the original set. I switched to game play then as training method.

I'm planning to use this tuning method and compare it against a set that was trained by game play. I'm really interested to see how it performs or whether we buy saved time with lower quality.

Thomas...

gladius · Post by **gladius** » Tue Mar 04, 2014 7:24 pm

I thought I'd post an update on how things went when I tried it out on SF. The material values produced were very strange. As an example, pawn value in the middle game was set to 30 cp. Pawn value in the endgame was a more reasonable 78 cp. Overall, the tuned values got blown off the map by the originals. It maybe that the logistic is not a good fit for the SF eval, and search, which is tuned to use margins based on the current eval.

There still may be bugs lurking in my implementation of course, but the error value was converging downwards, and the process terminated cleanly.

Bloodbane · Post by **Bloodbane** » Tue Mar 04, 2014 7:30 pm

The engine plays worse with iteration 222's values. A lot worse.

newE > bestE would have been a pretty funny mistake, but it's not that. I've got a bunch of unit tests which consist of a function(for example e^(x^2 + y^2)) and it's minimums/maximums to deal with stupid mistakes like that.

How Do You Automatically Tune Your Evaluation Tables

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm