Experiments with eval tuning

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
jdart
Posts: 3825
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Experiments with eval tuning

Post by jdart » Tue Mar 10, 2015 2:40 pm

I have recently made a modified version of my program that facilitates automated evaluation tuning - see http://arasanchess.org/blogs/feb15.html for a description, also code is now available at https://github.com/jdart1/arasan-chess/tree/cma-es.

I have previously used NOMAD software to tune a relatively small number of evaluation parameters, but this is an attempt to do a global tuning of many parameters, using more or less the method described by Peter Österlund (http://talkchess.com/forum/viewtopic.ph ... 22&t=50823).

My latest attempt at this used the following as an objective value: a 1-ply search over a training set of about 3.5 million positions, selected from relatively high-level games. The evals from the search are fed to a sigmoid function and the objective is the least-squared error between the sigmoid results and the game result (0, -1, or 1 for draw, loss or win). I can run the evaluations in parallel on multiple threads, but still the calculation takes 10 minutes or so.

I am using CMA-ES (https://www.lri.fr/~hansenmaesintro.html) for tuning the scoring parameters to reduce the objective value. I do not think NOMAD will work for this, the dimensions are too large.

My latest experiment however is not a success. The initial objective was 0.477623. After 1300 evaluations (which took quite some time) the objective has decreased to 0.462272 and it looked like further iterations were not improving it. (This is not a large change but this is what I have seen in my experience: at least with this type of objective the values change slowly). The tuned parameter set looks reasonable. However, when running my standard hyper-bullet game test suite, the rating decreased 10 ELO (comparing the pre-tuned values to the tuned values).

Now, there are a lot of variables here. Peter used fast games and all positions from those games, not a sample from slower games (I can try that too). There are other objective functions that may work better. CMA-ES is maybe not the best minimizer although it is reliable and does reduce the objective function. (I also tried BayesOpt (http://mloss.org/software/view/453/) but it core-dumped).

However, overall I am a bit discouraged.

--Jon

Henk
Posts: 5827
Joined: Mon May 27, 2013 8:31 am

Re: Experiments with eval tuning

Post by Henk » Tue Mar 10, 2015 2:54 pm

If I am going to tune evaluation automatically again I will compare position values only not game or match results for these computations takes too long. I don't have so much patience and the results have mostly been disappointing. But it is not the time but also the amount of work for nothing.

AlvaroBegue
Posts: 920
Joined: Tue Mar 09, 2010 2:46 pm
Location: New York
Full name: Álvaro Begué (RuyDos)

Re: Experiments with eval tuning

Post by AlvaroBegue » Tue Mar 10, 2015 3:19 pm

It is standard practice in Machine Learning to save part of your data (20%, perhaps?) as your "test set". The other 80% is referred to as the "training set". If your model has too many parameters, you'll be able to fit the noise in the training set, and the resulting model will perform poorly in the test set.

Did you do anything like this?

jdart
Posts: 3825
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Experiments with eval tuning

Post by jdart » Tue Mar 10, 2015 3:46 pm

The real objective is not to predict the game results from a very shallow search. That is only the surrogate measure that is used for tuning.

To validate the results you have to use actual game matches and that is what I am doing. Even with hyper-bullet games engines are searching 10-12 ply at least in these matches. That is a lot more search depth than the tuning environment uses and a much better predictor of success at regular blitz or standard time controls. The only reason you can't use this for tuning is the compute time involved (elapsed time for a full match run is about 8 hours using my current test setup with 36 cores, so it would take about a year to get 1000 evaluation points).

--Jon

mar
Posts: 2001
Joined: Fri Nov 26, 2010 1:00 pm
Location: Czech Republic
Full name: Martin Sedlak

Re: Experiments with eval tuning

Post by mar » Tue Mar 10, 2015 4:06 pm

Have you computed K first to match your eval scale to win probability?
This is a crucial step and Texel tuning method won't work without it (I learned the hard way).

jdart
Posts: 3825
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Experiments with eval tuning

Post by jdart » Tue Mar 10, 2015 5:07 pm

That is a good point. I have probably not optimized that.

--Jon

bob
Posts: 20563
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

Re: Experiments with eval tuning

Post by bob » Tue Mar 10, 2015 6:03 pm

jdart wrote:I have recently made a modified version of my program that facilitates automated evaluation tuning - see http://arasanchess.org/blogs/feb15.html for a description, also code is now available at https://github.com/jdart1/arasan-chess/tree/cma-es.

I have previously used NOMAD software to tune a relatively small number of evaluation parameters, but this is an attempt to do a global tuning of many parameters, using more or less the method described by Peter Österlund (http://talkchess.com/forum/viewtopic.ph ... 22&t=50823).

My latest attempt at this used the following as an objective value: a 1-ply search over a training set of about 3.5 million positions, selected from relatively high-level games. The evals from the search are fed to a sigmoid function and the objective is the least-squared error between the sigmoid results and the game result (0, -1, or 1 for draw, loss or win). I can run the evaluations in parallel on multiple threads, but still the calculation takes 10 minutes or so.

I am using CMA-ES (https://www.lri.fr/~hansenmaesintro.html) for tuning the scoring parameters to reduce the objective value. I do not think NOMAD will work for this, the dimensions are too large.

My latest experiment however is not a success. The initial objective was 0.477623. After 1300 evaluations (which took quite some time) the objective has decreased to 0.462272 and it looked like further iterations were not improving it. (This is not a large change but this is what I have seen in my experience: at least with this type of objective the values change slowly). The tuned parameter set looks reasonable. However, when running my standard hyper-bullet game test suite, the rating decreased 10 ELO (comparing the pre-tuned values to the tuned values).

Now, there are a lot of variables here. Peter used fast games and all positions from those games, not a sample from slower games (I can try that too). There are other objective functions that may work better. CMA-ES is maybe not the best minimizer although it is reliable and does reduce the objective function. (I also tried BayesOpt (http://mloss.org/software/view/453/) but it core-dumped).

However, overall I am a bit discouraged.

--Jon
Jon, this is (at least for me) nothing new. I've tried this multiple times. I remember when Anthony Cozzie decided to work on a simulated annealing approach and his first comment was "this might be the greatest thing to ever hit computer chess." Then he started the actual tuning. I spent a ton of time making every evaluation constant in Crafty use a variable that could be altered with a command to the program. He tried a tuning run with Crafty and sent me the results. Most Bizarre PST values I have ever seen. And cluster testing showed a significant Elo loss. I don't remember the exact number, but it was at least 10. We tried tuning just PSTs, everything but PSTs, subsets of things (i.e. just pawn evaluation terms, etc) but we never reached an Elo gain.

Which would mean that either my hand tuning was completely optimal (at that point in time, very hard to believe although it is much more likely today after all the manual cluster tuning I have done) or else the automated tuning was simply not working as expected.

My own personal opinion was that there was simply too much overlap between evaluation terms. Which can lead you to circular tuning loops where you raise A, raise B, and Lower C and see little change. Repeat. and repeat. I recently rewrote my passed pawn evaluation (and pawn evaluation) to eliminate a lot of this overlap, just for my own tuning sanity.

Perhaps this is an idea that simply doesn't work as long as a search is involved. The search introduces some pretty interesting side effects. I'd like to have 3 million test positions where tactics do not play any part. Then you could search each one quite deeply to discover what the best move actually is, and tune against that perhaps. But it is VERY hard to make sure tactics don't enter in. And if they do, the evaluation has no chance to help choose the right move in any realistic sort of way.

jdart
Posts: 3825
Joined: Fri Mar 10, 2006 4:23 am
Location: http://www.arasanchess.org

Re: Experiments with eval tuning

Post by jdart » Tue Mar 10, 2015 6:22 pm

I have had some modest success with tuning a relatively small number of parameters while keeping others constant.

There has been a lot of progress in optimization techniques in the past couple of decades. Many of these algorithms now can handle interdependent variables pretty well (in the literature, these problems are called "non-separable"). And there is progress optimizing expensive functions where you have a small evaluation budget. But anything over 100 variables with nonlinearity is still considered by most to be a very large and difficult problem, especially with an expensive and possibly noisy objective.

On top of that I think using a very rough surrogate for the real multi-ply deep search function is dubious.

As I posted some time back, the Go community has been using a different objective which is based on the eval diff between the supposed best move (played by strong players) and the move(s) chosen as better by the program. They have had success using this, together with a fairly sophisticated optimization method. I think there is potential in the general idea. But my experience so far is not good.

--Jon

mvk
Posts: 589
Joined: Tue Jun 04, 2013 8:15 pm

Re: Experiments with eval tuning

Post by mvk » Tue Mar 10, 2015 6:23 pm

jdart wrote:My latest attempt at this used the following as an objective value: a 1-ply search over a training set of about 3.5 million positions, selected from relatively high-level games. The evals from the search are fed to a sigmoid function and the objective is the least-squared error between the sigmoid results and the game result (0, -1, or 1 for draw, loss or win). I can run the evaluations in parallel on multiple threads, but still the calculation takes 10 minutes or so.
This is the same I have been doing. My set is 1M positions, sampled, unfiltered from a large PGN set, and I do 2 ply searches. I don't think that matters. I do hill climbing on single parameters, and then cycle through them. After 5-10 or more cycles, elo converges. That can take several weeks.
[Account deleted]

nionita
Posts: 161
Joined: Fri Oct 22, 2010 7:47 pm
Location: Austria

Re: Experiments with eval tuning

Post by nionita » Wed Mar 11, 2015 7:48 pm

jdart wrote:As I posted some time back, the Go community has been using a different objective which is based on the eval diff between the supposed best move (played by strong players) and the move(s) chosen as better by the program. They have had success using this, together with a fairly sophisticated optimization method. I think there is potential in the general idea. But my experience so far is not good.

--Jon
Whatever I tried with those 2 methods (actually objective functions) - it didn't work for me.

I have 2 major problems with this procedure:

1. As I said in another thread earlier, there is no theory which proves that such an objective function, when minimised, will give better play (Elo win). Even if some people here report gains, I guess this is just luck. It is just our hope that we can "teach" the evaluation better parameters by showing it results from a deeper search.

It is simple to demonstrate this: take 2 parameter points (far enough in the parameter space), calculate the objective function values for them. Then compare the 2 different engine versions by playing 30 000 games. I bet there is no relation between the game strenght and the objective function values. If yes, it's just luck - convince yourself with another point. Everyone can do this for himself to proove this - I did it a lot of times.

2. There was a thread here, 1 or 2 months ago - I can't find it now - in which somebody tried an obviously better eval in his engine and got a loss of Elo. Somebody else pointed out that there must be some fine tuning between eval and search (parameters, like LMR a.s.o.)

But then, if this is the case: what is the chance to get a better engine by just tuning this way only the eval? You would have to tune AT THE SAME TIME also (at least) some search parameters. But in this setting this is impossible, as those parameters are not part of the objective function.

In conclusion, in my opinion, we are left with just one method to improve the engine strength: by optimising the very noisy objective function of the results of tousends of games at different parameter points.

Post Reply