How Do You Automatically Tune Your Evaluation Tables

Evert · Post by **Evert** » Fri Mar 14, 2014 10:49 am

petero2 wrote: I don't think there is any such assumption in my method. My method only fixates K, not the value of any evaluation weight (such as the nominal value of a pawn). Also my method optimizes on the scores returned by the q-search function, which is not a vector-value. The interpolation between MG and EG scores has to happen before search sees the score, otherwise alpha-beta would not work.

Of course; I think you misunderstand what I was trying to say.

Assuming a traditional linear interpolation between mid-game and end-game scores (I gather you do something slightly different, which honestly sounds reasonable), the evaluation is a 2-vector composed of mid-game and end-game which gets projected to a 1D value for the purpose of alpha-beta. In principle, the two evaluations can use a very different scale (say, centi-pawns for the mid-game and milli-pawns for the end game for a slightly stupid example), but they need to be normalised with respect to each other in order to do the interpolation (of course, if you don't do that and get it badly wrong the program plays bad chess). One easy normalisation is just fixing the value of the pawn in both cases (as Thomas was doing). However, when doing that the winning rate given a +0.5 (say) advantage in the end game need not be the same as the winning rate for a +0.5 advantage in the middle game. This would show up as noise if you pass the +1 score through a logistic function and compare that with the actual outcome of the games from which the positions were taken. Contrary to the noise inherent in playing games, this is a systematic error, an inherent spread in the results that you cannot get rid of unless you allow the value of the pawn to change with game phase.

I know I say evaluation rather than q-score, but it comes down to the same thing; the q-search is only needed to make sure you reach a quiet position that the evaluation can say something about.

In fact my method assumes that there is not a perfect match between q-search scores and win probability. The goal of the optimization is to adjust the weights to make the mis-match smaller.

Of course. I do know how an optimisation works.

However, it should still not matter whether a given score comes from a mid-game or an end-game position: the win probability (distribution) for a given score should be independent of game-phase, otherwise you may get swamped by systematic uncertainties. This, it seems to me, is a tacit assumption that is required for the method to work well.

From your post I get the impression that you are not really using the method I described, but instead a variation of it. In my method the q-search score is used to compute E, the set of training positions include non-quiet positions and positions where the evaluation score is very large,

Of course. Unless you filter out non-quiet positions first (say, positions where qscore != eval) you cannot use eval directly, but I don't see why you'd do that (except to throw away useful positions and possibly sabotage your statistics). I do throw away positions that are "close to" mate, since the evaluation is pointless for those anyway.

However as I wrote in the description of my method, one big objection to this method is that it assumes that "correlation implies causation". This is also a possible reason you didn't see any improvement.

Well, I was also trying to use it on a very young engine which is very poorly tuned to begin with (also not an ortho-chess engine, but that hardly matters for the method), where the range of drawish scores is much larger than in chess (a pawn-advantage is worth less than it is in chess). It also doesn't recognise some dead drawn material combinations. This all combines to give a relatively weak correlation between score and win probability, which is probably a contributing factor to why this didn't work. I'll get back to it at some point in the future.

Michel · Post by **Michel** » Fri Mar 14, 2014 10:57 am

Nevertheless, I have so far increased the strength of texel about 150 elo points in 2.5 months using this method.

Wow!

tpetzke · Post by **tpetzke** » Fri Mar 14, 2014 12:12 pm

Of course. Unless you filter out non-quiet positions first (say, positions where qscore != eval) you cannot use eval directly, but I don't see why you'd do that (except to throw away useful positions and possibly sabotage your statistics). I do throw away positions that are "close to" mate, since the evaluation is pointless for those anyway.

A call to eval and a call to qSearch is computational wise a huge difference. If you want non-quiet positions in the set you should replace the non-quiet position with the position from the qSearch the score comes from.

Thomas...

Antonio Torrecillas · Post by **Antonio Torrecillas** » Fri Mar 14, 2014 4:37 pm

tpetzke wrote:
Of course. Unless you filter out non-quiet positions first (say, positions where qscore != eval) you cannot use eval directly, but I don't see why you'd do that (except to throw away useful positions and possibly sabotage your statistics). I do throw away positions that are "close to" mate, since the evaluation is pointless for those anyway.
A call to eval and a call to qSearch is computational wise a huge difference. If you want non-quiet positions in the set you should replace the non-quiet position with the position from the qSearch the score comes from.

In a tactical path , the static or the quiesce values are not trustable.
If you bother with the computational effort of quiesce evaluation. Just consider the positions where selected from a pgn file and a full search has made the move. If, in a given position, a quiesce will return a different value than the static evaluation, there is two possibilities:
The quiesce move was played, then the quiesce value is presumably better.
The quiesce capture was not played so the static and quiesce value are just as good/bad.

This is what I do:When I parse the pgn file I discard every position from where a capture or a check or an evasion or a promote move was made. This is easy and quick to do: you look for a '=','+','x' characters in the pgn move or a '+' in the previous move for an evasion. I discard positions where a tactical shot is developing and I expect the rest is quiet enough.(I do tune with static values).

best regards.
Antonio.

tpetzke · Post by **tpetzke** » Fri Mar 14, 2014 5:35 pm

I meant something differently.

In the process of optimization Peter calls qSearch on very positions in the test set to get a score, He does not call eval, because there are non quiet positions in his set.

This makes the optimization expensive. In a test set with 1Mio positions and 256 candidate solutions, you call 256 Mio times qsearch to complete a generation, much faster would be to call 256 Mio times just eval.

If you replace non quiet positions in your test set with the position the score of a qSearch at this position actually comes from, you can then in the optimization just call eval.

So this is as good or bad as the method Peter uses, but much faster. I'm not concerned with the effort to create the test set. This is only a one time effort. I'm concerned with the effort of processing it in the optimization later.

Antonio Torrecillas · Post by **Antonio Torrecillas** » Fri Mar 14, 2014 6:31 pm

Using quiesce values or full search values make also sense. note that changing some evaluation parameters may lead to different quiesce result.
Pre-computing the quiesce position result can be misleading. for example exchanging queens can make decisive a third parameter.Change this parameter and the quiesce no longer exchange Queens.

for a discussion of the search argument see for example the Deep Thought eval tuning description from Andreas Nowatzyk in January 2000

We observed that deeper searches in the tuning code lead to better
results, even though it could be argued that evaluations should be orthogonal
to searching

now what is better, a large set of quiet positions tuned statically, or a smaller set evaluated with a quiesce/search, or yet a more expensive tuning with games.

My approach is a mix on the first and last option.I tune with static values and then run a small feed-back tournament (2-3000 games).The pgn file is then parsed to generate the new tuning set.(I hope with plenty of mis evaluated positions).

tpetzke · Post by **tpetzke** » Fri Mar 14, 2014 7:11 pm

now what is better, a large set of quiet positions tuned statically, or a smaller set evaluated with a quiesce/search, or yet a more expensive tuning with games.

I tried 2 (kind of) and 3 so far.

Method 2 failed, I used here an EPD set and recorded how often the best move was found with a shallow (2 ply) search.

Method 3 works, but as you said, it is expensive, so I'm looking for cheaper alternatives.

http://macechess.blogspot.de/2013/04/ho ... ction.html

Thomas...

jdart · Post by **jdart** » Fri Mar 14, 2014 7:39 pm

To make method 3 (tuning by games) work it is helpful to have a tuning algorithm that minimizes the number of iterations (matches) required. I have been experimenting recently with the BOBYQA algorithm from http://dlib.net/optimization.html. Too early to tell if this is effective but it looks promising. This algorithm makes few assumptions about the function it is optimizing and converges rapidly for many optimization problems. One thing I have found is that it is easier to pick an initial trust region if you normalize the variables it is tuning so they have the same range, although this is not required.

--Jon

Antonio Torrecillas · Fri Mar 14, 2014 10:11 pm

so I'm looking for cheaper alternatives

The methods with quick convergence use to be prone to overfitting.(I am a kind of maniac about tuning, I spent last year trying different methods. zero elo of improvement, but a very funny year).The larger the training set, the less prone to overfitting and the longer it last to get results.
it is possible to tune with a single run on a very large data set?
Let me give you an idea, as an mental experiment or implement it, if it looks fine to you. (I have already done, it is just as good/bad as any other method).

running votes with inertia.

Code: Select all

enum &#123;
dont_change,increase,decrease
&#125; VOTES;
double votes&#91;MaxParameters&#93;&#91;3&#93;;

for each position
   for each feature parameter
      if the eval is good enough &#40;error < reference&#41;
      then
         votes&#91;i&#93;&#91;dont_change&#93;++;
     else
         if increasing the parameter reduce the error
         then
            votes&#91;i&#93;&#91;increase&#93;++;
         else
            votes&#91;i&#93;&#91;decrease&#93;++;
         end if
      end if
	// criteria to award an increment.
	ratio = votes&#91;i&#93;&#91;increase&#93; - votes&#91;i&#93;&#91;decrease&#93; / &#40;votes&#91;i&#93;&#91;increase&#93; + votes&#91;i&#93;&#91;decrease&#93; + votes&#91;i&#93;&#91;dont_change&#93;)
	if ratio > Ref_level then
		increase parameter;
		// to avoid bouncing up and down we do not reset votes to zero &#40;inertia&#41;
            votes&#91;i&#93;&#91;increase&#93; /= 2;
            votes&#91;i&#93;&#91;decrease&#93; /= 2;
            votes&#91;i&#93;&#91;dont_change&#93; /= 2;
	else
            if ratio < - ref_level then
               decrease parameter;
               // to avoid bouncing up and down &#40;inertia&#41;
               votes&#91;i&#93;&#91;increase&#93; /= 2;
               votes&#91;i&#93;&#91;decrease&#93; /= 2;
               votes&#91;i&#93;&#91;dont_change&#93; /= 2;
            end if
	end if
   end for features
end for positions

Ref_level control the learning rate.
you can stop and save and restore the state and resume the process.
A sparse feature do not award enough votes for an increment and can be dropped of the evaluation.

tpetzke · Post by **tpetzke** » Fri Mar 14, 2014 11:35 pm

I spent last year trying different methods. zero elo of improvement, but a very funny year).

The club seems larger than I thought.

Your method looks interesting. Is it able to leave a local optimum again or will it optimize towards it and then stop ?

How Do You Automatically Tune Your Evaluation Tables

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm

Re: The texel evaluation function optimization algorithm