Texel tuning - some issues

jdart · Post by **jdart** » Sat Dec 10, 2016 5:17 pm

Over the past year or so Arasan has implemented the Texel tuning method for its scoring parameters. Currently there are over 900 parameter values tuned with this method. Because the parameters generally have a linear relationship with the score, Arasan explicitly computes the gradient and currently uses the ADAM algorithm to minimize the objective. For training data it takes an input PGN file, does an initial low-depth search of each position and stores the end position of the PV for each move. Then the end position values are used as the training position set.

Lately I have been taking a closer look at the tuning algorithm and code and trying some variants of it. So far this has not resulted in any further improvement, despite the fact that I believe that some of the modifications I have made have a better foundation. So far "more correct, less buggy" has meant lower performance, which is something I think most chess programmers encounter from time to time.

One of the things I have been looking at is the objective function. The CPW writeup suggests using the mean-square error of the predicted score (sigmoid function based on the evaluation) and the actual score (0 for loss, 0.5 for draw or 1 for win). I think this is workable but a bit unusual in machine learning. It is not actually what Arasan currently uses. A couple of other possibilities:

1. Use ordinal logistic regression (see http://qwone.com/~jason/writing/olr.pdf) with the three possible result values (win, loss, draw). This requires a parameter to separate the draw region from the win or loss region. The loss function would be like this:

Code: Select all

   auto h = &#91;&#93; &#40;double x&#41; &#123; return log&#40;1.0+exp&#40;PARAM1*x&#41;); &#125;;

   double err;
   if &#40;result == 0&#41; &#123;
      err = h&#40;value-THETA1&#41;;
   &#125;
   else if &#40;result == 0.5&#41; &#123;
      err = h&#40;THETA1-value&#41; + h&#40;value-THETA2&#41;;
   &#125;
   else &#123;
      err = h&#40;THETA2-value&#41;;
   &#125;

.

where PARAM1 is the Texel "k" constant and THETA1, THETA2 are score values, probably in the range of 1 pawn (these can be tuned, although not with gradient descent). I tried this and it was not any improvement though.

2. Another objective I have looked at is this:

Code: Select all

   if &#40;result == 1.0&#41; &#123;
      return -log&#40;predict&#41;;
   &#125; else if &#40;result == 0.0&#41; &#123;
      return -log&#40;1-predict&#41;;
   &#125; else &#123;
      if &#40;predict > result&#41;
         return -0.5*log&#40;2*&#40;1-predict&#41;);
      else
         return -0.5*log&#40;2*predict&#41;;
    &#125;

Here "predict" is the sigmoid value. This uses the "log loss" that is commonly used for a 2-value result but for draws it uses a log measure of distance from a draw score. Technically the "if" condition makes it not differentiable but IMO that is not a problem because the function has a smooth transition at 0.5. Arasan currently uses something like this but with a different measure for the draw score, which I think is actually incorrect.

I have also had some concern about the optimization method. One of the issues is that like most programs I have integer valued parameters. ADAM and many other algorithms really assume the values are continuous. For parameters that can take a wide range of values it may be ok to just relax the integer assumptions and tune as if the parameters were real values, but I have some parameters that take a limited range of values.

To make things worse, Arasan initially did not scale its parameters but used a fairly large initial step size for all parameters. I have now corrected this so that the step sizes are based on the parameter range. And for ADAM I have experimented with constraining the step sizes so that for parameters with small range they do not decay so fast to zero but will stay at size 1 for at least some number of iterations. Still, this is a hack IMO. The fundamental problem is that technically finding the minimum is a MILP (Mixed Integer Linear Programming) problem, not a continuous optimization problem.

I am seeing more variation than I would expect in the tuning results across multiple runs and they seem quite sensitive to the training set used (PGN input). Besides games from my hyperbullet gauntlets, I had also tried using some playchess games, but I think that is problematic because those games typically are from engines with a deep opening book, so the engine is not even calculating until late in the game.

I am even wondering given this variation and the various other issues noted above if at least some of the improvements I have seen after tuning the latest version (19.2) are the result of a lucking tuning run that happened to go in the right direction.

Anyway, I wanted to share these thoughts and experiments and if anyone has suggestions I would be grateful for them.

--Jon

lantonov · Post by **lantonov** » Sat Dec 10, 2016 9:29 pm

I have tried a mixed GLM (General Linear Model) with a logit link (essentially an ordinal logistic regression of five categories (-2,-1,0,1,2)). The fixed effects were contributed by a DOE design of perturbed parameters and the random effect was the choice of openings.
The results were mixed (sometimes more and sometimes less ELO than base) and I couldn't find the source of this inconsistency. The experiments were very bulky and not amenable to sequential testing.

AlvaroBegue · Post by **AlvaroBegue** » Mon Dec 12, 2016 12:45 am

It is true that the different ranges are an issue. In neural networks it helps to use "whitening", where you shift and rescale the inputs so they all have mean 0 and variance 1.

If you are doing full batch optimization (i.e., not using stochastic gradient descent), I don't believe Adam is a very appropriate algorithm. Part of what Adam does is filter the gradients, because they are expected to be noisy, but you are probably feeding it gradients with no noise.

Instead of Adam try:
(1) a second-order method, like L-BFGS or Hessian-free optimization; or
(2) a simple adaptive method, like the one described in this lecture by Hinton. You have a nominal learning rate and then a per-parameter multiplier, initialized to 1. If two consecutive gradients have the same sign for some parameter, you add 0.05 to the per-parameter multiplier; otherwise, you multiply it by 0.95. I found that 0.2 and 0.8 worked better for some problems. YMMV.

jdart · Post by **jdart** » Tue Dec 13, 2016 2:38 am

Thanks, the adaptive method (2) seems like a reasonable approach.

--Jon

brtzsnr · Post by **brtzsnr** » Tue Dec 13, 2016 9:46 pm

Hi, Jon!

Thank you for your insights.

I also use Texel Tuning for Zurichess and I ran in more or less similar issues. I use float weights (continuous values) during training and round to 0.0001 of a pawn for game play.

Some thoughts on choosing the training data:
1) Positions from CCRL are not good because it's not what an engine evaluates during gameplay. I sample from millions of positions evaluated during play. If you want I can provide you a sample of 5 mil random positions (but no score).
2) Results for each position should be from equal strength opponents to eliminate the effect of tactics. Again CCRL games results introduce a lot of noise
3) Tuning has issues with violent positions because these are resolved by QS and very rarely by the eval.
4) I exclude late endgame positions (5 men or less) because these bias the scores on positions where search is very deep.

Regarding tuning:
1) ADAM gives me consistent results. Retraining usually results in something within 2-3 Elo.
2) Average loss is a good measure only for the same set of features / testing data. Adding new features doesn't necessarily improve game play even if the average loss decreases.
3) If you have non-eval terms dependent on eval (e.g. futility margin, SEE weights) those can screw up the pruning. I wasted 1 month of training because of my SEE weights were dependent on non-regularized weights.
4) I add some regularization, but not sure how important it is.
5) I tried other optimizers (everything provided by Tensorflow), but they converged very slow.

That said, Texel Tuning is a tool that gives good results, but it's unknown how close to the optimum is the end result.

Regards,

jdart · Post by **jdart** » Wed Dec 14, 2016 1:53 am

I sample from millions of positions evaluated during play.

How do you know what the result value for these positions is? (Since I assume they may not be in the main line of the game).

--Jon

jdart · Post by **jdart** » Fri Dec 16, 2016 4:53 am

After some more experimentation I have had some success with the combination of:

1. Using the Mean-Square error (sigmoid of search value - result, squared) as the loss function.

2. Using the "quiet-labeled" position set posted here by Alexandru Mosoi as the tuning data.

3. Using a L2 regularization penalty based on a sum of 1.2E-4 * mean square of the normalized parameter values, that is (value - midpoint of range)/parameter range.

4. Using the simple adaptive optimization method suggested by Álvaro Begué.

The tuned values from this are close in performance to what I have in the master branch that is my reference point. So far they are not better, though. Still I think this is a sounder method than what was used in the recent Arasan 19.2 release, which had a different loss function and position selection method.

--Jon

Texel tuning - some issues

Texel tuning - some issues

Re: Texel tuning - some issues

Re: Texel tuning - some issues

Re: Texel tuning - some issues

Re: Texel tuning - some issues

Re: Texel tuning - some issues

Re: Texel tuning - some issues