Lately I have been taking a closer look at the tuning algorithm and code and trying some variants of it. So far this has not resulted in any further improvement, despite the fact that I believe that some of the modifications I have made have a better foundation. So far "more correct, less buggy" has meant lower performance, which is something I think most chess programmers encounter from time to time.
One of the things I have been looking at is the objective function. The CPW writeup suggests using the mean-square error of the predicted score (sigmoid function based on the evaluation) and the actual score (0 for loss, 0.5 for draw or 1 for win). I think this is workable but a bit unusual in machine learning. It is not actually what Arasan currently uses. A couple of other possibilities:
1. Use ordinal logistic regression (see http://qwone.com/~jason/writing/olr.pdf) with the three possible result values (win, loss, draw). This requires a parameter to separate the draw region from the win or loss region. The loss function would be like this:
Code: Select all
auto h = [] (double x) { return log(1.0+exp(PARAM1*x)); };
double err;
if (result == 0) {
err = h(value-THETA1);
}
else if (result == 0.5) {
err = h(THETA1-value) + h(value-THETA2);
}
else {
err = h(THETA2-value);
}
where PARAM1 is the Texel "k" constant and THETA1, THETA2 are score values, probably in the range of 1 pawn (these can be tuned, although not with gradient descent). I tried this and it was not any improvement though.
2. Another objective I have looked at is this:
Code: Select all
if (result == 1.0) {
return -log(predict);
} else if (result == 0.0) {
return -log(1-predict);
} else {
if (predict > result)
return -0.5*log(2*(1-predict));
else
return -0.5*log(2*predict);
}
I have also had some concern about the optimization method. One of the issues is that like most programs I have integer valued parameters. ADAM and many other algorithms really assume the values are continuous. For parameters that can take a wide range of values it may be ok to just relax the integer assumptions and tune as if the parameters were real values, but I have some parameters that take a limited range of values.
To make things worse, Arasan initially did not scale its parameters but used a fairly large initial step size for all parameters. I have now corrected this so that the step sizes are based on the parameter range. And for ADAM I have experimented with constraining the step sizes so that for parameters with small range they do not decay so fast to zero but will stay at size 1 for at least some number of iterations. Still, this is a hack IMO. The fundamental problem is that technically finding the minimum is a MILP (Mixed Integer Linear Programming) problem, not a continuous optimization problem.
I am seeing more variation than I would expect in the tuning results across multiple runs and they seem quite sensitive to the training set used (PGN input). Besides games from my hyperbullet gauntlets, I had also tried using some playchess games, but I think that is problematic because those games typically are from engines with a deep opening book, so the engine is not even calculating until late in the game.
I am even wondering given this variation and the various other issues noted above if at least some of the improvements I have seen after tuning the latest version (19.2) are the result of a lucking tuning run that happened to go in the right direction.
Anyway, I wanted to share these thoughts and experiments and if anyone has suggestions I would be grateful for them.
--Jon