tuning via maximizing likelihood

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: tuning via maximizing likelihood

Post by jdart »

I am not sure it is done correctly.

--Jon
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: tuning via maximizing likelihood

Post by Rein Halbersma »

AlvaroBegue wrote:How do you handle draws? Or does your evaluation function return W/D/L probabilities?

But the real answer is that I don't need to penalize my evaluation function infinitely for getting one case wrong, which using logs would do.
Theoretically, the "log(logistic(x))" and "log(1 - logistic(x))" are perfectly well behaved for any finite value x (since logistic maps [-Inf, +Inf] to [0, 1]). However, to avoid numerical overflow when using single precision floats, one needs to be careful about scaling the linear eval (sum of weights * features) to somewhere in the range [-10, +10] (or [-20,+20] for double precision floats).
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: tuning via maximizing likelihood

Post by Rein Halbersma »

AlvaroBegue wrote: I don't know why you think that the convergence would be better than using mean squared error. I don't really know if it would be, but I am curious if you have a reason to believe that a priori.
If you have as MSE loss function for features x and weights w:

Code: Select all

Loss = 1/2 (result - logistic (w * x))^2
then you get as gradient

Code: Select all

Grad(w) = (result - logistic(w * x)) * x * logistic(w * x) * (1 - logistic(w * x))
If you use the log-likelihood loss function

Code: Select all

Loss = (1 - result) * log(1 - logistic(w * x)) + result * logistic(w * x)
then you get as gradient

Code: Select all

Grad(w) = (result - logistic(w * x)) * x
For positions where your eval score w * x is very far off the result, the MSE gradient will be almost zero (since either logistic(w * x) is near zero or (1 - logistic(w * x)) is near zero). This means that wrong predictions get penalized a lot less with MSE loss functions that with log-likelihood loss functions, and convergence with MSE loss functions is expected to be slower (all else being equal, so same initialization values, same features etc.)
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: tuning via maximizing likelihood

Post by Rein Halbersma »

AlvaroBegue wrote: I raised two separate objections. One of them is that, although using log-likelihood is somewhat theoretically motivated, it is not at all clear how draws should be handled.
I agree that plugging result = 1/2 into the log-likelihood for binomial logistical regression is ad hoc (even if it works numerically). The ordered logit model (aka cumulative link model aka proportional odds model) is ideally suited for handling draws. The loss function is then

Code: Select all

if (result == L) return log(logistic(theta_LD - w * x))
if (result == D) return log(logistic(theta_DW - w * x) - logistic(theta_LD - w *x))
if (result == W) return log(1 - logistic(theta_DW - w * x))
Here, theta_LD and theta_DW are threshold parameters which are a kind of generalized intercept terms. They represent the eval advantage necessary to make a loss/win more likely than a draw. These parameters also need to be optimized (the Arasan source code takes them as constants, but that is suboptimal). Note that "w * x" should not contain a constant term. Also note that the probability terms inside the logs add up to 1. Finally, you can take impose the restriction that theta_LD <0 < theta_WD or even theta_LD = - theta_DW. In that case, the eval should contain a side-to-move bonus.

Since 1 - logistic(x) = logistic(-x), taking theta_DW = theta_LD = 0, you recover the log-likelihood for binomial regression.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: tuning via maximizing likelihood

Post by Michel »

It seems one simply needs to choose an elo model (e.g. Bayes-Elo or Davidson)...

One may regard the evaluation function as a predictor for the elo of a position. Using an elo model one may convert the elo of a position into actual w/d/l ratios.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: tuning via maximizing likelihood

Post by Michel »

In Arasan this seems to be implemented in a rather non-standard way. If e is the prediction by the evaluation function then (w,d,l) are set to e,1-e,(e*(1-e))**2. These do not sum up to 1 and so they are not probabilities.

My instinct would be to stay with a standard ML implementation (supported by a standard elo model). But of course it is impossible to know without testing.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: tuning via maximizing likelihood

Post by Rein Halbersma »

Michel wrote:In Arasan this seems to be implemented in a rather non-standard way. If e is the prediction by the evaluation function then (w,d,l) are set to e,1-e,(e*(1-e))**2. These do not sum up to 1 and so they are not probabilities.

My instinct would be to stay with a standard ML implementation (supported by a standard elo model). But of course it is impossible to know without testing.
The ordered logit model is such an elo-model. See e.g. Knorr&#8208;Held, L., 2000. Dynamic rating of sports teams. Journal of the Royal Statistical Society: Series D (The Statistician), 49(2), pp.261-276.

The nice thing is that it uses a latent variable approach so that in a search tree, one can use only the linear predictor w*x (the log and logistic are monotonous transformations, so they leave the search outcome invariant).
Rein Halbersma
Posts: 741
Joined: Tue May 22, 2007 11:13 am

Re: tuning via maximizing likelihood

Post by Rein Halbersma »

Daniel Shawul wrote:Good to know! So far i have had better results with the ML objective function -- even though both barely improved my engine. You seem to use a 1 draw = 2 wins + 2 losses approach unless I am mistaken, is that intentional ? I am only aware of elo models that use 1 draw = 1 win + 1 loss (rao-kapper), 2 draw = 1 win + 1 loss (davidson).

Daniel
Plugging result=1/2 into the binomial logistic likelihood forces 2 draw = 1 win + 1 loss.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: tuning via maximizing likelihood

Post by Michel »

The ordered logit model is such an elo-model. See e.g. Knorr&#8208;Held, L., 2000. Dynamic rating of sports teams. Journal of the Royal Statistical Society: Series D (The Statistician), 49(2), pp.261-276.
Interesting. For w,d,l outcomes this seems to translate into the Bayes-Elo model if we impose the additional (very reasonable) requirement that elo=0 implies w=l (assuming of course that we take the logistic function as the response function).
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: tuning via maximizing likelihood

Post by Michel »

The ordered logit model is such an elo-model. See e.g. Knorr&#8208;Held, L., 2000. Dynamic rating of sports teams. Journal of the Royal Statistical Society: Series D (The Statistician), 49(2), pp.261-276.
In fact the standard Bayes-Elo model with the "draw elo" and "white advantage" parameters corresponds exactly to the model in the paper.

In the case of evaluation tuning the white advantage parameter is not relevant as it will have been incorporated in the stm bonus in the evaluation function.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.