tuning via maximizing likelihood

jdart · Post by **jdart** » Thu Oct 05, 2017 3:46 am

I am not sure it is done correctly.

--Jon

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 05, 2017 9:15 am

AlvaroBegue wrote:How do you handle draws? Or does your evaluation function return W/D/L probabilities?

But the real answer is that I don't need to penalize my evaluation function infinitely for getting one case wrong, which using logs would do.

Theoretically, the "log(logistic(x))" and "log(1 - logistic(x))" are perfectly well behaved for any finite value x (since logistic maps [-Inf, +Inf] to [0, 1]). However, to avoid numerical overflow when using single precision floats, one needs to be careful about scaling the linear eval (sum of weights * features) to somewhere in the range [-10, +10] (or [-20,+20] for double precision floats).

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 05, 2017 9:22 am

AlvaroBegue wrote: I don't know why you think that the convergence would be better than using mean squared error. I don't really know if it would be, but I am curious if you have a reason to believe that a priori.

If you have as MSE loss function for features x and weights w:

Code: Select all

Loss = 1/2 &#40;result - logistic &#40;w * x&#41;)^2

then you get as gradient

Code: Select all

Grad&#40;w&#41; = &#40;result - logistic&#40;w * x&#41;) * x * logistic&#40;w * x&#41; * &#40;1 - logistic&#40;w * x&#41;)

If you use the log-likelihood loss function

Code: Select all

Loss = &#40;1 - result&#41; * log&#40;1 - logistic&#40;w * x&#41;) + result * logistic&#40;w * x&#41;

then you get as gradient

Code: Select all

Grad&#40;w&#41; = &#40;result - logistic&#40;w * x&#41;) * x

For positions where your eval score w * x is very far off the result, the MSE gradient will be almost zero (since either logistic(w * x) is near zero or (1 - logistic(w * x)) is near zero). This means that wrong predictions get penalized a lot less with MSE loss functions that with log-likelihood loss functions, and convergence with MSE loss functions is expected to be slower (all else being equal, so same initialization values, same features etc.)

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 05, 2017 9:52 am

AlvaroBegue wrote: I raised two separate objections. One of them is that, although using log-likelihood is somewhat theoretically motivated, it is not at all clear how draws should be handled.

I agree that plugging result = 1/2 into the log-likelihood for binomial logistical regression is ad hoc (even if it works numerically). The ordered logit model (aka cumulative link model aka proportional odds model) is ideally suited for handling draws. The loss function is then

Code: Select all

if &#40;result == L&#41; return log&#40;logistic&#40;theta_LD - w * x&#41;)
if &#40;result == D&#41; return log&#40;logistic&#40;theta_DW - w * x&#41; - logistic&#40;theta_LD - w *x&#41;)
if &#40;result == W&#41; return log&#40;1 - logistic&#40;theta_DW - w * x&#41;)

Here, theta_LD and theta_DW are threshold parameters which are a kind of generalized intercept terms. They represent the eval advantage necessary to make a loss/win more likely than a draw. These parameters also need to be optimized (the Arasan source code takes them as constants, but that is suboptimal). Note that "w * x" should not contain a constant term. Also note that the probability terms inside the logs add up to 1. Finally, you can take impose the restriction that theta_LD <0 < theta_WD or even theta_LD = - theta_DW. In that case, the eval should contain a side-to-move bonus.

Since 1 - logistic(x) = logistic(-x), taking theta_DW = theta_LD = 0, you recover the log-likelihood for binomial regression.

Michel · Post by **Michel** » Thu Oct 05, 2017 11:13 am

It seems one simply needs to choose an elo model (e.g. Bayes-Elo or Davidson)...

One may regard the evaluation function as a predictor for the elo of a position. Using an elo model one may convert the elo of a position into actual w/d/l ratios.

Michel · Post by **Michel** » Thu Oct 05, 2017 12:20 pm

In Arasan this seems to be implemented in a rather non-standard way. If e is the prediction by the evaluation function then (w,d,l) are set to e,1-e,(e*(1-e))**2. These do not sum up to 1 and so they are not probabilities.

My instinct would be to stay with a standard ML implementation (supported by a standard elo model). But of course it is impossible to know without testing.

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 05, 2017 1:07 pm

Michel wrote:In Arasan this seems to be implemented in a rather non-standard way. If e is the prediction by the evaluation function then (w,d,l) are set to e,1-e,(e*(1-e))**2. These do not sum up to 1 and so they are not probabilities.

My instinct would be to stay with a standard ML implementation (supported by a standard elo model). But of course it is impossible to know without testing.

The ordered logit model is such an elo-model. See e.g. Knorr‐Held, L., 2000. Dynamic rating of sports teams. Journal of the Royal Statistical Society: Series D (The Statistician), 49(2), pp.261-276.

The nice thing is that it uses a latent variable approach so that in a search tree, one can use only the linear predictor w*x (the log and logistic are monotonous transformations, so they leave the search outcome invariant).

Rein Halbersma · Post by **Rein Halbersma** » Thu Oct 05, 2017 1:14 pm

Daniel Shawul wrote:Good to know! So far i have had better results with the ML objective function -- even though both barely improved my engine. You seem to use a 1 draw = 2 wins + 2 losses approach unless I am mistaken, is that intentional ? I am only aware of elo models that use 1 draw = 1 win + 1 loss (rao-kapper), 2 draw = 1 win + 1 loss (davidson).

Daniel

Plugging result=1/2 into the binomial logistic likelihood forces 2 draw = 1 win + 1 loss.

Michel · Post by **Michel** » Thu Oct 05, 2017 1:25 pm

The ordered logit model is such an elo-model. See e.g. Knorr‐Held, L., 2000. Dynamic rating of sports teams. Journal of the Royal Statistical Society: Series D (The Statistician), 49(2), pp.261-276.

Interesting. For w,d,l outcomes this seems to translate into the Bayes-Elo model if we impose the additional (very reasonable) requirement that elo=0 implies w=l (assuming of course that we take the logistic function as the response function).

Michel · Post by **Michel** » Thu Oct 05, 2017 1:45 pm

The ordered logit model is such an elo-model. See e.g. Knorr‐Held, L., 2000. Dynamic rating of sports teams. Journal of the Royal Statistical Society: Series D (The Statistician), 49(2), pp.261-276.

In fact the standard Bayes-Elo model with the "draw elo" and "white advantage" parameters corresponds exactly to the model in the paper.

In the case of evaluation tuning the white advantage parameter is not relevant as it will have been incorporated in the stm bonus in the evaluation function.

tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood

Re: tuning via maximizing likelihood