Advantage for White; Bayeselo (to Rémi Coulom)

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by Edmund »

Thanks to hgm and lucas for the suggestions.

Below you find updated graphs representing the same data.
1) bin-size is 4 elo-points
2) minimum bin-size is 4 samples
3) in the elo-delta graph I shifted all models by 20 elo points to compensate for the white to move advantage
4) I added the cdf of the normal distribution with sd=250
5) I added the function hgm suggested to estimate draws scaling by 40/25

Image

Agreed, the gauss function is a better fit than either the linear function or the logistic function.
Looking at the new graphs I am not so sure about hgms suggestion regarding the progression of the avg-elo score function. You are right that the next step is to take elo-delta into the equation.
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by Edmund »

Addendum:

I just did a quick lookup of the outliers in the avg-elo graph and indeed found that the values way below the linear regression feature a large average elo-delta and the ones above feature a low average elo-delta.
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by hgm »

It is interesting that the Gaussian seems to give a better fit, despite the fact that the ratings were derived using the logistic. (Now hope the logistic doesn't give a better fit on ratings derived with the Gaussian...) It could be that for this large data set based mostly on low delta-Elo data the obtained ratings are not very sensitive to the model used.

What worries me is that the empirical data seems steeper than the curve of the model from which they were derived. That means that Elo differences you have to put into the logistic formula to get the true score percentage are larger than those spit out by BayesElo. In other words, BayesElo systematically underestimates rating differences, compressing the rating scale.

I wonder if this is an artifact caused by the prior. Do you calculate the ratings for this data set yourself? If so, could you recalculate them using a smaller prior (e.g. 0.1 in stead of the standard 2.0)?
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by Rémi Coulom »

hgm wrote:
Rémi Coulom wrote:If the frequency does not match the model, then it is a sign that the model is bad. But if it does match the model, it does not mean that the model is good, because the ratings were computed with the model in the first place.
I am not sure I buy that. The number of games is so large, that splitting the data set in two and derive ratings from that would not give significantly different predictions for the ratings. And the other half of the data set would be good enough to define the empirical curve using those derived ratings.

If these empirical frequencies then match the model prediction, the model is by definition perfect. Because that was all the model was supposed to do: derive ratings that could be used to accurately predict frequencies.
It is not so obvious.

Imagine the extreme case where the model consists in saying that all the players have the same rating, and there is a given probability of winning and drawing. You can make this model fit the data perfectly. But its predictions are not good.

So your definition of "perfect" may make sense, but "perfect" in the sense of being unbiased is not "perfect" in the sense of making the best possible predictions.

For instance, we could imagine multi-dimensional models that make much better predictions than a "perfect" one-dimensional model.

Even for one-dimensional models, I can imagine distributions of player ratings that have no bias in predicting the winning frequency, but produce poor predictions.

Rémi
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by hgm »

Rémi Coulom wrote:For instance, we could imagine multi-dimensional models that make much better predictions than a "perfect" one-dimensional model.
OK, agreed. But I wouldn't say that the one-dimensional model was not good, then. It would be good as a rating model, assigning the best possible ratings, making the best possible winning-frequency predictions which a model that bases the winning frequency on the difference of a single pair of numbers could do. That you might be able to do better, in terms of predictions, by assigning each player an average and a variance (say), rather than assuming the same variance for all is very like true.
Even for one-dimensional models, I can imagine distributions of player ratings that have no bias in predicting the winning frequency, but produce poor predictions.
Not sure what you mean by that. What else is there to predict on a game than the winning frequency? Do you mean it might predict the winning frequency against a group of players, but not against the individual players of that group?
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by Edmund »

hgm wrote:It is interesting that the Gaussian seems to give a better fit, despite the fact that the ratings were derived using the logistic. (Now hope the logistic doesn't give a better fit on ratings derived with the Gaussian...) It could be that for this large data set based mostly on low delta-Elo data the obtained ratings are not very sensitive to the model used.

What worries me is that the empirical data seems steeper than the curve of the model from which they were derived. That means that Elo differences you have to put into the logistic formula to get the true score percentage are larger than those spit out by BayesElo. In other words, BayesElo systematically underestimates rating differences, compressing the rating scale.

I wonder if this is an artifact caused by the prior. Do you calculate the ratings for this data set yourself? If so, could you recalculate them using a smaller prior (e.g. 0.1 in stead of the standard 2.0)?
I cannot easily recalculate the Elos; I have taken the data from CCRL.

Taking into account the elo-delta of the players my model predicts the game outcome on average 0.15 percent-points better.

Code: Select all

Elo(x) = 1 / (1 + 10^(x/400))  
Elo^-1(x) = 400 *LOG10(x / (1-x)) 
P_draw_given_elodelta = Elo(-delta-whiteadvantage) * (1 - Elo(-delta-whiteadvantage)) * 40/25
Elo_draw = Elo^-1(P_draw_given_elodelta) + 0.096 * eloavg + 75
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by hgm »

Edmund wrote:I cannot easily recalculate the Elos; I have taken the data from CCRL.
Too bad. In what form do you have the data? Is it a PGN of complete games? Would it be possible to reduce it to just player and result tags? (That would be enough to feed it to BayesElo).
Taking into account the elo-delta of the players my model predicts the game outcome on average 0.15 percent-points better.
0.15%? :shock: How do you calculate that? In places (e.g. around +200 delta-Elo) the difference between the green curve and the data points is as much as 5% points, and the statistical noise (based on spread of the points) in that very much lower. Is it just that there are comparatively few games in those points, and a huge number of games in the points from -20 to +20 Elo?

It seems optically clear to me that you can improve the fit (i.e. reduce the 0.15% error) by scaling the ratings up by some 15-20%. I.e. use

Elo(x) = 1 / (1 + 10^(1.2*x/400))

to predict the score.
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by Rémi Coulom »

hgm wrote:
Rémi Coulom wrote:Even for one-dimensional models, I can imagine distributions of player ratings that have no bias in predicting the winning frequency, but produce poor predictions.
Not sure what you mean by that. What else is there to predict on a game than the winning frequency? Do you mean it might predict the winning frequency against a group of players, but not against the individual players of that group?
I mean that it is nice to have an unbiased estimator of the probability of winning, but it does not necessarily produce the best predictions. For instance, if you have a formula that produces an unbiased estimate of the probability of winning as a function as the rating difference between players, you might still beat the quality of prediction of that unbiased estimator by using another model that takes the mean rating of players as an additional parameter. Since it seems that the probability of draws increases with rating, that more advanced model might produce better predictions than your simple unbiased model.

So a model can be unbiased, and still can be improved in terms of prediction quality.

Prediction quality should be measured on data that were not used for computing the ratings. It can be measured by the average log-probability of results, for instance.

Rémi
Edmund
Posts: 670
Joined: Mon Dec 03, 2007 3:01 pm
Location: Barcelona, Spain

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by Edmund »

I downloaded the full pgn. Then used some script to strip everything away but the game_id, result and both elos. I then imported the data to excel where I am manipulating the data now.

Abs(Elo-delta) vs. Number of games looks like this:
Image
User avatar
hgm
Posts: 27788
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Advantage for White; Bayeselo (to Rémi Coulom)

Post by hgm »

OK, I see what you mean. By adding more parameters, one can always get a better fit. I was just looking for the best possible model that only takes rating difference into account.

I agree that in general predictions should better not be tested on the data they are derived from, because it will make you err into the direction of thinking that they are better than they really are. But for a very large data set it hardly matters. (E.g. the N/(N-1) correction you need for variances computed to a mean derived from the points itself, or from an independently given one.)

In that light, what do you think of the fact that the data points seem to stipulate a steeper-rising curve than the green logistic on which they are supposed to be based? Is this an indication that BayesElo's default approach is not the optimal way to extract the ratings?