normal vs logistic curve for elo model

Daniel Shawul · Post by **Daniel Shawul** » Fri Aug 03, 2012 12:39 am

What is the rationale for using logistic curve for estimating elo of a computer players ? I understand that the original Elo formula assumes normal distirubtion for the true strength (skill) of a player. But this was found out not to be the case for weaker human chess players as they tend to overperform. This makes sense for humans but for a computer it seems its true rating should be normally distributed like everything else in the world usually is. Humans sort of learn from game to game so a skewed distribution to the right makes some sense but the computer player is static.

Code: Select all

Humans&#58;     Assuming extreme value distribution for true strength of player R    => Logistic model for Ra - Rb &#40;i.e winning probabity&#41;
Computers&#58;  Assuming normal distribution for R                                   => Gaussian model for Ra - Rb

Also in a recent discussion here in CCC for those who remember, gaussian fit the observed distribution better even though the elo model used was logisitc. Maybe it is wrong to use logistic model for computer rating in the first place? I know other reasons have been attributed that observation so I am just throwing this just in case (probably missed something anyway)

A multi-player rating system known as trueskill uses similar techniques to bayeselo only difference being that it is multi-dimenstional and it uses gaussian for a player's skill distribution. Maybe it is even possible to modify the bradely terry model of bayeselo to determine the same ratings as trueskill that uses normal distribution. Has anyone looked into the similarities/differences of these methods ?

Rein Halbersma · Post by **Rein Halbersma** » Fri Aug 03, 2012 9:19 am

One reason for preference of the logistic over the normal distribution has to do with the relative difficulty of integrating them. Match predictions are given by the cumulative distribution conditional on the ELO difference:

Code: Select all

P&#40;result|ELO&#41;

For the normal distribution, no analytical expression is available and this requires numerical integration or extensive tables. For the logistic distribution, an analytical expression can be obtained (see Wikipedia).

The converse, ELO inference given the match results, requires Bayesian statistics

Code: Select all

 P&#40;ELO|result&#41; = P&#40;result|ELO&#41; x P&#40;ELO&#41; / P&#40;result&#41;

where P(ELO) is the so-called conjugate distribution of playing strength across the population. Here, the normal distribution is self-conjugate: if playing strength is normally distributed, then match results are also normally distributed (with sqrt(2) times the standard deviation of playing strengths). For the logistic distribution, the conjugate prior is a complicated Dirichlet distribution. So which distribution is most convenient depends your application.

In practice, it doesn't matter much. You can rescale the logistic distribution in two ways to make it more similar to the normal distribution. First, you can rescale the logistic's scale parameter by Sqrt(3)/Pi (=.55) so that it has the same standard deviation as the normal distribution. You can also rescale the logistic's scale parameter by 1/2 Sqrt(Pi/2) (=.63) so that it has the same slope as the normal distribution for delta-ELO=0. In both cases, the relative differences in the cumulative distribution are always less then -1% and +3% (percent, not percentage points!).

hgm · Post by **hgm** » Fri Aug 03, 2012 10:44 am

This can only be decided by empirical measurement. In particular, if you have player A scoring 65% against B, and B scoring 65% against C, how much will A score against C (averaged over the entire population of players, of course). This could basically be anything from 65% to 100%.

Not everything in real live is Gaussian; in fact hardly anything ever is. (As stockmarket theorist have found out to their financial ruins...) If a Chess game could be modelled by a sequence of independent decisions, each decision giving up a small amount of winning probability comparing to the game-theoretically best move, the performance of a player in a game would be normally distributed around an average with a standard deviation that could be derived from the undelying distribution of its move-to-move imperfection. Provided the standard deviation of this imperfection is finite, a necessary condition for the central limit theorem to hold.

But in real life the move-to-move imperfection might not have finite standard deviation at all. There could be arbitrarily large blunders, and the probability for them to occur might tail off slowly. So that the performance in an individual game is usually not the accumulated value of a large number of independent small imperfections, but dominated by the biggest blunder.

Compare it to sampling from a variable with a Cauchy distribution (= 1/(1+x^2)). It turns out that taking the average of 1 million samples gives a result that is exactly as poor as just taking the first sample. The more samples you take, the larger the probability that there is one lying far out in the tails that completely spoils the average. So the average never approaches a Gaussian distribution. In fact it always remains the original Cauchy distribution, without even narrowing with a 1/sqrt(N) law.

Averaging as a method of improving accuracy of a measurement is overrated...

Daniel Shawul · Post by **Daniel Shawul** » Fri Aug 03, 2012 1:36 pm

hgm wrote:This can only be decided by empirical measurement. In particular, if you have player A scoring 65% against B, and B scoring 65% against C, how much will A score against C (averaged over the entire population of players, of course). This could basically be anything from 65% to 100%.

Not everything in real live is Gaussian; in fact hardly anything ever is. (As stockmarket theorist have found out to their financial ruins...) If a Chess game could be modelled by a sequence of independent decisions, each decision giving up a small amount of winning probability comparing to the game-theoretically best move, the performance of a player in a game would be normally distributed around an average with a standard deviation that could be derived from the undelying distribution of its move-to-move imperfection. Provided the standard deviation of this imperfection is finite, a necessary condition for the central limit theorem to hold.

But in real life the move-to-move imperfection might not have finite standard deviation at all. There could be arbitrarily large blunders, and the probability for them to occur might tail off slowly. So that the performance in an individual game is usually not the accumulated value of a large number of independent small imperfections, but dominated by the biggest blunder.

Compare it to sampling from a variable with a Cauchy distribution (= 1/(1+x^2)). It turns out that taking the average of 1 million samples gives a result that is exactly as poor as just taking the first sample. The more samples you take, the larger the probability that there is one lying far out in the tails that completely spoils the average. So the average never approaches a Gaussian distribution. In fact it always remains the original Cauchy distribution, without even narrowing with a 1/sqrt(N) law.

Averaging as a method of improving accuracy of a measurement is overrated...

It is an interesting point of view. So we can look at performance int two perspectives. If you suppose that a game is won/lost by a single brilliant/blunder move, then an extreme value distribution will be appropriate. However if you assume that performance arises from the sum of small advantages gained from each move, a normal distribution is the most likely. Without going into these details, I would prefer to make a general statement that a computer is equally likely to underperform or overperform in a game. Assuming the contrary of this intuitive conclusion requires evidence as to why it is so. When human rating lists are changed from the original assumption that strength is normally distributed to a skewed distribution, there was evidence specially with weak players that tend to overperform. I don't see such a reason to make a change for computer rating lists as well, atleast not on something based on the same evidence as for humans. Strength distribution of computers doesn't necessarily have to be gaussian but should be symmetric IMO, which is not the case if we assume logistic for Ra-Rb. There are mathematical difficulties with using gaussian in bayesian inference I know, but recent posts that showed that computer ratings calculated with bayeselo,elostat,ordo etc don't seem to match the model, and in one particular case seemed to fit normal better raises this doubt for me. All of them use logistic/-alike elo models.

Daniel Shawul · Post by **Daniel Shawul** » Fri Aug 03, 2012 1:49 pm

Rein Halbersma wrote:One reason for preference of the logistic over the normal distribution has to do with the relative difficulty of integrating them. Match predictions are given by the cumulative distribution conditional on the ELO difference:
Code: Select all
P&#40;result|ELO&#41; 
For the normal distribution, no analytical expression is available and this requires numerical integration or extensive tables. For the logistic distribution, an analytical expression can be obtained (see Wikipedia).

The converse, ELO inference given the match results, requires Bayesian statistics
Code: Select all
 P&#40;ELO|result&#41; = P&#40;result|ELO&#41; x P&#40;ELO&#41; / P&#40;result&#41;
where P(ELO) is the so-called conjugate distribution of playing strength across the population. Here, the normal distribution is self-conjugate: if playing strength is normally distributed, then match results are also normally distributed (with sqrt(2) times the standard deviation of playing strengths). For the logistic distribution, the conjugate prior is a complicated Dirichlet distribution. So which distribution is most convenient depends your application.

You are right it is complicated to calculate the posterior for gaussian. Most of the time analytical methods are impossible for maximum-liklihood estimates so numerical methods that update mu and sigma (or their equivalents as in trueskill) are used. I know the results from a logistic and normal distribution of Elo difference are very minimal, but it lacks compulsive evidence as to why we should use logistic for computer games. Ease of application of logistic for bayesian inference is a good advantage and in some papers it is mentioned that inavailabilty of erf in C and that weak players overperform is the reason why chess people digress from using Elo's original normal distribution. FIDE still uses normal though i think.

In practice, it doesn't matter much. You can rescale the logistic distribution in two ways to make it more similar to the normal distribution. First, you can rescale the logistic's scale parameter by Sqrt(3)/Pi (=.55) so that it has the same standard deviation as the normal distribution. You can also rescale the logistic's scale parameter by 1/2 Sqrt(Pi/2) (=.63) so that it has the same slope as the normal distribution for delta-ELO=0. In both cases, the relative differences in the cumulative distribution are always less then -1% and +3% (percent, not percentage points!).

Yes I have read that use of normal or logistic has almost no difference in bayesian inference. It is not very clear to me how variances are handled for humans vs computers. Usually the simplified methods (k-factor) for updating human elos assume a certain standard deviation same for all, thus the rating difference between two players will have standard deviation of sqrt(2) * sigma. However in bayesian inference both mu and sigma are separate parameters for each player involved.

Rein Halbersma · Post by **Rein Halbersma** » Fri Aug 03, 2012 8:29 pm

Hi Daniel,

If you want to know more about optimal K-factors and updating, you might want to read about Kalman filtering, see e.g. chapter 1 of this PhD thesis http://www.stat.sdu.dk/publications/mon ... Thesis.pdf

Rein

Daniel Shawul · Post by **Daniel Shawul** » Fri Aug 03, 2012 10:37 pm

Rein Halbersma wrote:Hi Daniel,

If you want to know more about optimal K-factors and updating, you might want to read about Kalman filtering, see e.g. chapter 1 of this PhD thesis http://www.stat.sdu.dk/publications/mon ... Thesis.pdf

Rein

Thanks I will read it later even though I am afraid it maybe beyound my scope. I found a short description of the K-factor from the trueskill paper. It does not have a strong statistical backing but it is fairly sound. After one game the skills of both players (s1,s2) are updated such that the observed outcome is the most likely, and s1 + s2 is a constant. This is in line with maximum-liklihood estimate but the mean is the only parameter considered in the model. For an outcome y in {-1,0,1},

Code: Select all

delta = alpha * sigma * sqrt&#40;pi&#41; * ( &#40;y + 1&#41; / 2 - phi&#40;&#40;s1 - s2&#41; / sqrt&#40;2&#41; * sigma&#41;)
         |______K-factor_______|      |__Perf_|    |____Expected performance________|
s1 + delta
s2 - delta

The factors on the left hand side combine to give the K-factor where alpha is adjustable between (0,1). The larger the K-factor the more weight is given to the new result. Most of the time the 'momentum' from the old results is above 90%. This has problems with new players with no prior rating or someone competing after a long time off. I suppose the paper you linked to will be finding optimal values of K for different 'class' of players because K should be dynamic in principle. sigma is assumed to be constant for all so I don't think it will ever be possible to cover all the deficiencies.

On an unrelated note it seems that Remi is working comparison of the methods in bayeselo,trueskill and elostat. It is just a pre-print that outlines the major differences between the two.

Rémi Coulom · Post by **Rémi Coulom** » Fri Aug 03, 2012 11:08 pm

Daniel Shawul wrote:On an unrelated note it seems that Remi is working comparison of the methods in bayeselo,trueskill and elostat. It is just a pre-print that outlines the major differences between the two.

I am not really working on it. I am waiting for someone to finish the paper for me. Maybe I'll find a student some day. I'd be curious to compare the various models on chess data. But I am way to busy with much more exciting projects for the months and probably years to come.

Rémi

kbhearn · Post by **kbhearn** » Fri Aug 03, 2012 11:52 pm

I think it's a bit presumptive to assume there won't be an overperforming bias from the weaker computer, particularly at higher ratings, simply because of draws.

The larger the rating gap grows, and the stronger the players are overall, the harder it becomes for the stronger player to win instead of draw enough games to show the expected gap, because in many games its weaker opponent will simply not give up sufficient chances to win.

If you change the scoring system such that instead of draw being a relevant result, draw just means rematch until a win occurs, then it seems more likely that you'll be able to devise a scale such that the score between a direct match of A & C can be reasonably predicted by their performance against other opponents.

Daniel Shawul · Post by **Daniel Shawul** » Sat Aug 04, 2012 1:47 am

kbhearn wrote:I think it's a bit presumptive to assume there won't be an overperforming bias from the weaker computer, particularly at higher ratings, simply because of draws.

The larger the rating gap grows, and the stronger the players are overall, the harder it becomes for the stronger player to win instead of draw enough games to show the expected gap, because in many games its weaker opponent will simply not give up sufficient chances to win.

I don't think this is an indication of overperformance rather that draws are simply more frequent at higher elos. At lower elos you would expect more variance even if both opponents are equally matched because both players are at the exponential stage of learning. Strong opponents are at a saturated stage where it is much more difficult to overpower the opponent just like in the case of checkers. So I would say the estimated true strength of both players will be very close with more draws while the tendency to overperform or underperform is still equal. So next time they play again, the weaker player has still the same chance of drawing as before. But weak human players will have significantly larger chances of performing better from game to game. The reason for the skewed distribution for weak human players is something inherent i.e they tend to improve from game to game. This is a problem for any elo model because now the true strength varies from time to time. A draw between two players occurs when the performance differences (Ra-Rb) fall below a threshold value say theta0. If the true strengths are close and variances are low,then we will expect more draws. This is regardless of kind of strength distribution each player may have. OTOH computers have static true strength and their performance fluctuates around that constant value.

If you change the scoring system such that instead of draw being a relevant result, draw just means rematch until a win occurs, then it seems more likely that you'll be able to devise a scale such that the score between a direct match of A & C can be reasonably predicted by their performance against other opponents.

Usually a draw can be replaced by a win and a loss as in the model used by bayeselo. Pairwise comparison methods I know of usually consider A beats B with no provision for the occurence of a draw so some sort of kludge is needed to cure that..

normal vs logistic curve for elo model

normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model

Re: normal vs logistic curve for elo model