The deviation between the empirical points and the curves indicates a defect in the model. What is especially strange is that for deltaElo > 0 the blue points seem to follow the yellow line better (the Gaussian model), while for deltaElo < 0, the green line seems a better fit. So it seems the real curve has some asymmetry.

The two curves are supposed to have the same slope in (0, 0.5). The Logistic has longer tails than the Gaussian. Apparently the empirical distribution does not have such tails. Whether that would produce a fit that follows the tails or has the same slope in the center will depend on where most of the data is located. The noise in the blue points suggests that data was most abundant near the center, thus it should be no surprise that it tries to fit the slope there, at the expense of errors in the tails.

It cannot be excluded that the default prior causes compression of the scale. I have seen that problem with BayesElo in the ChessWar promo division, where there is a wild range of playing strengths (some 2000 Elo), and the top and bottom play comparatively few games (because it is a Swiss tourney). Then the prior assumption that the win probabilities can be anything is just no good. Determining the correct prior for a given topology of the pairing network is a very tough problem, though.

Note that games between engines differing ~400 Elo actually gives the most information on the overall scaling, while games with Elo differences close to 0 contribute only very little. Testers, however, usually focus on the latter. So it is comparatively easy for any fitting procedure to hide deficiencies of the model by an overall scaling of the ratings, as the latter is only weakly restricted by the data.

## EloStat, Bayeselo and Ordo

**Moderators:** bob, hgm, Harvey Williamson

**Forum rules**

This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

### Re: EloStat, Bayeselo and Ordo

By the way, playing a bit with expectation values of P for small number of games, I came to the following closed-form expressions (I didn't know they are so general):hgm wrote:The deviation between the empirical points and the curves indicates a defect in the model. What is especially strange is that for deltaElo > 0 the blue points seem to follow the yellow line better (the Gaussian model), while for deltaElo < 0, the green line seems a better fit. So it seems the real curve has some asymmetry.

The two curves are supposed to have the same slope in (0, 0.5). The Logistic has longer tails than the Gaussian. Apparently the empirical distribution does not have such tails. Whether that would produce a fit that follows the tails or has the same slope in the center will depend on where most of the data is located. The noise in the blue points suggests that data was most abundant near the center, thus it should be no surprise that it tries to fit the slope there, at the expense of errors in the tails.

It cannot be excluded that the default prior causes compression of the scale. I have seen that problem with BayesElo in the ChessWar promo division, where there is a wild range of playing strengths (some 2000 Elo), and the top and bottom play comparatively few games (because it is a Swiss tourney). Then the prior assumption that the win probabilities can be anything is just no good. Determining the correct prior for a given topology of the pairing network is a very tough problem, though.

Note that games between engines differing ~400 Elo actually gives the most information on the overall scaling, while games with Elo differences close to 0 contribute only very little. Testers, however, usually focus on the latter. So it is comparatively easy for any fitting procedure to hide deficiencies of the model by an overall scaling of the ratings, as the latter is only weakly restricted by the data.

binomial, W wins, L losses, W+L=N (total).

For W, the expectation value is (W+1)/(N+2). Similarly for L.

trinomial, W wins, D draws, L losses, W+D+L=N (total)

For W, the expectation value is (W+1)/(N+3). Similarly for D, L.

For a multinomial of degree M it's (W+1)/(N+M), which would be useful generally. That all if I didn't mess up something.

In chess, for small number of games, it's a bit surprising:

3 wins, 1 draw, 0 losses mean:

4/7 win probability, 2/7 draw probability, 1/7 loss probability.

Kai

- hgm
**Posts:**23630**Joined:**Fri Mar 10, 2006 9:06 am**Location:**Amsterdam**Full name:**H G Muller-
**Contact:**

### Re: EloStat, Bayeselo and Ordo

Indeed, those are the correct expressions for the best estimates.

What makes the Elo determination problem so tough is that the winning probabilities are in general not independent. (Unless the pairing network is a linear chain.) With N players there are N*(N-1)/2 win, draw and loss probabilities. (But of course W+D+L=1 for each player.) And they follow from only N parameters (their ratings), of which one can actually be eliminated because only rating differences enter the probability equation. (So an extra condition is needed to uniquely specify the ratings, e.g. that their average is zero.)

The rating model thus implies a dependency between the probabilities of the individual pairings. Even if you forget about draws, the allowed combinations of ratings form an (N-1)-dimensional manifold in the N*(N-1)/2-dimensional space of the probabilities. As for most rating models this manifold is not a (hyper-)plane, but curved, it is quite difficult to decide what you would consider an 'unbiased' prior (i.e. every point of the manifold assumed equally likely). I did not even manage solving this for a linearized rating model.

What makes the Elo determination problem so tough is that the winning probabilities are in general not independent. (Unless the pairing network is a linear chain.) With N players there are N*(N-1)/2 win, draw and loss probabilities. (But of course W+D+L=1 for each player.) And they follow from only N parameters (their ratings), of which one can actually be eliminated because only rating differences enter the probability equation. (So an extra condition is needed to uniquely specify the ratings, e.g. that their average is zero.)

The rating model thus implies a dependency between the probabilities of the individual pairings. Even if you forget about draws, the allowed combinations of ratings form an (N-1)-dimensional manifold in the N*(N-1)/2-dimensional space of the probabilities. As for most rating models this manifold is not a (hyper-)plane, but curved, it is quite difficult to decide what you would consider an 'unbiased' prior (i.e. every point of the manifold assumed equally likely). I did not even manage solving this for a linearized rating model.