Re: EloStat, Bayeselo and Ordo
Posted: Wed Jun 27, 2012 1:26 pm
The deviation between the empirical points and the curves indicates a defect in the model. What is especially strange is that for deltaElo > 0 the blue points seem to follow the yellow line better (the Gaussian model), while for deltaElo < 0, the green line seems a better fit. So it seems the real curve has some asymmetry.
The two curves are supposed to have the same slope in (0, 0.5). The Logistic has longer tails than the Gaussian. Apparently the empirical distribution does not have such tails. Whether that would produce a fit that follows the tails or has the same slope in the center will depend on where most of the data is located. The noise in the blue points suggests that data was most abundant near the center, thus it should be no surprise that it tries to fit the slope there, at the expense of errors in the tails.
It cannot be excluded that the default prior causes compression of the scale. I have seen that problem with BayesElo in the ChessWar promo division, where there is a wild range of playing strengths (some 2000 Elo), and the top and bottom play comparatively few games (because it is a Swiss tourney). Then the prior assumption that the win probabilities can be anything is just no good. Determining the correct prior for a given topology of the pairing network is a very tough problem, though.
Note that games between engines differing ~400 Elo actually gives the most information on the overall scaling, while games with Elo differences close to 0 contribute only very little. Testers, however, usually focus on the latter. So it is comparatively easy for any fitting procedure to hide deficiencies of the model by an overall scaling of the ratings, as the latter is only weakly restricted by the data.
The two curves are supposed to have the same slope in (0, 0.5). The Logistic has longer tails than the Gaussian. Apparently the empirical distribution does not have such tails. Whether that would produce a fit that follows the tails or has the same slope in the center will depend on where most of the data is located. The noise in the blue points suggests that data was most abundant near the center, thus it should be no surprise that it tries to fit the slope there, at the expense of errors in the tails.
It cannot be excluded that the default prior causes compression of the scale. I have seen that problem with BayesElo in the ChessWar promo division, where there is a wild range of playing strengths (some 2000 Elo), and the top and bottom play comparatively few games (because it is a Swiss tourney). Then the prior assumption that the win probabilities can be anything is just no good. Determining the correct prior for a given topology of the pairing network is a very tough problem, though.
Note that games between engines differing ~400 Elo actually gives the most information on the overall scaling, while games with Elo differences close to 0 contribute only very little. Testers, however, usually focus on the latter. So it is comparatively easy for any fitting procedure to hide deficiencies of the model by an overall scaling of the ratings, as the latter is only weakly restricted by the data.