EloStat, Bayeselo and Ordo

hgm · Post by **hgm** » Mon Jun 25, 2012 10:35 pm

michiguel wrote:So, the failure is caused by the assumptions (model), not the algorithmic approach to fit the model. Correct?

I don't know what failure you are talking about.

Optimal extraction of the model parameters from the data should always give an effect that skews the parameters towards equality more if you have less data. Because on average the probability of 'heads' for a coin that showed H heads and T tails will be (H+1)/(T+2), if you don't have any prior bias for what the probabilities can be.

So the estimated win probabilities themselves are dependent on this. That means any modeling of the win probability (e.g. in terms of rating differences) necessarily must also depend on this.

Rémi Coulom · Post by **Rémi Coulom** » Mon Jun 25, 2012 10:58 pm

hgm wrote:Now the assumption that this power is 1, as the Logistic has, was actually supported by real data, where win probability times loss probability gave a curve that fitted the draw probability very well.

I am not aware of any serious study that compares the quality of different models. Well, I know some, but they are inconclusive (not enough data to make a statistically significant comparison of different models).

See for instance that paper:
http://www.dtic.mil/dtic/tr/fulltext/u2/a236856.pdf

I started to take notes for a research I wanted to do. If I ever find a student willing to do it, I will finish that paper:
http://www.grappa.univ-lille3.fr/~coulo ... tcomes.pdf

I am not very excited by this research project, because I expect all the models make predictions of similar quality, like the paper of Hal Stern showed. But he was using only a few human games. With the huge databases of computer games we have now, I expect we should be able to update the results of that old paper with a more conclusive comparison.

Rémi

hgm · Post by **hgm** » Mon Jun 25, 2012 11:09 pm

Rémi Coulom wrote:With the huge databases of computer games we have now, I expect we should be able to update the results of that old paper with a more conclusive comparison.

Well, this is basically what Adam did, in the link he quoted above.

michiguel · Post by **michiguel** » Mon Jun 25, 2012 11:12 pm

hgm wrote:
michiguel wrote:So, the failure is caused by the assumptions (model), not the algorithmic approach to fit the model. Correct?
I don't know what failure you are talking about.

Optimal extraction of the model parameters from the data should always give an effect that skews the parameters towards equality more if you have less data. Because on average the probability of 'heads' for a coin that showed H heads and T tails will be (H+1)/(T+2), if you don't have any prior bias for what the probabilities can be.

So the estimated win probabilities themselves are dependent on this. That means any modeling of the win probability (e.g. in terms of rating differences) necessarily must also depend on this.

It is what I asked up in the thread
"How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that. "

Kai reported that, and it looks like a drawback. I was curious why.

Miguel

Rémi Coulom · Post by **Rémi Coulom** » Mon Jun 25, 2012 11:16 pm

hgm wrote:
Rémi Coulom wrote:With the huge databases of computer games we have now, I expect we should be able to update the results of that old paper with a more conclusive comparison.
Well, this is basically what Adam did, in the link he quoted above.

Can you give the link, please? I cannot find it.

Rémi

hgm · Post by **hgm** » Mon Jun 25, 2012 11:43 pm

Sorry, I am mixing things up. It was Edmund who gave the link above, and also compiled the data. In particular the graph in

http://talkchess.com/forum/viewtopic.ph ... 76&t=42729

Adam Hair · Post by **Adam Hair** » Tue Jun 26, 2012 12:01 am

hgm wrote:
Adam Hair wrote:I have to say I am a little lost. It makes sense to me that 2 draws could be replaced with a win and a loss, but not that one draw can be replaced with a win and a loss.
Well, this is what the Logistic model predicts. Because for the Logistic L(x)*(1-L(x)), which is the product of the probability for one win (L(x)) and one loss (1-L(x)), happens to be proportional to the probability of a single draw (which is given by the derivative of L(x)). Draws in this model are stronger evidence for equality of the players ratings than wins and losses, because they only occur with high frequency in a narrow range of rating difference, why the occasional win against a stronger opponent still occurs for much larger rating differences.

Now for other rating models it is in general not true that F*(1-F) ~ F'. (And in such models F' might not even model the draw rate. For instance, in the Gaussian model there is no exact equivalence between a draw and any number of wins + losses, even for fractional numbers. But in the limit of a large number of games, the likelihood for N draws approaches that of 0.9N wins plus 0.9N losses (IIRC). So one might say that each draw counts for 1.8 games there.

So it all depends on which power of F*(1-F) is proportional to F' (expanded as secod-order power series).

Now the assumption that this power is 1, as the Logistic has, was actually supported by real data, where win probability times loss probability gave a curve that fitted the draw probability very well.

Thanks H.G.. I think you have given me enough of an explanation for me to work out my misunderstanding (along with some additional reading and pencil/paper work).

Laskos · Post by **Laskos** » Tue Jun 26, 2012 4:09 am

Sven Schüle wrote:
Laskos wrote:
Sven Schüle wrote:
Laskos wrote:The correct rating here is -191, 0, 191 [...] and is derived analytically
In general there is no such thing like an (absolutely) "correct rating". The mapping between winning percentage (or probability) and rating difference depends on the underlying rating model, and there is more than one model, and not all three programs you are comparing use the same model AFAIK.

Furthermore, as HGM has shown, also the number of draws has an influence on the ratings. In addition to that, BayesElo (I don't know about Ordo here) also accounts for the colors. I don't know whether the (roughly) 55% win rate for White was considered in your example since you didn't mention it.

Sven
I mean correct according to logistic, assuming there is no white/black colours. There is 1-to-1 map between prcentages and Elo points differece (assuming the logistic and 400/diff). Cannot I say that 75% in a match (and the only match out there) means 191 points? I do understand where the number of draws enters, but I do not underastand it here, in a single match, or in my example of three engines. As statistical weight goes, when we have incongruous results we might have to take a weighted average w1*diff1+w2*diff2+...., and the number of draws does enter in w1,w2... . Also, for incogruous results (almost all ratings with 3 or more engines), it's true that the ratings are model dependent and there is no such thing as "correct rating". I can even argue for the weight quantity that 2 draws are always more than 1 win and 1 loss, but to say that 2 draws are always equivalent to 2 wins and 2 losses is wrong, one needs trinomials. For my example of three engines the weights (therefore the draws) apparently do come, but they are irrelevant because the results are exactly congruent, I have w1*diff + w2*diff+..., just simply the sum of the weights, which is a constant.

Ok, I didn't follow the entire Bayeselo algortithm, but I don't think the rating for absurdly simple cases is so counterintuitive.

Kai
It is correct to say that a rating difference of 191 points implies a winning probability of 75% based on the logistic curve. The other way round, deriving the "191" from the "75%", is also "correct" in a certain sense but it has a different meaning. I'll have to go into greater detail so everyone will understand the context.

The whole "elo rating" topic has one aspect which I believe is sometimes grossly underestimated, at least in CC, and I think it is important to understand what we do if we calculate ratings from (engine) games.

Whenever a set of games is taken to obtain elo ratings just from that set, this is always an attempt to predict something unknown, i.e. the "true relative playing strength" and, derived from it, the "real winning probabilities" of the participants. Now you can choose different ways to do that prediction but it is still a prediction for which it is not always trivial to say how good it actually is. It is surely possible to test the quality of those predictions produced by the different elo rating tools but what most people actually do is only look at the predictions itself, e.g. the resulting rating lists, and take them for granted (or don't). At least one of these tools (BayesElo, I'm not sure about the others) also adds information about the error bars to give an impression about the range of its predicted values, a detail that must be considered when judging about a tool's quality since a prediction with error bars of +/- 100 appears to be less likely to fail (and thus to be considered "wrong") than one with +/- 5, for instance.

One might argue: "Hey, what a nonsense, so I can easily provide a tool with 'perfect' prediction quality by always letting it state error bars of +/- 3000 ...". That's not the point, of course, since in reality the error bars mostly depend on the number of games you feed into the tool. Of course I agree that it is difficult to compare two tools where one includes error bars and the other doesn't.

So in my opinion you may of course argue about the quality of those predictions if you show that tool A has a lower probability of wrong predictions than tool B, or else some better "quality indicator value" than tool B based on another definition you choose. But just stating that tool B would return "wrong" results without comparing prediction qualities is not the way I believe it should be.

Please note also that the elo rating system is based on winning probabilities (here simply including draws as 0.5 points, and not considering colors!) being derived from rating differences, i.e. from an existing prediction. So the direction is "D ==> P(D)" (still assuming one rating model, e.g. "logistic"!). The other way round is "only" used to calculate the elo prediction, which is done incrementally resp. periodically for human players and "once for all games" for engines. Here I think that using information about draws and colors helps to improve the prediction quality, and it is up to the tool itself how it thinks this information is used in the best way.

My conclusion is: if you want to compare elo rating tools then you need to set up a procedure to measure prediction quality. I dont' think this is easy to achieve with a simulation. I would have to think about how it could be done, maybe you are faster and better with that than me.

My first guess would be that

- you need to define "secret" values for the "true playing strength" of your participants,
- then provide a sufficiently large set of games (game results) exactly reflecting the winning probabilites derived from these values (based on a chosen rating model),
- then select (randomly?) a subset of these games and let each rating tool provide its rating predictions based on only that subset (so hiding all other games to them),
- then select one (or more?) subset(s) of (different) games from the whole game set,
- and finally test the prediction quality somehow by comparing the predictions with the newly selected results.

Several points are open for me, though, like:
- How to deal with the problem of non-transitive ratings when providing the initial set of games and having >= 3 participants?
- How many games are needed at least in the initial set to get a "valid" measurement procedure?
- How many games resp. subsets are needed in the second step to get a "valid" measurement procedure?

I hope you can follow my thoughts, and either you or someone else can state whether doing so would make any sense.

Sven

I am not sure I fully understood you, but I agree with many things you wrote, especially with the testing methodology of the quality of prediction in the last paragraphs. First I have to clarify for me some things.
I think the errors of a prediction (therefore its quality) is dependent on the model too (for example logistic), not only on the number of games. Take 3 matches, with P are the following: 1 vs 2: 0.75, 2 vs 3: 0.75, 3 vs 1: 0.75 (with known X vs X: 0.5 for all X). No monotonous one-parameter model (D=>P(D)) will fit perfectly these points, any such model will introduce large model errors (in addition to statistical errors) in its predictions. If P are the following: 1 vs 2: 0.75, 2 vs 3: 0.75, 3 vs 1: 0.10. The logistic D=> P(D) fits perfectly the points (including X vs X: 0.5) and its combined (model and statistical) errors will be smaller than that of linear model, which does not fit the points. We will call the logistic model a better predictor than the linear model for the given result simply because combined errors of predictions are smaller. If the P are the following: 1 vs 2: 0.75, 2 vs 3: 0.75, 3 vs 1: 0.00, then the linear model is a better predictor.
Take something like EloStat or BayesElo. They give the rating (prediction) converted by a logistic to P with respect to a pool of engines. It is claimed that having two ratings for two different engines the prediction for P(D) in a match between engines is given by D (using the logistic). With a large pool of engines, the quality of the prediction is not extremely hard to check (or at least so seems to me). You are correct than one needs sub-sets of the matches for the rating tool to give predictions and data, but looking at Edmund's plot separating in sub-sets can be done visually.
http://talkchess.com/forum/viewtopic.ph ... t&start=10
We see in his second plot that the predictions using Bayeselo are a bit distorted (some maybe 10% compression) compared to pure theoretical logistic (green line, Bayeselo has the same logistic wih diff/400, but with additional parameters like prior, draw assumptions). I do not know if Bayeselo has a systematic error estimation of its model, probably not, and the total errors are larger than those purely statisitcal. To estimate the total error, take the distance on the vertical between the theoretical logistic and the dots, it's larger than the pure statistical noise (the distance between the dots and the best fit to them). One could do the same plot for Ordo, and I guess Ordo prediction for P(D)from D will lie around the theoretical logistic model (green curve). It would mean that Ordo errors are mostly statistical and smaller (no matter what they show as errors) than Bayeselo. Also, it seems that Bayeselo could be corrected a bit, mapping D to P(D) by a different logistic, something like diff/360 instead of diff/400.
The correct methodology for choosing the predictor is described well by you, and you also put a stress of transitivity. If in a set of of games, all matches would perfectly obey the transitivity rule inside a given model D=>P(D), then we wold need only to calculate the coupled errors, the ratings would be coming one by one. The coupled problem is mostly coming from intransitivity (incongruence) of the real results.
But I don't see a single match of +60 =30 -10 as ambiguous in any way, this gives a well defined trinomial distribution, which summed up on the surface of score=75 and games=100, gives us everything like rating difference (according to the logistic or else) and errors. There is no need to minimize errors and such things, I don't know what HGM was saying.

Kai

Laskos · Post by **Laskos** » Tue Jun 26, 2012 4:15 am

Rémi Coulom wrote:
hgm wrote:Now the assumption that this power is 1, as the Logistic has, was actually supported by real data, where win probability times loss probability gave a curve that fitted the draw probability very well.
I am not aware of any serious study that compares the quality of different models. Well, I know some, but they are inconclusive (not enough data to make a statistically significant comparison of different models).

See for instance that paper:
http://www.dtic.mil/dtic/tr/fulltext/u2/a236856.pdf

I started to take notes for a research I wanted to do. If I ever find a student willing to do it, I will finish that paper:
http://www.grappa.univ-lille3.fr/~coulo ... tcomes.pdf

I am not very excited by this research project, because I expect all the models make predictions of similar quality, like the paper of Hal Stern showed. But he was using only a few human games. With the huge databases of computer games we have now, I expect we should be able to update the results of that old paper with a more conclusive comparison.

Rémi

I don't know if it's so hard to see. Do you want to bet that Bayeselo is compressing the ratings, and the predictions, or to put it differently, total (model+statistical) errors it introduces predicting the results of an engine-engine matches are larger than that of Ordo (not those shown as errors in the tools themselves)? I mean as Bayeselo is applied for CCRL ratings.

Kai

hgm · Post by **hgm** » Tue Jun 26, 2012 8:32 am

That depends on how you use it, of course. You can always adjust the prior so that it produces wrong results.

But with a correct prior, BayesElo should give ratings that produce more accurate predictions. In particular, an analysis that does not take any prior into account tends to expand the rating scale in a small data set (where there are players that scored 0 or 2 out of 2, or 1 out of 8. The same holds for an analysis that does not properly take account of weighting draws: it expands the rating scale, degrading the accuracy of its predictions.

That is basically model-independent. Most programs use the same (Logistic) model, which is not able to handle intransitivity at all. But the superiority of the Bayesian approach can already be seen when you are trying to predict the plain result probabilities or overall scores without any attempt to derive the predictions from a model with a reduced number of parameters.

EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo