Sven Schüle wrote: Laskos wrote: Sven Schüle wrote:
Laskos wrote:The correct rating here is -191, 0, 191 [...] and is derived analytically
In general there is no such thing like an (absolutely) "correct rating". The mapping between winning percentage (or probability) and rating difference depends on the underlying rating model, and there is more than one model, and not all three programs you are comparing use the same model AFAIK.
Furthermore, as HGM has shown, also the number of draws has an influence on the ratings. In addition to that, BayesElo (I don't know about Ordo here) also accounts for the colors. I don't know whether the (roughly) 55% win rate for White was considered in your example since you didn't mention it.
I mean correct according to logistic, assuming there is no white/black colours. There is 1-to-1 map between prcentages and Elo points differece (assuming the logistic and 400/diff). Cannot I say that 75% in a match (and the only match out there) means 191 points? I do understand where the number of draws enters, but I do not underastand it here, in a single match, or in my example of three engines. As statistical weight goes, when we have incongruous results we might have to take a weighted average w1*diff1+w2*diff2+...., and the number of draws does enter in w1,w2... . Also, for incogruous results (almost all ratings with 3 or more engines), it's true that the ratings are model dependent and there is no such thing as "correct rating". I can even argue for the weight quantity that 2 draws are always more than 1 win and 1 loss, but to say that 2 draws are always equivalent to 2 wins and 2 losses is wrong, one needs trinomials. For my example of three engines the weights (therefore the draws) apparently do come, but they are irrelevant because the results are exactly congruent, I have w1*diff + w2*diff+..., just simply the sum of the weights, which is a constant.
Ok, I didn't follow the entire Bayeselo algortithm, but I don't think the rating for absurdly simple cases is so counterintuitive.
It is correct to say that a rating difference of 191 points implies a winning probability of 75% based on the logistic curve. The other way round, deriving the "191" from the "75%", is also "correct" in a certain sense but it has a different meaning. I'll have to go into greater detail so everyone will understand the context.
The whole "elo rating" topic has one aspect which I believe is sometimes grossly underestimated, at least in CC, and I think it is important to understand what we do if we calculate ratings from (engine) games.
Whenever a set of games is taken to obtain elo ratings just from that set, this is always an attempt to predict
something unknown, i.e. the "true relative playing strength" and, derived from it, the "real winning probabilities" of the participants. Now you can choose different ways to do that prediction but it is still a prediction for which it is not always trivial to say how good it actually is. It is surely possible to test the quality of those predictions produced by the different elo rating tools but what most people actually do is only look at the predictions itself, e.g. the resulting rating lists, and take them for granted (or don't). At least one of these tools (BayesElo, I'm not sure about the others) also adds information about the error bars to give an impression about the range of its predicted values, a detail that must be considered when judging about a tool's quality since a prediction with error bars of +/- 100 appears to be less likely to fail (and thus to be considered "wrong") than one with +/- 5, for instance.
One might argue: "Hey, what a nonsense, so I can easily provide a tool with 'perfect' prediction quality by always letting it state error bars of +/- 3000 ...". That's not the point, of course, since in reality the error bars mostly depend on the number of games you feed into the tool. Of course I agree that it is difficult to compare two tools where one includes error bars and the other doesn't.
So in my opinion you may
of course argue about the quality of those predictions if
you show that tool A has a lower probability of wrong predictions than tool B, or else some better "quality indicator value" than tool B based on another definition you choose. But just stating that tool B would return "wrong" results without comparing prediction qualities is not the way I believe it should be.
Please note also that the elo rating system is based on winning probabilities (here simply including draws as 0.5 points, and not considering colors!) being derived from rating differences, i.e. from an existing prediction. So the direction is "D ==> P(D)" (still assuming one rating model, e.g. "logistic"!). The other way round is "only" used to calculate the elo prediction, which is done incrementally resp. periodically for human players and "once for all games" for engines. Here I think that using information about draws and colors helps to improve the prediction quality, and it is up to the tool itself how it thinks this information is used in the best way.
My conclusion is: if you want to compare elo rating tools then you need to set up a procedure to measure prediction quality. I dont' think this is easy to achieve with a simulation. I would have to think about how it could be done, maybe you are faster and better with that than me.
My first guess would be that
- you need to define "secret" values for the "true playing strength" of your participants,
- then provide a sufficiently large set of games (game results) exactly reflecting the winning probabilites derived from these values (based on a chosen rating model),
- then select (randomly?) a subset of these games and let each rating tool provide its rating predictions based on only that subset (so hiding all other games to them),
- then select one (or more?) subset(s) of (different) games from the whole game set,
- and finally test the prediction quality somehow by comparing the predictions with the newly selected results.
Several points are open for me, though, like:
- How to deal with the problem of non-transitive ratings when providing the initial set of games and having >= 3 participants?
- How many games are needed at least in the initial set to get a "valid" measurement procedure?
- How many games resp. subsets are needed in the second step to get a "valid" measurement procedure?
I hope you can follow my thoughts, and either you or someone else can state whether doing so would make any sense.