EloStat, Bayeselo and Ordo

hgm · Post by **hgm** » Mon Jun 25, 2012 12:36 am

Laskos wrote:My problem is a bit different, if you are right, then I don't know what "rating" is supposed to mean. In absence of any other information, what is the rating difference between two engines scoring +60 =30 -10 against each other? Is it hard and I cannot do that by hand?

It is the same as +90 =0 -40, which means you have to invert the Logistic for 90/130.

The draw probability is proportional to the product of win ad loss probability (for the Logistic model), but in general not exactly equal. But that does not affect the position of the maximum.

hgm · Post by **hgm** » Mon Jun 25, 2012 12:44 am

michiguel wrote:How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that.

That just be a result of the prior likelihood, which is equivalent one draw between the progams. The ore games there are, the smaller the effect of this. This is the Bayesian aspect, where the best estimate for a probability leading to an M-out-of-N result is (M+1)/(N+2).

Laskos · Post by **Laskos** » Mon Jun 25, 2012 12:54 am

Sven Schüle wrote:
Laskos wrote:The correct rating here is -191, 0, 191 [...] and is derived analytically
In general there is no such thing like an (absolutely) "correct rating". The mapping between winning percentage (or probability) and rating difference depends on the underlying rating model, and there is more than one model, and not all three programs you are comparing use the same model AFAIK.

Furthermore, as HGM has shown, also the number of draws has an influence on the ratings. In addition to that, BayesElo (I don't know about Ordo here) also accounts for the colors. I don't know whether the (roughly) 55% win rate for White was considered in your example since you didn't mention it.

Sven

I mean correct according to logistic, assuming there is no white/black colours. There is 1-to-1 map between prcentages and Elo points differece (assuming the logistic and 400/diff). Cannot I say that 75% in a match (and the only match out there) means 191 points? I do understand where the number of draws enters, but I do not underastand it here, in a single match, or in my example of three engines. As statistical weight goes, when we have incongruous results we might have to take a weighted average w1*diff1+w2*diff2+...., and the number of draws does enter in w1,w2... . Also, for incogruous results (almost all ratings with 3 or more engines), it's true that the ratings are model dependent and there is no such thing as "correct rating". I can even argue for the weight quantity that 2 draws are always more than 1 win and 1 loss, but to say that 2 draws are always equivalent to 2 wins and 2 losses is wrong, one needs trinomials. For my example of three engines the weights (therefore the draws) apparently do come, but they are irrelevant because the results are exactly congruent, I have w1*diff + w2*diff+..., just simply the sum of the weights, which is a constant.

Ok, I didn't follow the entire Bayeselo algortithm, but I don't think the rating for absurdly simple cases is so counterintuitive.

Kai

michiguel · Post by **michiguel** » Mon Jun 25, 2012 12:58 am

hgm wrote:
michiguel wrote:How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that.
That just be a result of the prior likelihood, which is equivalent one draw between the progams. The ore games there are, the smaller the effect of this. This is the Bayesian aspect, where the best estimate for a probability leading to an M-out-of-N result is (M+1)/(N+2).

So, the failure is caused by the assumptions (model), not the algorithmic approach to fit the model. Correct?

Miguel

Laskos · Post by **Laskos** » Mon Jun 25, 2012 1:00 am

hgm wrote:
Laskos wrote:My problem is a bit different, if you are right, then I don't know what "rating" is supposed to mean. In absence of any other information, what is the rating difference between two engines scoring +60 =30 -10 against each other? Is it hard and I cannot do that by hand?
It is the same as +90 =0 -40, which means you have to invert the Logistic for 90/130.

Sorry, then don't wonder about rating compression, was it not expected by you? You already compressed it from 75% to 69%. What rating is that, rating of what?

The draw probability is proportional to the product of win ad loss probability (for the Logistic model), but in general not exactly equal. But that does not affect the position of the maximum.

You took 1 draw as exactly 1 win and 1 loss. I can argue that 2 draws are more than 1 win and 1 loss, but strictly speaking 1 draw is not equal to anything win/loss.

Adam Hair · Post by **Adam Hair** » Mon Jun 25, 2012 3:26 am

Adam Hair wrote:
Laskos wrote:I picked one simple example to compare the correctness of the rating programs, an example with 3 engines where I can compute ratings by hand. Knowing that a performance of 75% means 191 Elo points advantage and 90% means exactly double, 2*191=382 Elo points, building a PGN with results eng1-eng2: 75%, eng2-eng3: 75%, eng1-eng3: 90% with the same number of games for each engine, the rating has a fixed point at first iteration:

eng1 +191
eng2 0
eng3 -191

I took the names of Houdini, Strelka and Komodo for the first PGN of 90 games with such properties (from EloStat "programs" file):
Code: Select all
Individual statistics&#58;

1 Houdini 1.5a x64          &#58;  180   60 (+ 42,= 15,-  3&#41;, 82.5 %

Komodo64 3                    &#58;  30 (+ 24,=  6,-  0&#41;, 90.0 %
Strelka 5                     &#58;  30 (+ 18,=  9,-  3&#41;, 75.0 %

2 Strelka 5                 &#58;    0   60 (+ 21,= 18,- 21&#41;, 50.0 %

Komodo64 3                    &#58;  30 (+ 18,=  9,-  3&#41;, 75.0 %
Houdini 1.5a x64              &#58;  30 (+  3,=  9,- 18&#41;, 25.0 %

3 Komodo64 3                &#58; -180   60 (+  3,= 15,- 42&#41;, 17.5 %

Strelka 5                     &#58;  30 (+  3,=  9,- 18&#41;, 25.0 %
Houdini 1.5a x64              &#58;  30 (+  0,=  6,- 24&#41;, 10.0 %
The EloStat rating is:
Code: Select all
    Program                            Score     %    Av.Op.  Elo    +   -    Draws

  1 Houdini 1.5a x64               &#58;  49.5/ 60  82.5    -90    180   91  86   25.0 %
  2 Strelka 5                      &#58;  30.0/ 60  50.0      0      0   75  75   30.0 %
  3 Komodo64 3                     &#58;  10.5/ 60  17.5     90   -180   86  91   25.0 %
We see that EloStat gives -180, 0, 180 instead of the correct -191, 0, 191, compressing the rating by 22 points over the 382 points span of ratings.

The Bayeselo ratings are (for different mm flags):
Code: Select all
ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
   1 Houdini 1.5a x64   158   60   54    60   83%   -79   25%
   2 Strelka 5           -1   51   51    60   50%     1   30%
   3 Komodo64 3        -157   54   60    60   18%    79   25%
for mm 0 0, ratings -157, -1, 158 instead of -191, 0, 191, compressing the rating by some 66 points.
For mm 1 1 Bayeselo gives
Code: Select all
ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
   1 Houdini 1.5a x64   166   51   46    60   83%   -83   25%
   2 Strelka 5            0   44   44    60   50%     0   30%
   3 Komodo64 3        -165   47   51    60   18%    83   25%
compressing the ratings by some 50 points.

Ordo gives almost the correct result:
Code: Select all
                        ENGINE&#58;  RATING  ERROR  POINTS  PLAYED    (%)
              Houdini 1.5a x64&#58;  190.1   41.6    49.5      60   82.5%
                     Strelka 5&#58;   -0.0   35.9    30.0      60   50.0%
                    Komodo64 3&#58; -190.1   41.7    10.5      60   17.5%
With the same PGN multiplied 4 time for a total of 360 games:
Code: Select all
Individual statistics&#58;

1 Houdini 1.5a x64          &#58;  180  240 (+168,= 60,- 12&#41;, 82.5 %

Komodo64 3                    &#58; 120 (+ 96,= 24,-  0&#41;, 90.0 %
Strelka 5                     &#58; 120 (+ 72,= 36,- 12&#41;, 75.0 %

2 Strelka 5                 &#58;    0  240 (+ 84,= 72,- 84&#41;, 50.0 %

Komodo64 3                    &#58; 120 (+ 72,= 36,- 12&#41;, 75.0 %
Houdini 1.5a x64              &#58; 120 (+ 12,= 36,- 72&#41;, 25.0 %

3 Komodo64 3                &#58; -180  240 (+ 12,= 60,-168&#41;, 17.5 %

Strelka 5                     &#58; 120 (+ 12,= 36,- 72&#41;, 25.0 %
Houdini 1.5a x64              &#58; 120 (+  0,= 24,- 96&#41;, 10.0 %
For this PGN, EloStat gives:
Code: Select all
    Program                            Score     %    Av.Op.  Elo    +   -    Draws

  1 Houdini 1.5a x64               &#58; 198.0/240  82.5    -90    180   44  43   25.0 %
  2 Strelka 5                      &#58; 120.0/240  50.0      0      0   37  37   30.0 %
  3 Komodo64 3                     &#58;  42.0/240  17.5     90   -180   43  44   25.0 %
EloStat gives again ratings of +180, 0, -180 instead of +191, 0, -191, compressing the rating by 22 points.

Bayeselo:
mm 0 0
Code: Select all
ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
   1 Houdini 1.5a x64   166   30   28   240   83%   -83   25%
   2 Strelka 5           -1   26   26   240   50%     1   30%
   3 Komodo64 3        -165   28   30   240   18%    82   25%
mm 1 1
Code: Select all
ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
   1 Houdini 1.5a x64   173   26   25   240   83%   -86   25%
   2 Strelka 5            0   23   23   240   50%     0   30%
   3 Komodo64 3        -172   25   26   240   18%    86   25%
Bayeselo now compresses the rating by some 50 points for 0 0 mm flags, and by 37 points for 1 1 mm flags. Also, the result is different (and a bit closer to correct) as compared to 4 times less games (90 instead of 360) PGN file, probably due to prior.

Ordo gives again almost exactly correct result:
Code: Select all
                        ENGINE&#58;  RATING  ERROR  POINTS  PLAYED    (%)
              Houdini 1.5a x64&#58;  190.1   20.8   198.0     240   82.5%
                     Strelka 5&#58;    0.0   18.3   120.0     240   50.0%
                    Komodo64 3&#58; -190.1   20.4    42.0     240   17.5%
If I understood something, I would recommend using Ordo for direct rating comparison of engines, if one wants to avoid the rating compression and distortion.

Kai
For Bayeselo and mm 1 1, if you set prior to 0, the ratings will be +176, 0, -176 and will not vary.

If we change the number of draws but keep the scores the same, here are the results:

Arena
Code: Select all
1 A                         &#58;  180   60 (+ 49,=  1,- 10&#41;, 82.5 %

B                             &#58;  30 (+ 22,=  1,-  7&#41;, 75.0 %
C                             &#58;  30 (+ 27,=  0,-  3&#41;, 90.0 %

2 B                         &#58;    0   60 (+ 29,=  2,- 29&#41;, 50.0 %

A                             &#58;  30 (+  7,=  1,- 22&#41;, 25.0 %
C                             &#58;  30 (+ 22,=  1,-  7&#41;, 75.0 %

3 C                         &#58; -180   60 (+ 10,=  1,- 49&#41;, 17.5 %

A                             &#58;  30 (+  3,=  0,- 27&#41;, 10.0 %
B                             &#58;  30 (+  7,=  1,- 22&#41;, 25.0 %
Ordo
Code: Select all
ENGINE&#58;  RATING    POINTS  PLAYED    (%)
                             A&#58;  190.1      49.5      60   82.5%
                             B&#58;    0.0      30.0      60   50.0%
                             C&#58;  -190.1      10.5      60   17.5%
Bayeselo (using prior 0, mm 0 1, and covariance)
Code: Select all
Rank Name   Elo    +    - games score oppo. draws
   1 A      193   79   79    60   83%   -97    2%
   2 B        3   66   66    60   50%    -1    3%
   3 C     -196   79   79    60   18%    98    2%
As you can see, the ratings for Elostat and Ordo stay the same, while Bayeselo ratings expand when the percentage of draws decreases. Bayeselo's model assumes that draws indicate 2 engines are closer in strength. Whether or not this causes distortions is in the eye of the beholder.

One correction. mm 0 1 does not set White advantage to 0 as I thought it would. If I set White advantage to zero, then use mm 0 1 (along with setting prior to 0), then I get the following with my example:

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A      192   78   78    60   83%   -96    2%
   2 B        0   66   66    60   50%     0    3%
   3 C     -192   78   78    60   18%    96    2%

I still get the same result for Kai's example.

Adam Hair · Post by **Adam Hair** » Mon Jun 25, 2012 4:23 am

hgm wrote:
Laskos wrote:My problem is a bit different, if you are right, then I don't know what "rating" is supposed to mean. In absence of any other information, what is the rating difference between two engines scoring +60 =30 -10 against each other? Is it hard and I cannot do that by hand?
It is the same as +90 =0 -40, which means you have to invert the Logistic for 90/130.

The draw probability is proportional to the product of win ad loss probability (for the Logistic model), but in general not exactly equal. But that does not affect the position of the maximum.

I have to say I am a little lost. It makes sense to me that 2 draws could be replaced with a win and a loss, but not that one draw can be replaced with a win and a loss. Here is what I get with Bayeselo (prior=0, advantage=0, mm 0 1 to compute draw distribution from pgn):

+60=30-10

Code: Select all

Rank Name   Elo    +    - games score oppo. draws 
   1 A       93   28   28   100   75%   -93   30% 
   2 B      -93   28   28   100   25%    93   30%

+75=0-25

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
   1 A       95   39   39   100   75%   -95    0%
   2 B      -95   39   39   100   25%    95    0%

+90=0-40

Code: Select all

Rank Name   Elo    +    - games score oppo. draws 
   1 A       70   32   32   130   69%   -70    0% 
   2 B      -70   32   32   130   31%    70    0%

Sven · Post by **Sven** » Mon Jun 25, 2012 11:22 am

Laskos wrote:
Sven Schüle wrote:
Laskos wrote:The correct rating here is -191, 0, 191 [...] and is derived analytically
In general there is no such thing like an (absolutely) "correct rating". The mapping between winning percentage (or probability) and rating difference depends on the underlying rating model, and there is more than one model, and not all three programs you are comparing use the same model AFAIK.

Furthermore, as HGM has shown, also the number of draws has an influence on the ratings. In addition to that, BayesElo (I don't know about Ordo here) also accounts for the colors. I don't know whether the (roughly) 55% win rate for White was considered in your example since you didn't mention it.

Sven
I mean correct according to logistic, assuming there is no white/black colours. There is 1-to-1 map between prcentages and Elo points differece (assuming the logistic and 400/diff). Cannot I say that 75% in a match (and the only match out there) means 191 points? I do understand where the number of draws enters, but I do not underastand it here, in a single match, or in my example of three engines. As statistical weight goes, when we have incongruous results we might have to take a weighted average w1*diff1+w2*diff2+...., and the number of draws does enter in w1,w2... . Also, for incogruous results (almost all ratings with 3 or more engines), it's true that the ratings are model dependent and there is no such thing as "correct rating". I can even argue for the weight quantity that 2 draws are always more than 1 win and 1 loss, but to say that 2 draws are always equivalent to 2 wins and 2 losses is wrong, one needs trinomials. For my example of three engines the weights (therefore the draws) apparently do come, but they are irrelevant because the results are exactly congruent, I have w1*diff + w2*diff+..., just simply the sum of the weights, which is a constant.

Ok, I didn't follow the entire Bayeselo algortithm, but I don't think the rating for absurdly simple cases is so counterintuitive.

Kai

It is correct to say that a rating difference of 191 points implies a winning probability of 75% based on the logistic curve. The other way round, deriving the "191" from the "75%", is also "correct" in a certain sense but it has a different meaning. I'll have to go into greater detail so everyone will understand the context.

The whole "elo rating" topic has one aspect which I believe is sometimes grossly underestimated, at least in CC, and I think it is important to understand what we do if we calculate ratings from (engine) games.

Whenever a set of games is taken to obtain elo ratings just from that set, this is always an attempt to predict something unknown, i.e. the "true relative playing strength" and, derived from it, the "real winning probabilities" of the participants. Now you can choose different ways to do that prediction but it is still a prediction for which it is not always trivial to say how good it actually is. It is surely possible to test the quality of those predictions produced by the different elo rating tools but what most people actually do is only look at the predictions itself, e.g. the resulting rating lists, and take them for granted (or don't). At least one of these tools (BayesElo, I'm not sure about the others) also adds information about the error bars to give an impression about the range of its predicted values, a detail that must be considered when judging about a tool's quality since a prediction with error bars of +/- 100 appears to be less likely to fail (and thus to be considered "wrong") than one with +/- 5, for instance.

One might argue: "Hey, what a nonsense, so I can easily provide a tool with 'perfect' prediction quality by always letting it state error bars of +/- 3000 ...". That's not the point, of course, since in reality the error bars mostly depend on the number of games you feed into the tool. Of course I agree that it is difficult to compare two tools where one includes error bars and the other doesn't.

So in my opinion you may of course argue about the quality of those predictions if you show that tool A has a lower probability of wrong predictions than tool B, or else some better "quality indicator value" than tool B based on another definition you choose. But just stating that tool B would return "wrong" results without comparing prediction qualities is not the way I believe it should be.

Please note also that the elo rating system is based on winning probabilities (here simply including draws as 0.5 points, and not considering colors!) being derived from rating differences, i.e. from an existing prediction. So the direction is "D ==> P(D)" (still assuming one rating model, e.g. "logistic"!). The other way round is "only" used to calculate the elo prediction, which is done incrementally resp. periodically for human players and "once for all games" for engines. Here I think that using information about draws and colors helps to improve the prediction quality, and it is up to the tool itself how it thinks this information is used in the best way.

My conclusion is: if you want to compare elo rating tools then you need to set up a procedure to measure prediction quality. I dont' think this is easy to achieve with a simulation. I would have to think about how it could be done, maybe you are faster and better with that than me.

My first guess would be that

- you need to define "secret" values for the "true playing strength" of your participants,
- then provide a sufficiently large set of games (game results) exactly reflecting the winning probabilites derived from these values (based on a chosen rating model),
- then select (randomly?) a subset of these games and let each rating tool provide its rating predictions based on only that subset (so hiding all other games to them),
- then select one (or more?) subset(s) of (different) games from the whole game set,
- and finally test the prediction quality somehow by comparing the predictions with the newly selected results.

Several points are open for me, though, like:
- How to deal with the problem of non-transitive ratings when providing the initial set of games and having >= 3 participants?
- How many games are needed at least in the initial set to get a "valid" measurement procedure?
- How many games resp. subsets are needed in the second step to get a "valid" measurement procedure?

I hope you can follow my thoughts, and either you or someone else can state whether doing so would make any sense.

Sven

Rémi Coulom · Post by **Rémi Coulom** » Mon Jun 25, 2012 3:51 pm

As was already explained by others, bayeselo's model is not identical to the usual logistic model of expected score as a function of Elo difference. That's because bayeselo needs a model that gives the probability of a win, a draw, and a loss as a function of rating difference. The detailed distribution over possible game outcomes is necessary, just giving expected score is not enough.

In order to produce comparable values for ratings, bayeselo's elo is adjusted so that the expected score as a function of rating difference has the same derivative at zero. That means the curves match for small elo difference, but they don't for large elo difference.

In bayeselo, I used a model that considers one draw like one win and one loss. That's because it was done like that in Hunter's paper. But it is an arbitrary choice that may not be the best.

As Sven nicely explained, in order to compare the merit of different models, it would be necessary to run a study that compares their prediction abilities. I started to make one myself some time ago, but gave up by lack of motivation.

Rémi

hgm · Post by **hgm** » Mon Jun 25, 2012 10:18 pm

Adam Hair wrote:I have to say I am a little lost. It makes sense to me that 2 draws could be replaced with a win and a loss, but not that one draw can be replaced with a win and a loss.

Well, this is what the Logistic model predicts. Because for the Logistic L(x)*(1-L(x)), which is the product of the probability for one win (L(x)) and one loss (1-L(x)), happens to be proportional to the probability of a single draw (which is given by the derivative of L(x)). Draws in this model are stronger evidence for equality of the players ratings than wins and losses, because they only occur with high frequency in a narrow range of rating difference, why the occasional win against a stronger opponent still occurs for much larger rating differences.

Now for other rating models it is in general not true that F*(1-F) ~ F'. (And in such models F' might not even model the draw rate. For instance, in the Gaussian model there is no exact equivalence between a draw and any number of wins + losses, even for fractional numbers. But in the limit of a large number of games, the likelihood for N draws approaches that of 0.9N wins plus 0.9N losses (IIRC). So one might say that each draw counts for 1.8 games there.

So it all depends on which power of F*(1-F) is proportional to F' (expanded as secod-order power series).

Now the assumption that this power is 1, as the Logistic has, was actually supported by real data, where win probability times loss probability gave a curve that fitted the draw probability very well.

EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo