Page 1 of 5

EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 1:27 pm
I picked one simple example to compare the correctness of the rating programs, an example with 3 engines where I can compute ratings by hand. Knowing that a performance of 75% means 191 Elo points advantage and 90% means exactly double, 2*191=382 Elo points, building a PGN with results eng1-eng2: 75%, eng2-eng3: 75%, eng1-eng3: 90% with the same number of games for each engine, the rating has a fixed point at first iteration:

eng1 +191
eng2 0
eng3 -191

I took the names of Houdini, Strelka and Komodo for the first PGN of 90 games with such properties (from EloStat "programs" file):

Code: Select all

Individual statistics&#58;

1 Houdini 1.5a x64          &#58;  180   60 (+ 42,= 15,-  3&#41;, 82.5 %

Komodo64 3                    &#58;  30 (+ 24,=  6,-  0&#41;, 90.0 %
Strelka 5                     &#58;  30 (+ 18,=  9,-  3&#41;, 75.0 %

2 Strelka 5                 &#58;    0   60 (+ 21,= 18,- 21&#41;, 50.0 %

Komodo64 3                    &#58;  30 (+ 18,=  9,-  3&#41;, 75.0 %
Houdini 1.5a x64              &#58;  30 (+  3,=  9,- 18&#41;, 25.0 %

3 Komodo64 3                &#58; -180   60 (+  3,= 15,- 42&#41;, 17.5 %

Strelka 5                     &#58;  30 (+  3,=  9,- 18&#41;, 25.0 %
Houdini 1.5a x64              &#58;  30 (+  0,=  6,- 24&#41;, 10.0 %
The EloStat rating is:

Code: Select all

Program                            Score     %    Av.Op.  Elo    +   -    Draws

1 Houdini 1.5a x64               &#58;  49.5/ 60  82.5    -90    180   91  86   25.0 %
2 Strelka 5                      &#58;  30.0/ 60  50.0      0      0   75  75   30.0 %
3 Komodo64 3                     &#58;  10.5/ 60  17.5     90   -180   86  91   25.0 %
We see that EloStat gives -180, 0, 180 instead of the correct -191, 0, 191, compressing the rating by 22 points over the 382 points span of ratings.

The Bayeselo ratings are (for different mm flags):

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   158   60   54    60   83%   -79   25%
2 Strelka 5           -1   51   51    60   50%     1   30%
3 Komodo64 3        -157   54   60    60   18%    79   25%
for mm 0 0, ratings -157, -1, 158 instead of -191, 0, 191, compressing the rating by some 66 points.
For mm 1 1 Bayeselo gives

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   166   51   46    60   83%   -83   25%
2 Strelka 5            0   44   44    60   50%     0   30%
3 Komodo64 3        -165   47   51    60   18%    83   25%
compressing the ratings by some 50 points.

Ordo gives almost the correct result:

Code: Select all

ENGINE&#58;  RATING  ERROR  POINTS  PLAYED    (%)
Houdini 1.5a x64&#58;  190.1   41.6    49.5      60   82.5%
Strelka 5&#58;   -0.0   35.9    30.0      60   50.0%
Komodo64 3&#58; -190.1   41.7    10.5      60   17.5%

With the same PGN multiplied 4 time for a total of 360 games:

Code: Select all

Individual statistics&#58;

1 Houdini 1.5a x64          &#58;  180  240 (+168,= 60,- 12&#41;, 82.5 %

Komodo64 3                    &#58; 120 (+ 96,= 24,-  0&#41;, 90.0 %
Strelka 5                     &#58; 120 (+ 72,= 36,- 12&#41;, 75.0 %

2 Strelka 5                 &#58;    0  240 (+ 84,= 72,- 84&#41;, 50.0 %

Komodo64 3                    &#58; 120 (+ 72,= 36,- 12&#41;, 75.0 %
Houdini 1.5a x64              &#58; 120 (+ 12,= 36,- 72&#41;, 25.0 %

3 Komodo64 3                &#58; -180  240 (+ 12,= 60,-168&#41;, 17.5 %

Strelka 5                     &#58; 120 (+ 12,= 36,- 72&#41;, 25.0 %
Houdini 1.5a x64              &#58; 120 (+  0,= 24,- 96&#41;, 10.0 %
For this PGN, EloStat gives:

Code: Select all

Program                            Score     %    Av.Op.  Elo    +   -    Draws

1 Houdini 1.5a x64               &#58; 198.0/240  82.5    -90    180   44  43   25.0 %
2 Strelka 5                      &#58; 120.0/240  50.0      0      0   37  37   30.0 %
3 Komodo64 3                     &#58;  42.0/240  17.5     90   -180   43  44   25.0 %
EloStat gives again ratings of +180, 0, -180 instead of +191, 0, -191, compressing the rating by 22 points.

Bayeselo:
mm 0 0

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   166   30   28   240   83%   -83   25%
2 Strelka 5           -1   26   26   240   50%     1   30%
3 Komodo64 3        -165   28   30   240   18%    82   25%
mm 1 1

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   173   26   25   240   83%   -86   25%
2 Strelka 5            0   23   23   240   50%     0   30%
3 Komodo64 3        -172   25   26   240   18%    86   25%
Bayeselo now compresses the rating by some 50 points for 0 0 mm flags, and by 37 points for 1 1 mm flags. Also, the result is different (and a bit closer to correct) as compared to 4 times less games (90 instead of 360) PGN file, probably due to prior.

Ordo gives again almost exactly correct result:

Code: Select all

ENGINE&#58;  RATING  ERROR  POINTS  PLAYED    (%)
Houdini 1.5a x64&#58;  190.1   20.8   198.0     240   82.5%
Strelka 5&#58;    0.0   18.3   120.0     240   50.0%
Komodo64 3&#58; -190.1   20.4    42.0     240   17.5%
If I understood something, I would recommend using Ordo for direct rating comparison of engines, if one wants to avoid the rating compression and distortion.

Kai

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 4:38 pm
Laskos wrote:I picked one simple example to compare the correctness of the rating programs, an example with 3 engines where I can compute ratings by hand. Knowing that a performance of 75% means 191 Elo points advantage and 90% means exactly double, 2*191=382 Elo points, building a PGN with results eng1-eng2: 75%, eng2-eng3: 75%, eng1-eng3: 90% with the same number of games for each engine, the rating has a fixed point at first iteration:

eng1 +191
eng2 0
eng3 -191

I took the names of Houdini, Strelka and Komodo for the first PGN of 90 games with such properties (from EloStat "programs" file):

Code: Select all

Individual statistics&#58;

1 Houdini 1.5a x64          &#58;  180   60 (+ 42,= 15,-  3&#41;, 82.5 %

Komodo64 3                    &#58;  30 (+ 24,=  6,-  0&#41;, 90.0 %
Strelka 5                     &#58;  30 (+ 18,=  9,-  3&#41;, 75.0 %

2 Strelka 5                 &#58;    0   60 (+ 21,= 18,- 21&#41;, 50.0 %

Komodo64 3                    &#58;  30 (+ 18,=  9,-  3&#41;, 75.0 %
Houdini 1.5a x64              &#58;  30 (+  3,=  9,- 18&#41;, 25.0 %

3 Komodo64 3                &#58; -180   60 (+  3,= 15,- 42&#41;, 17.5 %

Strelka 5                     &#58;  30 (+  3,=  9,- 18&#41;, 25.0 %
Houdini 1.5a x64              &#58;  30 (+  0,=  6,- 24&#41;, 10.0 %
The EloStat rating is:

Code: Select all

Program                            Score     %    Av.Op.  Elo    +   -    Draws

1 Houdini 1.5a x64               &#58;  49.5/ 60  82.5    -90    180   91  86   25.0 %
2 Strelka 5                      &#58;  30.0/ 60  50.0      0      0   75  75   30.0 %
3 Komodo64 3                     &#58;  10.5/ 60  17.5     90   -180   86  91   25.0 %
We see that EloStat gives -180, 0, 180 instead of the correct -191, 0, 191, compressing the rating by 22 points over the 382 points span of ratings.

The Bayeselo ratings are (for different mm flags):

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   158   60   54    60   83%   -79   25%
2 Strelka 5           -1   51   51    60   50%     1   30%
3 Komodo64 3        -157   54   60    60   18%    79   25%
for mm 0 0, ratings -157, -1, 158 instead of -191, 0, 191, compressing the rating by some 66 points.
For mm 1 1 Bayeselo gives

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   166   51   46    60   83%   -83   25%
2 Strelka 5            0   44   44    60   50%     0   30%
3 Komodo64 3        -165   47   51    60   18%    83   25%
compressing the ratings by some 50 points.

Ordo gives almost the correct result:

Code: Select all

ENGINE&#58;  RATING  ERROR  POINTS  PLAYED    (%)
Houdini 1.5a x64&#58;  190.1   41.6    49.5      60   82.5%
Strelka 5&#58;   -0.0   35.9    30.0      60   50.0%
Komodo64 3&#58; -190.1   41.7    10.5      60   17.5%

With the same PGN multiplied 4 time for a total of 360 games:

Code: Select all

Individual statistics&#58;

1 Houdini 1.5a x64          &#58;  180  240 (+168,= 60,- 12&#41;, 82.5 %

Komodo64 3                    &#58; 120 (+ 96,= 24,-  0&#41;, 90.0 %
Strelka 5                     &#58; 120 (+ 72,= 36,- 12&#41;, 75.0 %

2 Strelka 5                 &#58;    0  240 (+ 84,= 72,- 84&#41;, 50.0 %

Komodo64 3                    &#58; 120 (+ 72,= 36,- 12&#41;, 75.0 %
Houdini 1.5a x64              &#58; 120 (+ 12,= 36,- 72&#41;, 25.0 %

3 Komodo64 3                &#58; -180  240 (+ 12,= 60,-168&#41;, 17.5 %

Strelka 5                     &#58; 120 (+ 12,= 36,- 72&#41;, 25.0 %
Houdini 1.5a x64              &#58; 120 (+  0,= 24,- 96&#41;, 10.0 %
For this PGN, EloStat gives:

Code: Select all

Program                            Score     %    Av.Op.  Elo    +   -    Draws

1 Houdini 1.5a x64               &#58; 198.0/240  82.5    -90    180   44  43   25.0 %
2 Strelka 5                      &#58; 120.0/240  50.0      0      0   37  37   30.0 %
3 Komodo64 3                     &#58;  42.0/240  17.5     90   -180   43  44   25.0 %
EloStat gives again ratings of +180, 0, -180 instead of +191, 0, -191, compressing the rating by 22 points.

Bayeselo:
mm 0 0

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   166   30   28   240   83%   -83   25%
2 Strelka 5           -1   26   26   240   50%     1   30%
3 Komodo64 3        -165   28   30   240   18%    82   25%
mm 1 1

Code: Select all

ResultSet-EloRating>ratings
Rank Name               Elo    +    - games score oppo. draws
1 Houdini 1.5a x64   173   26   25   240   83%   -86   25%
2 Strelka 5            0   23   23   240   50%     0   30%
3 Komodo64 3        -172   25   26   240   18%    86   25%
Bayeselo now compresses the rating by some 50 points for 0 0 mm flags, and by 37 points for 1 1 mm flags. Also, the result is different (and a bit closer to correct) as compared to 4 times less games (90 instead of 360) PGN file, probably due to prior.

Ordo gives again almost exactly correct result:

Code: Select all

ENGINE&#58;  RATING  ERROR  POINTS  PLAYED    (%)
Houdini 1.5a x64&#58;  190.1   20.8   198.0     240   82.5%
Strelka 5&#58;    0.0   18.3   120.0     240   50.0%
Komodo64 3&#58; -190.1   20.4    42.0     240   17.5%
If I understood something, I would recommend using Ordo for direct rating comparison of engines, if one wants to avoid the rating compression and distortion.

Kai
For Bayeselo and mm 1 1, if you set prior to 0, the ratings will be +176, 0, -176 and will not vary.

If we change the number of draws but keep the scores the same, here are the results:

Arena

Code: Select all

1 A                         &#58;  180   60 (+ 49,=  1,- 10&#41;, 82.5 %

B                             &#58;  30 (+ 22,=  1,-  7&#41;, 75.0 %
C                             &#58;  30 (+ 27,=  0,-  3&#41;, 90.0 %

2 B                         &#58;    0   60 (+ 29,=  2,- 29&#41;, 50.0 %

A                             &#58;  30 (+  7,=  1,- 22&#41;, 25.0 %
C                             &#58;  30 (+ 22,=  1,-  7&#41;, 75.0 %

3 C                         &#58; -180   60 (+ 10,=  1,- 49&#41;, 17.5 %

A                             &#58;  30 (+  3,=  0,- 27&#41;, 10.0 %
B                             &#58;  30 (+  7,=  1,- 22&#41;, 25.0 %
Ordo

Code: Select all

ENGINE&#58;  RATING    POINTS  PLAYED    (%)
A&#58;  190.1      49.5      60   82.5%
B&#58;    0.0      30.0      60   50.0%
C&#58;  -190.1      10.5      60   17.5%
Bayeselo (using prior 0, mm 0 1, and covariance)

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
1 A      193   79   79    60   83%   -97    2%
2 B        3   66   66    60   50%    -1    3%
3 C     -196   79   79    60   18%    98    2%

As you can see, the ratings for Elostat and Ordo stay the same, while Bayeselo ratings expand when the percentage of draws decreases. Bayeselo's model assumes that draws indicate 2 engines are closer in strength. Whether or not this causes distortions is in the eye of the beholder.

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 5:46 pm
For Bayeselo and mm 1 1, if you set prior to 0, the ratings will be +176, 0, -176 and will not vary.

If we change the number of draws but keep the scores the same, here are the results:

Arena

Code: Select all

1 A                         &#58;  180   60 (+ 49,=  1,- 10&#41;, 82.5 %

B                             &#58;  30 (+ 22,=  1,-  7&#41;, 75.0 %
C                             &#58;  30 (+ 27,=  0,-  3&#41;, 90.0 %

2 B                         &#58;    0   60 (+ 29,=  2,- 29&#41;, 50.0 %

A                             &#58;  30 (+  7,=  1,- 22&#41;, 25.0 %
C                             &#58;  30 (+ 22,=  1,-  7&#41;, 75.0 %

3 C                         &#58; -180   60 (+ 10,=  1,- 49&#41;, 17.5 %

A                             &#58;  30 (+  3,=  0,- 27&#41;, 10.0 %
B                             &#58;  30 (+  7,=  1,- 22&#41;, 25.0 %
Ordo

Code: Select all

ENGINE&#58;  RATING    POINTS  PLAYED    (%)
A&#58;  190.1      49.5      60   82.5%
B&#58;    0.0      30.0      60   50.0%
C&#58;  -190.1      10.5      60   17.5%
Bayeselo (using prior 0, mm 0 1, and covariance)

Code: Select all

Rank Name   Elo    +    - games score oppo. draws
1 A      193   79   79    60   83%   -97    2%
2 B        3   66   66    60   50%    -1    3%
3 C     -196   79   79    60   18%    98    2%

As you can see, the ratings for Elostat and Ordo stay the same, while Bayeselo ratings expand when the percentage of draws decreases. Bayeselo's model assumes that draws indicate 2 engines are closer in strength. Whether or not this causes distortions is in the eye of the beholder.
Thanks. It seems that Bayeselo gives more or less correct result only by using prior 0, number of draws 0, mm 0 1, etc. The correct rating here is -191, 0, 191 independent on the number of draws, and is derived analytically, therefore the Bayeselo rating compression and distortion for the values of the parameters and the number of draws normally used is not only in the eye of the beholder, is a fact. Maybe the assumption than 1 draw is equal to 1 win and 1 loss (if I remember correctly) is a bad extrapolation.

Kai

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 6:54 pm
Laskos wrote:Thanks. It seems that Bayeselo gives more or less correct result only by using prior 0, number of draws 0, mm 0 1, etc. The correct rating here is -191, 0, 191 independent on the number of draws, and is derived analytically, therefore the Bayeselo rating compression and distortion for the values of the parameters and the number of draws normally used is not only in the eye of the beholder, is a fact. Maybe the assumption than 1 draw is equal to 1 win and 1 loss (if I remember correctly) is a bad extrapolation.

Kai
You might find this thread from 3 month ago interesting:
http://talkchess.com/forum/viewtopic.php?t=42729

Apart from talking about BayesElos Prior, we also analyzed the Probability of Draw assumption you mentioned.

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 7:29 pm
Laskos wrote:Thanks. It seems that Bayeselo gives more or less correct result only by using prior 0, number of draws 0, mm 0 1, etc.
This is only because you have the wrong notion about what would be the 'correct result'. In other words, BayesElo is right, and your hand calculation is wrong.

The number of draws is significant, as the rating models predict, and actual statistics on computer games confirmed. Your simplistic hand calculation does not take that properly into account. Maximum likelihood, as used by BayesElo, does.

So my advice would be: use BayesElo, rather than the others.

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 8:24 pm
hgm wrote:
Laskos wrote:Thanks. It seems that Bayeselo gives more or less correct result only by using prior 0, number of draws 0, mm 0 1, etc.
This is only because you have the wrong notion about what would be the 'correct result'. In other words, BayesElo is right, and your hand calculation is wrong.

The number of draws is significant, as the rating models predict, and actual statistics on computer games confirmed. Your simplistic hand calculation does not take that properly into account. Maximum likelihood, as used by BayesElo, does.

So my advice would be: use BayesElo, rather than the others.
I don't quite understand. Bayselo starts from a logistic, a function of one variable, and then gives a compressed, wrong logistic. That is visible from my example and the link given by Edmund. The Gaussian in his fit could well be a compressed logistic, you yourself observed that the Bayeselo rating is compressed compared to the "experimental" data points in those plots. To illuminate me, what is your result for my simple example (3 engines), which seems easy to me to do by hand.

Kai

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 8:52 pm
There are two separate issues here: the correctness of the model, ( Logistic, Gaussian, linear) and the correctness of the analysis once the model is given (to determine the parameters).

For a Logistic model with (small) draw margin m, the probability for a draw (L(x+m) - L(x-m) ~ d/dx L(x)) is proportional to the probability for one loss + one win (L(x) * (1 - L(x))). So one observed draw has the same effect on the likelihood of x (the rating difference) as one win + one loss.

With N wins and M losses the likelihood od x is L(x)^N * (1-L(x))^M, which is maximum when

(N*L(x)^(N-1) * (1-L(x))^M - M*L(x)^N * (1-L(x))^(M-1)) * dL/dx(x) = 0

or

N*(1-L(x)) = M*L(x)
N = (N+M) * L(x)
L(x) = N/(N+M)

i.e. the expected formula based on the fraction of wins.

But as draws count for win + loss, a 15-5 result based on 15 wins and 5 losses has L(x) = 0.75, while one based on 10 wins plus 10 draws would give the same as for 20 wins plus 10 losses, i.e. L(x) = 0.66.

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 9:46 pm
hgm wrote:There are two separate issues here: the correctness of the model, ( Logistic, Gaussian, linear) and the correctness of the analysis once the model is given (to determine the parameters).
Yes, but if I am not wrong, the result here can be put on Edmund's plot, given the Bayeselo results and the true percentages, in the plot "White Score against Elo-delta". Your wondering of compression came from that plot.

For a Logistic model with (small) draw margin m, the probability for a draw (L(x+m) - L(x-m) ~ d/dx L(x)) is proportional to the probability for one loss + one win (L(x) * (1 - L(x))). So one observed draw has the same effect on the likelihood of x (the rating difference) as one win + one loss.

With N wins and M losses the likelihood od x is L(x)^N * (1-L(x))^M, which is maximum when

(N*L(x)^(N-1) * (1-L(x))^M - M*L(x)^N * (1-L(x))^(M-1)) * dL/dx(x) = 0

or

N*(1-L(x)) = M*L(x)
N = (N+M) * L(x)
L(x) = N/(N+M)

i.e. the expected formula based on the fraction of wins.

But as draws count for win + loss, a 15-5 result based on 15 wins and 5 losses has L(x) = 0.75, while one based on 10 wins plus 10 draws would give the same as for 20 wins plus 10 losses, i.e. L(x) = 0.66.
That' fine, I already saw something similar, is the draw just proportional to 1 win and 1 loss, or exactly equal? Second, I don't think draws are equal to anything win-loss in statistical weight sense (be it for maximum likelihood method), one has to use some summed-up trinomial distribution giving the same percentage, with varying N,M, Draws.
My problem is a bit different, if you are right, then I don't know what "rating" is supposed to mean. In absence of any other information, what is the rating difference between two engines scoring +60 =30 -10 against each other? Is it hard and I cannot do that by hand?

Kai

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 10:05 pm
hgm wrote:There are two separate issues here: the correctness of the model, ( Logistic, Gaussian, linear) and the correctness of the analysis once the model is given (to determine the parameters).
Yes, but if I am not wrong, the result here can be put on Edmund's plot, given the Bayeselo results and the true percentages, in the plot "White Score against Elo-delta". Your wondering of compression came from that plot.

For a Logistic model with (small) draw margin m, the probability for a draw (L(x+m) - L(x-m) ~ d/dx L(x)) is proportional to the probability for one loss + one win (L(x) * (1 - L(x))). So one observed draw has the same effect on the likelihood of x (the rating difference) as one win + one loss.

With N wins and M losses the likelihood od x is L(x)^N * (1-L(x))^M, which is maximum when

(N*L(x)^(N-1) * (1-L(x))^M - M*L(x)^N * (1-L(x))^(M-1)) * dL/dx(x) = 0

or

N*(1-L(x)) = M*L(x)
N = (N+M) * L(x)
L(x) = N/(N+M)

i.e. the expected formula based on the fraction of wins.

But as draws count for win + loss, a 15-5 result based on 15 wins and 5 losses has L(x) = 0.75, while one based on 10 wins plus 10 draws would give the same as for 20 wins plus 10 losses, i.e. L(x) = 0.66.
That' fine, I already saw something similar, is the draw just proportional to 1 win and 1 loss, or exactly equal? Second, I don't think draws are equal to anything win-loss in statistical weight sense (be it for maximum likelihood method), one has to use some summed-up trinomial distribution giving the same percentage, with varying N,M, Draws.
My problem is a bit different, if you are right, then I don't know what "rating" is supposed to mean. In absence of any other information, what is the rating difference between two engines scoring +60 =30 -10 against each other? Is it hard and I cannot do that by hand?

Kai
How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that.

Miguel

Re: EloStat, Bayeselo and Ordo

Posted: Sun Jun 24, 2012 10:21 pm
Laskos wrote:The correct rating here is -191, 0, 191 [...] and is derived analytically
In general there is no such thing like an (absolutely) "correct rating". The mapping between winning percentage (or probability) and rating difference depends on the underlying rating model, and there is more than one model, and not all three programs you are comparing use the same model AFAIK.

Furthermore, as HGM has shown, also the number of draws has an influence on the ratings. In addition to that, BayesElo (I don't know about Ordo here) also accounts for the colors. I don't know whether the (roughly) 55% win rate for White was considered in your example since you didn't mention it.

Sven