EloStat, Bayeselo and Ordo
Posted: Sun Jun 24, 2012 3:27 pm
I picked one simple example to compare the correctness of the rating programs, an example with 3 engines where I can compute ratings by hand. Knowing that a performance of 75% means 191 Elo points advantage and 90% means exactly double, 2*191=382 Elo points, building a PGN with results eng1-eng2: 75%, eng2-eng3: 75%, eng1-eng3: 90% with the same number of games for each engine, the rating has a fixed point at first iteration:
eng1 +191
eng2 0
eng3 -191
I took the names of Houdini, Strelka and Komodo for the first PGN of 90 games with such properties (from EloStat "programs" file):
The EloStat rating is:
We see that EloStat gives -180, 0, 180 instead of the correct -191, 0, 191, compressing the rating by 22 points over the 382 points span of ratings.
The Bayeselo ratings are (for different mm flags):
for mm 0 0, ratings -157, -1, 158 instead of -191, 0, 191, compressing the rating by some 66 points.
For mm 1 1 Bayeselo gives
compressing the ratings by some 50 points.
Ordo gives almost the correct result:
With the same PGN multiplied 4 time for a total of 360 games:
For this PGN, EloStat gives:
EloStat gives again ratings of +180, 0, -180 instead of +191, 0, -191, compressing the rating by 22 points.
Bayeselo:
mm 0 0
mm 1 1
Bayeselo now compresses the rating by some 50 points for 0 0 mm flags, and by 37 points for 1 1 mm flags. Also, the result is different (and a bit closer to correct) as compared to 4 times less games (90 instead of 360) PGN file, probably due to prior.
Ordo gives again almost exactly correct result:
If I understood something, I would recommend using Ordo for direct rating comparison of engines, if one wants to avoid the rating compression and distortion.
Kai
eng1 +191
eng2 0
eng3 -191
I took the names of Houdini, Strelka and Komodo for the first PGN of 90 games with such properties (from EloStat "programs" file):
Code: Select all
Individual statistics:
1 Houdini 1.5a x64 : 180 60 (+ 42,= 15,- 3), 82.5 %
Komodo64 3 : 30 (+ 24,= 6,- 0), 90.0 %
Strelka 5 : 30 (+ 18,= 9,- 3), 75.0 %
2 Strelka 5 : 0 60 (+ 21,= 18,- 21), 50.0 %
Komodo64 3 : 30 (+ 18,= 9,- 3), 75.0 %
Houdini 1.5a x64 : 30 (+ 3,= 9,- 18), 25.0 %
3 Komodo64 3 : -180 60 (+ 3,= 15,- 42), 17.5 %
Strelka 5 : 30 (+ 3,= 9,- 18), 25.0 %
Houdini 1.5a x64 : 30 (+ 0,= 6,- 24), 10.0 %
Code: Select all
Program Score % Av.Op. Elo + - Draws
1 Houdini 1.5a x64 : 49.5/ 60 82.5 -90 180 91 86 25.0 %
2 Strelka 5 : 30.0/ 60 50.0 0 0 75 75 30.0 %
3 Komodo64 3 : 10.5/ 60 17.5 90 -180 86 91 25.0 %
The Bayeselo ratings are (for different mm flags):
Code: Select all
ResultSet-EloRating>ratings
Rank Name Elo + - games score oppo. draws
1 Houdini 1.5a x64 158 60 54 60 83% -79 25%
2 Strelka 5 -1 51 51 60 50% 1 30%
3 Komodo64 3 -157 54 60 60 18% 79 25%
For mm 1 1 Bayeselo gives
Code: Select all
ResultSet-EloRating>ratings
Rank Name Elo + - games score oppo. draws
1 Houdini 1.5a x64 166 51 46 60 83% -83 25%
2 Strelka 5 0 44 44 60 50% 0 30%
3 Komodo64 3 -165 47 51 60 18% 83 25%
Ordo gives almost the correct result:
Code: Select all
ENGINE: RATING ERROR POINTS PLAYED (%)
Houdini 1.5a x64: 190.1 41.6 49.5 60 82.5%
Strelka 5: -0.0 35.9 30.0 60 50.0%
Komodo64 3: -190.1 41.7 10.5 60 17.5%
With the same PGN multiplied 4 time for a total of 360 games:
Code: Select all
Individual statistics:
1 Houdini 1.5a x64 : 180 240 (+168,= 60,- 12), 82.5 %
Komodo64 3 : 120 (+ 96,= 24,- 0), 90.0 %
Strelka 5 : 120 (+ 72,= 36,- 12), 75.0 %
2 Strelka 5 : 0 240 (+ 84,= 72,- 84), 50.0 %
Komodo64 3 : 120 (+ 72,= 36,- 12), 75.0 %
Houdini 1.5a x64 : 120 (+ 12,= 36,- 72), 25.0 %
3 Komodo64 3 : -180 240 (+ 12,= 60,-168), 17.5 %
Strelka 5 : 120 (+ 12,= 36,- 72), 25.0 %
Houdini 1.5a x64 : 120 (+ 0,= 24,- 96), 10.0 %
Code: Select all
Program Score % Av.Op. Elo + - Draws
1 Houdini 1.5a x64 : 198.0/240 82.5 -90 180 44 43 25.0 %
2 Strelka 5 : 120.0/240 50.0 0 0 37 37 30.0 %
3 Komodo64 3 : 42.0/240 17.5 90 -180 43 44 25.0 %
Bayeselo:
mm 0 0
Code: Select all
ResultSet-EloRating>ratings
Rank Name Elo + - games score oppo. draws
1 Houdini 1.5a x64 166 30 28 240 83% -83 25%
2 Strelka 5 -1 26 26 240 50% 1 30%
3 Komodo64 3 -165 28 30 240 18% 82 25%
Code: Select all
ResultSet-EloRating>ratings
Rank Name Elo + - games score oppo. draws
1 Houdini 1.5a x64 173 26 25 240 83% -86 25%
2 Strelka 5 0 23 23 240 50% 0 30%
3 Komodo64 3 -172 25 26 240 18% 86 25%
Ordo gives again almost exactly correct result:
Code: Select all
ENGINE: RATING ERROR POINTS PLAYED (%)
Houdini 1.5a x64: 190.1 20.8 198.0 240 82.5%
Strelka 5: 0.0 18.3 120.0 240 50.0%
Komodo64 3: -190.1 20.4 42.0 240 17.5%
Kai