For comparison, the earlier 25,000-game runs were:
Code: Select all
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 123 8 8 5120 66% 2 15%
2 Fruit 2.1 38 8 7 5119 55% 2 19%
3 opponent-21.7 28 7 7 5119 54% 2 34%
4 Crafty-22.2 2 4 4 25597 50% 0 19%
5 Glaurung 1.1 SMP 2 8 8 5120 50% 2 14%
6 Arasan 10.0 -193 8 9 5119 26% 2 15%
Rank Name Elo + - games score oppo. draws
1 Glaurung 2-epsilon/5 118 8 8 5120 67% -19 13%
2 Fruit 2.1 42 8 8 5120 58% -19 17%
3 opponent-21.7 32 7 7 5115 58% -19 36%
4 Glaurung 1.1 SMP 20 8 8 5120 55% -19 12%
5 Crafty-22.2 -19 4 4 25595 47% 4 19%
6 Arasan 10.0 -193 8 8 5120 28% -19 16%
If we compare the new result to that of the old second run, we get:
Code: Select all
Rank Name Elo + - 2nd diff
1 Glaurung 2-epsilon/5 108 7 7 118 +10
2 Fruit 2.1 62 7 6 42 -20
3 opponent-21.7 25 6 6 +32 +7
4 Glaurung 1.1 SMP 10 6 6 +20 +10
5 Crafty-22.2 -21 4 4 -19 +2
6 Arasan 10.0 -185 7 7 -193 -8
We see that the differences, listed in the last column, are not far from sqrt(2) = 1.41 times the uncertainties quoted by BayesElo. The standard deviation of the differences between the run of the 5 Crafty opponents is sqrt((100+400+49+100+64)/5) ~12.
This is about twice of what is expected (as the quoted BayesElo uncertainties are 95% confidence intervals, i.e. ~2 sigma). The main contribution to this variance is due to Fruit 2.1.
The first of the two old runs is way off.