Ordo's experimental approach
Posted: Sun Oct 06, 2013 1:49 am
Adam posted some experimental ratings in the T&M forum and some questions were generated, so I will open a thread here.
Ordo is experimenting with an alternative approach to calculate ratings using previous knowledge from the user (Bayesian concept, for the mathematically inclined). This is useful when the data is scarce and generates rankings with wild swings at the beginning of a new rating list or tournament. Traditionally in human ratings, a initial Elo is given and then updated. This has the opposite problem, which is that some players/engines will get stuck with a (wrong) initial rating and then it will be very difficult to adjust. Ordo is trying to accept an initial rating, but also how uncertain that is (which will be Gaussian prior distribution for the mathematically inclined). I will show an example using nTCEC, which is a good case study.
For instance, one line of the file that contains the “previous information” has
That means Houdini’s initial rating is 3214 (taken from CCRL) with an uncertainty of 39. Here the user should have the best possible educated guess. I combined the uncertainty of CCRL (15 elo for Houdini) plus other factor. I assumed that the relative rating between single cores and 16 core engines could have an uncertainty of 30 elo (some engines will scale better or worse) and that going from blitz to a very long time control may have an uncertainty of 20. In this case
On the other hand, in this example I used for The Baron:
It means that the initial rating taken from CCRL (2559) has a big uncertainty of 253. That is because the last public release was five years ago. I estimated that the uncertainty increases 50 elo per year. Other sources of error are the CCRL error (8) and the ones analyzed above (30 for SMP scaling, 20 for a different time control). So, sqrt (250^2 + 8^2 + 20^2 + 30^2) = 253
Another problem in these type of tournaments is that versions upgrades appear with no previous ratings. However, we know that the new versions cannot have very different ratings from the previous version. Here, the user should make the best educated guess. In this example, I decided that the new versions will have a similar rating with an uncertainty of 20 (between stages) or 50 between seasons. This “relative” previous information goes in a separate file with lines like this:
That means version 1.8a came after 1.8 and it is estimated to have the same elo (0) with an uncertainty of 20. With different versions, you can have different lines. Stockfish for instance have:
When two versions are radically different, you can say nothing and they will be treated as different engines, or for instance
The first is a complete rewrite and SMP. So, the uncertainty of 1000 reflects this fact and make them virtually disconnected. You can have bigger numbers but it won't matter. If you wanted to include more specific info, you could say
160 is the estimation for going from 1 core to 16 and 100 represents how uncertain that is. So, Ordo calculates the rating using a “previous_info” files, and a “relative_info” file with the version update connection. I did with all the games played in nTCEC so far with 16 cores (that includes last stages of last season). Here you have the ratings and the two files used (Adam's example was more elaborated since he used all the available info he could get, relative estimated strength among versions etc.).
Note that The Baron rating here is reasonable, as well as Jonny's
This is the previous info file
And the relative info file for versions
The output with only the latest version is
Miguel
Ordo is experimenting with an alternative approach to calculate ratings using previous knowledge from the user (Bayesian concept, for the mathematically inclined). This is useful when the data is scarce and generates rankings with wild swings at the beginning of a new rating list or tournament. Traditionally in human ratings, a initial Elo is given and then updated. This has the opposite problem, which is that some players/engines will get stuck with a (wrong) initial rating and then it will be very difficult to adjust. Ordo is trying to accept an initial rating, but also how uncertain that is (which will be Gaussian prior distribution for the mathematically inclined). I will show an example using nTCEC, which is a good case study.
For instance, one line of the file that contains the “previous information” has
Code: Select all
"Houdini 3", 3214, 39
On the other hand, in this example I used for The Baron:
Code: Select all
"The Baron 3.35a", 2559, 253
Another problem in these type of tournaments is that versions upgrades appear with no previous ratings. However, we know that the new versions cannot have very different ratings from the previous version. Here, the user should make the best educated guess. In this example, I decided that the new versions will have a similar rating with an uncertainty of 20 (between stages) or 50 between seasons. This “relative” previous information goes in a separate file with lines like this:
Code: Select all
"Bouquet 1.8a", "Bouquet 1.8", 0, 20
Code: Select all
"Stockfish 160913", "Stockfish 4", 0, 20
"Stockfish 4", "Stockfish 250413", 0, 50 <---- from different seasons!
"Stockfish 250413", "Stockfish 120413", 0, 20
"Stockfish 120413", "Stockfish 250313", 0, 20
Code: Select all
"Komodo 1063", "Komodo 4534", 0, 1000
Code: Select all
"Komodo 1063", "Komodo 4534", 160, 100
Note that The Baron rating here is reasonable, as well as Jonny's
Code: Select all
# PLAYER : RATING POINTS PLAYED (%)
1 Komodo 1092 : 3191 8.5 10 85.0%
Komodo 1063 : 3188 5.0 7 71.4%
2 Houdini 3 : 3186 55.5 95 58.4%
Stockfish 250413 : 3164 23.0 48 47.9%
Stockfish 250313 : 3162 8.0 12 66.7%
Stockfish 120413 : 3162 9.5 18 52.8%
3 Stockfish 160913 : 3162 7.5 9 83.3%
Stockfish 4 : 3158 5.0 7 71.4%
4 Bouquet 1.8a : 3156 7.0 9 77.8%
Bouquet 1.8 : 3155 5.5 7 78.6%
5 Critter 1.6a : 3135 9.0 16 56.2%
6 Rybka 4.1 : 3105 26.5 47 56.4%
7 Gull 2.2 : 3103 12.0 16 75.0%
Komodo 4534 : 3087 14.0 30 46.7%
8 Equinox 2b : 3052 9.0 17 52.9%
9 Hiarcs 14 : 3007 15.5 30 51.7%
10 Vitruvius 1.19 : 2996 5.0 12 41.7%
11 Naum 4.5 : 2991 5.0 10 50.0%
Naum 4.2 : 2988 4.5 7 64.3%
12 Hannibal 220813 : 2979 3.5 7 50.0%
13 Shredder 12 : 2961 8.5 18 47.2%
14 Spike 1.4 : 2928 9.0 17 52.9%
15 Junior 13.3 : 2913 9.0 17 52.9%
16 Spark 1 : 2913 8.0 18 44.4%
17 Minkochess 1.3 : 2858 3.5 7 50.0%
18 Jonny 6 : 2854 7.5 18 41.7%
Toga II 280513 : 2851 3.5 7 50.0%
19 Toga II 140913 : 2848 2.5 11 22.7%
20 Exchess 7.15b : 2824 6.0 18 33.3%
21 The Baron 3.35a : 2809 3.5 7 50.0%
22 Tornado 5 : 2809 3.0 11 27.3%
23 Sjeng WC2008 : 2809 3.5 7 50.0%
Tornado 4.88 : 2809 3.5 7 50.0%
24 Onno 1.27 : 2807 6.0 17 35.3%
25 Quazar 0.4 : 2796 2.0 12 16.7%
26 Gaviota 0.87a8 : 2792 3.5 7 50.0%
27 Scorpio 2.76 : 2789 3.0 7 42.9%
28 Crafty 23.6 : 2764 3.0 7 42.9%
29 Octochess 5178 : 2693 1.5 6 25.0%
30 Arasan 16 : 2636 1.5 6 25.0%
31 Redqueen 1.14 : 2636 1.0 6 16.7%
32 Nebula 2 : 2615 1.0 6 16.7%
33 Arminius 100813 : 2576 2.0 7 28.6%
34 Hamsters 0.71 : 2573 3.5 7 50.0%
35 Alfil 13.1 : 2571 2.5 7 35.7%
36 Delphil 3 : 2426 2.0 6 33.3%
37 Firefly 2.6 : 2176 0.0 6 0.0%
Code: Select all
"Houdini 3", 3214, 39
"Stockfish 4", 3122, 64
"Critter 1.6a", 3158, 39
"Komodo 1063", 3158, 64
"Bouquet 1.8", 3127, 63
"Rybka 4.1", 3096, 37
"Gull 2.2", 3069, 40
"Hannibal 220813", 2992, 39
"Hiarcs 14", 2988, 40
"Naum 4.2", 2982, 37
"Shredder 12", 2953, 37
"Spark 1", 2919, 37
"Spike 1.4", 2905, 38
"Toga II 280513", 2885, 69
"Junior 13.3", 2880, 40
"Minkochess 1.3", 2871, 40
"Onno 1.27", 2812, 39
"Sjeng WC2008", 2810, 38
"Scorpio 2.76", 2802, 46
"Gaviota 0.87a8", 2801, 63
"Crafty 23.6", 2793, 64
"Exchess 7.15b", 2790, 73
"Tornado 4.88", 2787, 47
"Jonny 6", 2761, 155
"Octochess 5178", 2726, 66
"Arasan 16", 2694, 64
"Redqueen 1.14", 2678, 44
"Nebula 2", 2643, 40
"Alfil 13.1", 2603, 115
"Hamsters 0.71", 2552, 37
"The Baron 3.35a", 2559, 253
"Delphil 3", 2396, 47
"Firefly 2.6", 2182, 40
"Arminius 100813", 2530, 155
"Equinox 2b", 3116, 63
Code: Select all
"Bouquet 1.8a", "Bouquet 1.8", 0, 20
"Komodo 1092", "Komodo 1063", 0, 20
"Naum 4.5", "Naum 4.2", 0, 20
"Toga II 140913", "Toga II 280513", 0, 20
"Tornado 5", "Tornado 4.88", 0, 20
"Stockfish 160913", "Stockfish 4", 0, 20
"Stockfish 4", "Stockfish 250413", 0, 50
"Stockfish 250413", "Stockfish 120413", 0, 20
"Stockfish 120413", "Stockfish 250313", 0, 20
"Komodo 1063", "Komodo 4534", 0, 1000
Code: Select all
# PLAYER : RATING POINTS PLAYED (%)
1 Komodo 1092 : 3191 8.5 10 85.0%
2 Houdini 3 : 3186 55.5 95 58.4%
3 Stockfish 160913 : 3162 7.5 9 83.3%
4 Bouquet 1.8a : 3156 7.0 9 77.8%
5 Critter 1.6a : 3135 9.0 16 56.2%
6 Rybka 4.1 : 3105 26.5 47 56.4%
7 Gull 2.2 : 3103 12.0 16 75.0%
8 Equinox 2b : 3052 9.0 17 52.9%
9 Hiarcs 14 : 3007 15.5 30 51.7%
10 Vitruvius 1.19 : 2996 5.0 12 41.7%
11 Naum 4.5 : 2991 5.0 10 50.0%
12 Hannibal 220813 : 2979 3.5 7 50.0%
13 Shredder 12 : 2961 8.5 18 47.2%
14 Spike 1.4 : 2928 9.0 17 52.9%
15 Junior 13.3 : 2913 9.0 17 52.9%
16 Spark 1 : 2913 8.0 18 44.4%
17 Minkochess 1.3 : 2858 3.5 7 50.0%
18 Jonny 6 : 2854 7.5 18 41.7%
19 Toga II 140913 : 2848 2.5 11 22.7%
20 Exchess 7.15b : 2824 6.0 18 33.3%
21 The Baron 3.35a : 2809 3.5 7 50.0%
22 Tornado 5 : 2809 3.0 11 27.3%
23 Sjeng WC2008 : 2809 3.5 7 50.0%
24 Onno 1.27 : 2807 6.0 17 35.3%
25 Quazar 0.4 : 2796 2.0 12 16.7%
26 Gaviota 0.87a8 : 2792 3.5 7 50.0%
27 Scorpio 2.76 : 2789 3.0 7 42.9%
28 Crafty 23.6 : 2764 3.0 7 42.9%
29 Octochess 5178 : 2693 1.5 6 25.0%
30 Arasan 16 : 2636 1.5 6 25.0%
31 Redqueen 1.14 : 2636 1.0 6 16.7%
32 Nebula 2 : 2615 1.0 6 16.7%
33 Arminius 100813 : 2576 2.0 7 28.6%
34 Hamsters 0.71 : 2573 3.5 7 50.0%
35 Alfil 13.1 : 2571 2.5 7 35.7%
36 Delphil 3 : 2426 2.0 6 33.3%
37 Firefly 2.6 : 2176 0.0 6 0.0%