Ordo's experimental approach

michiguel · Post by **michiguel** » Sun Oct 06, 2013 1:49 am

Adam posted some experimental ratings in the T&M forum and some questions were generated, so I will open a thread here.

Ordo is experimenting with an alternative approach to calculate ratings using previous knowledge from the user (Bayesian concept, for the mathematically inclined). This is useful when the data is scarce and generates rankings with wild swings at the beginning of a new rating list or tournament. Traditionally in human ratings, a initial Elo is given and then updated. This has the opposite problem, which is that some players/engines will get stuck with a (wrong) initial rating and then it will be very difficult to adjust. Ordo is trying to accept an initial rating, but also how uncertain that is (which will be Gaussian prior distribution for the mathematically inclined). I will show an example using nTCEC, which is a good case study.

For instance, one line of the file that contains the “previous information” has

Code: Select all

"Houdini 3",       3214,  39

That means Houdini’s initial rating is 3214 (taken from CCRL) with an uncertainty of 39. Here the user should have the best possible educated guess. I combined the uncertainty of CCRL (15 elo for Houdini) plus other factor. I assumed that the relative rating between single cores and 16 core engines could have an uncertainty of 30 elo (some engines will scale better or worse) and that going from blitz to a very long time control may have an uncertainty of 20. In this case

On the other hand, in this example I used for The Baron:

Code: Select all

"The Baron 3.35a", 2559, 253

It means that the initial rating taken from CCRL (2559) has a big uncertainty of 253. That is because the last public release was five years ago. I estimated that the uncertainty increases 50 elo per year. Other sources of error are the CCRL error (8) and the ones analyzed above (30 for SMP scaling, 20 for a different time control). So, sqrt (250^2 + 8^2 + 20^2 + 30^2) = 253

Another problem in these type of tournaments is that versions upgrades appear with no previous ratings. However, we know that the new versions cannot have very different ratings from the previous version. Here, the user should make the best educated guess. In this example, I decided that the new versions will have a similar rating with an uncertainty of 20 (between stages) or 50 between seasons. This “relative” previous information goes in a separate file with lines like this:

Code: Select all

"Bouquet 1.8a",     "Bouquet 1.8",        0, 20

That means version 1.8a came after 1.8 and it is estimated to have the same elo (0) with an uncertainty of 20. With different versions, you can have different lines. Stockfish for instance have:

Code: Select all

"Stockfish 160913", "Stockfish 4",        0, 20
"Stockfish 4",      "Stockfish 250413",   0, 50  <---- from different seasons!
"Stockfish 250413", "Stockfish 120413",   0, 20
"Stockfish 120413", "Stockfish 250313",   0, 20

When two versions are radically different, you can say nothing and they will be treated as different engines, or for instance

Code: Select all

"Komodo 1063",      "Komodo 4534",        0, 1000

The first is a complete rewrite and SMP. So, the uncertainty of 1000 reflects this fact and make them virtually disconnected. You can have bigger numbers but it won't matter. If you wanted to include more specific info, you could say

Code: Select all

"Komodo 1063",      "Komodo 4534",        160, 100

160 is the estimation for going from 1 core to 16 and 100 represents how uncertain that is. So, Ordo calculates the rating using a “previous_info” files, and a “relative_info” file with the version update connection. I did with all the games played in nTCEC so far with 16 cores (that includes last stages of last season). Here you have the ratings and the two files used (Adam's example was more elaborated since he used all the available info he could get, relative estimated strength among versions etc.).

Note that The Baron rating here is reasonable, as well as Jonny's

Code: Select all

   # PLAYER              &#58; RATING    POINTS  PLAYED    (%)
   1 Komodo 1092         &#58;   3191       8.5      10   85.0%
     Komodo 1063         &#58;   3188       5.0       7   71.4%
   2 Houdini 3           &#58;   3186      55.5      95   58.4%
     Stockfish 250413    &#58;   3164      23.0      48   47.9%
     Stockfish 250313    &#58;   3162       8.0      12   66.7%
     Stockfish 120413    &#58;   3162       9.5      18   52.8%
   3 Stockfish 160913    &#58;   3162       7.5       9   83.3%
     Stockfish 4         &#58;   3158       5.0       7   71.4%
   4 Bouquet 1.8a        &#58;   3156       7.0       9   77.8%
     Bouquet 1.8         &#58;   3155       5.5       7   78.6%
   5 Critter 1.6a        &#58;   3135       9.0      16   56.2%
   6 Rybka 4.1           &#58;   3105      26.5      47   56.4%
   7 Gull 2.2            &#58;   3103      12.0      16   75.0%
     Komodo 4534         &#58;   3087      14.0      30   46.7%
   8 Equinox 2b          &#58;   3052       9.0      17   52.9%
   9 Hiarcs 14           &#58;   3007      15.5      30   51.7%
  10 Vitruvius 1.19      &#58;   2996       5.0      12   41.7%
  11 Naum 4.5            &#58;   2991       5.0      10   50.0%
     Naum 4.2            &#58;   2988       4.5       7   64.3%
  12 Hannibal 220813     &#58;   2979       3.5       7   50.0%
  13 Shredder 12         &#58;   2961       8.5      18   47.2%
  14 Spike 1.4           &#58;   2928       9.0      17   52.9%
  15 Junior 13.3         &#58;   2913       9.0      17   52.9%
  16 Spark 1             &#58;   2913       8.0      18   44.4%
  17 Minkochess 1.3      &#58;   2858       3.5       7   50.0%
  18 Jonny 6             &#58;   2854       7.5      18   41.7%
     Toga II 280513      &#58;   2851       3.5       7   50.0%
  19 Toga II 140913      &#58;   2848       2.5      11   22.7%
  20 Exchess 7.15b       &#58;   2824       6.0      18   33.3%
  21 The Baron 3.35a     &#58;   2809       3.5       7   50.0%
  22 Tornado 5           &#58;   2809       3.0      11   27.3%
  23 Sjeng WC2008        &#58;   2809       3.5       7   50.0%
     Tornado 4.88        &#58;   2809       3.5       7   50.0%
  24 Onno 1.27           &#58;   2807       6.0      17   35.3%
  25 Quazar 0.4          &#58;   2796       2.0      12   16.7%
  26 Gaviota 0.87a8      &#58;   2792       3.5       7   50.0%
  27 Scorpio 2.76        &#58;   2789       3.0       7   42.9%
  28 Crafty 23.6         &#58;   2764       3.0       7   42.9%
  29 Octochess 5178      &#58;   2693       1.5       6   25.0%
  30 Arasan 16           &#58;   2636       1.5       6   25.0%
  31 Redqueen 1.14       &#58;   2636       1.0       6   16.7%
  32 Nebula 2            &#58;   2615       1.0       6   16.7%
  33 Arminius 100813     &#58;   2576       2.0       7   28.6%
  34 Hamsters 0.71       &#58;   2573       3.5       7   50.0%
  35 Alfil 13.1          &#58;   2571       2.5       7   35.7%
  36 Delphil 3           &#58;   2426       2.0       6   33.3%
  37 Firefly 2.6         &#58;   2176       0.0       6    0.0%

This is the previous info file

Code: Select all

"Houdini 3",       3214,  39
"Stockfish 4",     3122,  64
"Critter 1.6a",    3158,  39
"Komodo 1063",     3158,  64
"Bouquet 1.8",     3127,  63
"Rybka 4.1",       3096,  37
"Gull 2.2",        3069,  40
"Hannibal 220813", 2992,  39
"Hiarcs 14",       2988,  40
"Naum 4.2",        2982,  37
"Shredder 12",     2953,  37
"Spark 1",         2919,  37
"Spike 1.4",       2905,  38
"Toga II 280513",  2885,  69
"Junior 13.3",     2880,  40
"Minkochess 1.3",  2871,  40
"Onno 1.27",       2812,  39
"Sjeng WC2008",    2810,  38
"Scorpio 2.76",    2802,  46
"Gaviota 0.87a8",  2801,  63
"Crafty 23.6",     2793,  64
"Exchess 7.15b",   2790,  73
"Tornado 4.88",    2787,  47
"Jonny 6",         2761, 155
"Octochess 5178",  2726,  66
"Arasan 16",       2694,  64
"Redqueen 1.14",   2678,  44
"Nebula 2",        2643,  40
"Alfil 13.1",      2603, 115
"Hamsters 0.71",   2552,  37
"The Baron 3.35a", 2559, 253
"Delphil 3",       2396,  47
"Firefly 2.6",     2182,  40
"Arminius 100813", 2530, 155
"Equinox 2b",      3116,  63

And the relative info file for versions

Code: Select all

"Bouquet 1.8a",     "Bouquet 1.8",        0, 20
"Komodo 1092",      "Komodo 1063",        0, 20
"Naum 4.5",         "Naum 4.2",           0, 20
"Toga II 140913",   "Toga II 280513",     0, 20
"Tornado 5",        "Tornado 4.88",       0, 20
"Stockfish 160913", "Stockfish 4",        0, 20
"Stockfish 4",      "Stockfish 250413",   0, 50
"Stockfish 250413", "Stockfish 120413",   0, 20
"Stockfish 120413", "Stockfish 250313",   0, 20
"Komodo 1063",      "Komodo 4534",        0, 1000

The output with only the latest version is

Code: Select all

   # PLAYER              &#58; RATING    POINTS  PLAYED    (%)
   1 Komodo 1092         &#58;   3191       8.5      10   85.0%
   2 Houdini 3           &#58;   3186      55.5      95   58.4%
   3 Stockfish 160913    &#58;   3162       7.5       9   83.3%
   4 Bouquet 1.8a        &#58;   3156       7.0       9   77.8%
   5 Critter 1.6a        &#58;   3135       9.0      16   56.2%
   6 Rybka 4.1           &#58;   3105      26.5      47   56.4%
   7 Gull 2.2            &#58;   3103      12.0      16   75.0%
   8 Equinox 2b          &#58;   3052       9.0      17   52.9%
   9 Hiarcs 14           &#58;   3007      15.5      30   51.7%
  10 Vitruvius 1.19      &#58;   2996       5.0      12   41.7%
  11 Naum 4.5            &#58;   2991       5.0      10   50.0%
  12 Hannibal 220813     &#58;   2979       3.5       7   50.0%
  13 Shredder 12         &#58;   2961       8.5      18   47.2%
  14 Spike 1.4           &#58;   2928       9.0      17   52.9%
  15 Junior 13.3         &#58;   2913       9.0      17   52.9%
  16 Spark 1             &#58;   2913       8.0      18   44.4%
  17 Minkochess 1.3      &#58;   2858       3.5       7   50.0%
  18 Jonny 6             &#58;   2854       7.5      18   41.7%
  19 Toga II 140913      &#58;   2848       2.5      11   22.7%
  20 Exchess 7.15b       &#58;   2824       6.0      18   33.3%
  21 The Baron 3.35a     &#58;   2809       3.5       7   50.0%
  22 Tornado 5           &#58;   2809       3.0      11   27.3%
  23 Sjeng WC2008        &#58;   2809       3.5       7   50.0%
  24 Onno 1.27           &#58;   2807       6.0      17   35.3%
  25 Quazar 0.4          &#58;   2796       2.0      12   16.7%
  26 Gaviota 0.87a8      &#58;   2792       3.5       7   50.0%
  27 Scorpio 2.76        &#58;   2789       3.0       7   42.9%
  28 Crafty 23.6         &#58;   2764       3.0       7   42.9%
  29 Octochess 5178      &#58;   2693       1.5       6   25.0%
  30 Arasan 16           &#58;   2636       1.5       6   25.0%
  31 Redqueen 1.14       &#58;   2636       1.0       6   16.7%
  32 Nebula 2            &#58;   2615       1.0       6   16.7%
  33 Arminius 100813     &#58;   2576       2.0       7   28.6%
  34 Hamsters 0.71       &#58;   2573       3.5       7   50.0%
  35 Alfil 13.1          &#58;   2571       2.5       7   35.7%
  36 Delphil 3           &#58;   2426       2.0       6   33.3%
  37 Firefly 2.6         &#58;   2176       0.0       6    0.0%

Miguel

Michel · Post by **Michel** » Sun Oct 06, 2013 6:17 am

This is similar to what online rating systems such as Glicko (FICS) and TrueSkill (XBox Live) do.

You start with a Gaussian prior and update it with new information whenever it comes in.

The resulting posterior will usually not be Gaussian but you can approximate it with a Gaussion.

The thus approximated posterior will now become the new prior and you repeat.

In contrast traditional Maximum Likelihood Estimation first takes _all_ data and only then computes the maximum of the resulting posterior. This is what BayesElo does.

Daniel Shawul · Post by **Daniel Shawul** » Sun Oct 06, 2013 6:18 am

Bayesian rating is better no doubt about that. But IMO prior should be uniform and something that evaporates quickly otherwise you will have tough time convincing others about the calculated ratings. Bayeselo uses uniform prayer of 1 or 2 draws against a 'null opponent' (better) or against each other (current implantation) to get the benefit of using priors. If you assign a non-uniform priors, strong engines that have proven their strength in other rating lists will always have a distinct advantage even if they perform really badly in the current tournament. So the prior should used should be uniform, which reflects equality at start up, and not be so strong that the current results don't have much effect.

michiguel · Post by **michiguel** » Sun Oct 06, 2013 7:27 pm

Michel wrote:This is similar to what online rating systems such as Glicko (FICS) and TrueSkill (XBox Live) do.

You start with a Gaussian prior and update it with new information whenever it comes in.

The resulting posterior will usually not be Gaussian but you can approximate it with a Gaussion.

The thus approximated posterior will now become the new prior and you repeat.
That is not what this experimental version is doing.
In contrast traditional Maximum Likelihood Estimation first takes _all_ data and only then computes the maximum of the resulting posterior. This is what BayesElo does.

That is what Ordo always have done: it takes all data and calculates the ranking in one go. But, here it includes several prior distributions to relate individual engines to certain ratings, and several engines among each other, if desired.

This is sort of an extension of the feature you requested before. Rather than "pinning" or "anchoring certain engines to a given "well established" rating, here we can assign a rating with an uncertainty. If we assign "Engine_A", 2500, 0 (uncertainty 0) will have the same effect as pinning Engine_A to 2500. Now you can do "relative" pins. Engine_A - Engine_B will be 200 elo, for instance. But you can relax that giving an uncertainty too.

Miguel

michiguel · Post by **michiguel** » Sun Oct 06, 2013 7:46 pm

Daniel Shawul wrote:Bayesian rating is better no doubt about that. But IMO prior should be uniform and something that evaporates quickly otherwise you will have tough time convincing others about the calculated ratings. Bayeselo uses uniform prayer of 1 or 2 draws against a 'null opponent' (better) or against each other (current implantation) to get the benefit of using priors. If you assign a non-uniform priors, strong engines that have proven their strength in other rating lists will always have a distinct advantage even if they perform really badly in the current tournament. So the prior should used should be uniform, which reflects equality at start up, and not be so strong that the current results don't have much effect.

It is up to the user to assign what type of previous knowledge is implemented. I just gave an example to illustrate the possibilities. If the user wants to set the all initial ratings to zero and want that to be evaporated quickly, then it is possible to include in the priors file:

"Engine_A", 0, 1000
"Engine_B", 0, 1000
etc.

Which means it assumes all engines start at zero with very big uncertainty (1000 in this case), which is the one that controls how fast the engine updates.

Here, the user is free to estimate if Houdini has the same rating zero as Firefly, if it has a given difference based on whatever previous info they have (other lists, pre-runs etc.), or make no assumptions whatsoever (i.e. no line included in the "previous info" file). In my opinion, I would like to have a rating that includes realistic prior information with realistic uncertainties, or no assumptions at all if nothing is available. But that is me talking as a user. As a programmer, I am giving the options.

Miguel

Michel · Post by **Michel** » Sun Oct 06, 2013 7:51 pm

This is sort of an extension of the feature you requested before. Rather than "pinning" or "anchoring certain engines to a given "well established" rating, here we can assign a rating with an uncertainty. If we assign "Engine_A", 2500, 0 (uncertainty 0) will have the same effect as pinning Engine_A to 2500. Now you can do "relative" pins. Engine_A - Engine_B will be 200 elo, for instance. But you can relax that giving an uncertainty too.

Ok I see! I had misunderstood.

Milos · Post by **Milos** » Mon Oct 28, 2013 4:11 am

michiguel wrote:This is sort of an extension of the feature you requested before. Rather than "pinning" or "anchoring certain engines to a given "well established" rating, here we can assign a rating with an uncertainty. If we assign "Engine_A", 2500, 0 (uncertainty 0) will have the same effect as pinning Engine_A to 2500. Now you can do "relative" pins. Engine_A - Engine_B will be 200 elo, for instance. But you can relax that giving an uncertainty too.

Miguel

You assume that all uncertainties are independent because it makes it easy to combine the distributions.
Uncretainties are not independent and therefore even though your feature calculation is correct for the particular case in reality it is totally pointless.

michiguel · Post by **michiguel** » Mon Oct 28, 2013 7:25 am

Milos wrote:
michiguel wrote:This is sort of an extension of the feature you requested before. Rather than "pinning" or "anchoring certain engines to a given "well established" rating, here we can assign a rating with an uncertainty. If we assign "Engine_A", 2500, 0 (uncertainty 0) will have the same effect as pinning Engine_A to 2500. Now you can do "relative" pins. Engine_A - Engine_B will be 200 elo, for instance. But you can relax that giving an uncertainty too.

Miguel
You assume that all uncertainties are independent because it makes it easy to combine the distributions.
Uncretainties are not independent and therefore even though your feature calculation is correct for the particular case in reality it is totally pointless.

The uncertainties may be independent of each of other, or not. But, there is no reason to assume they have a strong dependency to start with, or that if there in any dependency at all they will affect too much the final result. In addition, we do not even know they distribute normally. We do not even know they distribute symmetrically. All this does is to have some sort of "reasonable" ball park estimation of the uncertainty, which will be used to "buffer" the big swings caused by the very few games that the system will be fed at the beginning. That is the goal of using Bayesian statistics for this type of ratings. The best estimation of the uncertainties, the better the buffer is. Ultimately, those will fade away as the number of games increase.

Miguel

Milos · Post by **Milos** » Mon Oct 28, 2013 1:30 pm

michiguel wrote:The uncertainties may be independent of each of other, or not. But, there is no reason to assume they have a strong dependency to start with, or that if there in any dependency at all they will affect too much the final result. In addition, we do not even know they distribute normally. We do not even know they distribute symmetrically. All this does is to have some sort of "reasonable" ball park estimation of the uncertainty, which will be used to "buffer" the big swings caused by the very few games that the system will be fed at the beginning. That is the goal of using Bayesian statistics for this type of ratings. The best estimation of the uncertainties, the better the buffer is. Ultimately, those will fade away as the number of games increase.

Miguel

As I said if you assume that radom variables realted to different factors that impact Elo are independent and follow normal distribution than the calculation of the distribution of the sum of those variables is quite trivial. I would have to see the sorce code to check if you implemented it correctly in Ordo, but since resulting distribution mean and variance is easy to calculate you problably also calculated error bars correctly.

However problem is that your implementation has little or no sense in the real world. As I said there are quite a few correlations between different input random variables (SMP/scaling and TC, improvement per year and SMP) and most of them are not even symetrical (and gaussian) for example dev version is almost always stronger than the retail version and error bars are never symetrical.

In addition to that are the rediculous assuptions (made by Adam for example) which are fed into your tool, that are not even guestimates, but just random numbers without even taking into account error bars that the tool produces and this somehow mades impression of scientific result in the end (and it is used by laymans which have no clue about statistics) while it is not even quazi scientific but only misuse of science for personal wishful thinking.
If you look Adam's result he has the list of like 15 engines with all the variants within 100 Elo where error bars for 2 sigma are at least 200 Elo, totaly ludicress result (while in the same time ppl just laugh at each other when someone publishes result from 100 games sample with 50Elo error bars, etc.).
By providing your tool you basically just feed rediculous speculations and ignorance on the forum (it is the same thing as with that "similarity detector" thing).

michiguel · Post by **michiguel** » Mon Oct 28, 2013 3:15 pm

Milos wrote:
michiguel wrote:The uncertainties may be independent of each of other, or not. But, there is no reason to assume they have a strong dependency to start with, or that if there in any dependency at all they will affect too much the final result. In addition, we do not even know they distribute normally. We do not even know they distribute symmetrically. All this does is to have some sort of "reasonable" ball park estimation of the uncertainty, which will be used to "buffer" the big swings caused by the very few games that the system will be fed at the beginning. That is the goal of using Bayesian statistics for this type of ratings. The best estimation of the uncertainties, the better the buffer is. Ultimately, those will fade away as the number of games increase.

Miguel
As I said if you assume that radom variables realted to different factors that impact Elo are independent and follow normal distribution than the calculation of the distribution of the sum of those variables is quite trivial. I would have to see the sorce code to check if you implemented it correctly in Ordo, but since resulting distribution mean and variance is easy to calculate you problably also calculated error bars correctly.

However problem is that your implementation has little or no sense in the real world. As I said there are quite a few correlations between different input random variables (SMP/scaling and TC, improvement per year and SMP) and most of them are not even symetrical (and gaussian) for example dev version is almost always stronger than the retail version and error bars are never symetrical.

In addition to that are the rediculous assuptions (made by Adam for example) which are fed into your tool, that are not even guestimates, but just random numbers without even taking into account error bars that the tool produces and this somehow mades impression of scientific result in the end (and it is used by laymans which have no clue about statistics) while it is not even quazi scientific but only misuse of science for personal wishful thinking.
If you look Adam's result he has the list of like 15 engines with all the variants within 100 Elo where error bars for 2 sigma are at least 200 Elo, totaly ludicress result (while in the same time ppl just laugh at each other when someone publishes result from 100 games sample with 50Elo error bars, etc.).
By providing your tool you basically just feed rediculous speculations and ignorance on the forum (it is the same thing as with that "similarity detector" thing).

Basically, you criticism centers on the input, not the tool itself. As I told you yesterday, it is up to the user to come with an estimation of the initial elo and its uncertainty to feed the program. Whatever it is, if done properly (or even half decently), it is much better than providing nothing, since it provides a buffer, and contains the wild swings of a rating that is based on few results that are coming in. That is the whole point of using Bayesian concepts here.

If you are trying to say that the error bars for the TCEC rating is big, because there are few games, yes, of course.

<sigh> ...and I am pretty sure I am not trying to feed ignorance into the forum by providing a tool (which I did not provide yet). If you have a comment, welcome. If you are trying to pick a silly fight out of the blue with this, I really do not have time for it.

Miguel

Ordo's experimental approach

Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach

Re: Ordo's experimental approach