understanding elo rating

flok · Post by **flok** » Thu Feb 11, 2016 9:16 am

Hi,

For a while I'm using bayeselo to compare versions of my chess program.
Now the last couple of weeks I have been pulling my hair out because of an unexplainable difference in elo-rating.

Code: Select all

Rank Name                  Elo    +    - games score oppo. draws 
   1 Stockfish 030914     2500  -75  202  1697  100%   925    0%  
   2 XboardEngine         1703   61   52  1698   91%   980    0%  
   3 Embla2001-lmr        1025   14   14  1697   57%  1048   40%  
   4 Embla-lmr-2006c2b     999   14   14  1705   53%  1045   38%  
   5 Embla-lmr-2006c2b-2   980   14   14  1696   50%  1048   35%  
   6 Embla2065_2045        951   13   13  1702   46%  1038   60%  
   7 Embla2067_2045        946   13   13  1698   45%  1049   59%  
   8 Embla2067_2065_2045   941   13   13  1698   45%  1051   51%  
   9 Embla2067             937   13   13  1699   46%  1050   54%  
  10 Embla2067_2065        933   13   13  1697   44%  1038   52%  
  11 Embla2067b            932   13   13  1700   42%  1040   58%  
  12 Embla2065             926   13   13  1699   43%  1050   59%  
  13 Embla2045             893   14   15  1700   37%  1041   37%  
  14 ParisHilton          -134  133  184  1696    0%  1130    0%

Embla2001-lmr and Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 are exact the same source-code and still they show a difference in rating.
To verify that I did not accidently compile a diffent version, I recompiled the Embla-lmr-2006c2b code to Embla-lmr-2006c2b-2. So if I made a mistake, then Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 can have different ratings. But one of these two must be the same as Embla2001-lmr.

Anyone got an idea what could be going on here?

(XboardEngine is tscp 1.81 and paris hilton is an engine that aims it playing the worst move possible so that I have a lower boundary)

Guenther · Post by **Guenther** » Thu Feb 11, 2016 10:38 am

flok wrote:Hi,

For a while I'm using bayeselo to compare versions of my chess program.
Now the last couple of weeks I have been pulling my hair out because of an unexplainable difference in elo-rating.
Code: Select all
Rank Name                  Elo    +    - games score oppo. draws 
   1 Stockfish 030914     2500  -75  202  1697  100%   925    0%  
   2 XboardEngine         1703   61   52  1698   91%   980    0%  
   3 Embla2001-lmr        1025   14   14  1697   57%  1048   40%  
   4 Embla-lmr-2006c2b     999   14   14  1705   53%  1045   38%  
   5 Embla-lmr-2006c2b-2   980   14   14  1696   50%  1048   35%  
   6 Embla2065_2045        951   13   13  1702   46%  1038   60%  
   7 Embla2067_2045        946   13   13  1698   45%  1049   59%  
   8 Embla2067_2065_2045   941   13   13  1698   45%  1051   51%  
   9 Embla2067             937   13   13  1699   46%  1050   54%  
  10 Embla2067_2065        933   13   13  1697   44%  1038   52%  
  11 Embla2067b            932   13   13  1700   42%  1040   58%  
  12 Embla2065             926   13   13  1699   43%  1050   59%  
  13 Embla2045             893   14   15  1700   37%  1041   37%  
  14 ParisHilton          -134  133  184  1696    0%  1130    0%
Embla2001-lmr and Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 are exact the same source-code and still they show a difference in rating.
To verify that I did not accidently compile a diffent version, I recompiled the Embla-lmr-2006c2b code to Embla-lmr-2006c2b-2. So if I made a mistake, then Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 can have different ratings. But one of these two must be the same as Embla2001-lmr.

Anyone got an idea what could be going on here?

(XboardEngine is tscp 1.81 and paris hilton is an engine that aims it playing the worst move possible so that I have a lower boundary)

What happens if you remove Stockfish and 'Paris Hilton' from the pgn pool? It is possible that both the 100% and the 0% player lead to some
undesired misscalculations because they will be represented just with approximated values.(0% and 100% cannot get real values - moreover
it does not make much sense to have them in if you want to test versions elowise)

Last not least you sure noticed that even with your setup both programs you mentioned are _within_ the given error bars?
1025-14 = 1011 and 999+14 = 1013)

Guenther

Ajedrecista · Post by **Ajedrecista** » Thu Feb 11, 2016 10:44 am

Hello Folkert:

flok wrote:Hi,

For a while I'm using bayeselo to compare versions of my chess program.
Now the last couple of weeks I have been pulling my hair out because of an unexplainable difference in elo-rating.
Code: Select all
Rank Name                  Elo    +    - games score oppo. draws 
   1 Stockfish 030914     2500  -75  202  1697  100%   925    0%  
   2 XboardEngine         1703   61   52  1698   91%   980    0%  
   3 Embla2001-lmr        1025   14   14  1697   57%  1048   40%  
   4 Embla-lmr-2006c2b     999   14   14  1705   53%  1045   38%  
   5 Embla-lmr-2006c2b-2   980   14   14  1696   50%  1048   35%  
   6 Embla2065_2045        951   13   13  1702   46%  1038   60%  
   7 Embla2067_2045        946   13   13  1698   45%  1049   59%  
   8 Embla2067_2065_2045   941   13   13  1698   45%  1051   51%  
   9 Embla2067             937   13   13  1699   46%  1050   54%  
  10 Embla2067_2065        933   13   13  1697   44%  1038   52%  
  11 Embla2067b            932   13   13  1700   42%  1040   58%  
  12 Embla2065             926   13   13  1699   43%  1050   59%  
  13 Embla2045             893   14   15  1700   37%  1041   37%  
  14 ParisHilton          -134  133  184  1696    0%  1130    0%
Embla2001-lmr and Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 are exact the same source-code and still they show a difference in rating.
To verify that I did not accidently compile a diffent version, I recompiled the Embla-lmr-2006c2b code to Embla-lmr-2006c2b-2. So if I made a mistake, then Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 can have different ratings. But one of these two must be the same as Embla2001-lmr.

Anyone got an idea what could be going on here?

(XboardEngine is tscp 1.81 and paris hilton is an engine that aims it playing the worst move possible so that I have a lower boundary)

I am not an expert but I would say that it is totally normal and due to randomness. Just try a match between two copies of the same code and you will see that it is difficult to get the exact 50% - 50%.

That being said, I think that the differences are somewhat big with that number of games. I do the following math, which I think it is correct:

Code: Select all

/* Pseudocode that does not follow C/C++ syntaxis&#58;

1 <- Embla2001-lmr
2 <- Embla-lmr-2006c2b
3 <- Embla-lmr-2006c2b-2

Sample standard deviations &#40;i = 1, 2, 3&#41;&#58; */

s&#40;i&#41; = sqrt ( &#123; score&#40;i&#41; * &#91; 1.0 - score&#40;i&#41; &#93; - 0.25 * draw_ratio&#40;i&#41; &#125; / &#91; games&#40;i&#41; - 1.0 &#93; )

/* Difference of two normal distributions&#58;
http&#58;//mathworld.wolfram.com/NormalDifferenceDistribution.html

s&#40;i,j&#41; <- sample standard deviation of the difference. */

s&#40;i,j&#41; = sqrt &#91; s&#40;i&#41; * s&#40;i&#41; + s&#40;j&#41; * s&#40;j&#41; &#93;

abs &#91; z&#40;i,j&#41; &#93; = abs &#91; score&#40;i&#41; - score&#40;j&#41; &#93; / s&#40;i,j&#41;

/* If I am not wrong&#58;

s&#40;1&#41; ~ 0.00925
s&#40;2&#41; ~ 0.00951
s&#40;3&#41; ~ 0.00979

s&#40;1,2&#41; ~ 0.01327
s&#40;1,3&#41; ~ 0.01347
s&#40;2,3&#41; ~ 0.01365

abs &#91; z&#40;1,2&#41; &#93; ~ 3.01
abs &#91; z&#40;1,3&#41; &#93; ~ 5.20
abs &#91; z&#40;2,3&#41; &#93; ~ 2.20

The used values of score&#40;i&#41; and draw_ratio&#40;i&#41; are rounded up to 1% &#40;the Bayeselo output&#41;. */

Embla2001-lmr and Embla-lmr-2006c2b-2 are too far in my opinion but these things can happen sometimes, like winning the lottery.

------------------------

Coming back again to the match between two identical engines, the expected score for each one is 50%, but the sample standard deviation is s = 0.5 * sqrt [ ( 1.0 - draw_ratio ) / ( games - 1.0 ) ] (just replacing score by 0.5 in the corresponding formula in the code box). So, there is an interval of scores with z-(sample standard deviation) confidence: [ 0.5 - z * s, 0.5 + z * s ], which is not 0.5 always but sometimes. I am writing in terms of score but it also could be checked with the error bars, just like Guenther pointed out.

For example: supose an a priori draw ratio of 40% and 1000 games. Then I would say that the score of each engine will end within the range [0.476, 0.524] = [ 47.6%, 52.4% ] with 95% confidence (z ~ 1.96) if I am not wrong. Scores can be different from 50% - 50% as well as the draw ratio can be different from 40%. More games -> more accuracy but more time consuming and more expensive electricity bills. It is always a trade-off.

Regards from Spain.

Ajedrecista.

flok · Post by **flok** » Thu Feb 11, 2016 11:03 am

Hi,

Ok thanks!
I'll just wait a bit longer before I conclude anything.

Regarding powerbill: I'm running things on 12 raspberry pi 2s. So that's about 0.4A (I actually measured 0.33A but I round it up because of overhead in the laptop-power-supply I use). So that is +/- 25 watt in total or somewhere around 51 euro per year. That I can afford

bob · Post by **bob** » Thu Feb 11, 2016 4:51 pm

flok wrote:Hi,

For a while I'm using bayeselo to compare versions of my chess program.
Now the last couple of weeks I have been pulling my hair out because of an unexplainable difference in elo-rating.
Code: Select all
Rank Name                  Elo    +    - games score oppo. draws 
   1 Stockfish 030914     2500  -75  202  1697  100%   925    0%  
   2 XboardEngine         1703   61   52  1698   91%   980    0%  
   3 Embla2001-lmr        1025   14   14  1697   57%  1048   40%  
   4 Embla-lmr-2006c2b     999   14   14  1705   53%  1045   38%  
   5 Embla-lmr-2006c2b-2   980   14   14  1696   50%  1048   35%  
   6 Embla2065_2045        951   13   13  1702   46%  1038   60%  
   7 Embla2067_2045        946   13   13  1698   45%  1049   59%  
   8 Embla2067_2065_2045   941   13   13  1698   45%  1051   51%  
   9 Embla2067             937   13   13  1699   46%  1050   54%  
  10 Embla2067_2065        933   13   13  1697   44%  1038   52%  
  11 Embla2067b            932   13   13  1700   42%  1040   58%  
  12 Embla2065             926   13   13  1699   43%  1050   59%  
  13 Embla2045             893   14   15  1700   37%  1041   37%  
  14 ParisHilton          -134  133  184  1696    0%  1130    0%
Embla2001-lmr and Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 are exact the same source-code and still they show a difference in rating.
To verify that I did not accidently compile a diffent version, I recompiled the Embla-lmr-2006c2b code to Embla-lmr-2006c2b-2. So if I made a mistake, then Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 can have different ratings. But one of these two must be the same as Embla2001-lmr.

Anyone got an idea what could be going on here?

(XboardEngine is tscp 1.81 and paris hilton is an engine that aims it playing the worst move possible so that I have a lower boundary)

Elo is given here as a 95% confidence that the rating is in the window N +/- the error bar. If you want zero variability, you have to play an infinite number of games. To get to +/- 1 Elo, you need around 100,000 games.

flok · Post by **flok** » Thu Feb 11, 2016 4:57 pm

bob wrote:
flok wrote:Hi,

For a while I'm using bayeselo to compare versions of my chess program.
Now the last couple of weeks I have been pulling my hair out because of an unexplainable difference in elo-rating.
Code: Select all
Rank Name                  Elo    +    - games score oppo. draws 
   1 Stockfish 030914     2500  -75  202  1697  100%   925    0%  
   2 XboardEngine         1703   61   52  1698   91%   980    0%  
   3 Embla2001-lmr        1025   14   14  1697   57%  1048   40%  
   4 Embla-lmr-2006c2b     999   14   14  1705   53%  1045   38%  
   5 Embla-lmr-2006c2b-2   980   14   14  1696   50%  1048   35%  
   6 Embla2065_2045        951   13   13  1702   46%  1038   60%  
   7 Embla2067_2045        946   13   13  1698   45%  1049   59%  
   8 Embla2067_2065_2045   941   13   13  1698   45%  1051   51%  
   9 Embla2067             937   13   13  1699   46%  1050   54%  
  10 Embla2067_2065        933   13   13  1697   44%  1038   52%  
  11 Embla2067b            932   13   13  1700   42%  1040   58%  
  12 Embla2065             926   13   13  1699   43%  1050   59%  
  13 Embla2045             893   14   15  1700   37%  1041   37%  
  14 ParisHilton          -134  133  184  1696    0%  1130    0%
Embla2001-lmr and Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 are exact the same source-code and still they show a difference in rating.
To verify that I did not accidently compile a diffent version, I recompiled the Embla-lmr-2006c2b code to Embla-lmr-2006c2b-2. So if I made a mistake, then Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 can have different ratings. But one of these two must be the same as Embla2001-lmr.

Anyone got an idea what could be going on here?

(XboardEngine is tscp 1.81 and paris hilton is an engine that aims it playing the worst move possible so that I have a lower boundary)
Elo is given here as a 95% confidence that the rating is in the window N +/- the error bar. If you want zero variability, you have to play an infinite number of games. To get to +/- 1 Elo, you need around 100,000 games.

Thanks.

But what are those +/- columns then for?
I thought they indicated in what range a rating is.
Like Embla2065_2045 would then be 938...964?

bob · Post by **bob** » Thu Feb 11, 2016 5:38 pm

flok wrote:
bob wrote:
flok wrote:Hi,

For a while I'm using bayeselo to compare versions of my chess program.
Now the last couple of weeks I have been pulling my hair out because of an unexplainable difference in elo-rating.
Code: Select all
Rank Name                  Elo    +    - games score oppo. draws 
   1 Stockfish 030914     2500  -75  202  1697  100%   925    0%  
   2 XboardEngine         1703   61   52  1698   91%   980    0%  
   3 Embla2001-lmr        1025   14   14  1697   57%  1048   40%  
   4 Embla-lmr-2006c2b     999   14   14  1705   53%  1045   38%  
   5 Embla-lmr-2006c2b-2   980   14   14  1696   50%  1048   35%  
   6 Embla2065_2045        951   13   13  1702   46%  1038   60%  
   7 Embla2067_2045        946   13   13  1698   45%  1049   59%  
   8 Embla2067_2065_2045   941   13   13  1698   45%  1051   51%  
   9 Embla2067             937   13   13  1699   46%  1050   54%  
  10 Embla2067_2065        933   13   13  1697   44%  1038   52%  
  11 Embla2067b            932   13   13  1700   42%  1040   58%  
  12 Embla2065             926   13   13  1699   43%  1050   59%  
  13 Embla2045             893   14   15  1700   37%  1041   37%  
  14 ParisHilton          -134  133  184  1696    0%  1130    0%
Embla2001-lmr and Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 are exact the same source-code and still they show a difference in rating.
To verify that I did not accidently compile a diffent version, I recompiled the Embla-lmr-2006c2b code to Embla-lmr-2006c2b-2. So if I made a mistake, then Embla-lmr-2006c2b and Embla-lmr-2006c2b-2 can have different ratings. But one of these two must be the same as Embla2001-lmr.

Anyone got an idea what could be going on here?

(XboardEngine is tscp 1.81 and paris hilton is an engine that aims it playing the worst move possible so that I have a lower boundary)
Elo is given here as a 95% confidence that the rating is in the window N +/- the error bar. If you want zero variability, you have to play an infinite number of games. To get to +/- 1 Elo, you need around 100,000 games.
Thanks.

But what are those +/- columns then for?
I thought they indicated in what range a rating is.
Like Embla2065_2045 would then be 938...964?

The Elo rating you see should be interpreted as this:

I am 95% confident that the actual rating is Elo plus or minus the error bar.

For the first entry (Embla2001-lmr) in your original post, that would be:

I am 95% confident that the rating lies somewhere between 1011 and 1039. 5% of the time it will be outside that window. The window narrows with more games. The stockfish rating has a huge error bar because if it wins every game, the rating could be 2500 or 3500, since it should win almost everything with either of those.

Stockfish is screwing up your numbers. It is pointless to play games against a program where you lose every single game. It adds nothing to the elo calculation at all, and actually hurts. Remove it.

Ajedrecista · Post by **Ajedrecista** » Thu Feb 11, 2016 6:04 pm

Hello again:

flok wrote:
bob wrote:Elo is given here as a 95% confidence that the rating is in the window N +/- the error bar. If you want zero variability, you have to play an infinite number of games. To get to +/- 1 Elo, you need around 100,000 games.
Thanks.

But what are those +/- columns then for?
I thought they indicated in what range a rating is.
Like Embla2065_2045 would then be 938...964?

This is exactly what they mean. The width of this interval is proportional to (games)^(-1/2).

I think that Bayeselo default output for error bars is 95% confidece ~ 1.96-sigma confidence. I suppose it can be changed, just reading the help:

Bayesian Elo Rating

Once you have Bayeselo opened in the command prompt, type ? and press Enter to get help. The level of confidence can be changed typing elo, pressing Enter, typing confidence X and pressing Enter again. 0 < X < 1 (for example: 99% = 0.99, which z value is around 2.5758 IIRC).

Just have a glance here to see these values of z. Remember: confidence = 1 - alpha and you want z(alpha/2) because it is a two tailed test; z implies a confidence level.

Those error bars are related to the intervals [score - z·s, score + z·s] that I wrote in my other post. Probably Bayeselo uses different formulae for compute s, but the meaning of those +/- columns should be the same that I think:

Code: Select all

You have a score, which translates into Elo.
A very basic idea would be &#40;surely not used in Bayeselo&#41;&#58;

Own_Elo = Average_Elo_of_the_opponents + 400*log&#91;score/&#40;1 - score&#41;&#93;
log&#40;x&#41;&#58; base 10.

The same can be done with "score - z·s" and "score + z·s"&#58;

Lower_own_Elo = Average_Elo_of_the_opponents + 400*log&#123;&#40;score - z·s&#41;/&#91;1 - &#40;score - z·s&#41;&#93;&#125;
Upper_own_Elo = Average_Elo_of_the_opponents + 400*log&#123;&#40;score + z·s&#41;/&#91;1 - &#40;score + z·s&#41;&#93;&#125;

Column&#40;-) = Own_Elo - Lower_own_Elo
Column&#40;+) = Upper_own_Elo - Own_Elo

I hope no typos.

------------------------

I agree with Bob's post: SF adds no information to you, only wastes time. Moreover, it looks like a convergence problem of an algorithm or whatever took place due to the existence of a negative value in the + column.

I would say that too good or too bad engines difficult rating computations due to the nature of logit function logit(score) = ln[score/(1 - score)] when score is near to 0 and near to 1: 400*log[score/(1 - score)] = [400/ln(10)]*logit(score), which could tend to ±infinity.

Regards from Spain.

Ajedrecista.

flok · Post by **flok** » Wed Feb 17, 2016 1:27 pm

Jesús, Robert, thanks for the elaborate replies!

I realised that the rating shown are not hard values. Even after thousands of games they should not be considered as, well, exactly that value. For example I tried running 3 compilations of my program, all 3 the same version with the only difference being their name as reported by the 'uci'-command and the result was a difference of around 50 elo points between the highest and lowest rating. That was with about 3000 games played in total with 7 programs.

Now I'm curious: bayeselo often gives a different result from ordo.

bayeselo:

Code: Select all

Rank Name                        Elo    +    - games score oppo. draws 
   1 Embla-lmr-2006c2b            60   29   29   329   61%    -5   42%  
   2 Emblar2130-lmr-trunk-2114    57   29   29   330   60%    -3   37%  
   3 Embla-lmr-2006c2b-2          55   29   28   330   60%    -6   43%  
   4 Embla2001-lmr                45   29   29   329   59%    -7   40%  
   5 Embla2067b                   -6   27   27   324   48%     0   65%  
   6 Embla2065                    -9   26   26   332   48%     2   67%  
   7 Embla2045                   -10   27   27   327   49%    -2   61%  
   8 Emblalmr-trunk-2114         -16   27   27   325   45%     2   65%  
   9 Embla2065_2045              -16   26   26   328   48%     0   67%  
  10 Embla2067_2065              -19   27   27   326   46%     1   64%  
  11 Embla2067                   -20   27   27   333   47%     3   62%  
  12 Embla2067_2045              -23   27   27   330   45%     4   63%  
  13 Embla2067_2065_2045         -27   27   27   331   46%     5   63%  
  14 Embla2001                   -31   26   27   337   44%     1   61%  
  15 Embla2043                   -40   27   28   337   43%     3   50%

ordo:

Code: Select all

# PLAYER                       &#58; RATING    POINTS  PLAYED    (%)
   1 Embla-lmr-2006c2b            &#58; 2370.7     199.5     329   60.6% 
   2 Embla-lmr-2006c2b-2          &#58; 2366.3     198.5     330   60.2% 
   3 Emblar2130-lmr-trunk-2114    &#58; 2366.1     197.0     330   59.7% 
   4 Embla2001-lmr                &#58; 2357.9     195.0     329   59.3% 
   5 Embla2065                    &#58; 2289.2     160.0     332   48.2% 
   6 Embla2045                    &#58; 2288.7     159.5     327   48.8% 
   7 Embla2065_2045               &#58; 2286.3     158.0     328   48.2% 
   8 Embla2067b                   &#58; 2284.3     155.0     324   47.8% 
   9 Embla2067                    &#58; 2283.1     157.0     333   47.1% 
  10 Embla2067_2065_2045          &#58; 2274.1     151.0     331   45.6% 
  11 Embla2067_2065               &#58; 2273.0     150.0     326   46.0% 
  12 Embla2067_2045               &#58; 2272.5     150.0     330   45.5% 
  13 Emblalmr-trunk-2114          &#58; 2270.2     147.5     325   45.4% 
  14 Embla2001                    &#58; 2260.8     149.5     337   44.4% 
  15 Embla2043                    &#58; 2256.6     146.5     337   43.5%

I wonder which one I should believe?

Ajedrecista · Post by **Ajedrecista** » Wed Feb 17, 2016 2:42 pm

Hello Folkert:

flok wrote:Jesús, Robert, thanks for the elaborate replies!

I realised that the rating shown are not hard values. Even after thousands of games they should not be considered as, well, exactly that value. For example I tried running 3 compilations of my program, all 3 the same version with the only difference being their name as reported by the 'uci'-command and the result was a difference of around 50 elo points between the highest and lowest rating. That was with about 3000 games played in total with 7 programs.

Now I'm curious: bayeselo often gives a different result from ordo.

bayeselo:

Code: Select all

Rank Name                        Elo    +    - games score oppo. draws 
   1 Embla-lmr-2006c2b            60   29   29   329   61%    -5   42%  
   2 Emblar2130-lmr-trunk-2114    57   29   29   330   60%    -3   37%  
   3 Embla-lmr-2006c2b-2          55   29   28   330   60%    -6   43%  
   4 Embla2001-lmr                45   29   29   329   59%    -7   40%  
   5 Embla2067b                   -6   27   27   324   48%     0   65%  
   6 Embla2065                    -9   26   26   332   48%     2   67%  
   7 Embla2045                   -10   27   27   327   49%    -2   61%  
   8 Emblalmr-trunk-2114         -16   27   27   325   45%     2   65%  
   9 Embla2065_2045              -16   26   26   328   48%     0   67%  
  10 Embla2067_2065              -19   27   27   326   46%     1   64%  
  11 Embla2067                   -20   27   27   333   47%     3   62%  
  12 Embla2067_2045              -23   27   27   330   45%     4   63%  
  13 Embla2067_2065_2045         -27   27   27   331   46%     5   63%  
  14 Embla2001                   -31   26   27   337   44%     1   61%  
  15 Embla2043                   -40   27   28   337   43%     3   50%

ordo:

Code: Select all

# PLAYER                       &#58; RATING    POINTS  PLAYED    (%)
   1 Embla-lmr-2006c2b            &#58; 2370.7     199.5     329   60.6% 
   2 Embla-lmr-2006c2b-2          &#58; 2366.3     198.5     330   60.2% 
   3 Emblar2130-lmr-trunk-2114    &#58; 2366.1     197.0     330   59.7% 
   4 Embla2001-lmr                &#58; 2357.9     195.0     329   59.3% 
   5 Embla2065                    &#58; 2289.2     160.0     332   48.2% 
   6 Embla2045                    &#58; 2288.7     159.5     327   48.8% 
   7 Embla2065_2045               &#58; 2286.3     158.0     328   48.2% 
   8 Embla2067b                   &#58; 2284.3     155.0     324   47.8% 
   9 Embla2067                    &#58; 2283.1     157.0     333   47.1% 
  10 Embla2067_2065_2045          &#58; 2274.1     151.0     331   45.6% 
  11 Embla2067_2065               &#58; 2273.0     150.0     326   46.0% 
  12 Embla2067_2045               &#58; 2272.5     150.0     330   45.5% 
  13 Emblalmr-trunk-2114          &#58; 2270.2     147.5     325   45.4% 
  14 Embla2001                    &#58; 2260.8     149.5     337   44.4% 
  15 Embla2043                    &#58; 2256.6     146.5     337   43.5%

I wonder which one I should believe?

What really matters are confidence intervals of ratings, this is why playing more games are desirable for narrowing the intervals. It is like coin tossing which has 50% of show heads and 50% of show tails. You flip a coin ten times and you could not obtain five heads and five tails (you obtain them with a probability of 252/1024 ~ 24.61% just using the probability mass function of a binomial distribution).

Bayeselo and Ordo use different algorithms, hence the different results that you note. In your examples, the average rating of Bayeselo is 0 and the average rating in Ordo is 2300. Rating calculations are important for their differences, not for their absolute value: you could have only two engines with ratings 100 and 0, or 2700 and 2600 and it would be the same.

There were a lot of topics in TalkChess in the past about Ordo vs. Bayeselo (just search them with the help of the search engine of the forum) and I do not have any preference. Both are valid in my opinion. You will see that the Elo span in your example with Bayeselo was 60 - (-40) = 100 Bayeselo and it is 2370.7 - 2256.6 = 114.1 Ordo points. It is important to note that, in general, 1 Bayeselo is not equal to 1 Ordo point, and it is not bad at all. Which is better: Celsius or Fahrenheit? Both of them.

Peter Österlund shared a method for compare lists by different rating programmes: compute average ratings and sample standard deviations of the ratings for each list, then compare z_(engine_i, list_k) = [rating(engine_i, list_k) - average_rating(list_k)]/sample_standard_deviation(list_k) in each list: they should be similar [z_(engine_i, list_1) with z_(engine_i, list_2) in this case, with i = 1, 2, 3, ..., 14, 15]. Please bear in mind that the order of the 15 engines is not the same in each list.

Regards from Spain.

Ajedrecista.

understanding elo rating

understanding elo rating

Re: understanding elo rating

Re: Understanding Elo rating.

Re: Understanding Elo rating.

Re: understanding elo rating

Re: understanding elo rating

Re: understanding elo rating

Re: Understanding Elo rating.

Re: understanding elo rating

Re: Understanding Elo rating.