ELOStat, Ordo, and Bayeselo - Part II

Adam Hair · Post by **Adam Hair** » Thu Jul 12, 2012 12:45 am

I do not know if it is more appropriate to post this here or in the General forum. My guess is that the interest would be a little higher here.

I have compared the ratings produced by these three programs for various databases and parameters to their models. Ideally, the ratings would match what the models predict. In reality, this is not the case with any of the programs.

I used the results/games from IPON, ChessWar, and CCRL 40/4 in this study. The reason for choosing these databases lies in the differences in how the engines are connected to each other (by connection I mean games played between opponents and games played by common opponents). IPON is a high connected set with reasonably equal engines. ChessWar is a sparsely connected set with engines having a large range of strengths. And the CCRL 40/4 is a large database with mixed characteristics.

Due to the fact that Ordo and ELOStat lack an assumption of a prior distribution of results, games involving an engine that had a score of 100% or 0% where removed from the ChessWar database. In fact, I went beyond this. I found that due to the lack of a prior, Ordo and ELOStat could not give sensible ratings for some engines that played only a few games (Note: Bayeselo is not completely immune from this 'problem'. Any ratings program needs a minimum number of games played to give realistic ratings). Also, ELOStat has an upper limit of 1500 engines. So, I culled out all engines that played less than 11 games, and repeated this process until no more engines could be removed with this filter.

Then, for each database and program, I computed the ratings using the appropriate parameter values and/or commands. Then, by usage of NotePad++ (regular expressions can be great once you learn how to use them

), several of Norm Pollock's PGN utilities (once again, thank you very much, Norm), a small Python script I asked help with at Stack Overflow, and Open Office, I was able to extract and prepare the data to use at ZunZun, a great website that allows one to perform statistical regressions. The data consisted White Elo (in 4 Elo bins), White score, and number of samples in each bin (minimum of 4 samples). My method of comparison was to assume that the denominator of 400 in each model was unknown, and to estimate that denominator using the data and weighted regression. The assumption is that the computed ratings should give an estimate that is close to 400.

For Ordo, the model for White score is:
50*(1+(1/(1+10^(-Elo Diff/400))-1/(1+10^(Elo Diff/400)))
Note: the actual model uses 398.34 instead of 400

ELOStat:
50*(1+(1/(1+10^(-Elo Diff/400))-1/(1+10^(Elo Diff/400)))
This a guess on my part. I have not been able to find the articles from Computerschach und Spiele that give the statistical theory behind ELOStat.

For Bayeselo:
50*(1+(1/(1+10^((-Elo Diff - Advantage + DrawElo)/400)))-(1/(1+10^((Elo Diff+Advantage+DrawElo)/400))))

Also, I took note of if the computed ratings exhibited an offset from the model (horizontal displacement). I will note here that all models exhibit an offset when no White advantage correction is used. I find that to be verification that White advantage is justified in the Bayeselo model.

Here are tables showing the estimated denominator for each database and each program. For Bayeselo, I conducted an estimate using the default values (Advantage=32.8; DrawElo=97.3; prior=2) and an estimate with prior=0.1 and values computed from each database. In the case of Ordo, I have a version that allows a White advantage value to be inputted. So, I conducted one estimate using no White advantage value, and a second estimate using a White advantage I found by regression methods. ELOStat allows no input of parameter values, so I conducted only one estimate with that program.

Code: Select all

IPON                              Estimated Denominator                95% Confidence Interval
ELOStat                               363.87                             &#40;351.30, 376.43&#41;
Ordo                                  397.86                             &#40;384.78, 410.94&#41;
Ordo with offset&#40;38.7&#41;                399.64                             &#40;394.63, 404.65&#41;
Bayeselo default                      320.46                             &#40;315.47, 325.45&#41;
Bayeselo adjusted                     277.61                             &#40;270.02, 285.20&#41;
&#40;50.3872; 167.083; 0.1&#41;

ELOStat

Ordo

Ordo (with offset of 38.7)

Bayeselo (default)

Bayeselo (computed values)

Code: Select all

CCRL 40/4                         Estimated Denominator                95% Confidence Interval
Elostat                               374.63                             &#40;366.66, 382.61&#41;
Ordo                                  396.61                             &#40;387.59, 405.63&#41;
Ordo with offset&#40;26.35&#41;               399.08                             &#40;393.79, 404.37&#41;
Bayeselo default                      361.44                             &#40;355.42, 367.46&#41;
Bayeselo adjusted                     351.33                             &#40;344.44, 358.21&#41;
&#40;32.0745; 118.751; 0.1&#41;

ELOStat

Ordo

Ordo (with offset of 26.35)

Bayeselo (default)

Bayeselo (computed values)

Code: Select all

ChessWar                          Estimated Denominator                95% Confidence Interval
Elostat                               300.45                             &#40;289.79, 311.12&#41;
Ordo                                  372.58                             &#40;359.00, 386.16&#41;
Ordo with offset&#40;28.34&#41;               376.71                             &#40;364.29, 389.14&#41;
Bayeselo default                      304.94                             &#40;292.27, 317.61&#41;
Bayeselo adjusted                     338.71                             &#40;324.49, 352.92&#41;
&#40;33.2002; 104.259; 0.1&#41;

ELOStat

Ordo

Ordo (with offset of 28.34)

Bayeselo (default)

Bayeselo (computed values)

As it can be seen from the tables and the graphs, ELOStat exhibits ratings compression and offset. The offset can be explained by the lack of a method to include advantage for White. I have no answer at this time as to why the ratings are compressed* as compared to its model (if it is indeed the correct model).

*The compression can be seen by this example:
From the computed ratings for ChessWar, the estimated value for the denominator is 300.45. Using this value gives an expected White score of 68.27% when White is rated 100 Elo above black. From ELOStats' model, the expected score when White is 100 Elo stronger than Black is 64.01%. 68.27% is associated with an Elo difference of approximately 133 Elo for the model.

Likewise, Ordo suffers from an offset. However, the only ratings compression occurs for the ChessWar database, and that compression is less than for the other programs.

I did not expect the results that I found for Bayeselo. As expected, it exhibited no offset. But its ratings are compressed in relation to its model. At this moment, I do not have an explanation for this behavior. I would have guessed that using the computed values for Advantage and DrawElo and possibly decreasing the prior would remove all compression.

One more test I could perform is one that Rémi Coulom used to judge Bayeselo. For any of the databases, half of the data points could be used to rate the engines. Then, the other half of the data points could be examined to see how well those ratings predicted the results. I probably will do this once I determine how I can manage it.

The graphs, the reports for each regression(in PDF format), and the data (Open Office spreadsheets) can be found here. You can also find this report at my blog.

Rémi Coulom · Post by **Rémi Coulom** » Thu Jul 12, 2012 8:23 am

Adam Hair wrote: For Bayeselo:
50*(1+(1/(1+10^((-Elo Diff - Advantage + DrawElo)/400)))-(1/(1+10^((Elo Diff+Advantage+DrawElo)/400))))

It is a bit more complicated. In bayeselo, the ratings are "scaled" to match the derivative of the usual Elo formula at Elo = 0. Maybe that explains the "compression" you noticed.

You can get the scaling with the "scale" command in the elo interface. This is how it is computed:

Code: Select all

   &#123;
    double x = std&#58;&#58;pow&#40;10.0, -bt.GetDrawElo&#40;) / 400&#41;;
    EloScale = x * 4.0 / (&#40;1 + x&#41; * &#40;1 + x&#41;);
   &#125;

Rémi

Adam Hair · Post by **Adam Hair** » Thu Jul 12, 2012 12:22 pm

Rémi Coulom wrote:
Adam Hair wrote: For Bayeselo:
50*(1+(1/(1+10^((-Elo Diff - Advantage + DrawElo)/400)))-(1/(1+10^((Elo Diff+Advantage+DrawElo)/400))))
It is a bit more complicated. In bayeselo, the ratings are "scaled" to match the derivative of the usual Elo formula at Elo = 0. Maybe that explains the "compression" you noticed.

You can get the scaling with the "scale" command in the elo interface. This is how it is computed:
Code: Select all
   &#123;
    double x = std&#58;&#58;pow&#40;10.0, -bt.GetDrawElo&#40;) / 400&#41;;
    EloScale = x * 4.0 / (&#40;1 + x&#41; * &#40;1 + x&#41;);
   &#125;
Rémi

Thanks, Rémi. I have somehow missed that piece of information about about Bayeselo. I will look into that.

Adam

Adam Hair · Post by **Adam Hair** » Thu Jul 12, 2012 2:00 pm

I quickly checked the effect of changing the scale to 1 for the Chess War ratings, and that does seem to remove the "compression" (as related to Bayeselo's model). I will redo the Bayeselo regressions and post the findings.

Adam Hair · Post by **Adam Hair** » Thu Jul 12, 2012 9:38 pm

After setting the scale to 1, the ratings compression I found for the Bayeselo results went away.

Here are the tables again, with the Bayeselo (computed values) results corrected:

Code: Select all

IPON                        Estimated Denominator        95% Confidence Interval
ELOStat                      363.87                             &#40;351.30, 376.43&#41;
Ordo                         397.86                             &#40;384.78, 410.94&#41;
Ordo with offset&#40;38.7&#41;       399.64                             &#40;394.63, 404.65&#41;
Bayeselo default             320.46                             &#40;315.47, 325.45&#41;
Bayeselo adjusted            396.09                             &#40;390.13, 402.04&#41;
&#40;50.3872; 167.083; 0.1; 1&#41;

Code: Select all

CCRL 40/4                  Estimated Denominator        95% Confidence Interval
Elostat                     374.63                             &#40;366.66, 382.61&#41;
Ordo                        396.61                             &#40;387.59, 405.63&#41;
Ordo with offset&#40;26.35&#41;     399.08                             &#40;393.79, 404.37&#41;
Bayeselo default            361.44                             &#40;355.42, 367.46&#41;
Bayeselo adjusted           400.22                             &#40;392.69, 407.75&#41;
&#40;32.0745; 118.751; 0.1; 1&#41;

Code: Select all

ChessWar                 Estimated Denominator        95% Confidence Interval
Elostat                    300.45                          &#40;289.79, 311.12&#41;
Ordo                       372.58                          &#40;359.00, 386.16&#41;
Ordo with offset&#40;28.34&#41;    376.71                          &#40;364.29, 389.14&#41;
Bayeselo default           304.94                          &#40;292.27, 317.61&#41;
Bayeselo adjusted          374.35                          &#40;359.74, 388.96&#41;
&#40;33.2002; 104.259; 0.1; 1&#41;

The reports and graphs (with the Bayeselo results corrected) can be found at http://www.mediafire.com/?io6p0m4uaupt1he

Laskos · Post by **Laskos** » Fri Jul 13, 2012 6:56 am

Adam Hair wrote:After setting the scale to 1, the ratings compression I found for the Bayeselo results went away.

Here are the tables again, with the Bayeselo (computed values) results corrected:

Code: Select all

IPON                        Estimated Denominator        95% Confidence Interval
ELOStat                      363.87                             &#40;351.30, 376.43&#41;
Ordo                         397.86                             &#40;384.78, 410.94&#41;
Ordo with offset&#40;38.7&#41;       399.64                             &#40;394.63, 404.65&#41;
Bayeselo default             320.46                             &#40;315.47, 325.45&#41;
Bayeselo adjusted            396.09                             &#40;390.13, 402.04&#41;
&#40;50.3872; 167.083; 0.1; 1&#41;

Code: Select all

CCRL 40/4                  Estimated Denominator        95% Confidence Interval
Elostat                     374.63                             &#40;366.66, 382.61&#41;
Ordo                        396.61                             &#40;387.59, 405.63&#41;
Ordo with offset&#40;26.35&#41;     399.08                             &#40;393.79, 404.37&#41;
Bayeselo default            361.44                             &#40;355.42, 367.46&#41;
Bayeselo adjusted           400.22                             &#40;392.69, 407.75&#41;
&#40;32.0745; 118.751; 0.1; 1&#41;

Code: Select all

ChessWar                 Estimated Denominator        95% Confidence Interval
Elostat                    300.45                          &#40;289.79, 311.12&#41;
Ordo                       372.58                          &#40;359.00, 386.16&#41;
Ordo with offset&#40;28.34&#41;    376.71                          &#40;364.29, 389.14&#41;
Bayeselo default           304.94                          &#40;292.27, 317.61&#41;
Bayeselo adjusted          374.35                          &#40;359.74, 388.96&#41;
&#40;33.2002; 104.259; 0.1; 1&#41;

The reports and graphs (with the Bayeselo results corrected) can be found at http://www.mediafire.com/?io6p0m4uaupt1he

Very good Adam, thanks. I hope from now on CCRL (and others) will either use Ordo or the adjusted Bayeselo. I didn't know that Bayeselo has that parameter "scale" which fixes it, in fact, if somebody revealed it earlier, the endless discussions about the rating compression of Bayeselo (a true compression denied by many with some very complicated arguments) would have been shorter.
There is a problem with ChessWar results. It may come from two sources: first, as you noted, the engines are sparsely connected. Second, more intriguing, on a larger range of strength in ChessWar, the deviation from the Logistic in the tails may be more visible.
Can you please do the following: for ChessWar results, plot with Denominator exactly 400 the dots for Ordo (with offset) and Bayeselo (adjusted)? I want to see if there is a match on smaller regions around 0 diff. Then, if you can, with (one or more) sub-sets of ChessWar results on a _smaller_ range (comparable say to IPON), try to have estimates for the Denominator for Ordo (offset) and Bayeselo (adjusted). The reason I suspect a bit that the curve on the tails is not exactly logistic is that I performed some very tiny experiments with engines on a wide range, and the specific to the logistic relationships between the scores were broken at the tails. My sample was too tiny.

Thanks, and thanks to Miguel for Ordo which confirmed my suspicions about Bayeselo (default) and EloStat. I hope Remi sets the Bayeselo defaults to the adjusted values.

Kai

Don · Post by **Don** » Fri Jul 13, 2012 7:53 pm

But isn't scale set to 1 by default?

Code: Select all

drd@greencheeks&#58;~/u/kom/komodo$ bayeselo 
version 0057, Copyright &#40;C&#41; 1997-2010 Remi Coulom.
compiled May 28 2011 11&#58;58&#58;36.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http&#58;//www.gnu.org/copyleft/gpl.html for details.
ResultSet>elo
ResultSet-EloRating>scale
1

Laskos wrote:
Adam Hair wrote:After setting the scale to 1, the ratings compression I found for the Bayeselo results went away.

Here are the tables again, with the Bayeselo (computed values) results corrected:
Code: Select all
IPON                        Estimated Denominator        95% Confidence Interval
ELOStat                      363.87                             &#40;351.30, 376.43&#41;
Ordo                         397.86                             &#40;384.78, 410.94&#41;
Ordo with offset&#40;38.7&#41;       399.64                             &#40;394.63, 404.65&#41;
Bayeselo default             320.46                             &#40;315.47, 325.45&#41;
Bayeselo adjusted            396.09                             &#40;390.13, 402.04&#41;
&#40;50.3872; 167.083; 0.1; 1&#41;
Code: Select all
CCRL 40/4                  Estimated Denominator        95% Confidence Interval
Elostat                     374.63                             &#40;366.66, 382.61&#41;
Ordo                        396.61                             &#40;387.59, 405.63&#41;
Ordo with offset&#40;26.35&#41;     399.08                             &#40;393.79, 404.37&#41;
Bayeselo default            361.44                             &#40;355.42, 367.46&#41;
Bayeselo adjusted           400.22                             &#40;392.69, 407.75&#41;
&#40;32.0745; 118.751; 0.1; 1&#41;
Code: Select all
ChessWar                 Estimated Denominator        95% Confidence Interval
Elostat                    300.45                          &#40;289.79, 311.12&#41;
Ordo                       372.58                          &#40;359.00, 386.16&#41;
Ordo with offset&#40;28.34&#41;    376.71                          &#40;364.29, 389.14&#41;
Bayeselo default           304.94                          &#40;292.27, 317.61&#41;
Bayeselo adjusted          374.35                          &#40;359.74, 388.96&#41;
&#40;33.2002; 104.259; 0.1; 1&#41;
The reports and graphs (with the Bayeselo results corrected) can be found at http://www.mediafire.com/?io6p0m4uaupt1he
Very good Adam, thanks. I hope from now on CCRL (and others) will either use Ordo or the adjusted Bayeselo. I didn't know that Bayeselo has that parameter "scale" which fixes it, in fact, if somebody revealed it earlier, the endless discussions about the rating compression of Bayeselo (a true compression denied by many with some very complicated arguments) would have been shorter.
There is a problem with ChessWar results. It may come from two sources: first, as you noted, the engines are sparsely connected. Second, more intriguing, on a larger range of strength in ChessWar, the deviation from the Logistic in the tails may be more visible.
Can you please do the following: for ChessWar results, plot with Denominator exactly 400 the dots for Ordo (with offset) and Bayeselo (adjusted)? I want to see if there is a match on smaller regions around 0 diff. Then, if you can, with (one or more) sub-sets of ChessWar results on a _smaller_ range (comparable say to IPON), try to have estimates for the Denominator for Ordo (offset) and Bayeselo (adjusted). The reason I suspect a bit that the curve on the tails is not exactly logistic is that I performed some very tiny experiments with engines on a wide range, and the specific to the logistic relationships between the scores were broken at the tails. My sample was too tiny.

Thanks, and thanks to Miguel for Ordo which confirmed my suspicions about Bayeselo (default) and EloStat. I hope Remi sets the Bayeselo defaults to the adjusted values.

Kai

Code: Select all

Rémi Coulom · Post by **Rémi Coulom** » Fri Jul 13, 2012 8:17 pm

Don wrote:But isn't scale set to 1 by default?

Code: Select all

drd@greencheeks&#58;~/u/kom/komodo$ bayeselo 
version 0057, Copyright &#40;C&#41; 1997-2010 Remi Coulom.
compiled May 28 2011 11&#58;58&#58;36.
This program comes with ABSOLUTELY NO WARRANTY.
This is free software, and you are welcome to redistribute it
under the terms and conditions of the GNU General Public License.
See http&#58;//www.gnu.org/copyleft/gpl.html for details.
ResultSet>elo
ResultSet-EloRating>scale
1

scale is 1 by default, but it is re-computed each time MM is run (because it may depend on the outcome of MM). So if you wish to use a scale of 1, you should set it to 1 after "mm".

I added this scale to make the Elo computed by bayeselo similar to usual Elo, that is to say the win rate for the same Elo difference will be almost the same. That's because early users were not happy that bayeselo's rating were not on a similar scale as usual Elo ratings.

Also, making bayeselo "adjusted" by default does not seem to be a good choice, because it really does not work well when used with a small number of games. Letting bayeselo estimate the advantage of playing White, or the DrawElo with just a dozen games won't work well. "adjusted" only works over large databases of games. I want the default to be foolproof.

Maybe I could use a strong prior for EloDraw and EloAdvantage, so that their estimation works even with a small number of games, but that would be very chess-specific, and I feel it is not worth the trouble.

Rémi

Don · Post by **Don** » Fri Jul 13, 2012 8:23 pm

I remember reading that you (Adam) prefer mm 1 1, but I don't remember the reason. Do you recommend that and why?

Don

Adam Hair · Post by **Adam Hair** » Sat Jul 14, 2012 3:37 am

Don wrote:I remember reading that you (Adam) prefer mm 1 1, but I don't remember the reason. Do you recommend that and why?

Don

mm 1 1 makes Bayeselo compute White advantage and drawElo (or eloDraw) from the database and use those values in computing the ratings. For example, the drawElo for IPON is much different than Bayeselo's default value, which was determined from the first few years worth of WBEC games. Bayeselo is designed to make use of the information contained in the draws, so why not make full use of that ability?

I will say that I thought using the default value was causing CCRL ratings to be compressed. But, that was being caused by the default scale value (which computed using the drawElo value).

How much any of this affects the ratings for your testing of Komodo is unknown and possibly not much at all. As far as I can tell, it is dependant on the database of games.

ELOStat, Ordo, and Bayeselo - Part II

ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II

Re: ELOStat, Ordo, and Bayeselo - Part II