I have compared the ratings produced by these three programs for various databases and parameters to their models. Ideally, the ratings would match what the models predict. In reality, this is not the case with any of the programs.
I used the results/games from IPON, ChessWar, and CCRL 40/4 in this study. The reason for choosing these databases lies in the differences in how the engines are connected to each other (by connection I mean games played between opponents and games played by common opponents). IPON is a high connected set with reasonably equal engines. ChessWar is a sparsely connected set with engines having a large range of strengths. And the CCRL 40/4 is a large database with mixed characteristics.
Due to the fact that Ordo and ELOStat lack an assumption of a prior distribution of results, games involving an engine that had a score of 100% or 0% where removed from the ChessWar database. In fact, I went beyond this. I found that due to the lack of a prior, Ordo and ELOStat could not give sensible ratings for some engines that played only a few games (Note: Bayeselo is not completely immune from this 'problem'. Any ratings program needs a minimum number of games played to give realistic ratings). Also, ELOStat has an upper limit of 1500 engines. So, I culled out all engines that played less than 11 games, and repeated this process until no more engines could be removed with this filter.
Then, for each database and program, I computed the ratings using the appropriate parameter values and/or commands. Then, by usage of NotePad++ (regular expressions can be great once you learn how to use them ), several of Norm Pollock's PGN utilities (once again, thank you very much, Norm), a small Python script I asked help with at Stack Overflow, and Open Office, I was able to extract and prepare the data to use at ZunZun, a great website that allows one to perform statistical regressions. The data consisted White Elo (in 4 Elo bins), White score, and number of samples in each bin (minimum of 4 samples). My method of comparison was to assume that the denominator of 400 in each model was unknown, and to estimate that denominator using the data and weighted regression. The assumption is that the computed ratings should give an estimate that is close to 400.
For Ordo, the model for White score is:
50*(1+(1/(1+10^(-Elo Diff/400))-1/(1+10^(Elo Diff/400)))
Note: the actual model uses 398.34 instead of 400
ELOStat:
50*(1+(1/(1+10^(-Elo Diff/400))-1/(1+10^(Elo Diff/400)))
This a guess on my part. I have not been able to find the articles from Computerschach und Spiele that give the statistical theory behind ELOStat.
For Bayeselo:
50*(1+(1/(1+10^((-Elo Diff - Advantage + DrawElo)/400)))-(1/(1+10^((Elo Diff+Advantage+DrawElo)/400))))
Also, I took note of if the computed ratings exhibited an offset from the model (horizontal displacement). I will note here that all models exhibit an offset when no White advantage correction is used. I find that to be verification that White advantage is justified in the Bayeselo model.
Here are tables showing the estimated denominator for each database and each program. For Bayeselo, I conducted an estimate using the default values (Advantage=32.8; DrawElo=97.3; prior=2) and an estimate with prior=0.1 and values computed from each database. In the case of Ordo, I have a version that allows a White advantage value to be inputted. So, I conducted one estimate using no White advantage value, and a second estimate using a White advantage I found by regression methods. ELOStat allows no input of parameter values, so I conducted only one estimate with that program.
Code: Select all
IPON Estimated Denominator 95% Confidence Interval
ELOStat 363.87 (351.30, 376.43)
Ordo 397.86 (384.78, 410.94)
Ordo with offset(38.7) 399.64 (394.63, 404.65)
Bayeselo default 320.46 (315.47, 325.45)
Bayeselo adjusted 277.61 (270.02, 285.20)
(50.3872; 167.083; 0.1)
Ordo
Ordo (with offset of 38.7)
Bayeselo (default)
Bayeselo (computed values)
Code: Select all
CCRL 40/4 Estimated Denominator 95% Confidence Interval
Elostat 374.63 (366.66, 382.61)
Ordo 396.61 (387.59, 405.63)
Ordo with offset(26.35) 399.08 (393.79, 404.37)
Bayeselo default 361.44 (355.42, 367.46)
Bayeselo adjusted 351.33 (344.44, 358.21)
(32.0745; 118.751; 0.1)
Ordo
Ordo (with offset of 26.35)
Bayeselo (default)
Bayeselo (computed values)
Code: Select all
ChessWar Estimated Denominator 95% Confidence Interval
Elostat 300.45 (289.79, 311.12)
Ordo 372.58 (359.00, 386.16)
Ordo with offset(28.34) 376.71 (364.29, 389.14)
Bayeselo default 304.94 (292.27, 317.61)
Bayeselo adjusted 338.71 (324.49, 352.92)
(33.2002; 104.259; 0.1)
Ordo
Ordo (with offset of 28.34)
Bayeselo (default)
Bayeselo (computed values)
As it can be seen from the tables and the graphs, ELOStat exhibits ratings compression and offset. The offset can be explained by the lack of a method to include advantage for White. I have no answer at this time as to why the ratings are compressed* as compared to its model (if it is indeed the correct model).
*The compression can be seen by this example:
From the computed ratings for ChessWar, the estimated value for the denominator is 300.45. Using this value gives an expected White score of 68.27% when White is rated 100 Elo above black. From ELOStats' model, the expected score when White is 100 Elo stronger than Black is 64.01%. 68.27% is associated with an Elo difference of approximately 133 Elo for the model.
Likewise, Ordo suffers from an offset. However, the only ratings compression occurs for the ChessWar database, and that compression is less than for the other programs.
I did not expect the results that I found for Bayeselo. As expected, it exhibited no offset. But its ratings are compressed in relation to its model. At this moment, I do not have an explanation for this behavior. I would have guessed that using the computed values for Advantage and DrawElo and possibly decreasing the prior would remove all compression.
One more test I could perform is one that Rémi Coulom used to judge Bayeselo. For any of the databases, half of the data points could be used to rate the engines. Then, the other half of the data points could be examined to see how well those ratings predicted the results. I probably will do this once I determine how I can manage it.
The graphs, the reports for each regression(in PDF format), and the data (Open Office spreadsheets) can be found here. You can also find this report at my blog.