Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...
ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name Elo + - games score oppo. draws
1 Engine_D 0 -1499 -1499 22176 50% 0 33%
2 Engine_C 0 -1499 -1499 22176 50% 0 33%
ResultSet-EloRating>
I am not sure I am interpreting your results correctly, as it was a long night last night traveling. But if you introduce an opponent with a small number of games and therefore high-uncertainty, doesn't that increase the uncertainty for ANY program that plays against them, in terms of the error bar?
I'll try to re-read after I recover from traveling to see what I might have mis-read...
Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...
ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name Elo + - games score oppo. draws
1 Engine_D 0 -1499 -1499 22176 50% 0 33%
2 Engine_C 0 -1499 -1499 22176 50% 0 33%
ResultSet-EloRating>
I am not sure I am interpreting your results correctly, as it was a long night last night traveling. But if you introduce an opponent with a small number of games and therefore high-uncertainty, doesn't that increase the uncertainty for ANY program that plays against them, in terms of the error bar?
I'll try to re-read after I recover from traveling to see what I might have mis-read...
That is true. But we were talking about that in the slightly earlier posts. I noticed something curious about the joint distribution confidence intervals in Daniel's last post and changed directions to comment on that.
I think we found that the default confidence intervals for Bayeselo is unaffected by introducing an opponent with a small number of games. The intervals produced by 'exactdist' and 'covariance' are affected. Since none of us were on the same page, there was some dispute how Bayeselo produces confidence intervals.
Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...
ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name Elo + - games score oppo. draws
1 Engine_D 0 -1499 -1499 22176 50% 0 33%
2 Engine_C 0 -1499 -1499 22176 50% 0 33%
ResultSet-EloRating>
I am not sure I am interpreting your results correctly, as it was a long night last night traveling. But if you introduce an opponent with a small number of games and therefore high-uncertainty, doesn't that increase the uncertainty for ANY program that plays against them, in terms of the error bar?
I'll try to re-read after I recover from traveling to see what I might have mis-read...
That is true. But we were talking about that in the slightly earlier posts. I noticed something curious about the joint distribution confidence intervals in Daniel's last post and changed directions to comment on that.
I think we found that the default confidence intervals for Bayeselo is unaffected by introducing an opponent with a small number of games. The intervals produced by 'exactdist' and 'covariance' are affected. Since none of us were on the same page, there was some dispute how Bayeselo produces confidence intervals.
I've been using covariance. Remi suggested that when we were talking about my cluster testing here several years ago...
Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...
jointdist works with a naive sampling of the joint posterior distribution. Sampling is done with a constant-resolution uniform grid. When the number of games gets high, the uniform grid becomes so coarse that it has no sample in high-probability area, and may produce wrong results. This could be improved by using a form of MCMC. Gibbs sampling would work well even with a high number of players. But I probably won't waste time with this.
I'd like to repeat my advice again: ignore confidence intervals, and use LOS. That is the only reasonable way to compare two engines with bayeselo.
If you don't like the Gaussian approximation of LOS, the best way might be to compute LOS with MCMC samples. But certainly not confidence intervals. You can tell nothing from confidence intervals when there is covariance and more than two players. In practice, there is always covariance.
That paper gives an efficient method for Gibbs sampling: http://arxiv.org/abs/1011.1761
Maybe Daniel will find the implementation of this method in bayeselo an interesting exercise.
Hi Remi
Thanks for the explanation! I have never looked at the details of how bayeselo does error bars (except for the default), so I do not know what the jointdist and exactdist do. I will definitely take a closer look to get a better idea. The elostat way is easier to grasp since it assumes Gaussian and reports variance (or standard deviation) alone , so error bars with multiple players don't make much sense when there is covariance. LOS definately seems to be better since it reports different values for each opponent and includes the covariances as well A-B=sqrt(VA+VB-2COVAB). The assumption of gaussian seem to be justified to me for >= 30 games because I usually get normal with uniform pseudo random numbers (proportions) with that sample size. However wiki says binomial proportions may sometimes be nonnormal even for larger n. So I guess MCMC is used to sample this distribution. I will dig in deeper in the code.
Thanks
Daniel
Daniel Shawul wrote:Hi Remi
Thanks for the explanation! I have never looked at the details of how bayeselo does error bars (except for the default), so I do not know what the jointdist and exactdist do. I will definitely take a closer look to get a better idea. The elostat way is easier to grasp since it assumes Gaussian and reports variance (or standard deviation) alone , so error bars with multiple players don't make much sense when there is covariance. LOS definately seems to be better since it reports different values for each opponent and includes the covariances as well A-B=sqrt(VA+VB-2COVAB). The assumption of gaussian seem to be justified to me for >= 30 games because I usually get normal with uniform pseudo random numbers (proportions) with that sample size. However wiki says binomial proportions may sometimes be nonnormal even for larger n. So I guess MCMC is used to sample this distribution. I will dig in deeper in the code.
Thanks
Daniel
It may be non-normal with large n only if the win rate is close to 0% or 100%. Otherwise the Gaussian approximation is very correct after a few games. So in most ordinary testing of chess programs the errors of the Gaussian approximation is completely negligible.
Yes, the overwhelming number of games played by the first 10 opponents keeps the uncertainty in their Elo estimates low. However, Miguel's point is not without merit.
Let's look at a more realistic example:
Ha? I didn't come up with the example but only showed that it didnt behave as he explained.
I did not say that you came up with the example. And Miguel's example is unrealistic in the sense that rarely (if ever) ~ 1 million games per engine are combined into a pgn.
"rarely" being the key word. I do this fairly frequently. Say I test a set of changes and have to run 100 complete tests. That is 100 runs of 30 K games, with 6 K games per opponent. So stockfish will have 600K games against various versions of Crafty. When trying to tune some parameters, I have run hundreds of such matches, in one run, where stockfish would then have well over a million games total...