Yet Another Testing Question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Yet Another Testing Question

Post by bob »

Adam Hair wrote:
Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   14   14   500   50%     0   20%
   2 Player1     0   14   14   500   50%     0   20%
With the larger number of games and all methods:

Code: Select all

ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    2    2 100000   50%     0   20%
   2 Player1     0    2    2 100000   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    1    1 100000   50%     0   20%
   2 Player1     0    1    1 100000   50%     0   20%
ResultSet-EloRating>exactdist
00:00:00,05
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    3    3 100000   50%     0   20%
   2 Player1     0    3    3 100000   50%     0   20%
ResultSet-EloRating>jointdist
00:00:08,15
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0 -1439 -1439 100000   50%     0   20%
   2 Player1     0 -1439 -1439 100000   50%     0   20%
ResultSet-EloRating>los
Not exactly gettting more accurate,is it ? Maybe algorithm has problems with two players...
Interesting. When I use joint distribution on a smaller pgn, I get this:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,01
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_B     0   16   13   504   50%     0   25%
   2 Engine_A     0   16   13   504   50%     0   25%
ResultSet-EloRating>
When I increase the # of games, I get the same as you:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_D     0 -1499 -1499 22176   50%     0   33%
   2 Engine_C     0 -1499 -1499 22176   50%     0   33%
ResultSet-EloRating>
I am not sure I am interpreting your results correctly, as it was a long night last night traveling. But if you introduce an opponent with a small number of games and therefore high-uncertainty, doesn't that increase the uncertainty for ANY program that plays against them, in terms of the error bar?

I'll try to re-read after I recover from traveling to see what I might have mis-read...
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Yet Another Testing Question

Post by Adam Hair »

bob wrote:
Adam Hair wrote:
Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   14   14   500   50%     0   20%
   2 Player1     0   14   14   500   50%     0   20%
With the larger number of games and all methods:

Code: Select all

ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    2    2 100000   50%     0   20%
   2 Player1     0    2    2 100000   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    1    1 100000   50%     0   20%
   2 Player1     0    1    1 100000   50%     0   20%
ResultSet-EloRating>exactdist
00:00:00,05
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    3    3 100000   50%     0   20%
   2 Player1     0    3    3 100000   50%     0   20%
ResultSet-EloRating>jointdist
00:00:08,15
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0 -1439 -1439 100000   50%     0   20%
   2 Player1     0 -1439 -1439 100000   50%     0   20%
ResultSet-EloRating>los
Not exactly gettting more accurate,is it ? Maybe algorithm has problems with two players...
Interesting. When I use joint distribution on a smaller pgn, I get this:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,01
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_B     0   16   13   504   50%     0   25%
   2 Engine_A     0   16   13   504   50%     0   25%
ResultSet-EloRating>
When I increase the # of games, I get the same as you:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_D     0 -1499 -1499 22176   50%     0   33%
   2 Engine_C     0 -1499 -1499 22176   50%     0   33%
ResultSet-EloRating>
I am not sure I am interpreting your results correctly, as it was a long night last night traveling. But if you introduce an opponent with a small number of games and therefore high-uncertainty, doesn't that increase the uncertainty for ANY program that plays against them, in terms of the error bar?

I'll try to re-read after I recover from traveling to see what I might have mis-read...
That is true. But we were talking about that in the slightly earlier posts. I noticed something curious about the joint distribution confidence intervals in Daniel's last post and changed directions to comment on that.

I think we found that the default confidence intervals for Bayeselo is unaffected by introducing an opponent with a small number of games. The intervals produced by 'exactdist' and 'covariance' are affected. Since none of us were on the same page, there was some dispute how Bayeselo produces confidence intervals.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Yet Another Testing Question

Post by bob »

Adam Hair wrote:
bob wrote:
Adam Hair wrote:
Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   14   14   500   50%     0   20%
   2 Player1     0   14   14   500   50%     0   20%
With the larger number of games and all methods:

Code: Select all

ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    2    2 100000   50%     0   20%
   2 Player1     0    2    2 100000   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    1    1 100000   50%     0   20%
   2 Player1     0    1    1 100000   50%     0   20%
ResultSet-EloRating>exactdist
00:00:00,05
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    3    3 100000   50%     0   20%
   2 Player1     0    3    3 100000   50%     0   20%
ResultSet-EloRating>jointdist
00:00:08,15
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0 -1439 -1439 100000   50%     0   20%
   2 Player1     0 -1439 -1439 100000   50%     0   20%
ResultSet-EloRating>los
Not exactly gettting more accurate,is it ? Maybe algorithm has problems with two players...
Interesting. When I use joint distribution on a smaller pgn, I get this:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,01
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_B     0   16   13   504   50%     0   25%
   2 Engine_A     0   16   13   504   50%     0   25%
ResultSet-EloRating>
When I increase the # of games, I get the same as you:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_D     0 -1499 -1499 22176   50%     0   33%
   2 Engine_C     0 -1499 -1499 22176   50%     0   33%
ResultSet-EloRating>
I am not sure I am interpreting your results correctly, as it was a long night last night traveling. But if you introduce an opponent with a small number of games and therefore high-uncertainty, doesn't that increase the uncertainty for ANY program that plays against them, in terms of the error bar?

I'll try to re-read after I recover from traveling to see what I might have mis-read...
That is true. But we were talking about that in the slightly earlier posts. I noticed something curious about the joint distribution confidence intervals in Daniel's last post and changed directions to comment on that.

I think we found that the default confidence intervals for Bayeselo is unaffected by introducing an opponent with a small number of games. The intervals produced by 'exactdist' and 'covariance' are affected. Since none of us were on the same page, there was some dispute how Bayeselo produces confidence intervals.
I've been using covariance. Remi suggested that when we were talking about my cluster testing here several years ago...
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: Yet Another Testing Question

Post by Rémi Coulom »

Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   14   14   500   50%     0   20%
   2 Player1     0   14   14   500   50%     0   20%
With the larger number of games and all methods:

Code: Select all

ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    2    2 100000   50%     0   20%
   2 Player1     0    2    2 100000   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    1    1 100000   50%     0   20%
   2 Player1     0    1    1 100000   50%     0   20%
ResultSet-EloRating>exactdist
00:00:00,05
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    3    3 100000   50%     0   20%
   2 Player1     0    3    3 100000   50%     0   20%
ResultSet-EloRating>jointdist
00:00:08,15
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0 -1439 -1439 100000   50%     0   20%
   2 Player1     0 -1439 -1439 100000   50%     0   20%
ResultSet-EloRating>los
Not exactly gettting more accurate,is it ? Maybe algorithm has problems with two players...
Hi,

I have some explanations of the algorithms in that other post:
http://talkchess.com/forum/viewtopic.ph ... 99&t=22731

jointdist works with a naive sampling of the joint posterior distribution. Sampling is done with a constant-resolution uniform grid. When the number of games gets high, the uniform grid becomes so coarse that it has no sample in high-probability area, and may produce wrong results. This could be improved by using a form of MCMC. Gibbs sampling would work well even with a high number of players. But I probably won't waste time with this.

I'd like to repeat my advice again: ignore confidence intervals, and use LOS. That is the only reasonable way to compare two engines with bayeselo.

If you don't like the Gaussian approximation of LOS, the best way might be to compute LOS with MCMC samples. But certainly not confidence intervals. You can tell nothing from confidence intervals when there is covariance and more than two players. In practice, there is always covariance.

That old post shows an example that illustrates why confidence intervals are evil:
http://www.open-aurec.com/wbforum/viewt ... t=60#p7614

That paper gives an efficient method for Gibbs sampling:
http://arxiv.org/abs/1011.1761
Maybe Daniel will find the implementation of this method in bayeselo an interesting exercise.

Rémi
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Yet Another Testing Question

Post by Daniel Shawul »

Hi Remi
Thanks for the explanation! I have never looked at the details of how bayeselo does error bars (except for the default), so I do not know what the jointdist and exactdist do. I will definitely take a closer look to get a better idea. The elostat way is easier to grasp since it assumes Gaussian and reports variance (or standard deviation) alone , so error bars with multiple players don't make much sense when there is covariance. LOS definately seems to be better since it reports different values for each opponent and includes the covariances as well A-B=sqrt(VA+VB-2COVAB). The assumption of gaussian seem to be justified to me for >= 30 games because I usually get normal with uniform pseudo random numbers (proportions) with that sample size. However wiki says binomial proportions may sometimes be nonnormal even for larger n. So I guess MCMC is used to sample this distribution. I will dig in deeper in the code.
Thanks
Daniel
Rémi Coulom
Posts: 438
Joined: Mon Apr 24, 2006 8:06 pm

Re: Yet Another Testing Question

Post by Rémi Coulom »

Daniel Shawul wrote:Hi Remi
Thanks for the explanation! I have never looked at the details of how bayeselo does error bars (except for the default), so I do not know what the jointdist and exactdist do. I will definitely take a closer look to get a better idea. The elostat way is easier to grasp since it assumes Gaussian and reports variance (or standard deviation) alone , so error bars with multiple players don't make much sense when there is covariance. LOS definately seems to be better since it reports different values for each opponent and includes the covariances as well A-B=sqrt(VA+VB-2COVAB). The assumption of gaussian seem to be justified to me for >= 30 games because I usually get normal with uniform pseudo random numbers (proportions) with that sample size. However wiki says binomial proportions may sometimes be nonnormal even for larger n. So I guess MCMC is used to sample this distribution. I will dig in deeper in the code.
Thanks
Daniel
It may be non-normal with large n only if the win rate is close to 0% or 100%. Otherwise the Gaussian approximation is very correct after a few games. So in most ordinary testing of chess programs the errors of the Gaussian approximation is completely negligible.

Rémi
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Yet Another Testing Question

Post by bob »

Adam Hair wrote:
Daniel Shawul wrote:
Yes, the overwhelming number of games played by the first 10 opponents keeps the uncertainty in their Elo estimates low. However, Miguel's point is not without merit.

Let's look at a more realistic example:
Ha? I didn't come up with the example but only showed that it didnt behave as he explained.
I did not say that you came up with the example. And Miguel's example is unrealistic in the sense that rarely (if ever) ~ 1 million games per engine are combined into a pgn.
"rarely" being the key word. I do this fairly frequently. Say I test a set of changes and have to run 100 complete tests. That is 100 runs of 30 K games, with 6 K games per opponent. So stockfish will have 600K games against various versions of Crafty. When trying to tune some parameters, I have run hundreds of such matches, in one run, where stockfish would then have well over a million games total...