Yet Another Testing Question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Yet Another Testing Question

Post by Adam Hair »

Daniel Shawul wrote:You said it would increase _signifcantly_ which it did n't. The error margin is 5 vs 6 after you add a few more games from the second pool. You simply forgot that we still have the oldgames when adding the new pool. Ofcourse I wouldn't expect error margin to be the same with different sets of players but it will not change the error margins significantly as you claimed. Here is a direct transaltion of your frist example. With or without the second pool error margin still remains 1...

Code: Select all

1   Player9   164   1   1   900100   50.00%   164   20.00% 
2   Player8   164   1   1   900100   50.00%   164   20.00% 
3   Player7   164   1   1   900100   50.00%   164   20.00% 
4   Player6   164   1   1   900100   50.00%   164   20.00% 
5   Player5   164   1   1   900100   50.00%   164   20.00% 
6   Player4   164   1   1   900100   50.00%   164   20.00% 
7   Player3   164   1   1   900100   50.00%   164   20.00% 
8   Player2   164   1   1   900100   50.00%   164   20.00% 
9   Player1   164   1   1   900100   50.00%   164   20.00% 
10   Player0   164   1   1   900100   50.00%   164   20.00% 
11   Player10   -163   52   52   190   28.90%   9   20.00% 
12   Player11   -163   52   52   190   28.90%   9   20.00% 
13   Player12   -163   52   52   190   28.90%   9   20.00% 
14   Player13   -163   52   52   190   28.90%   9   20.00% 
15   Player14   -163   52   52   190   28.90%   9   20.00% 
16   Player15   -163   52   52   190   28.90%   9   20.00% 
17   Player16   -163   52   52   190   28.90%   9   20.00% 
18   Player17   -163   52   52   190   28.90%   9   20.00% 
19   Player18   -163   52   52   190   28.90%   9   20.00% 
20   Player19   -163   52   52   190   28.90%   9   20.00% 
Now compare this result to what you said:
All the errors will increase tremendously, because now the values against the average of the pool is uncertain.
Obviously it didn't increase at all even if the average elo of the pool is decreased by 163. You forgot that we still have those 100000 games between themselves, otherwise you wouldn't talk about distance examples you gave , which I fail to see its relevance here at all.

For completeness here are results ignoring one of the pools. You can see there isn't much of a difference for any of the pools even though they have tremendously different number of games and elos as welll..

First pool's error is same:

Code: Select all

1	Player0	0	1	1	900000	50.00%	0	20.00%
2	Player1	0	1	1	900000	50.00%	0	20.00%
3	Player2	0	1	1	900000	50.00%	0	20.00%
4	Player3	0	1	1	900000	50.00%	0	20.00%
5	Player4	0	1	1	900000	50.00%	0	20.00%
6	Player5	0	1	1	900000	50.00%	0	20.00%
7	Player6	0	1	1	900000	50.00%	0	20.00%
8	Player7	0	1	1	900000	50.00%	0	20.00%
9	Player8	0	1	1	900000	50.00%	0	20.00%
10	Player9	0	1	1	900000	50.00%	0	20.00%
Second pool's error bar is 52 vs 53

Code: Select all

1	Player0	187	82	82	100	90.00%	-186	20.00%
2	Player1	187	82	82	100	90.00%	-186	20.00%
3	Player2	187	82	82	100	90.00%	-186	20.00%
4	Player3	187	82	82	100	90.00%	-186	20.00%
5	Player4	187	82	82	100	90.00%	-186	20.00%
6	Player5	187	82	82	100	90.00%	-186	20.00%
7	Player6	187	82	82	100	90.00%	-186	20.00%
8	Player7	187	82	82	100	90.00%	-186	20.00%
9	Player8	187	82	82	100	90.00%	-186	20.00%
10	Player9	187	82	82	100	90.00%	-186	20.00%
11	Player10	-186	53	53	190	28.90%	10	20.00%
12	Player11	-186	53	53	190	28.90%	10	20.00%
13	Player12	-186	53	53	190	28.90%	10	20.00%
14	Player13	-186	53	53	190	28.90%	10	20.00%
15	Player14	-186	53	53	190	28.90%	10	20.00%
16	Player15	-186	53	53	190	28.90%	10	20.00%
17	Player16	-186	53	53	190	28.90%	10	20.00%
18	Player17	-186	53	53	190	28.90%	10	20.00%
19	Player18	-186	53	53	190	28.90%	10	20.00%
20	Player19	-186	53	53	190	28.90%	10	20.00%
Daniel
Yes, the overwhelming number of games played by the first 10 opponents keeps the uncertainty in their Elo estimates low. However, Miguel's point is not without merit.

Let's look at a more realistic example:

Code: Select all

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_B     0   14   14   504   50%     0   25% 
   2 Engine_A     0   14   14   504   50%     0   25% 
Now, let's add two more engines that have played 6 games against each other and against each A and B:

Code: Select all

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_B     0   61   61   516   50%     0   25% 
   2 Engine_A     0   61   61   516   50%     0   25% 
   3 Engine_C     0  102  102    18   50%     0   33% 
   4 Engine_D     0  102  102    18   50%     0   33% 
The uncertainty in the ratings for A and B do rise dramatically.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Yet Another Testing Question

Post by michiguel »

Adam Hair wrote:
Daniel Shawul wrote:You said it would increase _signifcantly_ which it did n't. The error margin is 5 vs 6 after you add a few more games from the second pool. You simply forgot that we still have the oldgames when adding the new pool. Ofcourse I wouldn't expect error margin to be the same with different sets of players but it will not change the error margins significantly as you claimed. Here is a direct transaltion of your frist example. With or without the second pool error margin still remains 1...

Code: Select all

1   Player9   164   1   1   900100   50.00%   164   20.00% 
2   Player8   164   1   1   900100   50.00%   164   20.00% 
3   Player7   164   1   1   900100   50.00%   164   20.00% 
4   Player6   164   1   1   900100   50.00%   164   20.00% 
5   Player5   164   1   1   900100   50.00%   164   20.00% 
6   Player4   164   1   1   900100   50.00%   164   20.00% 
7   Player3   164   1   1   900100   50.00%   164   20.00% 
8   Player2   164   1   1   900100   50.00%   164   20.00% 
9   Player1   164   1   1   900100   50.00%   164   20.00% 
10   Player0   164   1   1   900100   50.00%   164   20.00% 
11   Player10   -163   52   52   190   28.90%   9   20.00% 
12   Player11   -163   52   52   190   28.90%   9   20.00% 
13   Player12   -163   52   52   190   28.90%   9   20.00% 
14   Player13   -163   52   52   190   28.90%   9   20.00% 
15   Player14   -163   52   52   190   28.90%   9   20.00% 
16   Player15   -163   52   52   190   28.90%   9   20.00% 
17   Player16   -163   52   52   190   28.90%   9   20.00% 
18   Player17   -163   52   52   190   28.90%   9   20.00% 
19   Player18   -163   52   52   190   28.90%   9   20.00% 
20   Player19   -163   52   52   190   28.90%   9   20.00% 
Now compare this result to what you said:
All the errors will increase tremendously, because now the values against the average of the pool is uncertain.
Obviously it didn't increase at all even if the average elo of the pool is decreased by 163. You forgot that we still have those 100000 games between themselves, otherwise you wouldn't talk about distance examples you gave , which I fail to see its relevance here at all.

For completeness here are results ignoring one of the pools. You can see there isn't much of a difference for any of the pools even though they have tremendously different number of games and elos as welll..

First pool's error is same:

Code: Select all

1	Player0	0	1	1	900000	50.00%	0	20.00%
2	Player1	0	1	1	900000	50.00%	0	20.00%
3	Player2	0	1	1	900000	50.00%	0	20.00%
4	Player3	0	1	1	900000	50.00%	0	20.00%
5	Player4	0	1	1	900000	50.00%	0	20.00%
6	Player5	0	1	1	900000	50.00%	0	20.00%
7	Player6	0	1	1	900000	50.00%	0	20.00%
8	Player7	0	1	1	900000	50.00%	0	20.00%
9	Player8	0	1	1	900000	50.00%	0	20.00%
10	Player9	0	1	1	900000	50.00%	0	20.00%
Second pool's error bar is 52 vs 53

Code: Select all

1	Player0	187	82	82	100	90.00%	-186	20.00%
2	Player1	187	82	82	100	90.00%	-186	20.00%
3	Player2	187	82	82	100	90.00%	-186	20.00%
4	Player3	187	82	82	100	90.00%	-186	20.00%
5	Player4	187	82	82	100	90.00%	-186	20.00%
6	Player5	187	82	82	100	90.00%	-186	20.00%
7	Player6	187	82	82	100	90.00%	-186	20.00%
8	Player7	187	82	82	100	90.00%	-186	20.00%
9	Player8	187	82	82	100	90.00%	-186	20.00%
10	Player9	187	82	82	100	90.00%	-186	20.00%
11	Player10	-186	53	53	190	28.90%	10	20.00%
12	Player11	-186	53	53	190	28.90%	10	20.00%
13	Player12	-186	53	53	190	28.90%	10	20.00%
14	Player13	-186	53	53	190	28.90%	10	20.00%
15	Player14	-186	53	53	190	28.90%	10	20.00%
16	Player15	-186	53	53	190	28.90%	10	20.00%
17	Player16	-186	53	53	190	28.90%	10	20.00%
18	Player17	-186	53	53	190	28.90%	10	20.00%
19	Player18	-186	53	53	190	28.90%	10	20.00%
20	Player19	-186	53	53	190	28.90%	10	20.00%
Daniel
Yes, the overwhelming number of games played by the first 10 opponents keeps the uncertainty in their Elo estimates low. However, Miguel's point is not without merit.

Let's look at a more realistic example:

Code: Select all

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_B     0   14   14   504   50%     0   25% 
   2 Engine_A     0   14   14   504   50%     0   25% 
Now, let's add two more engines that have played 6 games against each other and against each A and B:

Code: Select all

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_B     0   61   61   516   50%     0   25% 
   2 Engine_A     0   61   61   516   50%     0   25% 
   3 Engine_C     0  102  102    18   50%     0   33% 
   4 Engine_D     0  102  102    18   50%     0   33% 
The uncertainty in the ratings for A and B do rise dramatically.
The example I showed was wrong because I used BE in the wrong way. I was starting to thing something was funny, and it was. I downloaded it here for this and I followed the example in the website, which uses exactdist. Using covariance, which is the proper way to do it I get

Code: Select all

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_B     0    3    3  6336   50%     0   50% 
   2 Engine_A     0    3    3  6336   50%     0   50% 

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_A     0   42   42  6360   50%     0   50% 
   2 Engine_B     0   42   42  6360   50%     0   50% 
   3 Engine_C     0   71   71    24   50%     0   50% 
   4 Engine_D     0   71   71    24   50%     0   50% 
Looking at help we see

Code: Select all

covariance ...... compute intervals with the full Hessian
Which is the way to go and

Code: Select all

exactdist [p] ... compute intervals assuming exact opponent Elos
Which is a gross approximation, completely invalid in this case.

For a totally unrelated issue Adam mentioned to me that he uses covariance (thanks!) and now looking at the help I understand why. In fact, why is the example in the website using exactdist? Probably covariance is too slow for too many players because of the inversion of a gigantic matrix?

Miguel
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Yet Another Testing Question

Post by Adam Hair »

IIRC, the cost of computing the ratings with 'exactdist' is linear with respect to the number of players, while the cost for 'covariance' is cubic. However, I found that for well connected databases, 'covariance' is much faster than 'exactdist'. I test I ran recently using the CCRL 40/4 database found that 'exactdist' took 2 minutes 42 seconds to compute the estimated ratings intervals, while 'covariance' took 15 seconds. Since 'covariance' gives correct ratings (as compared to 'exactdist') for pathological pgns, I believe it is working properly. I have an idea about why 'covariance' is faster in the situations I mention, but I really find it hard to look at Bayeselo's source and see what it is doing (the deficiency is mine, not Remi's). So, I probably am completely wrong in my guess (information is retained from the estimation of the maximum likelihood ratings).
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Yet Another Testing Question

Post by Daniel Shawul »

Yes, the overwhelming number of games played by the first 10 opponents keeps the uncertainty in their Elo estimates low. However, Miguel's point is not without merit.

Let's look at a more realistic example:
Ha? I didn't come up with the example but only showed that it didnt behave as he explained. Lets look at yours now:
Code:
Rank Name Elo + - games score oppo. draws
1 Engine_B 0 14 14 504 50% 0 25%
2 Engine_A 0 14 14 504 50% 0 25%


Now, let's add two more engines that have played 6 games against each other and against each A and B:

Code:
Rank Name Elo + - games score oppo. draws
1 Engine_B 0 61 61 516 50% 0 25%
2 Engine_A 0 61 61 516 50% 0 25%
3 Engine_C 0 102 102 18 50% 0 33%
4 Engine_D 0 102 102 18 50% 0 33%


The uncertainty in the ratings for A and B do rise dramatically.
You are doing something wrong. I get this:

Code: Select all

500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>
After games are added:

Code: Select all

550 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0   148   27   27   520   52%   137   20%
   2 Player1   148   27   27   520   52%   137   20%
   3 Player3  -148  123  123    30   23%    49   20%
   4 Player2  -148  123  123    30   23%    49   20%
ResultSet-EloRating>
So the uncertaintint is still 27...
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Yet Another Testing Question

Post by michiguel »

Daniel Shawul wrote:
Yes, the overwhelming number of games played by the first 10 opponents keeps the uncertainty in their Elo estimates low. However, Miguel's point is not without merit.

Let's look at a more realistic example:
Ha? I didn't come up with the example but only showed that it didnt behave as he explained. Lets look at yours now:
Code:
Rank Name Elo + - games score oppo. draws
1 Engine_B 0 14 14 504 50% 0 25%
2 Engine_A 0 14 14 504 50% 0 25%


Now, let's add two more engines that have played 6 games against each other and against each A and B:

Code:
Rank Name Elo + - games score oppo. draws
1 Engine_B 0 61 61 516 50% 0 25%
2 Engine_A 0 61 61 516 50% 0 25%
3 Engine_C 0 102 102 18 50% 0 33%
4 Engine_D 0 102 102 18 50% 0 33%


The uncertainty in the ratings for A and B do rise dramatically.
You are doing something wrong. I get this:

Code: Select all

500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>
After games are added:

Code: Select all

550 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0   148   27   27   520   52%   137   20%
   2 Player1   148   27   27   520   52%   137   20%
   3 Player3  -148  123  123    30   23%    49   20%
   4 Player2  -148  123  123    30   23%    49   20%
ResultSet-EloRating>
So the uncertaintint is still 27...
I see that you are using the default, which is a very crude approximation and not adequate for this case.

http://talkchess.com/forum/viewtopic.php?p=206722

Miguel
EDIT:
http://talkchess.com/forum/viewtopic.ph ... _view=flat
The previous link did not work as I expected
Daniel Shawul
Posts: 4185
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Yet Another Testing Question

Post by Daniel Shawul »

I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   14   14   500   50%     0   20%
   2 Player1     0   14   14   500   50%     0   20%
With the larger number of games and all methods:

Code: Select all

ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    2    2 100000   50%     0   20%
   2 Player1     0    2    2 100000   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    1    1 100000   50%     0   20%
   2 Player1     0    1    1 100000   50%     0   20%
ResultSet-EloRating>exactdist
00:00:00,05
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    3    3 100000   50%     0   20%
   2 Player1     0    3    3 100000   50%     0   20%
ResultSet-EloRating>jointdist
00:00:08,15
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0 -1439 -1439 100000   50%     0   20%
   2 Player1     0 -1439 -1439 100000   50%     0   20%
ResultSet-EloRating>los
Not exactly gettting more accurate,is it ? Maybe algorithm has problems with two players...
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Yet Another Testing Question

Post by Adam Hair »

Daniel Shawul wrote:
Yes, the overwhelming number of games played by the first 10 opponents keeps the uncertainty in their Elo estimates low. However, Miguel's point is not without merit.

Let's look at a more realistic example:
Ha? I didn't come up with the example but only showed that it didnt behave as he explained.
I did not say that you came up with the example. And Miguel's example is unrealistic in the sense that rarely (if ever) ~ 1 million games per engine are combined into a pgn.
Daniel Shawul wrote: Lets look at yours now:
Code:
Rank Name Elo + - games score oppo. draws
1 Engine_B 0 14 14 504 50% 0 25%
2 Engine_A 0 14 14 504 50% 0 25%


Now, let's add two more engines that have played 6 games against each other and against each A and B:

Code:
Rank Name Elo + - games score oppo. draws
1 Engine_B 0 61 61 516 50% 0 25%
2 Engine_A 0 61 61 516 50% 0 25%
3 Engine_C 0 102 102 18 50% 0 33%
4 Engine_D 0 102 102 18 50% 0 33%


The uncertainty in the ratings for A and B do rise dramatically.
You are doing something wrong. I get this:

Code: Select all

500 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>
After games are added:

Code: Select all

550 game(s) loaded, 0 game(s) with unknown result ignored.
ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0   148   27   27   520   52%   137   20%
   2 Player1   148   27   27   520   52%   137   20%
   3 Player3  -148  123  123    30   23%    49   20%
   4 Player2  -148  123  123    30   23%    49   20%
ResultSet-EloRating>
So the uncertaintint is still 27...
I see that there was misunderstanding all around. I used 'covariance' to compute the intervals. After reading this post from Rémi and the link you provided, I will give less weight to Bayeselo (or from other programs) confidence intervals.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Yet Another Testing Question

Post by Adam Hair »

Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   14   14   500   50%     0   20%
   2 Player1     0   14   14   500   50%     0   20%
With the larger number of games and all methods:

Code: Select all

ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    2    2 100000   50%     0   20%
   2 Player1     0    2    2 100000   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    1    1 100000   50%     0   20%
   2 Player1     0    1    1 100000   50%     0   20%
ResultSet-EloRating>exactdist
00:00:00,05
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    3    3 100000   50%     0   20%
   2 Player1     0    3    3 100000   50%     0   20%
ResultSet-EloRating>jointdist
00:00:08,15
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0 -1439 -1439 100000   50%     0   20%
   2 Player1     0 -1439 -1439 100000   50%     0   20%
ResultSet-EloRating>los
Not exactly gettting more accurate,is it ? Maybe algorithm has problems with two players...
Interesting. When I use joint distribution on a smaller pgn, I get this:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,01
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_B     0   16   13   504   50%     0   25%
   2 Engine_A     0   16   13   504   50%     0   25%
ResultSet-EloRating>
When I increase the # of games, I get the same as you:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_D     0 -1499 -1499 22176   50%     0   33%
   2 Engine_C     0 -1499 -1499 22176   50%     0   33%
ResultSet-EloRating>
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Yet another testing question: error bars.

Post by Ajedrecista »

Hello Daniel and Adam:
Adam Hair wrote:
Daniel Shawul wrote:I used the default ofcourse since Remi suggests that to avoid ugly error margin reports in some cases as shown here. Elostat gives also the same error margins. It seems to me even in case we have one pool (two players) default gives better results. If I am given one result set with 200-200-100, then I can calculate by vand margin of error to be 27 which the default shows, but I get half of that with exact dist...

Code: Select all

ResultSet>elo
ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   27   27   500   50%     0   20%
   2 Player1     0   27   27   500   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0   14   14   500   50%     0   20%
   2 Player1     0   14   14   500   50%     0   20%
With the larger number of games and all methods:

Code: Select all

ResultSet-EloRating>mm 1 1
00:00:00,00
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    2    2 100000   50%     0   20%
   2 Player1     0    2    2 100000   50%     0   20%
ResultSet-EloRating>covariance
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    1    1 100000   50%     0   20%
   2 Player1     0    1    1 100000   50%     0   20%
ResultSet-EloRating>exactdist
00:00:00,05
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0    3    3 100000   50%     0   20%
   2 Player1     0    3    3 100000   50%     0   20%
ResultSet-EloRating>jointdist
00:00:08,15
ResultSet-EloRating>ratings
Rank Name      Elo    +    - games score oppo. draws
   1 Player0     0 -1439 -1439 100000   50%     0   20%
   2 Player1     0 -1439 -1439 100000   50%     0   20%
ResultSet-EloRating>los
Not exactly gettting more accurate,is it ? Maybe algorithm has problems with two players...
Interesting. When I use joint distribution on a smaller pgn, I get this:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,01
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_B     0   16   13   504   50%     0   25%
   2 Engine_A     0   16   13   504   50%     0   25%
ResultSet-EloRating>
When I increase the # of games, I get the same as you:

Code: Select all

ResultSet-EloRating>jointdist
00:00:00,17
ResultSet-EloRating>ratings
Rank Name       Elo    +    - games score oppo. draws
   1 Engine_D     0 -1499 -1499 22176   50%     0   33%
   2 Engine_C     0 -1499 -1499 22176   50%     0   33%
ResultSet-EloRating>
Just my two cents: regarding error bars in a match between two engines only, I can calculate error bars by my own method (maybe similar to EloSTAT algorithm, I do not know) although my method has its drawbacks (for example with a draw ratio of 100% or near it, or with very unbalanced matches of more than 300 Elo of difference between the engines). I am not an expert in using BayesElo, but I sometimes realize that the error bars are too tiny, like 1-sigma confidence instead of 95% confidence (example here); I must type 'confidence 0.95' for getting usual error bars, as shown in my thread about diminishing returns at fixed depth testing.

In case of a match between two engines, if there are lots of games (small standard deviation) and the score of each engine is near 50%, the error bars can be multiplied by 1.96 in case of getting 95% confidence from 1-sigma confidence. If I take a formula for the average error <e> from the last equation of this post, here is my explanation:
Score = mu ~ 1/2; standard deviation = sd << 1:

|<e(k)>/<e(k = 1)>| = (200·log{[mu + k·(sd)][1 - mu + k·(sd)]/[mu - k·(sd)][1 - mu - k·(sd)]})/{200·log[(mu + sd)(1 - mu + sd)/(mu - sd)(1 - mu - sd)]} =

= ([200/ln(10)]·ln{[0.5 + k·(sd)][0.5 + k·(sd)]/[0.5 - k·(sd)][0.5 - k·(sd)]})/{[200/ln(10)]·ln[(0.5 + sd)(0.5 + sd)/(0.5 - sd)(0.5 - sd)]} =

= 2·ln{[0.5 + k·(sd)]/[0.5 - k·(sd)]}/{2·ln[(0.5 + sd)/(0.5 - sd)]} = ln{[1 + 2k·(sd)]/[1 - 2k·(sd)]}/ln{[1 + 2·(sd)]/[1 - 2·(sd)]} =

= {ln[1 + 2k·(sd)] - ln[1 - 2k·(sd)]}/{ln[1 + 2·(sd)] - ln[1 - 2·(sd)]} ~ [2k·(sd) - (-2k)·(sd)]/[2·(sd) - (-2)·(sd)] = [4k·(sd)]/[4·(sd)] = k
Using my programme LOS_and_Elo_uncertainties_calculator, I obtain these error bars for the exposed examples:

Code: Select all

For 95% confidence&#58;
-------------------

Daniel's examples&#58;

+200 -200 =100 &#40;n = 500, D = 20%); e ~ ± 27.29 Elo

+40000 -40000 =20000 &#40;n = 100000, D = 20%); e ~ ± 1.93 Elo

------------------------

Adam's examples&#58;

+189 -189 =126 &#40;n = 504, D = 25%); e ~ ± 26.32 Elo

+7392 -7392 =7392 &#40;n = 22176, D = 1/3&#41;; e ~ ± 3.73 Elo
So, in Daniel's examples, 'mm 1 1' and 'ratings' give similar results when compared to my results; in Adam's first example, I am puzzled of asymmetric error bars for score = 50% (they are also too tiny, probably for the issue I explained at the top of this post).

My message is not a solution but I want to give my idea about the supposed correct results of error bars of your examples.

Regards from Spain.

Ajedrecista.
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Yet Another Testing Question

Post by bob »

brianr wrote:Can I correctly say that v8.52 is in fact better than v8.71 based on the following?

First is heads-up play, which looks pretty clear.

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker852x64    15    7    7  1800   55%   -15   44%
   2 Tinker871x64   -15    7    7  1800   45%    15   44%
ResultSet-EloRating>los
              Ti Ti
Tinker852x64     99
Tinker871x64   0
But, look again with a much larger opponent pool. Of course, in this case the error margins look inconclusive.

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws
   1 		              276   16   16  1674   72%    92   19%
   2                          269   11   11  3183   73%    86   19%
   3                          241    6    7 12702   72%    53   17%
   4                          221   26   25   582   69%    68   23%
   5                          210    6    6 13740   68%    61   17%
   6                          203    6    6 16336   68%    52   18%
   7                          147    9    9  4670   63%    46   20%
   8                          142    7    7  8168   65%    34   29%
   9                          141    7    6 10662   63%    42   23%
  10                          119    7    7  9673   61%    36   28%
  11                          103    5    5 20711   58%    44   20%
  12                           60    6    5 16730   51%    51   19%
  13                           58    6    6 11093   53%    38   24%
  14                            8    6    6 11419   45%    48   21%
  15 Tinker871x64               0    8    7  5859   45%    35   40%
  16 		                    -5    6    6 10976   43%    43   32%
  17 Tinker852x64              -7    5    5 23280   47%    13   37%
  18                           -8    3    4 34496   51%   -13   40%
  19                          -11    5    4 23803   41%    60   21%
  20                          -18    5    6 14485   48%     0   34%
  21                          -18    6    6 11354   43%    33   30%
  22                          -25    9    9  4099   44%    18   25%
  23                          -26    5    5 31361   36%    90   18%
  24                          -35    5    6 12305   44%     2   40%
  25                          -44    8    9  4706   40%    30   36%
  26                          -46    9    9  4770   36%    53   31%
  27                          -53   28   28   358   45%   -18   41%
  28                          -66    8    8  6689   32%    80   21%
  29                          -72    6    6 15102   33%    70   16%
  30                         -268   23   24  1117   14%    64   10%
  31                         -371  160  249    17    6%     1   12%
Thanks
I simply do not like heads-up play to evaluate changes. I've done too many of 'em during the time I developed my cluster-testing approach, and heads-up simply produces distorted results.

I suppose it depends on your goal. If your goal is to write a version that beats your current version, and nothing else, then self-testing will work well. But if your goal is to produce a version that is stronger overall, you need to test against a pool of opponents instead, so that multiple different programs get a chance to attack/break your new changes in ways your own program won't do A simple example might be king safety. If your old version doesn't know how to do kingside attacks, then your new version that understands better how to defend against them won't seem any better, because the code is not used. But against attacking opponents, you might find your new code is actually worse because it is poorly tuned and you play too defensively and get strangled...