Yet Another Testing Question

Discussion of chess software programming and technical issues.

Moderator: Ras

brianr
Posts: 540
Joined: Thu Mar 09, 2006 3:01 pm
Full name: Brian Richardson

Yet Another Testing Question

Post by brianr »

Can I correctly say that v8.52 is in fact better than v8.71 based on the following?

First is heads-up play, which looks pretty clear.

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker852x64    15    7    7  1800   55%   -15   44%
   2 Tinker871x64   -15    7    7  1800   45%    15   44%
ResultSet-EloRating>los
              Ti Ti
Tinker852x64     99
Tinker871x64   0
But, look again with a much larger opponent pool. Of course, in this case the error margins look inconclusive.

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws
   1 		              276   16   16  1674   72%    92   19%
   2                          269   11   11  3183   73%    86   19%
   3                          241    6    7 12702   72%    53   17%
   4                          221   26   25   582   69%    68   23%
   5                          210    6    6 13740   68%    61   17%
   6                          203    6    6 16336   68%    52   18%
   7                          147    9    9  4670   63%    46   20%
   8                          142    7    7  8168   65%    34   29%
   9                          141    7    6 10662   63%    42   23%
  10                          119    7    7  9673   61%    36   28%
  11                          103    5    5 20711   58%    44   20%
  12                           60    6    5 16730   51%    51   19%
  13                           58    6    6 11093   53%    38   24%
  14                            8    6    6 11419   45%    48   21%
  15 Tinker871x64               0    8    7  5859   45%    35   40%
  16 		                    -5    6    6 10976   43%    43   32%
  17 Tinker852x64              -7    5    5 23280   47%    13   37%
  18                           -8    3    4 34496   51%   -13   40%
  19                          -11    5    4 23803   41%    60   21%
  20                          -18    5    6 14485   48%     0   34%
  21                          -18    6    6 11354   43%    33   30%
  22                          -25    9    9  4099   44%    18   25%
  23                          -26    5    5 31361   36%    90   18%
  24                          -35    5    6 12305   44%     2   40%
  25                          -44    8    9  4706   40%    30   36%
  26                          -46    9    9  4770   36%    53   31%
  27                          -53   28   28   358   45%   -18   41%
  28                          -66    8    8  6689   32%    80   21%
  29                          -72    6    6 15102   33%    70   16%
  30                         -268   23   24  1117   14%    64   10%
  31                         -371  160  249    17    6%     1   12%
Thanks
ZirconiumX
Posts: 1361
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: Yet Another Testing Question

Post by ZirconiumX »

I'd take the gauntlet result over the self-play - otherwise you are simply tuning the engine against itself. But that is just me.

Matthew:out
tu ne cede malis, sed contra audentior ito
User avatar
Ajedrecista
Posts: 2135
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Yet another testing question.

Post by Ajedrecista »

Hello Brian:
brianr wrote:Can I correctly say that v8.52 is in fact better than v8.71 based on the following?

First is heads-up play, which looks pretty clear.

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker852x64    15    7    7  1800   55%   -15   44%
   2 Tinker871x64   -15    7    7  1800   45%    15   44%
ResultSet-EloRating>los
              Ti Ti
Tinker852x64     99
Tinker871x64   0
But, look again with a much larger opponent pool. Of course, in this case the error margins look inconclusive.

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws
   1 		              276   16   16  1674   72%    92   19%
   2                          269   11   11  3183   73%    86   19%
   3                          241    6    7 12702   72%    53   17%
   4                          221   26   25   582   69%    68   23%
   5                          210    6    6 13740   68%    61   17%
   6                          203    6    6 16336   68%    52   18%
   7                          147    9    9  4670   63%    46   20%
   8                          142    7    7  8168   65%    34   29%
   9                          141    7    6 10662   63%    42   23%
  10                          119    7    7  9673   61%    36   28%
  11                          103    5    5 20711   58%    44   20%
  12                           60    6    5 16730   51%    51   19%
  13                           58    6    6 11093   53%    38   24%
  14                            8    6    6 11419   45%    48   21%
  15 Tinker871x64               0    8    7  5859   45%    35   40%
  16 		                    -5    6    6 10976   43%    43   32%
  17 Tinker852x64              -7    5    5 23280   47%    13   37%
  18                           -8    3    4 34496   51%   -13   40%
  19                          -11    5    4 23803   41%    60   21%
  20                          -18    5    6 14485   48%     0   34%
  21                          -18    6    6 11354   43%    33   30%
  22                          -25    9    9  4099   44%    18   25%
  23                          -26    5    5 31361   36%    90   18%
  24                          -35    5    6 12305   44%     2   40%
  25                          -44    8    9  4706   40%    30   36%
  26                          -46    9    9  4770   36%    53   31%
  27                          -53   28   28   358   45%   -18   41%
  28                          -66    8    8  6689   32%    80   21%
  29                          -72    6    6 15102   33%    70   16%
  30                         -268   23   24  1117   14%    64   10%
  31                         -371  160  249    17    6%     1   12%
Thanks
I agree with Matthew: a bunch of engines will bring a wider variety of styles and it is preferred. Typical tournaments include different engines and not different versions of the same engine. Once said that, I think that self tests can be good although they tend to enlarge the Elo gap between versions.

If I consider the first test as +594 -414 =792 in favour of 8.52 version:

Code: Select all

LOS_and_Elo_uncertainties_calculator, ® 2012.

----------------------------------------------------------------
Calculation of Elo uncertainties in a match between two engines:
----------------------------------------------------------------

(The input and output data is referred to the first engine).

Please write down non-negative integers.

Maximum number of games supported: 2147483647.

Write down the number of wins (up to 1825361100):

594

Write down the number of loses (up to 1825361100):

414

Write down the number of draws (up to 2147482639):

792

 Write down the confidence level (in percentage) between 65% and 99.9% (it will be rounded up to 0.01%):

95

Write down the clock rate of the CPU (in GHz), only for timing the elapsed time of the calculations:

3

---------------------------------------
Elo interval for 95.00 % confidence:

Elo rating difference:     34.86 Elo

Lower rating difference:   22.87 Elo
Upper rating difference:   46.93 Elo

Lower bound uncertainty:  -11.99 Elo
Upper bound uncertainty:   12.07 Elo
Average error:        +/-  12.03 Elo

K = (average error)*[sqrt(n)] =  510.33

Elo interval: ]  22.87,   46.93[
---------------------------------------

Number of games of the match:      1800
Score: 55.00 %
Elo rating difference:   34.86 Elo
Draw ratio: 44.00 %

*********************************************************
Standard deviation:  1.7130 % of the points of the match.
*********************************************************

 Error bars were calculated with two-sided tests; values are rounded up to 0.01 Elo, or 0.01 in the case of K.

-------------------------------------------------------------------
Calculation of likelihood of superiority (LOS) in a one-sided test:
-------------------------------------------------------------------

LOS (taking into account draws) is always calculated, if possible.

LOS (not taking into account draws) is only calculated if wins + loses < 16001.

LOS (average value) is calculated only when LOS (not taking into account draws) is calculated.
______________________________________________

LOS: 100.00 % (taking into account draws).
LOS: 100.00 % (not taking into account draws).
LOS: 100.00 % (average value).
______________________________________________

These values of LOS are rounded up to 0.01%

End of the calculations. Approximated elapsed time:   91 ms.

Thanks for using LOS_and_Elo_uncertainties_calculator. Press Enter to exit.
LOS is near 100% (as BayesElo shows) and 8.52 version should be at least around 23 Elo better than 8.71 version with 95% confidence. But the second test loks more convincing. In fact, error bars are a little inconclusive here: Robert Houdart said that in this case, square root of squares sum could be a good indicator. Taking a look to your data:

Code: Select all

  15 Tinker871x64               0    8    7  5859   45%    35   40%
  ...
  17 Tinker852x64              -7    5    5 23280   47%    13   37%
I take rating(Tinker 8.71 x64) ~ 0 ± 7.5; rating(Tinker 8.52 x64) ~ -7 ± 5. Then, the difference should be:

Code: Select all

Rating(Tinker 8.71 x64) - rating(Tinker 8.52 x64) ~ 0 - (-7) ± sqrt[(7.5)² + 5²] = 7 ± sqrt(81.25) ~ 7 ± 9.0139 ~ 7 ± 9 Elo ~ [-2, 16].
According to your second test, Tinker 8.71 seems better than Tinker 8.52 but their Elo difference is not statistically definitive for 95% confidence. It would be good that you take a look at the LOS value that BayesElo computes between these two versions of Tinker in the second test.

Regards from Spain.

Ajedrecista.
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Yet Another Testing Question

Post by michiguel »

brianr wrote:Can I correctly say that v8.52 is in fact better than v8.71 based on the following?

First is heads-up play, which looks pretty clear.

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker852x64    15    7    7  1800   55%   -15   44%
   2 Tinker871x64   -15    7    7  1800   45%    15   44%
ResultSet-EloRating>los
              Ti Ti
Tinker852x64     99
Tinker871x64   0
But, look again with a much larger opponent pool. Of course, in this case the error margins look inconclusive.

Code: Select all

Rank Name                     Elo    +    - games score oppo. draws
   1 		              276   16   16  1674   72%    92   19%
   2                          269   11   11  3183   73%    86   19%
   3                          241    6    7 12702   72%    53   17%
   4                          221   26   25   582   69%    68   23%
   5                          210    6    6 13740   68%    61   17%
   6                          203    6    6 16336   68%    52   18%
   7                          147    9    9  4670   63%    46   20%
   8                          142    7    7  8168   65%    34   29%
   9                          141    7    6 10662   63%    42   23%
  10                          119    7    7  9673   61%    36   28%
  11                          103    5    5 20711   58%    44   20%
  12                           60    6    5 16730   51%    51   19%
  13                           58    6    6 11093   53%    38   24%
  14                            8    6    6 11419   45%    48   21%
  15 Tinker871x64               0    8    7  5859   45%    35   40%
  16 		                    -5    6    6 10976   43%    43   32%
  17 Tinker852x64              -7    5    5 23280   47%    13   37%
  18                           -8    3    4 34496   51%   -13   40%
  19                          -11    5    4 23803   41%    60   21%
  20                          -18    5    6 14485   48%     0   34%
  21                          -18    6    6 11354   43%    33   30%
  22                          -25    9    9  4099   44%    18   25%
  23                          -26    5    5 31361   36%    90   18%
  24                          -35    5    6 12305   44%     2   40%
  25                          -44    8    9  4706   40%    30   36%
  26                          -46    9    9  4770   36%    53   31%
  27                          -53   28   28   358   45%   -18   41%
  28                          -66    8    8  6689   32%    80   21%
  29                          -72    6    6 15102   33%    70   16%
  30                         -268   23   24  1117   14%    64   10%
  31                         -371  160  249    17    6%     1   12%
Thanks
In this context and for what you want to know, errors are meaningless. The errors you see represent the variation compared to the average of the pool, but what you need is the "head to head" error. You are interested in the parameter (EloEngine15 - EloEngine17), which has its own error, and could be _much_ smaller.

For instance, you can have ten engines (A, B, C, D, E, F, G, H, I, J) that played 100,000 games against each other, and each will have an error that would be less that 1. But, you incorporate a pgn with fewer games of ten other engines, which played only 10 games against the former ones and among themselves. All the errors will increase tremendously, because now the values against the average of the pool is uncertain.

Suppose we are standing one close to another, 1 meter apart. We are 1 meter +/- 0.01 meter, we are both in line to Buenos Aires, Argentina. Now, lets suppose I give my distance to Argentina and your distance to Argentina from Chicago. That would be 9000,000 meters +/ 1000 meters, and possibly 9000,001 meters +/ 1000 meters. The error between us is not that high!

You may like to compare the parameter LOS, but I am not sure how it is calculated.

Alternatively, you may want to take a look to Ordo with simulations, fixing the rating of the reference engine. All the errors will be related the reference. In addition you get a matrix with all the "head to head" errors (A v B, A v C, etc.). That is what you should be looking for and that is exactly the reason why I wrote the simulations feature.

That would be (1000 simulations)

ordo -p yourpgnfile.pgn -o output.txt -W -s 1000 -e error.csv

and open error.csv with excel. or

ordo -p yourpgnfile.pgn -o output.txt -W -s 1000 -e error.csv -a 0 -A "Tinker852x64"

To put Tinker852x64 as reference (all the errors will be relative to it, not to the average of the pool, and you can make a direction comparison).

Miguel
PS: However, it is likely in your case that the difference you see is related to the fact that a gauntlet may have different outcomes than a direct match.
brianr
Posts: 540
Joined: Thu Mar 09, 2006 3:01 pm
Full name: Brian Richardson

Re: Yet Another Testing Question

Post by brianr »

michiguel wrote: In this context and for what you want to know, errors are meaningless. The errors you see represent the variation compared to the average of the pool, but what you need is the "head to head" error. You are interested in the parameter (EloEngine15 - EloEngine17), which has its own error, and could be _much_ smaller.
Bayeselo says:

Code: Select all

ResultSet-EloRating>pairstats 0 1
-- pairstats between i = 0, and j = 1
Name[i] = Tinker871x64
Name[j] = Tinker852x64
elo[i] - elo[j] = 6.9787
Games = 2256
w_ij = 364
d_ij = 551
l_ij = 254
w_ji = 449
d_ji = 459
l_ji = 179
elo[i] - elo[j] = -23.2599
So, does this mean 8.71 is about 23 elo worse than 8.52, and if so, with what confidence?
When I do exactdist 0/1 the Distribution interface opens and I get:

Code: Select all

ResultSet-EloRating>exactdist 0
ResultSet-EloRating-Distribution>mostlikely
-38.961
Distribution commands
~~~~~~~~~~~~~~~~~~~~~
lower [c] ....... lower bound (c=confidence, default=0.95)
upper [c] ....... upper bound
mostlikely ...... most likely value
write ........... write complete distribution
ResultSet-EloRating-Distribution>lower
-47.1862
ResultSet-EloRating-Distribution>upper
-30.3499

ResultSet-EloRating>exactdist 1
ResultSet-EloRating-Distribution>mostlikely
-44.955
ResultSet-EloRating-Distribution>lower
-50.8457
ResultSet-EloRating-Distribution>upper
-41.4676
Or, does this mean they are about 6 elo apart?

Thanks again.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Yet Another Testing Question

Post by Daniel Shawul »

In this context and for what you want to know, errors are meaningless. The errors you see represent the variation compared to the average of the pool, but what you need is the "head to head" error. You are interested in the parameter (EloEngine15 - EloEngine17), which has its own error, and could be _much_ smaller.

For instance, you can have ten engines (A, B, C, D, E, F, G, H, I, J) that played 100,000 games against each other, and each will have an error that would be less that 1. But, you incorporate a pgn with fewer games of ten other engines, which played only 10 games against the former ones and among themselves. All the errors will increase tremendously, because now the values against the average of the pool is uncertain.
Why? It still has the 100,000 head to head games so why would it increase at all. Error margin do not have anything to do average opponent AFAIK.
Suppose we are standing one close to another, 1 meter apart. We are 1 meter +/- 0.01 meter, we are both in line to Buenos Aires, Argentina. Now, lets suppose I give my distance to Argentina and your distance to Argentina from Chicago. That would be 9000,000 meters +/ 1000 meters, and possibly 9000,001 meters +/ 1000 meters. The error between us is not that high!
I think you are confused, bayeselo error margin calculation does not work like that. In this example , you are comparing A with B and then A with C which will ofcourse have different error margin. The example you gave before with 100,000 games should not show increase in error bars after adding a couple of 10 games from another pool of players. The pool which played fewer games will have larger error margins , and those who played 100000 games will still have 1 elo...
ZirconiumX
Posts: 1361
Joined: Sun Jul 17, 2011 11:14 am
Full name: Hannah Ravensloft

Re: Yet Another Testing Question

Post by ZirconiumX »

23 Elo with 95% confidence - Jesus' calculations are with 95% confidence and the BE results agree with them.

Matthew:out
tu ne cede malis, sed contra audentior ito
User avatar
michiguel
Posts: 6401
Joined: Thu Mar 09, 2006 8:30 pm
Location: Chicago, Illinois, USA

Re: Yet Another Testing Question

Post by michiguel »

Daniel Shawul wrote:
In this context and for what you want to know, errors are meaningless. The errors you see represent the variation compared to the average of the pool, but what you need is the "head to head" error. You are interested in the parameter (EloEngine15 - EloEngine17), which has its own error, and could be _much_ smaller.

For instance, you can have ten engines (A, B, C, D, E, F, G, H, I, J) that played 100,000 games against each other, and each will have an error that would be less that 1. But, you incorporate a pgn with fewer games of ten other engines, which played only 10 games against the former ones and among themselves. All the errors will increase tremendously, because now the values against the average of the pool is uncertain.
Why? It still has the 100,000 head to head games so why would it increase at all. Error margin do not have anything to do average opponent AFAIK.
Suppose we are standing one close to another, 1 meter apart. We are 1 meter +/- 0.01 meter, we are both in line to Buenos Aires, Argentina. Now, lets suppose I give my distance to Argentina and your distance to Argentina from Chicago. That would be 9000,000 meters +/ 1000 meters, and possibly 9000,001 meters +/ 1000 meters. The error between us is not that high!
I think you are confused, bayeselo error margin calculation does not work like that. In this example , you are comparing A with B and then A with C which will ofcourse have different error margin. The example you gave before with 100,000 games should not show increase in error bars after adding a couple of 10 games from another pool of players. The pool which played fewer games will have larger error margins , and those who played 100000 games will still have 1 elo...
No, I am not confused, and this is program independent. The Error has to be based on the difference between the rating of the engine and a reference value, since the scale is relative. That reference value in the way things are generally calculated is the average of the pool. So, if you have a rating 100 +/10, that means that the distance between your rating and the average is 100 with an error of 10. Two engines could have a bigger error related to the general pool, but a smaller error between them. So, this behavior is expected, otherwise, the question would be what the error given means.

And I think you can see it. For instance, if you have a pgn between A and B with equal results (draws, white and black wins etc) cut and pasted several times to increase the numbers you have

Code: Select all

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_B     0    5    5  6336   50%     0   50% 
   2 Engine_A     0    5    5  6336   50%     0   50% 
If now you include extra games between C, D and C-A, C-B, D-A, and D-B you get

Code: Select all

Rank Name       Elo    +    - games score oppo. draws 
   1 Engine_A     0    6    6  6360   50%     0   50% 
   2 Engine_B     0    6    6  6360   50%     0   50% 
   3 Engine_C     0   80   80    24   50%     0   50% 
   4 Engine_D     0   80   80    24   50%     0   50% 
The error did not go down, it went up, despite it should have stayed the same or microscopically go down (EDIT: ... go down it there was an absolute scale).

If you are interested in knowing the difference between two specific engines, you have to specifically recalculate the error between those. This of course can be done (but it is NOT directly related to the previous individual errors) and I think the info in BE is somehow included in the LOS calculation. But I do not know if it was done that way. It looks like it.

Miguel
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Yet Another Testing Question

Post by Adam Hair »

Given both the self-play and the games against the pool of opponents, v8.52 appears to be better. After all, the second test does not differentiate between the two versions.

In reality, they appear to be quite close in strength. As it has been noted, self-play accentuates the difference. More games played by v8.71 against the pool would make things more definitive. In the end, I also would lean towards the results of the the second test, after v8.71 plays more games. Playing against a pool of opponents adds more noise (which requires more games to overcome), but it seems to give more reliable results.
Daniel Shawul
Posts: 4186
Joined: Tue Mar 14, 2006 11:34 am
Location: Ethiopia

Re: Yet Another Testing Question

Post by Daniel Shawul »

You said it would increase _signifcantly_ which it did n't. The error margin is 5 vs 6 after you add a few more games from the second pool. You simply forgot that we still have the oldgames when adding the new pool. Ofcourse I wouldn't expect error margin to be the same with different sets of players but it will not change the error margins significantly as you claimed. Here is a direct transaltion of your frist example. With or without the second pool error margin still remains 1...

Code: Select all

1   Player9   164   1   1   900100   50.00%   164   20.00% 
2   Player8   164   1   1   900100   50.00%   164   20.00% 
3   Player7   164   1   1   900100   50.00%   164   20.00% 
4   Player6   164   1   1   900100   50.00%   164   20.00% 
5   Player5   164   1   1   900100   50.00%   164   20.00% 
6   Player4   164   1   1   900100   50.00%   164   20.00% 
7   Player3   164   1   1   900100   50.00%   164   20.00% 
8   Player2   164   1   1   900100   50.00%   164   20.00% 
9   Player1   164   1   1   900100   50.00%   164   20.00% 
10   Player0   164   1   1   900100   50.00%   164   20.00% 
11   Player10   -163   52   52   190   28.90%   9   20.00% 
12   Player11   -163   52   52   190   28.90%   9   20.00% 
13   Player12   -163   52   52   190   28.90%   9   20.00% 
14   Player13   -163   52   52   190   28.90%   9   20.00% 
15   Player14   -163   52   52   190   28.90%   9   20.00% 
16   Player15   -163   52   52   190   28.90%   9   20.00% 
17   Player16   -163   52   52   190   28.90%   9   20.00% 
18   Player17   -163   52   52   190   28.90%   9   20.00% 
19   Player18   -163   52   52   190   28.90%   9   20.00% 
20   Player19   -163   52   52   190   28.90%   9   20.00% 
Now compare this result to what you said:
All the errors will increase tremendously, because now the values against the average of the pool is uncertain.
Obviously it didn't increase at all even if the average elo of the pool is decreased by 163. You forgot that we still have those 100000 games between themselves, otherwise you wouldn't talk about distance examples you gave , which I fail to see its relevance here at all.

For completeness here are results ignoring one of the pools. You can see there isn't much of a difference for any of the pools even though they have tremendously different number of games and elos as welll..

First pool's error is same:

Code: Select all

1	Player0	0	1	1	900000	50.00%	0	20.00%
2	Player1	0	1	1	900000	50.00%	0	20.00%
3	Player2	0	1	1	900000	50.00%	0	20.00%
4	Player3	0	1	1	900000	50.00%	0	20.00%
5	Player4	0	1	1	900000	50.00%	0	20.00%
6	Player5	0	1	1	900000	50.00%	0	20.00%
7	Player6	0	1	1	900000	50.00%	0	20.00%
8	Player7	0	1	1	900000	50.00%	0	20.00%
9	Player8	0	1	1	900000	50.00%	0	20.00%
10	Player9	0	1	1	900000	50.00%	0	20.00%
Second pool's error bar is 52 vs 53

Code: Select all

1	Player0	187	82	82	100	90.00%	-186	20.00%
2	Player1	187	82	82	100	90.00%	-186	20.00%
3	Player2	187	82	82	100	90.00%	-186	20.00%
4	Player3	187	82	82	100	90.00%	-186	20.00%
5	Player4	187	82	82	100	90.00%	-186	20.00%
6	Player5	187	82	82	100	90.00%	-186	20.00%
7	Player6	187	82	82	100	90.00%	-186	20.00%
8	Player7	187	82	82	100	90.00%	-186	20.00%
9	Player8	187	82	82	100	90.00%	-186	20.00%
10	Player9	187	82	82	100	90.00%	-186	20.00%
11	Player10	-186	53	53	190	28.90%	10	20.00%
12	Player11	-186	53	53	190	28.90%	10	20.00%
13	Player12	-186	53	53	190	28.90%	10	20.00%
14	Player13	-186	53	53	190	28.90%	10	20.00%
15	Player14	-186	53	53	190	28.90%	10	20.00%
16	Player15	-186	53	53	190	28.90%	10	20.00%
17	Player16	-186	53	53	190	28.90%	10	20.00%
18	Player17	-186	53	53	190	28.90%	10	20.00%
19	Player18	-186	53	53	190	28.90%	10	20.00%
20	Player19	-186	53	53	190	28.90%	10	20.00%
Daniel