Do You really need 1000s of games for testing?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Jouni
Posts: 3286
Joined: Wed Mar 08, 2006 8:15 pm

Do You really need 1000s of games for testing?

Post by Jouni »

Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Do You really need 1000s of games for testing?

Post by hgm »

The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Do You really need 1000s of games for testing?

Post by Laskos »

Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni
+/- 3 Elo points 2 standard deviations (95% confidence) is achieved in 25,000-35,000 games depending on the rate of draws (which I assume 30-50%), more draws, less games are needed. I cannot imagine a serious developer not testing for 3 Elo points improvements.
For tasks as testing TBs/noTBs THIS is the order of magnitude of benefit or harm. Also, testing many Ippos against Rybka 4 needs often 10,000+ games for a LOS of 95%. That are the rules of statistics. The result is not measured in positions, it's simply a trinomial of W/D/L.

What you are trying to say is that a confidence of 30-50%, achieved with much smaller samples, satisfies you.

Kai
User avatar
mhull
Posts: 13447
Joined: Wed Mar 08, 2006 9:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

Re: Do You really need 1000s of games for testing?

Post by mhull »

Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni
To illustrate that few games are not good enough, compare Crafty 23.3 performance in the 22nd and 23rd (Division 2) CCRL events. Wide fluctuation against the same opponents.

Not enough games to nail down even its relative strength to any one program.
Matthew Hull
brianr
Posts: 536
Joined: Thu Mar 09, 2006 3:01 pm

Re: Do You really need 1000s of games for testing?

Post by brianr »

Here are some troubling (note the switch in LOS) tests resulting from self play, so clearly at least several opponents is better than none:

Code: Select all

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker805x64    11   18   18   246   54%   -11   34%
   2 Tinker801x64   -11   18   18   246   46%    11   34%
ResultSet-EloRating>los
              Ti Ti
Tinker805x64     88
Tinker801x64  11

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker805x64     4   15   15   377   51%    -4   35%
   2 Tinker801x64    -4   15   15   377   49%     4   35%
ResultSet-EloRating>los
              Ti Ti
Tinker805x64     69
Tinker801x64  30

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker801x64     4    7    7  1532   51%    -4   35%
   2 Tinker805x64    -4    7    7  1532   49%     4   35%
ResultSet-EloRating>los
              Ti Ti
Tinker801x64     87
Tinker805x64  12

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker801x64     3    6    6  2128   51%    -3   36%
   2 Tinker805x64    -3    6    6  2128   49%     3   36%
ResultSet-EloRating>los
              Ti Ti
Tinker801x64     86
Tinker805x64  13

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker801x64     3    5    5  3157   51%    -3   38%
   2 Tinker805x64    -3    5    5  3157   49%     3   38%
ResultSet-EloRating>los
              Ti Ti
Tinker801x64     84
Tinker805x64  15

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker801x64     1    4    4  4119   50%    -1   38%
   2 Tinker805x64    -1    4    4  4119   50%     1   38%
ResultSet-EloRating>los
              Ti Ti
Tinker801x64     61
Tinker805x64  38

Rank Name           Elo    +    - games score oppo. draws
   1 Tinker801x64     1    4    4  5071   50%    -1   38%
   2 Tinker805x64    -1    4    4  5071   50%     1   38%
ResultSet-EloRating>los
              Ti Ti
Tinker801x64     73
Tinker805x64  26

Here is some data against a couple of other opponents

Code: Select all

Rank Name               Elo    +    - games score oppo. draws
   1 C                   46   13   13  1233   60%   -26   28%
   2 A                    3   13   13  1250   54%   -26   26%
   3 Tinker801x64       -20   15   15   869   44%    25   27%
   4 Tinker805x64       -30   12   12  1614   42%    25   27%
ResultSet-EloRating>los
                  C  A  Ti Ti
C                    99 99 99
A                  0    97 99
Tinker801x64       0  2    77
Tinker805x64       0  0 22
bob
Posts: 20943
Joined: Mon Feb 27, 2006 7:30 pm
Location: Birmingham, AL

Re: Do You really need 1000s of games for testing?

Post by bob »

Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni
The math does not lie. The answer is to your question is "yes, you do need thousands of games, unless the two programs are so far different in their skill levels that a small number of games will be enough to establish which is better. But if you want a high-level of accuracy in measuring the two Elos, you will _still_ need a ton of games...
User avatar
hgm
Posts: 27795
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Do You really need 1000s of games for testing?

Post by hgm »

I just aborted a test after 50 games, because one side was leading 35-15. I test positions, not engines, but in that does not really matter for the math. What I did was to handicap the winning side by an extra Pawn, and restarted the run. With additional Pawn odds the positon should be more equal, and the result of the test clearer. Should the 35-15 have been a statistical fluke, then I will know, because then with the additional Pawn odds awarded unjustly, that side should now lose heavily, and I would only have wasted a few games. Not arrived at a wrong conclusion.
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Do You really need 1000s of games for testing?

Post by Uri Blass »

hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
I do not know a way to get conclusion about small difference in small number of games but
nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,
A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Do You really need 1000s of games for testing?

Post by Laskos »

Uri Blass wrote:
hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
I do not know a way to get conclusion about small difference in small number of games but
nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,
A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.
Uri, you mean by that the systematic error, which is much worse than statistical error. You can get every time 2 Elo points difference in a 400 games match which means nothing until you eliminate exactly that systematic error. You can get even a correct result with the systematic error, but you can not know when. For better or worse, most engines exhibit a sufficiently random behaviour even at fixed time or fixed depth controls to get rid of systematic errors. That is, with a set of balanced, random opening positions. I am even in favour of playing each position with reversed colours too, as this eliminates unbalanced positions.

ONE CAN NOT DO BETTER THAN THE STATISTICAL ERROR MARGINS.
ONE CAN DO WORSE THAN THAT.

Kai
Uri Blass
Posts: 10282
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Do You really need 1000s of games for testing?

Post by Uri Blass »

Laskos wrote:
Uri Blass wrote:
hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
I do not know a way to get conclusion about small difference in small number of games but
nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,
A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.
Uri, you mean by that the systematic error, which is much worse than statistical error. You can get every time 2 Elo points difference in a 400 games match which means nothing until you eliminate exactly that systematic error. You can get even a correct result with the systematic error, but you can not know when. For better or worse, most engines exhibit a sufficiently random behaviour even at fixed time or fixed depth controls to get rid of systematic errors. That is, with a set of balanced, random opening positions. I am even in favour of playing each position with reversed colours too, as this eliminates unbalanced positions.

ONE CAN NOT DO BETTER THAN THE STATISTICAL ERROR MARGINS.
ONE CAN DO WORSE THAN THAT.

Kai
My point is not that I know a way to do better but that it is not something that is proved mathematically and in theory it is possible to have 200 positions when a match based on them tells us correctly which program is better for a real difference of at least 2 elo(also if the real difference is calculated based on some constant big set of 1000,000 positions).