## Do You really need 1000s of games for testing?

Discussion of chess software programming and technical issues.

Moderators: bob, hgm, Harvey Williamson

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Jouni
Posts: 2053
Joined: Wed Mar 08, 2006 7:15 pm

### Do You really need 1000s of games for testing?

Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni

hgm
Posts: 23871
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

### Re: Do You really need 1000s of games for testing?

The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...

Posts: 9725
Joined: Wed Jul 26, 2006 8:21 pm

### Re: Do You really need 1000s of games for testing?

Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni
+/- 3 Elo points 2 standard deviations (95% confidence) is achieved in 25,000-35,000 games depending on the rate of draws (which I assume 30-50%), more draws, less games are needed. I cannot imagine a serious developer not testing for 3 Elo points improvements.
For tasks as testing TBs/noTBs THIS is the order of magnitude of benefit or harm. Also, testing many Ippos against Rybka 4 needs often 10,000+ games for a LOS of 95%. That are the rules of statistics. The result is not measured in positions, it's simply a trinomial of W/D/L.

What you are trying to say is that a confidence of 30-50%, achieved with much smaller samples, satisfies you.

Kai

mhull
Posts: 12538
Joined: Wed Mar 08, 2006 8:02 pm
Location: Dallas, Texas
Full name: Matthew Hull

### Re: Do You really need 1000s of games for testing?

Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni
To illustrate that few games are not good enough, compare Crafty 23.3 performance in the 22nd and 23rd (Division 2) CCRL events. Wide fluctuation against the same opponents.

Not enough games to nail down even its relative strength to any one program.
Matthew Hull

brianr
Posts: 359
Joined: Thu Mar 09, 2006 2:01 pm

### Re: Do You really need 1000s of games for testing?

Here are some troubling (note the switch in LOS) tests resulting from self play, so clearly at least several opponents is better than none:

Code: Select all

``````Rank Name           Elo    +    - games score oppo. draws
1 Tinker805x64    11   18   18   246   54%   -11   34%
2 Tinker801x64   -11   18   18   246   46%    11   34%
ResultSet-EloRating>los
Ti Ti
Tinker805x64     88
Tinker801x64  11

Rank Name           Elo    +    - games score oppo. draws
1 Tinker805x64     4   15   15   377   51%    -4   35%
2 Tinker801x64    -4   15   15   377   49%     4   35%
ResultSet-EloRating>los
Ti Ti
Tinker805x64     69
Tinker801x64  30

Rank Name           Elo    +    - games score oppo. draws
1 Tinker801x64     4    7    7  1532   51%    -4   35%
2 Tinker805x64    -4    7    7  1532   49%     4   35%
ResultSet-EloRating>los
Ti Ti
Tinker801x64     87
Tinker805x64  12

Rank Name           Elo    +    - games score oppo. draws
1 Tinker801x64     3    6    6  2128   51%    -3   36%
2 Tinker805x64    -3    6    6  2128   49%     3   36%
ResultSet-EloRating>los
Ti Ti
Tinker801x64     86
Tinker805x64  13

Rank Name           Elo    +    - games score oppo. draws
1 Tinker801x64     3    5    5  3157   51%    -3   38%
2 Tinker805x64    -3    5    5  3157   49%     3   38%
ResultSet-EloRating>los
Ti Ti
Tinker801x64     84
Tinker805x64  15

Rank Name           Elo    +    - games score oppo. draws
1 Tinker801x64     1    4    4  4119   50%    -1   38%
2 Tinker805x64    -1    4    4  4119   50%     1   38%
ResultSet-EloRating>los
Ti Ti
Tinker801x64     61
Tinker805x64  38

Rank Name           Elo    +    - games score oppo. draws
1 Tinker801x64     1    4    4  5071   50%    -1   38%
2 Tinker805x64    -1    4    4  5071   50%     1   38%
ResultSet-EloRating>los
Ti Ti
Tinker801x64     73
Tinker805x64  26

``````
Here is some data against a couple of other opponents

Code: Select all

``````Rank Name               Elo    +    - games score oppo. draws
1 C                   46   13   13  1233   60%   -26   28%
2 A                    3   13   13  1250   54%   -26   26%
3 Tinker801x64       -20   15   15   869   44%    25   27%
4 Tinker805x64       -30   12   12  1614   42%    25   27%
ResultSet-EloRating>los
C  A  Ti Ti
C                    99 99 99
A                  0    97 99
Tinker801x64       0  2    77
Tinker805x64       0  0 22``````

bob
Posts: 20664
Joined: Mon Feb 27, 2006 6:30 pm
Location: Birmingham, AL

### Re: Do You really need 1000s of games for testing?

Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni
The math does not lie. The answer is to your question is "yes, you do need thousands of games, unless the two programs are so far different in their skill levels that a small number of games will be enough to establish which is better. But if you want a high-level of accuracy in measuring the two Elos, you will _still_ need a ton of games...

hgm
Posts: 23871
Joined: Fri Mar 10, 2006 9:06 am
Location: Amsterdam
Full name: H G Muller
Contact:

### Re: Do You really need 1000s of games for testing?

I just aborted a test after 50 games, because one side was leading 35-15. I test positions, not engines, but in that does not really matter for the math. What I did was to handicap the winning side by an extra Pawn, and restarted the run. With additional Pawn odds the positon should be more equal, and the result of the test clearer. Should the 35-15 have been a statistical fluke, then I will know, because then with the additional Pawn odds awarded unjustly, that side should now lose heavily, and I would only have wasted a few games. Not arrived at a wrong conclusion.

Uri Blass
Posts: 8642
Joined: Wed Mar 08, 2006 11:37 pm
Location: Tel-Aviv Israel

### Re: Do You really need 1000s of games for testing?

hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
I do not know a way to get conclusion about small difference in small number of games but
nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,
A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.

Posts: 9725
Joined: Wed Jul 26, 2006 8:21 pm

### Re: Do You really need 1000s of games for testing?

Uri Blass wrote:
hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
I do not know a way to get conclusion about small difference in small number of games but
nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,
A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.
Uri, you mean by that the systematic error, which is much worse than statistical error. You can get every time 2 Elo points difference in a 400 games match which means nothing until you eliminate exactly that systematic error. You can get even a correct result with the systematic error, but you can not know when. For better or worse, most engines exhibit a sufficiently random behaviour even at fixed time or fixed depth controls to get rid of systematic errors. That is, with a set of balanced, random opening positions. I am even in favour of playing each position with reversed colours too, as this eliminates unbalanced positions.

ONE CAN NOT DO BETTER THAN THE STATISTICAL ERROR MARGINS.
ONE CAN DO WORSE THAN THAT.

Kai

Uri Blass
Posts: 8642
Joined: Wed Mar 08, 2006 11:37 pm
Location: Tel-Aviv Israel

### Re: Do You really need 1000s of games for testing?

Uri Blass wrote:
hgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...
I do not know a way to get conclusion about small difference in small number of games but
nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,
A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove
that you cannot build 200 positions when it is correct for them.
Uri, you mean by that the systematic error, which is much worse than statistical error. You can get every time 2 Elo points difference in a 400 games match which means nothing until you eliminate exactly that systematic error. You can get even a correct result with the systematic error, but you can not know when. For better or worse, most engines exhibit a sufficiently random behaviour even at fixed time or fixed depth controls to get rid of systematic errors. That is, with a set of balanced, random opening positions. I am even in favour of playing each position with reversed colours too, as this eliminates unbalanced positions.

ONE CAN NOT DO BETTER THAN THE STATISTICAL ERROR MARGINS.
ONE CAN DO WORSE THAN THAT.

Kai
My point is not that I know a way to do better but that it is not something that is proved mathematically and in theory it is possible to have 200 positions when a match based on them tells us correctly which program is better for a real difference of at least 2 elo(also if the real difference is calculated based on some constant big set of 1000,000 positions).