## Do You really need 1000s of games for testing?

**Moderators:** hgm, Harvey Williamson, bob

**Forum rules**

This textbox is used to restore diagrams posted with the [d] tag before the upgrade.

### Do You really need 1000s of games for testing?

Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni

Jouni

- hgm
**Posts:**23498**Joined:**Fri Mar 10, 2006 9:06 am**Location:**Amsterdam**Full name:**H G Muller-
**Contact:**

### Re: Do You really need 1000s of games for testing?

The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...

### Re: Do You really need 1000s of games for testing?

+/- 3 Elo points 2 standard deviations (95% confidence) is achieved in 25,000-35,000 games depending on the rate of draws (which I assume 30-50%), more draws, less games are needed. I cannot imagine a serious developer not testing for 3 Elo points improvements.Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni

For tasks as testing TBs/noTBs THIS is the order of magnitude of benefit or harm. Also, testing many Ippos against Rybka 4 needs often 10,000+ games for a LOS of 95%. That are the rules of statistics. The result is not measured in positions, it's simply a trinomial of W/D/L.

What you are trying to say is that a confidence of 30-50%, achieved with much smaller samples, satisfies you.

Kai

### Re: Do You really need 1000s of games for testing?

To illustrate that few games are not good enough, compare Crafty 23.3 performance in the 22nd and 23rd (Division 2) CCRL events. Wide fluctuation against the same opponents.Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni

Not enough games to nail down even its relative strength to any one program.

Matthew Hull

### Re: Do You really need 1000s of games for testing?

Here are some troubling (note the switch in LOS) tests resulting from self play, so clearly at least several opponents is better than none:
Here is some data against a couple of other opponents

Code: Select all

```
Rank Name Elo + - games score oppo. draws
1 Tinker805x64 11 18 18 246 54% -11 34%
2 Tinker801x64 -11 18 18 246 46% 11 34%
ResultSet-EloRating>los
Ti Ti
Tinker805x64 88
Tinker801x64 11
Rank Name Elo + - games score oppo. draws
1 Tinker805x64 4 15 15 377 51% -4 35%
2 Tinker801x64 -4 15 15 377 49% 4 35%
ResultSet-EloRating>los
Ti Ti
Tinker805x64 69
Tinker801x64 30
Rank Name Elo + - games score oppo. draws
1 Tinker801x64 4 7 7 1532 51% -4 35%
2 Tinker805x64 -4 7 7 1532 49% 4 35%
ResultSet-EloRating>los
Ti Ti
Tinker801x64 87
Tinker805x64 12
Rank Name Elo + - games score oppo. draws
1 Tinker801x64 3 6 6 2128 51% -3 36%
2 Tinker805x64 -3 6 6 2128 49% 3 36%
ResultSet-EloRating>los
Ti Ti
Tinker801x64 86
Tinker805x64 13
Rank Name Elo + - games score oppo. draws
1 Tinker801x64 3 5 5 3157 51% -3 38%
2 Tinker805x64 -3 5 5 3157 49% 3 38%
ResultSet-EloRating>los
Ti Ti
Tinker801x64 84
Tinker805x64 15
Rank Name Elo + - games score oppo. draws
1 Tinker801x64 1 4 4 4119 50% -1 38%
2 Tinker805x64 -1 4 4 4119 50% 1 38%
ResultSet-EloRating>los
Ti Ti
Tinker801x64 61
Tinker805x64 38
Rank Name Elo + - games score oppo. draws
1 Tinker801x64 1 4 4 5071 50% -1 38%
2 Tinker805x64 -1 4 4 5071 50% 1 38%
ResultSet-EloRating>los
Ti Ti
Tinker801x64 73
Tinker805x64 26
```

Code: Select all

```
Rank Name Elo + - games score oppo. draws
1 C 46 13 13 1233 60% -26 28%
2 A 3 13 13 1250 54% -26 26%
3 Tinker801x64 -20 15 15 869 44% 25 27%
4 Tinker805x64 -30 12 12 1614 42% 25 27%
ResultSet-EloRating>los
C A Ti Ti
C 99 99 99
A 0 97 99
Tinker801x64 0 2 77
Tinker805x64 0 0 22
```

### Re: Do You really need 1000s of games for testing?

The math does not lie. The answer is to your question is "yes, you do need thousands of games, unless the two programs are so far different in their skill levels that a small number of games will be enough to establish which is better. But if you want a high-level of accuracy in measuring the two Elos, you will _still_ need a ton of games...Jouni wrote:Yes I know the mathematics, but when I look at IPON or CCRL ratings I have feeling, that engine rating don't change much from 200 games to 2000 games! Usually no more than +-5 points I think. Maybe the secret is to select good/broad selection of opponents? If You don't see improvements in 200 games it's not significant. 200 games is already something like 20 000 individual moves / positions...

Jouni

- hgm
**Posts:**23498**Joined:**Fri Mar 10, 2006 9:06 am**Location:**Amsterdam**Full name:**H G Muller-
**Contact:**

### Re: Do You really need 1000s of games for testing?

I just aborted a test after 50 games, because one side was leading 35-15. I test positions, not engines, but in that does not really matter for the math. What I did was to handicap the winning side by an extra Pawn, and restarted the run. With additional Pawn odds the positon should be more equal, and the result of the test clearer. Should the 35-15 have been a statistical fluke, then I will know, because then with the additional Pawn odds awarded unjustly, that side should now lose heavily, and I would only have wasted a few games. Not arrived at a wrong conclusion.

### Re: Do You really need 1000s of games for testing?

I do not know a way to get conclusion about small difference in small number of games buthgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...

nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,

A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove

that you cannot build 200 positions when it is correct for them.

### Re: Do You really need 1000s of games for testing?

Uri, you mean by that the systematic error, which is much worse than statistical error. You can get every time 2 Elo points difference in a 400 games match which means nothing until you eliminate exactly that systematic error. You can get even a correct result with the systematic error, but you can not know when. For better or worse, most engines exhibit a sufficiently random behaviour even at fixed time or fixed depth controls to get rid of systematic errors. That is, with a set of balanced, random opening positions. I am even in favour of playing each position with reversed colours too, as this eliminates unbalanced positions.Uri Blass wrote:I do not know a way to get conclusion about small difference in small number of games buthgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...

nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,

A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove

that you cannot build 200 positions when it is correct for them.

ONE CAN NOT DO BETTER THAN THE STATISTICAL ERROR MARGINS.

ONE CAN DO WORSE THAN THAT.

Kai

### Re: Do You really need 1000s of games for testing?

My point is not that I know a way to do better but that it is not something that is proved mathematically and in theory it is possible to have 200 positions when a match based on them tells us correctly which program is better for a real difference of at least 2 elo(also if the real difference is calculated based on some constant big set of 1000,000 positions).Laskos wrote:Uri, you mean by that the systematic error, which is much worse than statistical error. You can get every time 2 Elo points difference in a 400 games match which means nothing until you eliminate exactly that systematic error. You can get even a correct result with the systematic error, but you can not know when. For better or worse, most engines exhibit a sufficiently random behaviour even at fixed time or fixed depth controls to get rid of systematic errors. That is, with a set of balanced, random opening positions. I am even in favour of playing each position with reversed colours too, as this eliminates unbalanced positions.Uri Blass wrote:I do not know a way to get conclusion about small difference in small number of games buthgm wrote:The mathematics really tells everyhing there is to tell. If you see that practice does not conform to the mathematical predictions, it means that something is horribly wrong with the practice, and that the results should not be believed because you know what is going wrong. Such is the nature of mathematics: there is no arguing with it. If 1+1=2, and my pocket calculater says 1+1=3, the calculator is broken. Even if it says it all the time. Even if a hundred other calculators say it too.

I don't know how solid your observation is. (E.g. what means "usually", and how many engines were incuded in this observation.) With 200 games the score standard deviation should be about 2.5%, which translates to 17 Elo points. That still means there should be a great number of cases where it changes by 5 or less. But in more than half the cases, it should change 10 or more. If not, something fishy is going on with that rating list...

nothing in math tell me that it is impossible.

You assume assumptions like the assumption that the games are not independent(if you play the same opening with both color this assumption is not correct).

Nothing in math tell me that it is impossible to have 200 positions when every time that A is better than B by at least 2 elo,

A beat B in a match of 400 games based on the 200 positions.

I do not know 200 positions when it is correct for them but I cannot prove

that you cannot build 200 positions when it is correct for them.

ONE CAN NOT DO BETTER THAN THE STATISTICAL ERROR MARGINS.

ONE CAN DO WORSE THAN THAT.

Kai