margin of error

lkaufman · Post by **lkaufman** » Sun Sep 16, 2012 2:16 am

Here is a question related to the recent post about testing. Let's say I run a gauntlet of my program against many others, and after 16,500 games my software shows a margin of error of 5 elo. Now I run a new version against the exact same opponents, and the rating is a few elo better, again with a margin of error of 5. It is my understanding that the margin of error for the difference between the two engines is 5 times the square root of 2, or 7. So far, correct, right?
Now I play a match directly between th two versions. I play the same number of games as in the gauntlet, 16,500. The software shows the margin of error for each version to be 5. Is the margin of error for the elo difference between them equal to 5, to 7, or something else?

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 16, 2012 3:18 am

lkaufman wrote:Here is a question related to the recent post about testing. Let's say I run a gauntlet of my program against many others, and after 16,500 games my software shows a margin of error of 5 elo. Now I run a new version against the exact same opponents, and the rating is a few elo better, again with a margin of error of 5. It is my understanding that the margin of error for the difference between the two engines is 5 times the square root of 2, or 7. So far, correct, right?
Now I play a match directly between th two versions. I play the same number of games as in the gauntlet, 16,500. The software shows the margin of error for each version to be 5. Is the margin of error for the elo difference between them equal to 5, to 7, or something else?

It is 7. It doesn't matter which pool of opponents you have. So for a result obtained in any way (self-test or against multiple opponents) : A = Elo1 +/- D1 and B = Elo2 +/- D2, the uncertainity of A-B is sqrt(D1^2+D2^2)

lkaufman · Post by **lkaufman** » Sun Sep 16, 2012 4:14 am

Thanks, that's what I thought. I think most people who run direct matches set the elo of the old version at a constant, say 3000, and just look at the elo and margin of error of the new version. That's the way I worked when I was working on Rybka 3, but I frequently complained to Vas that something was wrong, because I was getting way too many results outside of the margin of error. I eventually concluded that the mistake was failing to multiply the given margin of error by the square root of 2, but I was never 100% sure that this was correct for a direct match. I appreciate independent confirmation of this.

zamar · Post by **zamar** » Sun Sep 16, 2012 8:28 am

Yes, that's correct.

See: http://en.wikipedia.org/wiki/Propagation_of_uncertainty, especially section "partial derivates".

Rémi Coulom · Post by **Rémi Coulom** » Sun Sep 16, 2012 9:12 am

Note that Daniel's formula is correct only if the two confidence intervals are independent measurements. This may be approximately correct in the case of two long gauntlets against the same opponents. But I recommend loading all games into bayeselo and looking at LOS. It should be correct every time.

There are cases where the LOS reported by bayeselo with all games will be different from what you get when assuming measurements are independent:
- If the number of games is small, there is interference from the prior
- The correctness of the formula depends on how the confidence intervals are computed. Do the confidence intervals take rating uncertainty of opponents into consideration? If they do, there may be problems too.
- Considering all the games of the two gauntlets at the same time allows to estimate the strength of opponents more accurately.
- In case the two programs don't have the exact same opponents in the same proportion, then Daniel's formula can be very wrong.

I should probably remove confidence intervals and force people to use LOS in the next version, because as soon as there is correlation, there is no way to interpret confidence intervals in a correct way to compare ratings. There is almost always some correlation.

Rémi

Don · Post by **Don** » Sun Sep 23, 2012 3:32 pm

Rémi Coulom wrote:Note that Daniel's formula is correct only if the two confidence intervals are independent measurements. This may be approximately correct in the case of two long gauntlets against the same opponents. But I recommend loading all games into bayeselo and looking at LOS. It should be correct every time.

There are cases where the LOS reported by bayeselo with all games will be different from what you get when assuming measurements are independent:
- If the number of games is small, there is interference from the prior
- The correctness of the formula depends on how the confidence intervals are computed. Do the confidence intervals take rating uncertainty of opponents into consideration? If they do, there may be problems too.
- Considering all the games of the two gauntlets at the same time allows to estimate the strength of opponents more accurately.
- In case the two programs don't have the exact same opponents in the same proportion, then Daniel's formula can be very wrong.

I should probably remove confidence intervals and force people to use LOS in the next version, because as soon as there is correlation, there is no way to interpret confidence intervals in a correct way to compare ratings. There is almost always some correlation.

Rémi

Rémi,

What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?

Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?

Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.

Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.

Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.

Don

Rémi Coulom · Post by **Rémi Coulom** » Sun Sep 23, 2012 4:14 pm

Don wrote: What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?

Yes.

Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?

Yes.

Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.

Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.

Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.

Don

Measuring statistical significance of the comparison when doing that kind of early stopping is very subtle and difficult. There was a thread about this topic some time ago. I am sorry I cannot find it any more. Lucas ran experiments if I remember correctly. Michel also participated. It is a complicated topic.

That paper is a reference that comes to my mind:
www.di.ens.fr/willow/pdfs/ICML08b.pdf

The idea is that if you decide to stop early, the experiment becomes less significant than what the LOS tells.

Rémi

Don · Post by **Don** » Sun Sep 23, 2012 4:44 pm

Rémi Coulom wrote:
Don wrote: What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?
Yes.

Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?
Yes.

Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.

Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.

Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.

Don
Measuring statistical significance of the comparison when doing that kind of early stopping is very subtle and difficult. There was a thread about this topic some time ago. I am sorry I cannot find it any more. Lucas ran experiments if I remember correctly. Michel also participated. It is a complicated topic.

That paper is a reference that comes to my mind:
www.di.ens.fr/willow/pdfs/ICML08b.pdf

The idea is that if you decide to stop early, the experiment becomes less significant than what the LOS tells.

Rémi

I wonder if something simple will suffice - even if it's not that precise mathematically. Here is an example stopping rule:

1. Set number of games in advance, e.g. 20,000

2. Set a threshold LOS for stopping, i.e. +/- 98 LOS

2. For calculating LOS "incrementally", assume the unplayed games end will come out even - and the draws in the same ratio as the played games.

3. Stop at 20,000 games or else at the first moment you cross the predefined LOS threshold.

This would probably not be the correct LOS but could it be considered a safe "bound"? In other words if it reports 80% that would give you more confidence that the program was improved than if 80% were after 20,000 games, right?

Note that there are 2 issues here -

1. You can stop at any arbitrary point.
2. You must predefine your stopping rule.

I think you have to predefine your stopping rule. Even if it's not a fixed number of games the rule itself has to be "fixed" such as what I described above. Otherwise you can always delay stopping in an attempt to influence the conclusion.

I'll take a look at that paper, but I don't know if I will be able to understand it. I get lost when the math gets heavy.

Don

Rémi Coulom · Post by **Rémi Coulom** » Sun Sep 23, 2012 5:21 pm

That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.

Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.

Rémi

Don · Post by **Don** » Sun Sep 23, 2012 5:26 pm

Rémi Coulom wrote:That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.

Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.

Rémi

Excellent idea. I use simulations like this a lot, especially when I don't understand the math or I'm too lazy to wade through it.

Don

margin of error

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error