margin of error
Moderators: hgm, Harvey Williamson, bob
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
margin of error
Here is a question related to the recent post about testing. Let's say I run a gauntlet of my program against many others, and after 16,500 games my software shows a margin of error of 5 elo. Now I run a new version against the exact same opponents, and the rating is a few elo better, again with a margin of error of 5. It is my understanding that the margin of error for the difference between the two engines is 5 times the square root of 2, or 7. So far, correct, right?
Now I play a match directly between th two versions. I play the same number of games as in the gauntlet, 16,500. The software shows the margin of error for each version to be 5. Is the margin of error for the elo difference between them equal to 5, to 7, or something else?
Now I play a match directly between th two versions. I play the same number of games as in the gauntlet, 16,500. The software shows the margin of error for each version to be 5. Is the margin of error for the elo difference between them equal to 5, to 7, or something else?

 Posts: 3724
 Joined: Tue Mar 14, 2006 10:34 am
 Location: Ethiopia
 Contact:
Re: margin of error
It is 7. It doesn't matter which pool of opponents you have. So for a result obtained in any way (selftest or against multiple opponents) : A = Elo1 +/ D1 and B = Elo2 +/ D2, the uncertainity of AB is sqrt(D1^2+D2^2)lkaufman wrote:Here is a question related to the recent post about testing. Let's say I run a gauntlet of my program against many others, and after 16,500 games my software shows a margin of error of 5 elo. Now I run a new version against the exact same opponents, and the rating is a few elo better, again with a margin of error of 5. It is my understanding that the margin of error for the difference between the two engines is 5 times the square root of 2, or 7. So far, correct, right?
Now I play a match directly between th two versions. I play the same number of games as in the gauntlet, 16,500. The software shows the margin of error for each version to be 5. Is the margin of error for the elo difference between them equal to 5, to 7, or something else?
Re: margin of error
Thanks, that's what I thought. I think most people who run direct matches set the elo of the old version at a constant, say 3000, and just look at the elo and margin of error of the new version. That's the way I worked when I was working on Rybka 3, but I frequently complained to Vas that something was wrong, because I was getting way too many results outside of the margin of error. I eventually concluded that the mistake was failing to multiply the given margin of error by the square root of 2, but I was never 100% sure that this was correct for a direct match. I appreciate independent confirmation of this.
Re: margin of error
Yes, that's correct.
See: http://en.wikipedia.org/wiki/Propagation_of_uncertainty, especially section "partial derivates".
See: http://en.wikipedia.org/wiki/Propagation_of_uncertainty, especially section "partial derivates".
Joona Kiiski

 Posts: 429
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: margin of error
Note that Daniel's formula is correct only if the two confidence intervals are independent measurements. This may be approximately correct in the case of two long gauntlets against the same opponents. But I recommend loading all games into bayeselo and looking at LOS. It should be correct every time.
There are cases where the LOS reported by bayeselo with all games will be different from what you get when assuming measurements are independent:
 If the number of games is small, there is interference from the prior
 The correctness of the formula depends on how the confidence intervals are computed. Do the confidence intervals take rating uncertainty of opponents into consideration? If they do, there may be problems too.
 Considering all the games of the two gauntlets at the same time allows to estimate the strength of opponents more accurately.
 In case the two programs don't have the exact same opponents in the same proportion, then Daniel's formula can be very wrong.
I should probably remove confidence intervals and force people to use LOS in the next version, because as soon as there is correlation, there is no way to interpret confidence intervals in a correct way to compare ratings. There is almost always some correlation.
Rémi
There are cases where the LOS reported by bayeselo with all games will be different from what you get when assuming measurements are independent:
 If the number of games is small, there is interference from the prior
 The correctness of the formula depends on how the confidence intervals are computed. Do the confidence intervals take rating uncertainty of opponents into consideration? If they do, there may be problems too.
 Considering all the games of the two gauntlets at the same time allows to estimate the strength of opponents more accurately.
 In case the two programs don't have the exact same opponents in the same proportion, then Daniel's formula can be very wrong.
I should probably remove confidence intervals and force people to use LOS in the next version, because as soon as there is correlation, there is no way to interpret confidence intervals in a correct way to compare ratings. There is almost always some correlation.
Rémi
Re: margin of error
Rémi,Rémi Coulom wrote:Note that Daniel's formula is correct only if the two confidence intervals are independent measurements. This may be approximately correct in the case of two long gauntlets against the same opponents. But I recommend loading all games into bayeselo and looking at LOS. It should be correct every time.
There are cases where the LOS reported by bayeselo with all games will be different from what you get when assuming measurements are independent:
 If the number of games is small, there is interference from the prior
 The correctness of the formula depends on how the confidence intervals are computed. Do the confidence intervals take rating uncertainty of opponents into consideration? If they do, there may be problems too.
 Considering all the games of the two gauntlets at the same time allows to estimate the strength of opponents more accurately.
 In case the two programs don't have the exact same opponents in the same proportion, then Daniel's formula can be very wrong.
I should probably remove confidence intervals and force people to use LOS in the next version, because as soon as there is correlation, there is no way to interpret confidence intervals in a correct way to compare ratings. There is almost always some correlation.
Rémi
What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?
Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?
Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.
Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.
Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.
Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.

 Posts: 429
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: margin of error
Yes.Don wrote: What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?
Yes.Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?
Measuring statistical significance of the comparison when doing that kind of early stopping is very subtle and difficult. There was a thread about this topic some time ago. I am sorry I cannot find it any more. Lucas ran experiments if I remember correctly. Michel also participated. It is a complicated topic.Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.
Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.
Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.
Don
That paper is a reference that comes to my mind:
www.di.ens.fr/willow/pdfs/ICML08b.pdf
The idea is that if you decide to stop early, the experiment becomes less significant than what the LOS tells.
Rémi
Re: margin of error
I wonder if something simple will suffice  even if it's not that precise mathematically. Here is an example stopping rule:Rémi Coulom wrote:Yes.Don wrote: What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?
Yes.Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?
Measuring statistical significance of the comparison when doing that kind of early stopping is very subtle and difficult. There was a thread about this topic some time ago. I am sorry I cannot find it any more. Lucas ran experiments if I remember correctly. Michel also participated. It is a complicated topic.Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.
Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.
Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.
Don
That paper is a reference that comes to my mind:
www.di.ens.fr/willow/pdfs/ICML08b.pdf
The idea is that if you decide to stop early, the experiment becomes less significant than what the LOS tells.
Rémi
1. Set number of games in advance, e.g. 20,000
2. Set a threshold LOS for stopping, i.e. +/ 98 LOS
2. For calculating LOS "incrementally", assume the unplayed games end will come out even  and the draws in the same ratio as the played games.
3. Stop at 20,000 games or else at the first moment you cross the predefined LOS threshold.
This would probably not be the correct LOS but could it be considered a safe "bound"? In other words if it reports 80% that would give you more confidence that the program was improved than if 80% were after 20,000 games, right?
Note that there are 2 issues here 
1. You can stop at any arbitrary point.
2. You must predefine your stopping rule.
I think you have to predefine your stopping rule. Even if it's not a fixed number of games the rule itself has to be "fixed" such as what I described above. Otherwise you can always delay stopping in an attempt to influence the conclusion.
I'll take a look at that paper, but I don't know if I will be able to understand it. I get lost when the math gets heavy.
Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.

 Posts: 429
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: margin of error
That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.
Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.
Rémi
Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.
Rémi
Re: margin of error
Excellent idea. I use simulations like this a lot, especially when I don't understand the math or I'm too lazy to wade through it.Rémi Coulom wrote:That paper is very probably not the best thing to read. I'll try to find that older thread if none of its participants gives us a link to it.
Even if you don't understand deep theory, there are simple ways to test any stopping method you design: just simulate it. For each elo point difference (1, 2, 3...) you can run many simulations of your early stopping method, and measure at which frequency it makes the wrong decision.
Rémi
Don
Capital punishment would be more effective as a preventive measure if it were administered prior to the crime.