margin of error

lkaufman · Post by **lkaufman** » Sun Sep 23, 2012 10:05 pm

This is getting to be quite confusing, with differing answers depending on what software is assumed etc.
I'll restate the question in a very non-ambiguous manner that is software-independent. Let's assume we are comparing two versions and want to play enough games to be able to tell which is stronger with an error margin of 5 elo. We can either play each of them against a foreign gauntlet or play them directly against each other. Let's assume the draw percentage is the same in either case, and let's ignore the possibility that the results may be different even with an infinite sample size.
So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)? It seems that all three answers could be inferred from various posts on this topic.

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 23, 2012 10:30 pm

ernest wrote:
hgm wrote:I would expect the A-B error to be 10 in that case, because the errors would perfectly anti-correlate.
Why would they anti-correlate?... (instead of being independent)
Nono, I believe the best estimation for the A-B error is 7 (sqrt thing).

For A's score to rise B's score has to fall down. So correlation coefficient is -1. If you play two separate gauntlets for A and B , they are independent since A's score don't affect B, so no correlation and the error is sqrt(D1^2+D2^2).

hgm · Post by **hgm** » Sun Sep 23, 2012 11:04 pm

lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?

You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 24, 2012 4:17 am

hgm wrote:
lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.

I don't really understand this, because you are not really measuring the variance of A-B directly. You still have to calculate the variances and co-variances of A and B separately and use the formula for A-B to get the final error of margin. Negative correlation drives up the variance of A-B as we agreed so independence of results of A and B can only decrease the variance of A-B. That is if the variance of A and B are kept the same. Lets take an example for a result 200-200-100 between two engines.

Code: Select all

Var&#40;A&#41;=Var&#40;B&#41;=1/5 
Cov&#40;A,B&#41; = -1/5 
Var&#40;A-B&#41;=1/5+1/5-2&#40;-1/5&#41;=4/5

So the variance of A-B is exactly four times that of either of the players.
Say instead A played against C, and B against C with two independent experiments and both got 200-200-100 against C.

Code: Select all

Var&#40;A&#41;=Var&#40;B&#41;=1/5 
Cov&#40;A,B&#41;=0 even thoug Cov&#40;A,C&#41;=Cov&#40;B,C&#41;=-1/5 
Var&#40;A-B&#41;= 1/5+1/5=2/5

So with independent experiments the variance is actually halved at twice the number of games. So they give comparable results.

--------------------------------------------------
Edit: Infact, the standard error of the difference is better for the case we play against two different engines:
1st method: s.e = sqrt(4/5) / sqrt(500) = sqrt(4/5 * 500)
2nd method: s.e = sqrt(2/5) / sqrt(2 * 500) = sqrt(1/5 * 500)
We need to play four times as many games b/n A and B to reach same margin of error as the second method, but that will be twice as many games.
So on equal grounds, The second method requires half the number of games to reach the same standard error. Quite the opposite of your claim.
----------------------------------------------------

P.S: I left out confidence intervals and standard error (which would require division by sqrt(N)) in my calculations. So Larry you should be careful when applying the formula directly on confidence values..

P.S.1: Detail calculations

Code: Select all

Var&#40;A&#41;=&#40;200&#40;1-0.5&#41;^2+200&#40;0-0.5&#41;^2+100&#40;0.5-0.5&#41;^2&#41;/500 
Cov&#40;A,B&#41;=&#40;200&#40;1-0.5&#41;&#40;0-0.5&#41;+200&#40;0-0.5&#41;&#40;1-0.5&#41;+100&#40;0.5-0.5&#41;&#40;0.5-0.5&#41;)/500

Don · Post by **Don** » Mon Sep 24, 2012 4:59 am

hgm wrote:
lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.

This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.

Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 24, 2012 5:11 am

Don wrote:
hgm wrote:
lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.
This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.

Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.

Well then I am truly baffled. My calculation says playing A vs B directly requires twice as many games than doing A vs C and B vs C. And now you guys are saying the other way round requires 4x as many games. So we have a difference of 8x

Can you provide data and explain how you calculated error bars ? I am really curious now as I believe negative correlation drives up variance of A-B but decreases that of A + B, which I think is causing the confusion here. Well maybe I am wrong but we lack a good explanation ...

michiguel · Post by **michiguel** » Mon Sep 24, 2012 5:49 am

Daniel Shawul wrote:
Don wrote:
hgm wrote:
lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.
This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.

Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.
Well then I am truly baffled. My calculation says playing A vs B directly requires twice as many games than doing A vs C and B vs C. And now you guys are saying the other way round requires 4x as many games. So we have a difference of 8x
Can you provide data and explain how you calculated error bars ? I am really curious now as I believe negative correlation drives up variance of A-B but decreases that of A + B, which I think is causing the confusion here. Well maybe I am wrong but we lack a good explanation ...

HGM explanation is correct.

A plays B and the the difference (deltaAB) has an error Eab.
Then, with the same number games, we can calculate deltaAC, and it will have error Eac = Eab (since number of games are the same).
Also, with the same number games, we can calculate deltaCB, and it will have error Ecb = Eab (since number of games are the same).

So, we can calculate indirectly
deltaAB = deltaAC + deltaCB

Here we can already see that the error of this indirect calculation is bigger than Eab, no matter what, and we are already playing twice as many games.

deltaAC and deltaCB are independent, so the error for the indirect calculation is
IndirectError_ab = sqrt(Eac^2 + Ecb^2)
IndirectError_ab = sqrt(Eac^2 + Eac^2)
IndirectError_ab = sqrt(2*Eac^2)
IndirectError_ab = sqrt(2) * Eac

If we want the IndirectError_ab to be Eab, we have to make Eac = Eab/sqrt(2). We can do that playing twice as many games, which makes the total 4x.

Miguel

Don · Post by **Don** » Mon Sep 24, 2012 5:49 am

Daniel Shawul wrote:
Don wrote:
hgm wrote:
lkaufman wrote: So the question is: Do we need the same number of total games to be played in each case, do we need twice as many in the case of gauntlet testing (i.e. the same number of games for each program as would be needed for a match between them), or do we need four times as many total games for the gauntlet method (i.e. twice as many games per engine X two engines tested)?
You would need 4 times as many games. The direct confrontation would give you an error of 5 Elo. To get that same error from a difference of two measurements, you need to do each of those measurements to an accuracy of 3.5 Elo, because their errors will add (root-mean-square-wise) when taking the difference. And that requires twie as many games in each gauntlet. So four times as many in total.
This is exactly correct - I ran a simulation to prove it empirically, but intuitively I already believed it would come out 4x.

Essentially I ran 20,000 game simulated matches where player B was 2 ELO stronger than player A, and counted how many times it returned a "correct" result. It turns out that it is correct about 87.3 percent of the time, given or take half a percent. Then I ran another test where both programs played a foreign opponent and compared their scores against the foreign players, measuring them indirectly. This was only reliable about 79% of the time although it required twice as many games. When I played this with 40,000 games matches it very closely matched the 87.3 of the initial run - but requiring 4x more games to play.
Well then I am truly baffled. My calculation says playing A vs B directly requires twice as many games than doing A vs C and B vs C. And now you guys are saying the other way round requires 4x as many games.

Logically if I play A vs B and use 20,000 games, both player have played 20,000 games. If I play A vs C, and then play a second match B vs C, I have to play at least twice as many games, 20,000 for the first match and 20,000 for the second match. In other words I have wasted a lot of testing resources involving a 3rd party. So I think it's pretty obvious that the answer is at least 2x.

The other difference is that we are gauging the relative strength of the player indirectly, through player C. But player C also has error margins adding to the imprecision of the results. So it's more than just a mere 2x since we are involving the 3rd party with the error he creates.

I don't know if this is a good analogy, but imagine that we were nearly the same height - an easy way to determine who was the tallest would be for us to stand back to back and compare directly. Probably both of us would be squirming around a bit so let's say that each of us could be off by as much as quarter of an inch in either direction.

Another way to see who is taller is to measure each of us separately, using (let's say) an equally imprecise measurement - a person holding a paper tape measure that was wrinkled up a bit - and he could also be accurate only up to 1/4 inch either way as he would have his hands full wrestling with the tape. In the back to back measurement you have 2 sources of error, in the tape scenario you have 4 sources of error (the tape measurement is applied twice.)

So we have a difference of 8x
Can you provide data and explain how you calculated error bars ? I am really curious now as I believe negative correlation drives up variance of A-B but decreases that of A + B, which I think is causing the confusion here. Well maybe I am wrong but we lack a good explanation ...

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 24, 2012 5:57 am

HGM explanation is correct.

A plays B and the the difference (deltaAB) has an error Eab.
Then, with the same number games, we can calculate deltaAC, and it will have error Eac = Eab (since number of games are the same).
Also, with the same number games, we can calculate deltaCB, and it will have error Ecb = Eab (since number of games are the same).

So, we can calculate indirectly
deltaAB = deltaAC + deltaCB

Here we can already see that the error of this indirect calculation is bigger than Eab, no matter what, and we are already playing twice as many games.

deltaAC and deltaCB are independent, so the error for the indirect calculation is
IndirectError_ab = sqrt(Eac^2 + Ecb^2)
IndirectError_ab = sqrt(Eac^2 + Eac^2)
IndirectError_ab = sqrt(2*Eac^2)
IndirectError_ab = sqrt(2) * Eac

If we want the IndirectError_ab to be Eab, we have to make Eac = Eab/sqrt(2). We can do that playing twice as many games, which makes the total 4x.

Miguel

No you are missing inclusion of co-variance completely. In the first A vs B test you have a big covariance so that affects the variance of A - B big time. Even HGM agreed that for the example I gave two standard errors of 5 elo each , the std(A-B) = 10 which your calculation ignores..

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 24, 2012 6:03 am

Logically if I play A vs B and use 20,000 games, both player have played 20,000 games. If I play A vs C, and then play a second match B vs C, I have to play at least twice as many games, 20,000 for the first match and 20,000 for the second match. In other words I have wasted a lot of testing resources involving a 3rd party. So I think it's pretty obvious that the answer is at least 2x.

Yes but you too are forgetting covariance. Remeber Remi warned that my forumal could be wrong since there is usually covariance. Please look at my calculation, and see how the covariance affects A-B significantly. It is equal in magnitude to the variance.

Code: Select all

For A vs B
var&#40;A&#41;=var&#40;B&#41;
cov&#40;A,B&#41;=-sqrt&#40;var&#40;A&#41;var&#40;B&#41;)=-var&#40;A&#41;
So var&#40;A-B&#41;=var&#40;A&#41;+var&#40;A&#41;-2&#40;-var&#40;A&#41;)=4var&#40;A&#41;

So when you match A with B, you have 4 time as big a variance. Look at my example and tell me where I made a mistake.

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error