margin of error

michiguel · Post by **michiguel** » Mon Sep 24, 2012 8:46 am

Daniel Shawul wrote:
No, both default and covariance assume Gaussian. That has nothing to do with this.
http://www.talkchess.com/forum/viewtopi ... &start=199

The point is one assume the the opponent is the true rating (gross approximation that could be valid when there are many opponents and lots of games), and the other not. So, that is why you get a difference that is close to 2.
I am talking about default and exactdist which is what we are comparing here. One gives 27 and the other 15 and the difference is exactly the gaussian assumption.

That is not the only difference!

From the link:
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)

Miguel

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 24, 2012 8:51 am

michiguel wrote:
Daniel Shawul wrote:
When you measure A vs B, you measure only one percentage.

Miguel
This is what I am saying elostat and bayeselo do differently than your assumption. They do assume that they measured two percentages so for a winning percentage of 50%-50% with a correlation of -1. It is evident because they store separate scores for A and B as WWWWLLLLDD and LLLLWWWWDD. If you calculate the elo margin for 200-200-100 like I did var(A)=1/5 which translates to 27 elo as displayed by both elostat and bayeselo default. If it was a direct measure of A-B as you claimed , their report of error margin's are wrong and should have been 27/2.
No, I claimed that the correct way to calculate the errors in BE is in covariance mode, and that is the error over the value (A-average of the pool).

Once you do that, you get DeltaAB = 2*A and Eab = 2*Error that BE reports.

Regardless of how you do it, you will end up having a bigger error if you do this process indirectly through a third party.

Miguel

Well then why would you say the following:

Yes, if you have
Code: Select all
            Elo   Error +/- 
Engine_A   +100    10 
Engine_B   -100    10
That is a way to represent the results, but the direct measure is Engine_A-EngineB = 200 +/- 20. These are the numbers I am talking about. DeltaAB and Eab.

I get that you are talking about DeltaAB but your A's and B's are 10 which is what I am saying.
There is a difference in the basic assumption as both elostat and bayeselo default do assume separate results for A and B with a correlation of -1.

Daniel Shawul · Post by **Daniel Shawul** » Mon Sep 24, 2012 8:55 am

michiguel wrote:
Daniel Shawul wrote:
No, both default and covariance assume Gaussian. That has nothing to do with this.
http://www.talkchess.com/forum/viewtopi ... &start=199

The point is one assume the the opponent is the true rating (gross approximation that could be valid when there are many opponents and lots of games), and the other not. So, that is why you get a difference that is close to 2.
I am talking about default and exactdist which is what we are comparing here. One gives 27 and the other 15 and the difference is exactly the gaussian assumption.
That is not the only difference!

From the link:
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.
"covariance": assume Gaussian distribution, but not that the rating of opponents are true. This may be very costly if you have thousands of players, but it is more accurate than the default. The cost is cubic in the number of players (it is a matrix inversion)

Miguel

For the third time, I said the one I am comparing are default and exactscore. (which is what I have given the results for). For the third time the difference b/n the two is exactly the gaussian assumption.

Default: assume opponents ratings are their true ratings, and Gaussian distribution
"exactdist": assume opponents ratings are their true ratings, but does not assume Gaussian distribution. This will produce asymmetric intervals, especially for very high or very low winning rates. Cost is linear in the number of players.

Michel · Post by **Michel** » Mon Sep 24, 2012 9:48 am

Michel wrote:
If you test against a set of engines with known elo you can use the likelihood ratio test (also known as Wald test). This test is easy to use.

I'll look for that - do you have any references?

We don't really test head to head, but each version of Komodo tests against a set of opponents (not Komodo versions.)

The Wald test also works for non head to head tests. The key point is that the elo of the opponents needs to be known (which is the case for a head to head test since you may assume that the elo of the single opponent is zero).

The Wald test is for testing a single parameter (e.g. elo). It is implemented as a random walk were each outcome (win/draw/loss) changes a certain number (the likelihood ratio) in a precomputed way. Once the likelihood ratio falls outside a precomputed interval you accept either H0 or H1.

In practice you also have a truncation time and a threshold which determines whether you accept H0 or H1 at truncation.

I wrote a python utility that computes the parameters (alas for now only for a head to head test, I could extend it for more opponents).

http://hardy.uhasselt.be/Toga/wald

Code: Select all

Usage: wald [OPTION]...
Generate parameters for a truncated Wald test and optionally
produce an informative graph.

  --alpha       type I error probability when there is no elo difference
  --beta        type II error probability with elo difference equal to epsilon
  --epsilon     see --beta
  --truncation  truncate test after this many observations
  --graph       produce graph
  --draw_ratio  draw ratio when there is no elo difference
  --help        print this message

It will print the parameters of the random walk (i.e. what to add in case of W/D/L) as well as the stopping conditions.

It will also produce a graph which looks as follows

http://hardy.uhasselt.be/Toga/graph3.png

lucasart · Post by **lucasart** » Mon Sep 24, 2012 9:53 am

Don wrote:
Rémi Coulom wrote:
Don wrote: What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?
Yes.

Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?
Yes.

Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.

Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.

Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.

Don
Measuring statistical significance of the comparison when doing that kind of early stopping is very subtle and difficult. There was a thread about this topic some time ago. I am sorry I cannot find it any more. Lucas ran experiments if I remember correctly. Michel also participated. It is a complicated topic.

That paper is a reference that comes to my mind:
www.di.ens.fr/willow/pdfs/ICML08b.pdf

The idea is that if you decide to stop early, the experiment becomes less significant than what the LOS tells.

Rémi
I wonder if something simple will suffice - even if it's not that precise mathematically. Here is an example stopping rule:

1. Set number of games in advance, e.g. 20,000

2. Set a threshold LOS for stopping, i.e. +/- 98 LOS

2. For calculating LOS "incrementally", assume the unplayed games end will come out even - and the draws in the same ratio as the played games.

3. Stop at 20,000 games or else at the first moment you cross the predefined LOS threshold.

This would probably not be the correct LOS but could it be considered a safe "bound"? In other words if it reports 80% that would give you more confidence that the program was improved than if 80% were after 20,000 games, right?

Note that there are 2 issues here -

1. You can stop at any arbitrary point.
2. You must predefine your stopping rule.

I think you have to predefine your stopping rule. Even if it's not a fixed number of games the rule itself has to be "fixed" such as what I described above. Otherwise you can always delay stopping in an attempt to influence the conclusion.

I'll take a look at that paper, but I don't know if I will be able to understand it. I get lost when the math gets heavy.

Don

I've tried many hacks along those lines, and trust me, it will never be reliable.
the best thing to use for the practical case of detecting an improvement in your program is arguably the sequential Wald test.
http://talkchess.com/forum/viewtopic.ph ... t&start=40
In your case you don't want to do self t'esting. So the unitary experiment should be 1 game vs each opponent, and the result be the score achieved. You can easily setup a sequential Wald test for that.
The only problem is that the stopping rule must coded in the tournament manager. You could use cutechess-cli to run one experi,ment (ie a game vs each opponent) and a python script that runs it and test the stopping rule.

Michel · Post by **Michel** » Mon Sep 24, 2012 10:05 am

You could use cutechess-cli to run one experi,ment (ie a game vs each opponent) and a python script that runs it and test the stopping rule.

Actually that is a briliant idea. To put a truncated Wald test in cutechess-cli (I believe cutechess-cli can run tournaments now).

The math for determining the parameters of a truncated Wald test is quite complicated (not the execution of the test itself which is trivial) but I wrote some notes about it, so I am available for consulting if necessary.

Rein Halbersma · Post by **Rein Halbersma** » Mon Sep 24, 2012 10:22 am

Michel wrote:
Michel wrote:
If you test against a set of engines with known elo you can use the likelihood ratio test (also known as Wald test). This test is easy to use.

I'll look for that - do you have any references?

We don't really test head to head, but each version of Komodo tests against a set of opponents (not Komodo versions.)
The Wald test also works for non head to head tests. The key point is that the elo of the opponents needs to be known (which is the case for a head to head test since you may assume that the elo of the single opponent is zero).

The Wald test is for testing a single parameter (e.g. elo). It is implemented as a random walk were each outcome (win/draw/loss) changes a certain number (the likelihood ratio) in a precomputed way. Once the likelihood ratio falls outside a precomputed interval you accept either H0 or H1.

In practice you also have a truncation time and a threshold which determines whether you accept H0 or H1 at truncation.

I wrote a python utility that computes the parameters (alas for now only for a head to head test, I could extend it for more opponents).

http://hardy.uhasselt.be/Toga/wald
Code: Select all
Usage: wald [OPTION]...
Generate parameters for a truncated Wald test and optionally
produce an informative graph.

  --alpha       type I error probability when there is no elo difference
  --beta        type II error probability with elo difference equal to epsilon
  --epsilon     see --beta
  --truncation  truncate test after this many observations
  --graph       produce graph
  --draw_ratio  draw ratio when there is no elo difference
  --help        print this message
It will print the parameters of the random walk (i.e. what to add in case of W/D/L) as well as the stopping conditions.

It will also produce a graph which looks as follows

http://hardy.uhasselt.be/Toga/graph3.png

The Wald-test is different from the Likelihood-Ratio test. For general hypothesis testing, if you have estimated a model for an alternative hypothesis H1, that gives you back a log-likelihood function LogL1(theta) and parameter theta1 that maximizes LogL1, the Wald-test to check whether theta1 is signicificantly different from its value theta0 under the null hypothesis is simply the statistics

(theta1-theta0) / var(theta1)

which is distributed as a chi-squared variable with 1 degree of freedom in the case of a single rating. Note that you don't need to estimate a model of the null hypothesis. Here, var(theta1) is given by the second derivative of the log-likelihood function at the maximum.

The likelihood-ratio test on the other hand, needs the log-likelihood for two models: the null and the alternative hypothesis, say LogL1(theta1) and LogL0(theta0). The likelihood ratio test is then the statistic

2(LogL1-LogL0)

which is also distributed as a chi-squared variable with 1 degree of freedom. Asympotitically (sample size goes to infinity) the two tests behave the same, but for finite samples, LR <= Wald. So with Wald-tests, you will reject the null hypothesis more often than with the LR-test (at least for small samples).

Michel · Post by **Michel** » Mon Sep 24, 2012 10:28 am

You are confusing the non-sequential version of the tests with the sequential version (which we are discussing here).

The sequential version of the likelihood ratio test is known as the (sequential) Wald test.

http://en.wikipedia.org/wiki/Sequential ... ratio_test

Rein Halbersma · Post by **Rein Halbersma** » Mon Sep 24, 2012 10:37 am

Michel wrote:You are confusing the non-sequential version of the tests with the sequential version (which we are discussing here).

The sequential version of the likelihood ratio test is known as the (sequential) Wald test.

http://en.wikipedia.org/wiki/Sequential ... ratio_test

Ah thanks for pointing that out. I learnt something today!

The nomenclature is a bit unfortunate though: it really is a sequential application of the likelihood-ratio test, and although it was invented by Wald, it's completely different from the other Wald-test!

Michel · Post by **Michel** » Mon Sep 24, 2012 2:02 pm

I cleaned up my notes on (continuous) random walks a little bit and put them here

http://hardy.uhasselt.be/Toga/support_random_walk.pdf

Since the sequential Wald test is a random walk on an interval the formulas in those notes may be used to compute the various associated probabilities.

Inverting the formulas (which is only possible numerically) allows one to design tests with predescribed characteristics.

This is how the utility

http://hardy.uhasselt.be/Toga/wald

works.

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error