margin of error

Michel · Post by **Michel** » Sun Sep 23, 2012 6:14 pm

If you test against a set of engines with known elo you can use the likelihood ratio test (also known as Wald test). This test is easy to use.

Don · Post by **Don** » Sun Sep 23, 2012 6:38 pm

Michel wrote:If you test against a set of engines with known elo you can use the likelihood ratio test (also known as Wald test). This test is easy to use.

I'll look for that - do you have any references?

We don't really test head to head, but each version of Komodo tests against a set of opponents (not Komodo versions.)

Don

hgm · Post by **hgm** » Sun Sep 23, 2012 6:54 pm

If you directly play the versions against each other, the error in the difference is actually 5 Elo, not 7. With only half the number of games. Measuring a difference in two separate measurements always adds the squares of the errors, a problem you don't have if you measure the difference directly.

Compare it to the thought experiment where the gauntlet opponents where all version A, without you knowing it. Then you would play 16,500 games B-A, as in the direct measurement. But in stead of taking that result, you now subtract the result of the A-A gauntlet. Which is pure error, as you know that difference between A and A has to be zero. So you wasted 16,500 games to produce a number that is nothing but error, to add that to your more accurate B-A result to spoil it.

In measuring piece values I try to exploit this as much as possible, by playing the pieces I want to compare directly against each other. And preferably in pairs. (E.g. two Chancellors for one side, two Archbishops for the other.) Then you measure the difference of the pair values with the same accuracy as belongs to that number of games, making the error in the individual pieces half of that. (Which in a one-vs-one measurement would have taken 4 times as many games, and in two one-vs-others measurements 16 times as many games.)

Of course you have to beware of systematic errors, and also here it holds that an accurate piece value requires games against a variety of material combinations, not just one. And we know from Bishops that pairs can already show significant cooperative effects.

hgm · Post by **hgm** » Sun Sep 23, 2012 7:09 pm

Rémi Coulom wrote:I should probably remove confidence intervals and force people to use LOS in the next version, because as soon as there is correlation, there is no way to interpret confidence intervals in a correct way to compare ratings. There is almost always some correlation.

The information contained in the LOS approaches zero if it gets very close to 1. It would be more informative to have a matrix that gives you the error in each individual difference.

I like to know how much progress I make. Not just the likelihood I made some unspecified progress.

Ajedrecista · Post by **Ajedrecista** » Sun Sep 23, 2012 7:23 pm

Hello Don:

Don wrote:
Rémi Coulom wrote:
Don wrote: What does LOS actually mean? If program A reports 98 over program B given that we agreed beforehand on the number of games to play, is that supposed to imply that it's 98% likely to be superior?
Yes.

Presumably, if one program is in fact superior, even if it's only by 1 ELO, I assume that number will show 100% eventually if you run the test long enough. Is that correct?
Yes.

Is there a way mathematically to make this work without specifying the number of games in advance? Presumably it will require a lot more games to make a clear distinction but it would be useful to know while a test is in progress.

Years ago we had a rule to simply stop a test once one side was ahead by N games, it had the effect that even if the "wrong" version won the match, it was very likely the superiority was minor if N was large enough.

Currently we run all tests to the same number of games (20,000) but we stop a test early when it become "obvious" that it is not going to catch up to our best version. Due to limited testing resources we have to make compromises like this but it would be a big win to have mathematically sound rules to stopping a test.

Don
Measuring statistical significance of the comparison when doing that kind of early stopping is very subtle and difficult. There was a thread about this topic some time ago. I am sorry I cannot find it any more. Lucas ran experiments if I remember correctly. Michel also participated. It is a complicated topic.

That paper is a reference that comes to my mind:
www.di.ens.fr/willow/pdfs/ICML08b.pdf

The idea is that if you decide to stop early, the experiment becomes less significant than what the LOS tells.

Rémi
I wonder if something simple will suffice - even if it's not that precise mathematically. Here is an example stopping rule:

1. Set number of games in advance, e.g. 20,000

2. Set a threshold LOS for stopping, i.e. +/- 98 LOS

2. For calculating LOS "incrementally", assume the unplayed games end will come out even - and the draws in the same ratio as the played games.

3. Stop at 20,000 games or else at the first moment you cross the predefined LOS threshold.

This would probably not be the correct LOS but could it be considered a safe "bound"? In other words if it reports 80% that would give you more confidence that the program was improved than if 80% were after 20,000 games, right?

Note that there are 2 issues here -

1. You can stop at any arbitrary point.
2. You must predefine your stopping rule.

I think you have to predefine your stopping rule. Even if it's not a fixed number of games the rule itself has to be "fixed" such as what I described above. Otherwise you can always delay stopping in an attempt to influence the conclusion.

I'll take a look at that paper, but I don't know if I will be able to understand it. I get lost when the math gets heavy.

Don

I am not the most authorized person to shed light about this issue but I will try to put my two cents.

First of all, I fully agree with Rémi in his answers to your post. I can be wrong, but I understand LOS as a one-sided test where the probability of being wrong is min(LOS, 1 - LOS)... of course, a decent amount of games is needed for narrowing error bars.

I will write about test of only two engines: reading your points, the first one (20000 games) seems sensible, but I do not like the second and specially the third one. Please imagine a bad start of one of the engines:

Code: Select all

+7 -1 =1

The number of games is VERY small (only nine), but LOS ~ 98.05% > 98% according to my own programme. I suppose that this LOS value for this example is the same as Rémi proposed almost three years ago in this post. So, you plan 20000 games and LOS > 98% is first surpassed in the ninth game but it is highly inaccurate IMHO: first play a lot of games, then take a look on LOS.

A LOS value of only 80% is very low for my understanding, because you can be wrong 1/5 of the times, which is a lot. I match confidence of error bars and LOS as follows (I can be wrong):

Code: Select all

&#40;Confidence&#41; = 2*LOS - 1.
LOS = 0.5 + &#40;confidence&#41;/2.

So, for the standard 95% confidence, the equivalent LOS is 97.5%; IIRC, you usually prefer 98% confidence that is 99% LOS with my model.

I can write nothing about stopping rules because I do not know much about them and they must be a complicated issue, as Rémi already wrote. But I programmed some Fortran 95 programmes that try to help here (I think that they are Windows compatible only, sorry for non-Windows users, but sources are included):

Three Fortran Programs

Steve.R wrote:Statistical programs for interpreting test results by Jesús Muñoz
http://www14.zippyshare.com/v/95422764/file.html

1- LOS and Elo uncertainties calculator
Programme for calculate uncertainties (± Elo) in a match between two engines, and likelihood of superiority (LOS).

2- Minimum number of games
Programme for calculate the minimum number of needed games to ensure an Elo gain with a given LOS value.

3- Minimum score for no regression
Programme for calculate the minimum score required for avoiding regression with a given LOS value.

The first programme calculate error bars only writing the number of wins, draws, loses and the desired confidence interval, and it calculates LOS values at the same time. The second programme uses a model that calculates the minimum number of games required with a given LOS value (using mean and standard deviation of a normal distribution) with the assumption of no draws, but you can calculate by yourself the minimum number of games with a given draw ratio if you multiply the output result by sqrt(1 - draw_ratio). The third programme is based on Ciarrochi's tables but with more freedom for input the number of games, the draw ratio and LOS = 1 - cut-off = 1 - alpha (you will find more similar results with a high number of games).

I am not an erudite or a bright person on the issue, but I just try to add my grain of salt. My humble tip is that setting a LOS threshold can be double-edged with few number of games, specially if this threshold is not very high. I also think that you must put a double threshold (for example LOS < 2% or LOS > 98%), just in case the version you expected to perform better finally performs worse.

I see that self tests are not the custom of Komodo development (I think that a variety of engines is better) but disgracefully I can not write anything about non-head-to-head matches and I wish that you will find better help from other forum members.

Regards from Spain.

Ajedrecista.

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 23, 2012 7:35 pm

Well isn't the software displaying error of margins for the elos of each player, and not the difference? So even in a head to head match we still have to multiply by sqrt(2) unless the report is directly of A-B and associated error or margin, which I don't think bayeselo does. Say I have the following report:
Engine A 180 elo +5 -5
Engine B -180 elo +5 -5
Is A-B 's error of margin still 5?

Ajedrecista · Post by **Ajedrecista** » Sun Sep 23, 2012 7:40 pm

Hello Daniel:

Daniel Shawul wrote:Well isn't the software displaying error of margins for the elos of each player, and not the difference? So even in a head to head match we still have to multiply by sqrt(2) unless the report is directly of A-B and associated error or margin, which I don't think bayeselo does. Say I have the following report:
Engine A 180 elo +5 -5
Engine B -180 elo +5 -5
Is A-B 's error of margin still 5?

Robert Houdart once wrote that in this case, SRSS (square root of sum squares) could be a good indicator: in your example, A - B = 180 - (-180) ± sqrt(5² + 5²) = 360 ± sqrt(50). I am not wise in this issue, but it does not look bad to me.

Regards from Spain.

Ajedrecista.

hgm · Post by **hgm** » Sun Sep 23, 2012 7:56 pm

Daniel Shawul wrote:Well isn't the software displaying error of margins for the elos of each player, and not the difference? So even in a head to head match we still have to multiply by sqrt(2) unless the report is directly of A-B and associated error or margin, which I don't think bayeselo does. Say I have the following report:
Engine A 180 elo +5 -5
Engine B -180 elo +5 -5
Is A-B 's error of margin still 5?

I would expect the A-B error to be 10 in that case, because the errors would perfectly anti-correlate. But it depends on the software. I usually estimate my errors by hand, as 40%/sqrt(NrOfGames), then I know exactly what I am doing.

Daniel Shawul · Post by **Daniel Shawul** » Sun Sep 23, 2012 8:21 pm

hgm wrote:
Daniel Shawul wrote:Well isn't the software displaying error of margins for the elos of each player, and not the difference? So even in a head to head match we still have to multiply by sqrt(2) unless the report is directly of A-B and associated error or margin, which I don't think bayeselo does. Say I have the following report:
Engine A 180 elo +5 -5
Engine B -180 elo +5 -5
Is A-B 's error of margin still 5?
I would expect the A-B error to be 10 in that case, because the errors would perfectly anti-correlate. But it depends on the software. I usually estimate my errors by hand, as 40%/sqrt(NrOfGames), then I know exactly what I am doing.

That is what I thought too. The correlation is -1 and it would be even greater. But I am not sure what error or margin should be attached to each player since we have co-variances. About last week we had disputes which error of margin is the right one given a score of 200-200-100 b/n two players. The elostat way gave about twice the value of other complicated methods. What do you think will be an appropriate error of margin report for the above score. Now it is somewhat evident to me that reporting a single error of margin is wrong, but reporting separate variance and co-variance matrices is not a solution either.

ernest · Post by **ernest** » Sun Sep 23, 2012 8:49 pm

hgm wrote:I would expect the A-B error to be 10 in that case, because the errors would perfectly anti-correlate.

Why would they anti-correlate?... (instead of being independent)
Nono, I believe the best estimation for the A-B error is 7 (sqrt thing).

margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: margin of error

Re: Margin of error.

Re: margin of error

Re: Margin of error.

Re: margin of error

Re: margin of error

Re: margin of error