Type I error in LOS based early stopping rule

Laskos · Post by **Laskos** » Wed Aug 07, 2013 1:43 am

If one is stopping early a match of planned N games (not shorter than 50 games) as soon as the Likelihood Of Superiority (LOS) reaches a certain value, and is not using SPRT as a stopping rule, he should be aware that a fixed LOS steadily accumulates Type I errors with the number N of planned games. Here is a table of Type I errors with the number of games (in fact wins+losses, as LOS is independent of draws).

Code: Select all

                 TYPE   I   ERROR
N Games  LOS=0.95   LOS=0.99   LOS=0.999

  100       27%        7.1%       1.2% 
  200       38%       11.5%       1.7%
  400       49%       14.6%       2.3%
  800       56%       17.8%       2.7%
 1500       62%       21.9%       2.9%
 3000       68%       24.0%       3.6%
 5000       72%       25.8%       4.4%
10000       76%       28.9%       4.8%
30000       82%       33.9%       5.4%

LOS of 95% is totally useless as early stopping. If one wants to have a Type I error of less than 5% for N up to ~30,000 (wins+losses) games, a LOS of 99.9% could be used as an early stopping rule. One can stop the match as soon as LOS gets to 99.9%, if the match is shorter than 50,000 games, and longer than 50 games. LOS is easy to calculate as (1 + Erf[(wins - losses)/Sqrt{2*(wins + losses)}])/2. Or just use SPRT of cutechess-cli.

Michel · Post by **Michel** » Wed Aug 07, 2013 8:01 am

Thanks!

It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.

Luckily there is the SPRT!

Uri Blass · Post by **Uri Blass** » Wed Aug 07, 2013 8:43 am

I do not know if one should not use LOS as an early stopping rule.

It is possible to use LOS when you stop only with 99.9%
confidence or after 30,000 (wins+losses).

I do not know if it is better or worse than normal SPRT and I think that it may be dependent on the distribution of the elo changes in the changes that you test.

SPRT is clearly better when you test 2 simple case H0 no improvement against H1 (6 elo improvement) but reality is that there are more than 2 cases.

LOS may be clearly faster than SPRT to reject clear regressions or accept clear improvements and for example if 50% of the changes that you test are 200 elo improvement and 50% of the changes that you test are 200 elo reductions then LOS is clearly faster to make improvement relative to SPRT.

Of course I took an extreme case that does not happen but my point is that the best testing rules are dependent on what you believe about the distribution of the changes that you test.

Laskos · Post by **Laskos** » Wed Aug 07, 2013 6:05 pm

Michel wrote:Thanks!

It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.

Luckily there is the SPRT!

Until the SPRT issue with drawelo is not fixed in cutechess-cli, I think I can rely on LOS of 99.9% with Type I error under 5% for matches of 50-50,000 games (including draws) in searching for true positives. LOS has an advantage that it is computed whenever one wants, and not sequentially. Also, LOS 99.9% means 3.1 SD, and one doesn't even have to calculate LOS, but usual 2 SD error margins, and say a result 16 +/- 10 Elo is conclusive, and a stop is allowed, while 14 +/- 10 Elo is not. I think SPRT saves some 10-20% time (typical LOS at SPRT stop is about 99.0-99.5%), and Type II errors can be dealt with, but one has to be careful choosing hypothesis. Sure, when drawelo will be computed from actual games in cutechess-cli, there is no issue that it is better, and LOS is just a check for out of range cases like Uri describes.

lucasart · Post by **lucasart** » Thu Aug 08, 2013 5:49 am

Laskos wrote:
Michel wrote:Thanks!

It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.

Luckily there is the SPRT!
Until the SPRT issue with drawelo is not fixed in cutechess-cli, I think I can rely on LOS of 99.9% with Type I error under 5% for matches of 50-50,000 games (including draws) in searching for true positives. LOS has an advantage that it is computed whenever one wants, and not sequentially. Also, LOS 99.9% means 3.1 SD, and one doesn't even have to calculate LOS, but usual 2 SD error margins, and say a result 16 +/- 10 Elo is conclusive, and a stop is allowed, while 14 +/- 10 Elo is not. I think SPRT saves some 10-20% time (typical LOS at SPRT stop is about 99.0-99.5%), and Type II errors can be dealt with, but one has to be careful choosing hypothesis. Sure, when drawelo will be computed from actual games in cutechess-cli, there is no issue that it is better, and LOS is just a check for out of range cases like Uri describes.

The issue has been fixed:
https://github.com/cutechess/cutechess/issues/6
https://github.com/cutechess/cutechess/ ... 894552b45b

What the fix does is two things:
1/ draw_elo is not hardcoded anymore, but estimated out of sample (draw_elo has much more impact on the stopping rule that I initially imagined)
2/ elo0 and elo1 are expressed in usual ELO units, not in unscaled Bayes ELO units.

The idea is that the end user is not supposed to know anything about the Bayes ELO model, which is only used internally to reduce the dimension of the problem to 1 (instead of 2) so that SPRT can be applied.

Now the user interface presents SPRT just like the litterature would:
* alpha is the max type I error of the test. This maximum is reached for elo = elo0.
* beta is the max type II error of the test, in the zone elo >= elo1. This maximum is reached for elo=elo1.
* H0: elo=elo0, H1: elo=elo1. All elos are expressed in usual units, so people don't have to figure out to rescale them into bayes elo, depending on the value of draw_elo...

To use this, you must either:
* download the latest source code from Ilari's repo and compile it.
* or use version 0.6, which Ilari released a while ago.

I hope this clarifies.

As for Uri's post:
* I have no idea what he is talking about, and I don't think it has to do with draw_elo.
* He has yet to explain to us by which measure he thinks his "p-value 99.9% test when win+draw>30,000" is better than SPRT.
* Until he provides some concrete figures and a comparison apples to apples of both tests, you should disregard his posts. He has been trolling the Stockfish development forum in a similar manner when he joined.

PS: Implementation has been validated by a simulator, which itseld has been validated against the theory (formulas by Michel Van den Bergh).

Uri Blass · Post by **Uri Blass** » Thu Aug 08, 2013 6:45 am

I ask the moderators to delete a post when somebody say that I am a troll(note that there is only one person in this forum who blamed me on trolling and I am not going to give names so people who read the subject later after the moderators delete the relevant post may not know the name of that person that is really not important and the only important thing is to stop his bad behaviour)

The reason that I ask the moderator to delete the post is because the post is against the CCC charter.

People are not allowed to post a message that is abusive in nature and claiming that I am trolling and was trolling in a different forum in a similiar manner is clearly a post that is abusive in nature.

Note that I even do not repsond directly to the question and respond to my post because I am afraid that if the moderator delete a post then all the responses for this post also get deleted.

For the relevant questions:

1)It is correct that my post has nothing to do with draw elo
2) p-value 99.9% test when win+losses>30,000
can be better than SPRT when the target is to get the biggest expected improvement for the same number of games.

It is dependent on the distribution of the value of the changes that you test and people should have some opinion about the distribution.

The extreme example that I gave is that the changes that you have are very big changes of 200 elo.

Let imagine even more extreme case.
Every change that you test gives 100% against the previous version or 0% against the previous version and you know it in advance and the only thing that you do not know is if it is going to give 100% or 0%.

The best testing strategy is simply to have a single game in every test.
p-value 99.9% is going to give 10 games in every test and give the same result when SPRT in the way that it is used in the stockfish testing is going to give nearly 100 games in every test and give the same result.

It is clear that p-value 99.9% is about 10 times faster than SPRT in this case.

The point is that the best testing strategy is dependent on the apriory distribution of the changes and claiming that something is the best testing strategy without assumptions about the distribution of the changes
that you test is simply not correct.

lucasart · Post by **lucasart** » Thu Aug 08, 2013 12:20 pm

Uri,

You are Comparing (0) Apples (1) and Oranges (2), and giving me an example of Cucumbers (3):

(0) Comparison
A comparison doesn't mean hand waving. I am not interested in arguments made of words and fine assertions. I have yet to see any analysis and numbers from you to support any of your arguments.

(1) Apples

SPRT(elo0,elo1,alpha,beta)

For example

SPRT(0,6,5%,5%)

is (in layman terms), "a test with 5% error level and 6 elo resolution".

(2) Oranges

Uri Blass wrote: p-value 99.9% test when win+losses>30,000

What is the error level of that test and testing resolution ?
To what SPRT test are you "comparing" it ? (what elo0, elo1, alpha, beta)

(3) Cucumbers

Uri Blass wrote:The best testing strategy is simply to have a single game in every test.
p-value 99.9% is going to give 10 games in every test and give the same result when SPRT in the way that it is used in the stockfish testing is going to give nearly 100 games in every test and give the same result.

It is clear that p-value 99.9% is about 10 times faster than SPRT in this case.

What happenned to the win+draw>30,000 ? This is a completely different test, and the average stopping times, the error level, and testing resolution, are not even remotely comparable (see Kai's initial post).

I have studied this problem in great depth, and experimented it to death. I will not do it again, just for you. I do not have the patience. I will say this for the last time, and stop arguing with you from now on: please learn and experiment with SPRT, before you continue to make affirmations about it.

Thank you.

Uri Blass · Post by **Uri Blass** » Thu Aug 08, 2013 12:59 pm

1)I compare SPRT(0,6,5%,5%) that the stockfish team use to
p-value 99.9% test when you stop when win+losses>30,000 if you do not achieve this p-value.

2)I do not claim that one of them is better and I claim that the better test is dependent on the distribution of changes.

3)I think that a logical target is simply to achieve the maximal expected elo improvement in the program when you have limited number of games(that may be for example the number of games that you expect to play in 2 months).

Suppose that you have a list of 1,000,000 changes that you are interested in testing and you can use only 1,000,000 games.

What is the strategy that you are going to use?
My claim is that the optimal strategy (that I do not claim to know how to calculate in the general case) is dependent on the distribution of the elo improvement in the changes that you test.

I cannot say that SPRT is better and in an extreme case I can prove that p-value achieve better results inspite of not being optimal.

The extreme case that I talk about it is if 50% of your changes are 10000 elo improvement and 50% of your changes are 10000 elo reduction(10000 elo difference means that the probability to win for the better program is practically 1 and more than 0.999999999).

In this specific case the best testing strategy is going to be having 1 game for every change and you can expect getting 500,000*10000 elo improvement in 1,000,000 games by this strategy.

SPRT(0,6,5%,5%) in this case is going to give you the possibility to test only nearly 10,000 tests out of 1,000,000 that means that you are going to get only 5000*10000 elo improvement.

p-value 99.9% is going to give you 5000*100,000 elo improvement better than SPRT but not close to optimal.

Of course the case when every change is 10,000 elo improvement or 10,000 elo reduction does not happen but the point is that the best testing algorithm to get the biggest improvement is dependent on the distribution of the rating difference between the original program and the modified program.

AlvaroBegue · Post by **AlvaroBegue** » Thu Aug 08, 2013 3:52 pm

You are absolutely right that the optimal stopping rule depends on the distribution of ELO increments of the ideas being tested.

Could the people with lots of testing experience give us an idea for what the distribution looks like? A mean and a standard deviation would probably suffice.

I have thought of a procedure to compute the optimal stopping rule given the distribution of ELO increments, and I know that at least Don expressed interest in it in the past. The quantity to maximize is expected ELO gain per game played. Unfortunately, I haven't found the time to actually implement it: I have three kids under 3...

zamar · Post by **zamar** » Thu Aug 08, 2013 4:15 pm

AlvaroBegue wrote:You are absolutely right that the optimal stopping rule depends on the distribution of ELO increments of the ideas being tested.

Could the people with lots of testing experience give us an idea for what the distribution looks like? A mean and a standard deviation would probably suffice.

I have thought of a procedure to compute the optimal stopping rule given the distribution of ELO increments, and I know that at least Don expressed interest in it in the past. The quantity to maximize is expected ELO gain per game played. Unfortunately, I haven't found the time to actually implement it: I have three kids under 3...

Of course the optimal stopping rule depends on the distribution. If one tests only +100 elo patches, we need zero games per patch and can commit them straight away.

But seriously speaking I prefer to stay with well-known and well-respected theory (SPRT) rather than to read the endless hand waving from Uri which just confuses beginners.

Type I error in LOS based early stopping rule

Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule