Type I error in LOS based early stopping rule

Michel · Post by **Michel** » Fri Aug 09, 2013 6:00 pm

But a simple test with elo0=-1, elo1=3, and the true value elo=8 needs as many games or more than the LOS of 99.9% with the same Type I error.

This is not a valid comparison. As Joona and Lucas were saying. If you already know the elo difference NO games are necessary.

This is performing simultaneously a pair of one-sided SPRTs? One for acceptance/rejection of H0, and another for acceptance/rejection of H1?

Yes

There seem to be two cases where SPRT is sub-optimal: elo=(elo0+elo1)/2 and elo>>elo1>~elo0.

I do not accept the second case as being suboptimal (see above).

Laskos · Post by **Laskos** » Fri Aug 09, 2013 7:16 pm

Michel wrote:
But a simple test with elo0=-1, elo1=3, and the true value elo=8 needs as many games or more than the LOS of 99.9% with the same Type I error.
This is not a valid comparison. As Joona and Lucas were saying. If you already know the elo difference NO games are necessary.

You seem not to get it. SPRT is less than optimal and less efficient than LOS of 99.9% at the same Type I error for elo0=-1, elo1=3, in the ranges elo> +8 and elo< -6. It's exactly the opposite, with SPRT one has to guess beforehand that elo is between -6 and +8 (and not very close to (3-1)/2=1) for SPRT to be close to optimal.

There seem to be two cases where SPRT is sub-optimal: elo=(elo0+elo1)/2 and elo>>elo1>~elo0.
I do not accept the second case as being suboptimal (see above).

See the above.

Michel · Post by **Michel** » Fri Aug 09, 2013 7:59 pm

Ok I misunderstood. You are not fixing the stronger engine, but you are simply assuming that the elo difference is large. Well I stand by my opinion that optimizing for large elo differences is pointless (read my earlier post on this).

Uri Blass · Post by **Uri Blass** » Fri Aug 09, 2013 9:15 pm

Michel wrote:Ok I misunderstood. You are not fixing the stronger engine, but you are simply assuming that the elo difference is large. Well I stand by my opinion that optimizing for large elo differences is pointless (read my earlier post on this).

I think that it may be not relevant for stockfish but it is clearly relevant for new engine when you expect a big improvement by a single change and I think adding null move pruning or adding LMR or adding mobility evaluation can clearly add more than 8 elo so changes that add more than 8 elo are clearly common for a new engine.

Laskos · Post by **Laskos** » Sat Aug 10, 2013 1:23 pm

Uri Blass wrote:
Michel wrote:Ok I misunderstood. You are not fixing the stronger engine, but you are simply assuming that the elo difference is large. Well I stand by my opinion that optimizing for large elo differences is pointless (read my earlier post on this).
I think that it may be not relevant for stockfish but it is clearly relevant for new engine when you expect a big improvement by a single change and I think adding null move pruning or adding LMR or adding mobility evaluation can clearly add more than 8 elo so changes that add more than 8 elo are clearly common for a new engine.

Some patches are clearly bad by more than 6 Elo points, and with a window of elo0=-1 and elo1=3, SPRT will again have difficulty to reject it. But SPRT is never very bad. If elo is close to the window or inside it, it's usually 20-30% faster than LOS 99.9%, if it's far away from the window it's 20-40% slower than LOS. But SPRT has clearly defined alpha and beta, and usually one is not far away from the window. So, it's not optimal for all Elo differences, but even when it's sub-optimal, it's not by much. And SPRT is easy to implement.

lucasart · Post by **lucasart** » Sat Aug 10, 2013 1:31 pm

Laskos wrote:If one is stopping early a match of planned N games (not shorter than 50 games) as soon as the Likelihood Of Superiority (LOS) reaches a certain value, and is not using SPRT as a stopping rule, he should be aware that a fixed LOS steadily accumulates Type I errors with the number N of planned games. Here is a table of Type I errors with the number of games (in fact wins+losses, as LOS is independent of draws).
Code: Select all
                 TYPE   I   ERROR
N Games  LOS=0.95   LOS=0.99   LOS=0.999

  100       27%        7.1%       1.2% 
  200       38%       11.5%       1.7%
  400       49%       14.6%       2.3%
  800       56%       17.8%       2.7%
 1500       62%       21.9%       2.9%
 3000       68%       24.0%       3.6%
 5000       72%       25.8%       4.4%
10000       76%       28.9%       4.8%
30000       82%       33.9%       5.4%
LOS of 95% is totally useless as early stopping. If one wants to have a Type I error of less than 5% for N up to ~30,000 (wins+losses) games, a LOS of 99.9% could be used as an early stopping rule. One can stop the match as soon as LOS gets to 99.9%, if the match is shorter than 50,000 games, and longer than 50 games. LOS is easy to calculate as (1 + Erf[(wins - losses)/Sqrt{2*(wins + losses)}])/2. Or just use SPRT of cutechess-cli.

I don't understand how you calculate these numbers. For a given cap/floor to the number of games (respectively N and 50), the type I error %age is a function of the probability:

Code: Select all

type I = f&#40;P&#40;win&#41;, P&#40;draw&#41;) = g&#40;bayeselo, drawelo&#41; = h&#40;elo, drawratio&#41;

depending on your parametrization.

First you reduce the problem to 1 dimension by using the bayeselo model (you fix drawelo and the variable is bayeselo):

Code: Select all

type I = g_drawelo&#40;bayeselo&#41;

Now you have to do simulations to measure the type I error rate for each value of the parameter bayeselo.

The max type I error of the p-value based test (regardless of the quantile 99.9% or anything else, and regardless of your floor at 50 games and your cap at N games or no floor/cap) is 50%. This value is obviously reached when bayeselo = 0-:

Code: Select all

limit&#40;elo->0, elo < 0&#41; g&#40;bayeselo&#41; = 50%

Your testing hypothesis are:

Code: Select all

H0 bayeselo < 0
H1 = H0^C = elo < 0

[I'm not sure if the p-value test is almost surely finite, and if the expected stopping converged at the point elo = 0. so I remove this case, which is non restrictive in practice]

So your worst type I error rate is 50%, which is simply not acceptable.

Any attmept to compare this to SPRT is always going to result in Apples vs. Pears:

Code: Select all

1/ SPRT uses binary hypothesie
2/ p-value 99.9% uses elo < 0 vs elo > 0

SPRT is optimal for the tests belonging to family 1/

p-value test is a very bad test of family 2/. There are some tests of family 2/, and I have done some experiments, but in the end, I prefered SPRT (despite the tradeoff and the [elo0,elo1] zone which is annoying):
certis.enpc.fr/~audibert/Mes%20articles/ICML08b.pdf‎

Michel · Post by **Michel** » Sat Aug 10, 2013 1:57 pm

There are some tests of family 2/, and I have done some experiments, but in the end, I prefered SPRT (despite the tradeoff and the [elo0,elo1] zone which is annoying):

Tests that test elo>0 versus elo<0 have unbounded expected running time. Since in practice you are always forced to truncate you will always end up with an elo0,elo1 tradeoff.... It is unavoidable.

Laskos · Post by **Laskos** » Sat Aug 10, 2013 2:05 pm

lucasart wrote:
Laskos wrote:If one is stopping early a match of planned N games (not shorter than 50 games) as soon as the Likelihood Of Superiority (LOS) reaches a certain value, and is not using SPRT as a stopping rule, he should be aware that a fixed LOS steadily accumulates Type I errors with the number N of planned games. Here is a table of Type I errors with the number of games (in fact wins+losses, as LOS is independent of draws).
Code: Select all
                 TYPE   I   ERROR
N Games  LOS=0.95   LOS=0.99   LOS=0.999

  100       27%        7.1%       1.2% 
  200       38%       11.5%       1.7%
  400       49%       14.6%       2.3%
  800       56%       17.8%       2.7%
 1500       62%       21.9%       2.9%
 3000       68%       24.0%       3.6%
 5000       72%       25.8%       4.4%
10000       76%       28.9%       4.8%
30000       82%       33.9%       5.4%
LOS of 95% is totally useless as early stopping. If one wants to have a Type I error of less than 5% for N up to ~30,000 (wins+losses) games, a LOS of 99.9% could be used as an early stopping rule. One can stop the match as soon as LOS gets to 99.9%, if the match is shorter than 50,000 games, and longer than 50 games. LOS is easy to calculate as (1 + Erf[(wins - losses)/Sqrt{2*(wins + losses)}])/2. Or just use SPRT of cutechess-cli.
I don't understand how you calculate these numbers. For a given cap/floor to the number of games (respectively N and 50), the type I error %age is a function of the probability:
Code: Select all
type I = f&#40;P&#40;win&#41;, P&#40;draw&#41;) = g&#40;bayeselo, drawelo&#41; = h&#40;elo, drawratio&#41;
depending on your parametrization.

First you reduce the problem to 1 dimension by using the bayeselo model (you fix drawelo and the variable is bayeselo):
Code: Select all
type I = g_drawelo&#40;bayeselo&#41;
Now you have to do simulations to measure the type I error rate for each value of the parameter bayeselo.

The max type I error of the p-value based test (regardless of the quantile 99.9% or anything else, and regardless of your floor at 50 games and your cap at N games or no floor/cap) is 50%. This value is obviously reached when bayeselo = 0-:
Code: Select all
limit&#40;elo->0, elo < 0&#41; g&#40;bayeselo&#41; = 50%
Your testing hypothesis are:
Code: Select all
H0 bayeselo < 0
H1 = H0^C = elo < 0
[I'm not sure if the p-value test is almost surely finite, and if the expected stopping converged at the point elo = 0. so I remove this case, which is non restrictive in practice]

So your worst type I error rate is 50%, which is simply not acceptable.

Any attmept to compare this to SPRT is always going to result in Apples vs. Pears:
Code: Select all
1/ SPRT uses binary hypothesie
2/ p-value 99.9% uses elo < 0 vs elo > 0
SPRT is optimal for the tests belonging to family 1/

p-value test is a very bad test of family 2/. There are some tests of family 2/, and I have done some experiments, but in the end, I prefered SPRT (despite the tradeoff and the [elo0,elo1] zone which is annoying):
certis.enpc.fr/~audibert/Mes%20articles/ICML08b.pdf‎

I didn't quite followed, but it's simpler than that.
H0 is elo=0
H1 is elo>0 && elo<0.
The null hypothesis H0 is true, and I get the rate of its rejection with a given LOS, N, floor 50, getting the Type I error. Type I error is always 100% for N->infinity for every LOS.
Yes, SPRT being a binary hypothesis, the comparison is of apples to oranges, and it's comparable only in rejection of H0: elo=0. But I think that's the testers want, to check for elo<>0 and stopping with a certain Type I error.

Don · Post by **Don** » Mon Aug 19, 2013 11:17 pm

AlvaroBegue wrote:You are absolutely right that the optimal stopping rule depends on the distribution of ELO increments of the ideas being tested.

Could the people with lots of testing experience give us an idea for what the distribution looks like? A mean and a standard deviation would probably suffice.

I have thought of a procedure to compute the optimal stopping rule given the distribution of ELO increments, and I know that at least Don expressed interest in it in the past. The quantity to maximize is expected ELO gain per game played. Unfortunately, I haven't found the time to actually implement it: I have three kids under 3...

I want to integrate SPRT into our own tester but I am unsure of the math. However it is not difficult to run simulations using the criteria of 2 programs which are equal in strength (simulated programs) while duplicating the draw percentage that you are actually getting. But that is still not SPRT.

So the idea is to find a stopping rule you can live with. You don't want to accept too many regressions. A trivial stopping rule is to stop after your target program is ahead by W games or behind by L games. Using a simulation to avoid the math will tell you quickly how many "regressions" you will accept. You could consider a zero ELO improvement a regression.

But it's also important that the regression you accept are small ones. It's a given that you will accept some regressions but you want those to be very small ones.

So I am going to have to either wade through the math or find the calculation on-line already code up.

Laskos · Post by **Laskos** » Mon Aug 19, 2013 11:50 pm

Don wrote:
AlvaroBegue wrote:You are absolutely right that the optimal stopping rule depends on the distribution of ELO increments of the ideas being tested.

Could the people with lots of testing experience give us an idea for what the distribution looks like? A mean and a standard deviation would probably suffice.

I have thought of a procedure to compute the optimal stopping rule given the distribution of ELO increments, and I know that at least Don expressed interest in it in the past. The quantity to maximize is expected ELO gain per game played. Unfortunately, I haven't found the time to actually implement it: I have three kids under 3...
I want to integrate SPRT into our own tester but I am unsure of the math. However it is not difficult to run simulations using the criteria of 2 programs which are equal in strength (simulated programs) while duplicating the draw percentage that you are actually getting. But that is still not SPRT.

So the idea is to find a stopping rule you can live with. You don't want to accept too many regressions. A trivial stopping rule is to stop after your target program is ahead by W games or behind by L games. Using a simulation to avoid the math will tell you quickly how many "regressions" you will accept. You could consider a zero ELO improvement a regression.

But it's also important that the regression you accept are small ones. It's a given that you will accept some regressions but you want those to be very small ones.

So I am going to have to either wade through the math or find the calculation on-line already code up.

SPRT is easy to implement, summing up sequentially some logarithms of some ratios, even the links in this thread are enough to follow it. SPRT behaves usually very well, and is the best recommendation, but if you don't want to go through all the logarithms, you can use LOS of 99.9% in the 50-50,000 games range for a Type I error smaller than 5%. It's usually slower than SPRT, but if you have Erf function, it's easy to calculate LOS whenever you want (not necessarily sequentially) as (1 + Erf[(wins - losses)/Sqrt{2*(wins + losses)}])/2. If you want to control Type I and II errors, then SPRT is the way to go.

Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule

Re: Type I error in LOS based early stopping rule