Laskos wrote:If one is stopping early a match of planned N games (not shorter than 50 games) as soon as the Likelihood Of Superiority (LOS) reaches a certain value, and is not using SPRT as a stopping rule, he should be aware that a fixed LOS steadily accumulates Type I errors with the number N of planned games. Here is a table of Type I errors with the number of games (in fact wins+losses, as LOS is independent of draws).
Code: Select all
TYPE I ERROR
N Games LOS=0.95 LOS=0.99 LOS=0.999
100 27% 7.1% 1.2%
200 38% 11.5% 1.7%
400 49% 14.6% 2.3%
800 56% 17.8% 2.7%
1500 62% 21.9% 2.9%
3000 68% 24.0% 3.6%
5000 72% 25.8% 4.4%
10000 76% 28.9% 4.8%
30000 82% 33.9% 5.4%
LOS of 95% is totally useless as early stopping. If one wants to have a Type I error of less than 5% for N up to ~30,000 (wins+losses) games, a LOS of 99.9% could be used as an early stopping rule. One can stop the match as soon as LOS gets to 99.9%, if the match is shorter than 50,000 games, and longer than 50 games. LOS is easy to calculate as (1 + Erf[(wins - losses)/Sqrt{2*(wins + losses)}])/2. Or just use SPRT of cutechess-cli.
I don't understand how you calculate these numbers. For a given cap/floor to the number of games (respectively N and 50), the type I error %age is a function of the probability:
Code: Select all
type I = f(P(win), P(draw)) = g(bayeselo, drawelo) = h(elo, drawratio)
depending on your parametrization.
First you reduce the problem to 1 dimension by using the bayeselo model (you fix drawelo and the variable is bayeselo):
Code: Select all
type I = g_drawelo(bayeselo)
Now you have to do simulations to measure the type I error rate for each value of the parameter bayeselo.
The max type I error of the p-value based test (regardless of the quantile 99.9% or anything else, and regardless of your floor at 50 games and your cap at N games or no floor/cap) is 50%. This value is obviously reached when bayeselo = 0-:
Code: Select all
limit(elo->0, elo < 0) g(bayeselo) = 50%
Your testing hypothesis are:
Code: Select all
H0 bayeselo < 0
H1 = H0^C = elo < 0
[I'm not sure if the p-value test is almost surely finite, and if the expected stopping converged at the point elo = 0. so I remove this case, which is non restrictive in practice]
So your worst type I error rate is 50%, which is simply not acceptable.
Any attmept to compare this to SPRT is always going to result in Apples vs. Pears:
Code: Select all
1/ SPRT uses binary hypothesie
2/ p-value 99.9% uses elo < 0 vs elo > 0
SPRT is optimal for the tests belonging to family 1/
p-value test is a very bad test of family 2/. There are some tests of family 2/, and I have done some experiments, but in the end, I prefered SPRT (despite the tradeoff and the [elo0,elo1] zone which is annoying):
certis.enpc.fr/~audibert/Mes%20articles/ICML08b.pdf‎
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.