After looking at this graph you posted:
I got to thinking, wouldn't it be nice if we could simultaneously calculate multiple statistics and then stop when any one of them reaches a valid bound. e.g. calculate the statistic(s) for the ELO bounds (0,4), (3,8), and (7, 16) for each game played. When any one of these is accepted your done. If the (7,16) bound is rejected you would continue because there is still a chance that either the (3,8) bound or (0,4) bound could still be accepted with more games. I think a change like this would cut the number of games required by a substantial amount.
Further more, we could also introduce bounds like these, (-2, 2), (-4, -1), and (-8, -3). The point of these calculation would be that if one of these bounds accepts H0 we can be pretty sure that the patch is an ELO loser and there is no need to continue the test.
In a different thread
Laskos wrote:I played a bit with SPRT of cutechess-cli and a true value of 8 Elo points difference. It seems a bit of an art.
Code: Select all
H0 H1 accepted games LOS
0 2 H1 15700 99.99%
0 3 H1 12000 99.95%
0 5 H1 6800 99.70%
0 10 H1 5800 99.28%
0 15 H1 2300 99.25%
0 20 H0 8000 99.61%
0 25 H0 2900 95.60%
H0 is always 0 points here, alpha=beta=0.05, and the true difference is 8 Elo points.For these alpha,beta, if one sets H1 too low, he will play more games than 99.9% LOS before a stop picking H1. If one sets H1 reasonably, one saves ~20% time compared to LOS 99.9%. If one is setting H1 too large, H0 is picked for stopping as a false negative, while the LOS of the stronger engine can be 99.6%. This is frustrating, as a bit more games and the result is a true positive, but one has to reset H1 to a lower value, and redo the test.
I am not claiming that LOS is in fact a stopping rule, but on the range 50-50,000 games (including draws), where the Type I error is smaller than 5%, LOS of 99.9% is only 20% slower in stopping on true positive or true negative, without this art of guessing what hypothesis H1 I have to choose. Sure, SPRT is robust in handling alpha and beta errors, and in setting the epsilon interval, and should be used as such, but it will not prevent me to stop whenever I see 3.1SD advantage for one engine, if not using SPRT, in a 50-50,000 games match. I played a bit with SPRT of cutechess-cli and a true value of 8 Elo points difference. It seems a bit of an art.
After reading this post I decided that the optimal bound to reduce the number of games required is with the upper bound is set to twice the expected ELO of the patch assuming that the lower bound is zero. And that H0 was highly likely to be accepted if the bound is made higher than this. This seems to fit the data you gave in the post. So, I'm not so sure it's an art to pick the bounds as you put it. I think for the lowest number of games multiple bounds should be tested simultaneously. If asymmetrical CI's are used then these tests would seem to be sufficient to increase efficiency to near maximum values. My understanding is that calculating the current LLR with a single set of bounds doesn't take much time as compared to playing a single game. So I assume that several bounds could be calculated with out adding significantly to the time used. Is there any reason that you can think of that doing this would produce invalid results?I can't think of any reason that this couldn't be done.
It also seems that having some idea of the ELO bounds of the individual patches, when taken as a group, the percentages of patches that fall into each of the different ELO groups could be used to tune the CI's used for each different set of ELO bounds used. A kind of self tuning.
Another idea for those reluctant to accept asymmetrical CI's:
If a symmetrical CI is used, I'm not sure why anyone would want to do this, but if they did then it might be best to also calculate the LOS after each game at least until the accumulation of type I errors exceeds the given CI. If the LOS reaches 99.9% (3.1 SD) then the test has reached a valid stopping point. The LOS should be calculated both ways. Meaning that program 1 (P1), the program that is being tested is superior to program 2 (P2) the reference program, for an LOS of 99.9% AND P2 is tested against P1 to see if its LOS is 99.9%. The first test (P1 is superior to P2) is a valid early stopping point for accepting H1 and the second test (P2 is superior to P1) is a valid early stopping point for accepting H0. This would, in effect, take the best parts of both tests and combine them so that in the graph the efficiency line for this compound test would follow the 3 SD line until it roughly intercepted the +/- 5% line of SPRT.
So the question: For those that insist on using symmetrical CI's, I assume their will be a few holdouts, Is there a simple way to calculate LOS's accumulated type I errors as shown in your post? By simple I mean simple enough that it could easily be incorporated into a program with the standard math functions available in most math libraries.
Laskos wrote:If one is stopping early a match of planned N games (not shorter than 50 games) as soon as the Likelihood Of Superiority (LOS) reaches a certain value, and is not using SPRT as a stopping rule, he should be aware that a fixed LOS steadily accumulates Type I errors with the number N of planned games. Here is a table of Type I errors with the number of games (in fact wins+losses, as LOS is independent of draws).
Code: Select all
TYPE I ERROR
N Games LOS=0.95 LOS=0.99 LOS=0.999
100 27% 7.1% 1.2%
200 38% 11.5% 1.7%
400 49% 14.6% 2.3%
800 56% 17.8% 2.7%
1500 62% 21.9% 2.9%
3000 68% 24.0% 3.6%
5000 72% 25.8% 4.4%
10000 76% 28.9% 4.8%
30000 82% 33.9% 5.4%
LOS of 95% is totally useless as early stopping. If one wants to have a Type I error of less than 5% for N up to ~30,000 (wins+losses) games, a LOS of 99.9% could be used as an early stopping rule. One can stop the match as soon as LOS gets to 99.9%, if the match is shorter than 50,000 games, and longer than 50 games. LOS is easy to calculate as (1 + Erf[(wins - losses)/Sqrt{2*(wins + losses)}])/2. Or just use SPRT of cutechess-cli.
Anyway, I will be traveling for several days starting Thursday morning. I'm not sure how much internet access I'll have and will be busy most of the time, so I though I would give you a few things to think about and test when the new model is ready. If it turns out that several statistics can be calculated and used then it would be interesting to know what the best set of such bounds would be if given a limited number of them. i.e. 4 to 8
Regards,
Zen
Only 2 defining forces have ever offered to die for you.....Jesus Christ and the American Soldier. One died for your soul, the other for your freedom.