Type I error in LOS based early stopping rule

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Type I error in LOS based early stopping rule

Post by lucasart »

Uri Blass wrote:1)I compare SPRT(0,6,5%,5%) that the stockfish team use to
p-value 99.9% test when you stop when win+losses>30,000 if you do not achieve this p-value.
I'm sorry, I misread that in your initial post. I thought you wanted to look at the LOS condition only *after* 30,000 wins+losses. That invalidates my point about cucumbers.

The examples you give are obviously completely irrealistic. What does it even mean to test a 10,000 elo patch ? Have you ever seen such a patch ? If so, let me know, and I will gladly put it n DiscoCheck.

SPRT (or any well defined stopping algorithm) is a systematic rule that avoid the biais and arbitrary aspect of early stopping by hand, when you feel that you have enough games (lots of people still do that and are not aware of the effects of early stopping, and just look at the p-value). But it doesn't mean you cannot use your brain either:

* If the results look something like 100-0-0, instead of commiting the patch and screaming hurray, you probably start to look for a problem in your testing framework (all games lost on time or due to buggy adjudication, or segfault or whatever). Fix the bug and rerun.

* If the results look like 0-100-0, you don't just reject the patch and conclude that the idea doesn't work. Instead you look for bugs in the patch. This can only be the result of a serious bug.

In real life, the patches are very close to zero most of the time. Probably the majority is slightly negative, and some are slightly positive. The stockfish testing framework could be used to estimate (based on a large history of teted patches) the distribution of that ELO parameter. The various SPRT used or fixed size, or early stopped by hand, might slightly biais the result, but nothing big. My guess is something like E(elo)=-2 elo and stdev(elo)=3, no more.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Type I error in LOS based early stopping rule

Post by Laskos »

lucasart wrote:
Laskos wrote:
Michel wrote:Thanks!

It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.

Luckily there is the SPRT!
Until the SPRT issue with drawelo is not fixed in cutechess-cli, I think I can rely on LOS of 99.9% with Type I error under 5% for matches of 50-50,000 games (including draws) in searching for true positives. LOS has an advantage that it is computed whenever one wants, and not sequentially. Also, LOS 99.9% means 3.1 SD, and one doesn't even have to calculate LOS, but usual 2 SD error margins, and say a result 16 +/- 10 Elo is conclusive, and a stop is allowed, while 14 +/- 10 Elo is not. I think SPRT saves some 10-20% time (typical LOS at SPRT stop is about 99.0-99.5%), and Type II errors can be dealt with, but one has to be careful choosing hypothesis. Sure, when drawelo will be computed from actual games in cutechess-cli, there is no issue that it is better, and LOS is just a check for out of range cases like Uri describes.
The issue has been fixed:
https://github.com/cutechess/cutechess/issues/6
https://github.com/cutechess/cutechess/ ... 894552b45b

What the fix does is two things:
1/ draw_elo is not hardcoded anymore, but estimated out of sample (draw_elo has much more impact on the stopping rule that I initially imagined)
2/ elo0 and elo1 are expressed in usual ELO units, not in unscaled Bayes ELO units.

The idea is that the end user is not supposed to know anything about the Bayes ELO model, which is only used internally to reduce the dimension of the problem to 1 (instead of 2) so that SPRT can be applied.

Now the user interface presents SPRT just like the litterature would:
* alpha is the max type I error of the test. This maximum is reached for elo = elo0.
* beta is the max type II error of the test, in the zone elo >= elo1. This maximum is reached for elo=elo1.
* H0: elo=elo0, H1: elo=elo1. All elos are expressed in usual units, so people don't have to figure out to rescale them into bayes elo, depending on the value of draw_elo...

To use this, you must either:
* download the latest source code from Ilari's repo and compile it.
* or use version 0.6, which Ilari released a while ago.

I hope this clarifies.

As for Uri's post:
* I have no idea what he is talking about, and I don't think it has to do with draw_elo.
* He has yet to explain to us by which measure he thinks his "p-value 99.9% test when win+draw>30,000" is better than SPRT.
* Until he provides some concrete figures and a comparison apples to apples of both tests, you should disregard his posts. He has been trolling the Stockfish development forum in a similar manner when he joined.

PS: Implementation has been validated by a simulator, which itseld has been validated against the theory (formulas by Michel Van den Bergh).
I think Uri has a point, I tested 2 engines 90 Elo points apart. With cutechess-cli H0 0 Elo points, and H1 5 Elo points that eng1 is stronger than eng2, it needed 400 games to accept H1, with a LOS of 99.999999999999%. When H1 is set to 30 points, it needed only 100 games to accept it, with a LOS of 99.7%. When H1 is set to 60 points, it needed only 40 games to accept it with a LOS of 98.8%. Alpha and beta were 0.1. LOS of 99.9% was quicker than the resolution 5 Elo points SPRT, and it seems one has to guess beforehand what to expect in the test, and set hypotheses accordingly.
User avatar
lucasart
Posts: 3232
Joined: Mon May 31, 2010 1:29 pm
Full name: lucasart

Re: Type I error in LOS based early stopping rule

Post by lucasart »

Laskos wrote:
lucasart wrote:
Laskos wrote:
Michel wrote:Thanks!

It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.

Luckily there is the SPRT!
Until the SPRT issue with drawelo is not fixed in cutechess-cli, I think I can rely on LOS of 99.9% with Type I error under 5% for matches of 50-50,000 games (including draws) in searching for true positives. LOS has an advantage that it is computed whenever one wants, and not sequentially. Also, LOS 99.9% means 3.1 SD, and one doesn't even have to calculate LOS, but usual 2 SD error margins, and say a result 16 +/- 10 Elo is conclusive, and a stop is allowed, while 14 +/- 10 Elo is not. I think SPRT saves some 10-20% time (typical LOS at SPRT stop is about 99.0-99.5%), and Type II errors can be dealt with, but one has to be careful choosing hypothesis. Sure, when drawelo will be computed from actual games in cutechess-cli, there is no issue that it is better, and LOS is just a check for out of range cases like Uri describes.
The issue has been fixed:
https://github.com/cutechess/cutechess/issues/6
https://github.com/cutechess/cutechess/ ... 894552b45b

What the fix does is two things:
1/ draw_elo is not hardcoded anymore, but estimated out of sample (draw_elo has much more impact on the stopping rule that I initially imagined)
2/ elo0 and elo1 are expressed in usual ELO units, not in unscaled Bayes ELO units.

The idea is that the end user is not supposed to know anything about the Bayes ELO model, which is only used internally to reduce the dimension of the problem to 1 (instead of 2) so that SPRT can be applied.

Now the user interface presents SPRT just like the litterature would:
* alpha is the max type I error of the test. This maximum is reached for elo = elo0.
* beta is the max type II error of the test, in the zone elo >= elo1. This maximum is reached for elo=elo1.
* H0: elo=elo0, H1: elo=elo1. All elos are expressed in usual units, so people don't have to figure out to rescale them into bayes elo, depending on the value of draw_elo...

To use this, you must either:
* download the latest source code from Ilari's repo and compile it.
* or use version 0.6, which Ilari released a while ago.

I hope this clarifies.

As for Uri's post:
* I have no idea what he is talking about, and I don't think it has to do with draw_elo.
* He has yet to explain to us by which measure he thinks his "p-value 99.9% test when win+draw>30,000" is better than SPRT.
* Until he provides some concrete figures and a comparison apples to apples of both tests, you should disregard his posts. He has been trolling the Stockfish development forum in a similar manner when he joined.

PS: Implementation has been validated by a simulator, which itseld has been validated against the theory (formulas by Michel Van den Bergh).
I think Uri has a point, I tested 2 engines 90 Elo points apart. With cutechess-cli H0 0 Elo points, and H1 5 Elo points that eng1 is stronger than eng2, it needed 400 games to accept H1, with a LOS of 99.999999999999%. When H1 is set to 30 points, it needed only 100 games to accept it, with a LOS of 99.7%. When H1 is set to 60 points, it needed only 40 games to accept it with a LOS of 98.8%. Alpha and beta were 0.1. LOS of 99.9% was quicker than the resolution 5 Elo points SPRT, and it seems one has to guess beforehand what to expect in the test, and set hypotheses accordingly.
These two stopping algorithms are not comparable. The max type I error of the p-value 99.9% test is much higher than 5%.

You are just comparing them one one example, which is rather trivial and is not the intended use case. Typical use case:
* test for improvement elo0=0 elo1=epsilon
* test for non regression elo0=-epsilon elo1=0
* test for release, make sure we have gained at least X elo before releasing: elo0=X elo1=X+epsilon.

I'm going to take an extreme example "a la Uri Blass": what if the patch is worth 10^-100 elo ? With SPRT(elo0=0, elo1=epsilon, alpha=beta=5%) you have at most 5% chance of concluding a false positive. But in this situation the p-value 99.9% is no better than flipping a coin to decide!

The flaw in your comparison is that you are already assuming that ou know the result (elo=90). The whole problem in real life testing of patches is that you don't. If we follow the extrmee logic of Uri Blass, we should (as Joona explained) make the test even simpler: just play one game, and see who wins. Of course with stupid use cases like 10,000 elo hypothetical patches we are going to conclude that our stopping rule makes almost no mistake and is much faster than SPRT. But it's a completely stupid example that assumes that you know the result that you are trying to measure.
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Type I error in LOS based early stopping rule

Post by Laskos »

lucasart wrote:
Laskos wrote:
lucasart wrote:
Laskos wrote:
Michel wrote:Thanks!

It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.

Luckily there is the SPRT!
Until the SPRT issue with drawelo is not fixed in cutechess-cli, I think I can rely on LOS of 99.9% with Type I error under 5% for matches of 50-50,000 games (including draws) in searching for true positives. LOS has an advantage that it is computed whenever one wants, and not sequentially. Also, LOS 99.9% means 3.1 SD, and one doesn't even have to calculate LOS, but usual 2 SD error margins, and say a result 16 +/- 10 Elo is conclusive, and a stop is allowed, while 14 +/- 10 Elo is not. I think SPRT saves some 10-20% time (typical LOS at SPRT stop is about 99.0-99.5%), and Type II errors can be dealt with, but one has to be careful choosing hypothesis. Sure, when drawelo will be computed from actual games in cutechess-cli, there is no issue that it is better, and LOS is just a check for out of range cases like Uri describes.
The issue has been fixed:
https://github.com/cutechess/cutechess/issues/6
https://github.com/cutechess/cutechess/ ... 894552b45b

What the fix does is two things:
1/ draw_elo is not hardcoded anymore, but estimated out of sample (draw_elo has much more impact on the stopping rule that I initially imagined)
2/ elo0 and elo1 are expressed in usual ELO units, not in unscaled Bayes ELO units.

The idea is that the end user is not supposed to know anything about the Bayes ELO model, which is only used internally to reduce the dimension of the problem to 1 (instead of 2) so that SPRT can be applied.

Now the user interface presents SPRT just like the litterature would:
* alpha is the max type I error of the test. This maximum is reached for elo = elo0.
* beta is the max type II error of the test, in the zone elo >= elo1. This maximum is reached for elo=elo1.
* H0: elo=elo0, H1: elo=elo1. All elos are expressed in usual units, so people don't have to figure out to rescale them into bayes elo, depending on the value of draw_elo...

To use this, you must either:
* download the latest source code from Ilari's repo and compile it.
* or use version 0.6, which Ilari released a while ago.

I hope this clarifies.

As for Uri's post:
* I have no idea what he is talking about, and I don't think it has to do with draw_elo.
* He has yet to explain to us by which measure he thinks his "p-value 99.9% test when win+draw>30,000" is better than SPRT.
* Until he provides some concrete figures and a comparison apples to apples of both tests, you should disregard his posts. He has been trolling the Stockfish development forum in a similar manner when he joined.

PS: Implementation has been validated by a simulator, which itseld has been validated against the theory (formulas by Michel Van den Bergh).
I think Uri has a point, I tested 2 engines 90 Elo points apart. With cutechess-cli H0 0 Elo points, and H1 5 Elo points that eng1 is stronger than eng2, it needed 400 games to accept H1, with a LOS of 99.999999999999%. When H1 is set to 30 points, it needed only 100 games to accept it, with a LOS of 99.7%. When H1 is set to 60 points, it needed only 40 games to accept it with a LOS of 98.8%. Alpha and beta were 0.1. LOS of 99.9% was quicker than the resolution 5 Elo points SPRT, and it seems one has to guess beforehand what to expect in the test, and set hypotheses accordingly.
These two stopping algorithms are not comparable. The max type I error of the p-value 99.9% test is much higher than 5%.

You are just comparing them one one example, which is rather trivial and is not the intended use case. Typical use case:
* test for improvement elo0=0 elo1=epsilon
* test for non regression elo0=-epsilon elo1=0
* test for release, make sure we have gained at least X elo before releasing: elo0=X elo1=X+epsilon.

I'm going to take an extreme example "a la Uri Blass": what if the patch is worth 10^-100 elo ? With SPRT(elo0=0, elo1=epsilon, alpha=beta=5%) you have at most 5% chance of concluding a false positive. But in this situation the p-value 99.9% is no better than flipping a coin to decide!

The flaw in your comparison is that you are already assuming that ou know the result (elo=90). The whole problem in real life testing of patches is that you don't. If we follow the extrmee logic of Uri Blass, we should (as Joona explained) make the test even simpler: just play one game, and see who wins. Of course with stupid use cases like 10,000 elo hypothetical patches we are going to conclude that our stopping rule makes almost no mistake and is much faster than SPRT. But it's a completely stupid example that assumes that you know the result that you are trying to measure.
LOS 99.9% has Type I error of less than 5% for up to 30,000 games (wins+losses), but this Type I error climbs steadily with the number of games (at infinity the error is 100%, that's why generally LOS is a bad stopping rule). I am not disputing the SPRT optimality, but let's get absurd the SPRT way: set HO to 0 points, and H1 to 0.000000001 points. It will have hard time picking H1 if the Elo difference is 3 points.
I now tested with a 30 points difference engines, elo0=0, elo1=5, alpha, beta 0.05, and it picked H1 after 1600 games, slower again than 99.9% LOS.
I agree that these cases are not representative, while SPRT is sequentially optimal, LOS of 99.9% (or 3.1SD) can be used as a rule of thumb by glancing on the intermediate results for early stopping in 50-50,000 games.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Type I error in LOS based early stopping rule

Post by Laskos »

Laskos wrote:
lucasart wrote:
Laskos wrote:
lucasart wrote:
Laskos wrote:
Michel wrote:Thanks!

It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.

Luckily there is the SPRT!
Until the SPRT issue with drawelo is not fixed in cutechess-cli, I think I can rely on LOS of 99.9% with Type I error under 5% for matches of 50-50,000 games (including draws) in searching for true positives. LOS has an advantage that it is computed whenever one wants, and not sequentially. Also, LOS 99.9% means 3.1 SD, and one doesn't even have to calculate LOS, but usual 2 SD error margins, and say a result 16 +/- 10 Elo is conclusive, and a stop is allowed, while 14 +/- 10 Elo is not. I think SPRT saves some 10-20% time (typical LOS at SPRT stop is about 99.0-99.5%), and Type II errors can be dealt with, but one has to be careful choosing hypothesis. Sure, when drawelo will be computed from actual games in cutechess-cli, there is no issue that it is better, and LOS is just a check for out of range cases like Uri describes.
The issue has been fixed:
https://github.com/cutechess/cutechess/issues/6
https://github.com/cutechess/cutechess/ ... 894552b45b

What the fix does is two things:
1/ draw_elo is not hardcoded anymore, but estimated out of sample (draw_elo has much more impact on the stopping rule that I initially imagined)
2/ elo0 and elo1 are expressed in usual ELO units, not in unscaled Bayes ELO units.

The idea is that the end user is not supposed to know anything about the Bayes ELO model, which is only used internally to reduce the dimension of the problem to 1 (instead of 2) so that SPRT can be applied.

Now the user interface presents SPRT just like the litterature would:
* alpha is the max type I error of the test. This maximum is reached for elo = elo0.
* beta is the max type II error of the test, in the zone elo >= elo1. This maximum is reached for elo=elo1.
* H0: elo=elo0, H1: elo=elo1. All elos are expressed in usual units, so people don't have to figure out to rescale them into bayes elo, depending on the value of draw_elo...

To use this, you must either:
* download the latest source code from Ilari's repo and compile it.
* or use version 0.6, which Ilari released a while ago.

I hope this clarifies.

As for Uri's post:
* I have no idea what he is talking about, and I don't think it has to do with draw_elo.
* He has yet to explain to us by which measure he thinks his "p-value 99.9% test when win+draw>30,000" is better than SPRT.
* Until he provides some concrete figures and a comparison apples to apples of both tests, you should disregard his posts. He has been trolling the Stockfish development forum in a similar manner when he joined.

PS: Implementation has been validated by a simulator, which itseld has been validated against the theory (formulas by Michel Van den Bergh).
I think Uri has a point, I tested 2 engines 90 Elo points apart. With cutechess-cli H0 0 Elo points, and H1 5 Elo points that eng1 is stronger than eng2, it needed 400 games to accept H1, with a LOS of 99.999999999999%. When H1 is set to 30 points, it needed only 100 games to accept it, with a LOS of 99.7%. When H1 is set to 60 points, it needed only 40 games to accept it with a LOS of 98.8%. Alpha and beta were 0.1. LOS of 99.9% was quicker than the resolution 5 Elo points SPRT, and it seems one has to guess beforehand what to expect in the test, and set hypotheses accordingly.
These two stopping algorithms are not comparable. The max type I error of the p-value 99.9% test is much higher than 5%.

You are just comparing them one one example, which is rather trivial and is not the intended use case. Typical use case:
* test for improvement elo0=0 elo1=epsilon
* test for non regression elo0=-epsilon elo1=0
* test for release, make sure we have gained at least X elo before releasing: elo0=X elo1=X+epsilon.

I'm going to take an extreme example "a la Uri Blass": what if the patch is worth 10^-100 elo ? With SPRT(elo0=0, elo1=epsilon, alpha=beta=5%) you have at most 5% chance of concluding a false positive. But in this situation the p-value 99.9% is no better than flipping a coin to decide!

The flaw in your comparison is that you are already assuming that ou know the result (elo=90). The whole problem in real life testing of patches is that you don't. If we follow the extrmee logic of Uri Blass, we should (as Joona explained) make the test even simpler: just play one game, and see who wins. Of course with stupid use cases like 10,000 elo hypothetical patches we are going to conclude that our stopping rule makes almost no mistake and is much faster than SPRT. But it's a completely stupid example that assumes that you know the result that you are trying to measure.
LOS 99.9% has Type I error of less than 5% for up to 30,000 games (wins+losses), but this Type I error climbs steadily with the number of games (at infinity the error is 100%, that's why generally LOS is a bad stopping rule). I am not disputing the SPRT optimality, but let's get absurd the SPRT way: set HO to 0 points, and H1 to 0.000000001 points. It will have hard time picking H1 if the Elo difference is 3 points.
I now tested with a 30 points difference engines, elo0=0, elo1=5, alpha, beta 0.05, and it picked H1 after 1600 games, slower again than 99.9% LOS.
I agree that these cases are not representative, while SPRT is sequentially optimal, LOS of 99.9% (or 3.1SD) can be used as a rule of thumb by glancing on the intermediate results for early stopping in 50-50,000 games.
For 8 points difference, H0: 0 points, H1: 5 points, SPRT with alpha=beta=0.05 picked H1 faster than LOS 99.9%. So yes, by choosing reasonably the hypotheses, SPRT works very well. I will probably simulate it to check the alpha and beta margins.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Type I error in LOS based early stopping rule

Post by Laskos »

I played a bit with SPRT of cutechess-cli and a true value of 8 Elo points difference. It seems a bit of an art.

Code: Select all

H0    H1    accepted   games   LOS

 0     2       H1      15700  99.99%
 0     3       H1      12000  99.95%
 0     5       H1       6800  99.70%
 0    10       H1       5800  99.28%
 0    15       H1       2300  99.25%
 0    20       H0       8000  99.61%
 0    25       H0       2900  95.60%
H0 is always 0 points here, alpha=beta=0.05, and the true difference is 8 Elo points.For these alpha,beta, if one sets H1 too low, he will play more games than 99.9% LOS before a stop picking H1. If one sets H1 reasonably, one saves ~20% time compared to LOS 99.9%. If one is setting H1 too large, H0 is picked for stopping as a false negative, while the LOS of the stronger engine can be 99.6%. This is frustrating, as a bit more games and the result is a true positive, but one has to reset H1 to a lower value, and redo the test.
I am not claiming that LOS is in fact a stopping rule, but on the range 50-50,000 games (including draws), where the Type I error is smaller than 5%, LOS of 99.9% is only 20% slower in stopping on true positive or true negative, without this art of guessing what hypothesis H1 I have to choose. Sure, SPRT is robust in handling alpha and beta errors, and in setting the epsilon interval, and should be used as such, but it will not prevent me to stop whenever I see 3.1SD advantage for one engine, if not using SPRT, in a 50-50,000 games match.
Robert Pope
Posts: 558
Joined: Sat Mar 25, 2006 8:27 pm

Re: Type I error in LOS based early stopping rule

Post by Robert Pope »

What is SPRT? I haven't heard that abbreviation before.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Type I error in LOS based early stopping rule

Post by Adam Hair »

Robert Pope wrote:What is SPRT? I haven't heard that abbreviation before.
Sequential Probability Ratio Test

It allows the formulation of a stopping rule that ensures the probability of false positives and false negatives are held to predetermined values. It usually, but not always, determines if a change in a chess engine is an improvement ( or not ) in fewer games than fixed game testing. Lucas Braesch and Michel van den Bergh studied its application to engine testing, and it is most prominently being used by Stockfish developers and testers.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Type I error in LOS based early stopping rule

Post by Michel »

@Kai,

(A) Testing time is dominated by the time it takes to validate/reject small elo patches. There is no point in trying to optimize the case of large elo differences. Such cases are (1) rare, (2) are quickly handled by whatever sequential testing method one uses and (3) one is generally happy to spend extra time on large elo patches anyway. So as Lucas was describing it: no need to optimize for the occasional 10,000 elo patch!

(B) The SPRT is optimal for binary hypotheses (i.e. elo=elo0 or elo=elo1 and nothing else). It is not optimal if one wants to minimize expected running time over all elo differences (this is the Kiefer-Weiss problem). Asymptotically optimal solutions to the Kiefer-Weiss problem are known (e.g. the 2-SPRT) but it is quite difficult to design such tests with prescribed alpha and beta. This is in contrast to the SPRT for which it is trivial. And the saving of the 2-SPRT over the SPRT in the worst case scenario (elo=(elo0+elo1)/2 when alpha=beta) is not all that big either.

[[ I have been simplifying a bit. The asymptotic optimality of the 2-SPRT has been proved under hypotheses which are strictly speaking not valid for the BayesElo model, but it is close enough. ]]
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Type I error in LOS based early stopping rule

Post by Laskos »

Michel wrote:@Kai,

(A) Testing time is dominated by the time it takes to validate/reject small elo patches. There is no point in trying to optimize the case of large elo differences. Such cases are (1) rare, (2) are quickly handled by whatever sequential testing method one uses and (3) one is generally happy to spend extra time on large elo patches anyway. So as Lucas was describing it: no need to optimize for the occasional 10,000 elo patch!
I was not talking of these absurd exaggerations. But a simple test with elo0=-1, elo1=3, and the true value elo=8 needs as many games or more than the LOS of 99.9% with the same Type I error.

(B) The SPRT is optimal for binary hypotheses (i.e. elo=elo0 or elo=elo1 and nothing else). It is not optimal if one wants to minimize expected running time over all elo differences (this is the Kiefer-Weiss problem). Asymptotically optimal solutions to the Kiefer-Weiss problem are known (e.g. the 2-SPRT) but it is quite difficult to design such tests with prescribed alpha and beta. This is in contrast to the SPRT for which it is trivial. And the saving of the 2-SPRT over the SPRT in the worst case scenario (elo=(elo0+elo1)/2 when alpha=beta) is not all that big either.

[[ I have been simplifying a bit. The asymptotic optimality of the 2-SPRT has been proved under hypotheses which are strictly speaking not valid for the BayesElo model, but it is close enough. ]]
This is performing simultaneously a pair of one-sided SPRTs? One for acceptance/rejection of H0, and another for acceptance/rejection of H1? There seem to be two cases where SPRT is sub-optimal: elo=(elo0+elo1)/2 and elo>>elo1>~elo0.