Laskos wrote:Michel wrote:Thanks!
It is well known that one should not use LOS (or p value) as an early stopping rule, but your test demonstrates that the error is much more extreme than what one would intuitively think.
Luckily there is the SPRT!
Until the SPRT issue with drawelo is not fixed in cutechess-cli, I think I can rely on LOS of 99.9% with Type I error under 5% for matches of 50-50,000 games (including draws) in searching for true positives. LOS has an advantage that it is computed whenever one wants, and not sequentially. Also, LOS 99.9% means 3.1 SD, and one doesn't even have to calculate LOS, but usual 2 SD error margins, and say a result 16 +/- 10 Elo is conclusive, and a stop is allowed, while 14 +/- 10 Elo is not. I think SPRT saves some 10-20% time (typical LOS at SPRT stop is about 99.0-99.5%), and Type II errors can be dealt with, but one has to be careful choosing hypothesis. Sure, when drawelo will be computed from actual games in cutechess-cli, there is no issue that it is better, and LOS is just a check for out of range cases like Uri describes.
The issue has been fixed:
https://github.com/cutechess/cutechess/issues/6
https://github.com/cutechess/cutechess/ ... 894552b45b
What the fix does is two things:
1/ draw_elo is not hardcoded anymore, but estimated out of sample (draw_elo has much more impact on the stopping rule that I initially imagined)
2/ elo0 and elo1 are expressed in usual ELO units, not in unscaled Bayes ELO units.
The idea is that the end user is not supposed to know anything about the Bayes ELO model, which is only used internally to reduce the dimension of the problem to 1 (instead of 2) so that SPRT can be applied.
Now the user interface presents SPRT just like the litterature would:
* alpha is the max type I error of the test. This maximum is reached for elo = elo0.
* beta is the max type II error of the test, in the zone elo >= elo1. This maximum is reached for elo=elo1.
* H0: elo=elo0, H1: elo=elo1. All elos are expressed in usual units, so people don't have to figure out to rescale them into bayes elo, depending on the value of draw_elo...
To use this, you must either:
* download the latest source code from Ilari's repo and compile it.
* or use version 0.6, which Ilari released a while ago.
I hope this clarifies.
As for Uri's post:
* I have no idea what he is talking about, and I don't think it has to do with draw_elo.
* He has yet to explain to us by which measure he thinks his "p-value 99.9% test when win+draw>30,000" is better than SPRT.
* Until he provides some concrete figures and a comparison apples to apples of both tests, you should disregard his posts. He has been trolling the Stockfish development forum in a similar manner when he joined.
PS: Implementation has been validated by a simulator, which itseld has been validated against the theory (formulas by Michel Van den Bergh).
Theory and practice sometimes clash. And when that happens, theory loses. Every single time.