I'm sorry, I misread that in your initial post. I thought you wanted to look at the LOS condition only *after* 30,000 wins+losses. That invalidates my point about cucumbers.Uri Blass wrote:1)I compare SPRT(0,6,5%,5%) that the stockfish team use to
p-value 99.9% test when you stop when win+losses>30,000 if you do not achieve this p-value.
The examples you give are obviously completely irrealistic. What does it even mean to test a 10,000 elo patch ? Have you ever seen such a patch ? If so, let me know, and I will gladly put it n DiscoCheck.
SPRT (or any well defined stopping algorithm) is a systematic rule that avoid the biais and arbitrary aspect of early stopping by hand, when you feel that you have enough games (lots of people still do that and are not aware of the effects of early stopping, and just look at the p-value). But it doesn't mean you cannot use your brain either:
* If the results look something like 100-0-0, instead of commiting the patch and screaming hurray, you probably start to look for a problem in your testing framework (all games lost on time or due to buggy adjudication, or segfault or whatever). Fix the bug and rerun.
* If the results look like 0-100-0, you don't just reject the patch and conclude that the idea doesn't work. Instead you look for bugs in the patch. This can only be the result of a serious bug.
In real life, the patches are very close to zero most of the time. Probably the majority is slightly negative, and some are slightly positive. The stockfish testing framework could be used to estimate (based on a large history of teted patches) the distribution of that ELO parameter. The various SPRT used or fixed size, or early stopped by hand, might slightly biais the result, but nothing big. My guess is something like E(elo)=-2 elo and stdev(elo)=3, no more.