It's obviously dangerous to keep testing something until it finally gets the result you want - but I think a very general answer to your question is that if this concerns people (rejected a potentially good patch) then you must arrange the test to be more stringent - while asking if you are willing to accept more regressions. If you want your cake and to eat it too it will require a whole lot more games obviously. Of course re-running a test is a bit dangerous - sooner or later they will get it accepted if that is what they want.Ajedrecista wrote:Hello:
I see that some patches fail SPRT with score > 50% from time to time in SF testing framework and some people there want to give an additional try to these patches. My question is: how often is this scenario present? I did my own SPRT simulator inspired by this Lucas' post. Here are my results for stages I and II:
Probabilities of pass these two stages are:Code: Select all
Stage I --> SPRT (-1.5, 4.5): alpha = beta = 0.05 (5%); 'a priori' drawelo = 240. 10000 simulations each time. Bayeselo Passes Fails <Games> Fails with score > 50% Fails with score = 50% 0.5 2702 7298 26316 1493 27 1 3782 6218 28037 1423 25 1.5 4995 5005 28599 1211 21 2 6196 3804 27908 850 13 2.5 7249 2751 26427 552 12 3 8127 1873 24447 345 7 ============================================================================================================= Stage II --> SPRT (0, 6): alpha = beta = 0.05 (5%); 'a priori' drawelo = 270. 10000 simulations each time. Bayeselo Passes Fails <Games> Fails with score > 50% Fails with score = 50% 0.5 791 9209 20872 3477 43 1 1190 8810 23704 3830 34 1.5 1808 8192 26478 3950 32 2 2737 7263 28387 3733 26 2.5 3796 6204 30394 3358 29 3 5008 4992 30719 2741 23
If one patch passes Stage I then plays at Stage II. The average number of games until stop by SPRT rules (considering a game at Stage II about four times longer than a game at Stage I):Code: Select all
Bayeselo = 0.5: P ~ 0.2702*0.0791 ~ 2.14% Bayeselo = 1: P ~ 0.3782*0.1190 ~ 4.50% Bayeselo = 1.5: P ~ 0.4995*0.1808 ~ 9.03% Bayeselo = 2: P ~ 0.6196*0.2737 ~ 16.96% Bayeselo = 2.5: P ~ 0.7249*0.3796 ~ 27.52% Bayeselo = 3: P ~ 0.8127*0.5008 ~ 40.70%
I hope that someone can confirm my results. It is needless to say that my simulator is far from perfect.Code: Select all
Bayeselo = 0.5: <games> ~ 26316 + 4*20872 = 109804 @ 15+0.05 ~ 27451 @ 60+0.05 Bayeselo = 1: <games> ~ 28037 + 4*23704 = 122853 @ 15+0.05 ~ 30713 @ 60+0.05 Bayeselo = 1.5: <games> ~ 28599 + 4*26478 = 134511 @ 15+0.05 ~ 33628 @ 60+0.05 Bayeselo = 2: <games> ~ 27908 + 4*28387 = 141456 @ 15+0.05 ~ 35364 @ 60+0.05 Bayeselo = 2.5: <games> ~ 26427 + 4*30394 = 148003 @ 15+0.05 ~ 37001 @ 60+0.05 Bayeselo = 3: <games> ~ 24447 + 4*30719 = 147323 @ 15+0.05 ~ 36831 @ 60+0.05
Regards from Spain.
Ajedrecista.
I think tinkering with alpha, beta and the 2 ELO values is always going to be an issue subject to disagreement or opinion.
A more interesting question is the 2 or 3 phased approach where a test must pass more than 1 test. I think that is a really great idea. In the testing framework I'm working on (inspired by what Stockfish uses) the first test will be a very fast test which is easy to pass even if the change is somewhat negative ELO-wise but it's only purpose is to quickly dismiss obviously bad changes. That first pass will run at very fast time controls.
I don't know what the number are for requiring more than 1 test to pass but I think each test should be less strict (obviously) if you do that or else very little would pass - and it's not easy to analyze because we are assuming the first (or perhaps "early") passes are not as relevant. We have been burned too many times by non-scalable changes that tested very well at fast time controls but represented a regression at any "reasonable" time control.