18 days from SF4 release and about ~30+ ELO gain!

Don · Post by **Don** » Sun Sep 15, 2013 6:54 pm

Ajedrecista wrote:Hello:

I see that some patches fail SPRT with score > 50% from time to time in SF testing framework and some people there want to give an additional try to these patches. My question is: how often is this scenario present? I did my own SPRT simulator inspired by this Lucas' post. Here are my results for stages I and II:

Code: Select all

Stage I --> SPRT (-1.5, 4.5&#41;&#58;
alpha = beta = 0.05 &#40;5%); 'a priori' drawelo = 240.
10000 simulations each time.

Bayeselo        Passes        Fails       <Games>        Fails with score > 50%        Fails with score = 50%
   0.5           2702          7298        26316                  1493                           27
   1             3782          6218        28037                  1423                           25
   1.5           4995          5005        28599                  1211                           21
   2             6196          3804        27908                   850                           13
   2.5           7249          2751        26427                   552                           12
   3             8127          1873        24447                   345                            7

=============================================================================================================

Stage II --> SPRT &#40;0, 6&#41;&#58;
alpha = beta = 0.05 &#40;5%); 'a priori' drawelo = 270.
10000 simulations each time.

Bayeselo        Passes        Fails       <Games>        Fails with score > 50%        Fails with score = 50%
   0.5            791          9209        20872                  3477                           43
   1             1190          8810        23704                  3830                           34
   1.5           1808          8192        26478                  3950                           32
   2             2737          7263        28387                  3733                           26
   2.5           3796          6204        30394                  3358                           29
   3             5008          4992        30719                  2741                           23

Probabilities of pass these two stages are:

Code: Select all

Bayeselo = 0.5&#58; P ~ 0.2702*0.0791 ~  2.14%
Bayeselo = 1&#58;   P ~ 0.3782*0.1190 ~  4.50%
Bayeselo = 1.5&#58; P ~ 0.4995*0.1808 ~  9.03%
Bayeselo = 2&#58;   P ~ 0.6196*0.2737 ~ 16.96%
Bayeselo = 2.5&#58; P ~ 0.7249*0.3796 ~ 27.52%
Bayeselo = 3&#58;   P ~ 0.8127*0.5008 ~ 40.70%

If one patch passes Stage I then plays at Stage II. The average number of games until stop by SPRT rules (considering a game at Stage II about four times longer than a game at Stage I):

Code: Select all

Bayeselo = 0.5&#58; <games> ~ 26316 + 4*20872 = 109804 @ 15+0.05 ~ 27451 @ 60+0.05
Bayeselo = 1&#58;   <games> ~ 28037 + 4*23704 = 122853 @ 15+0.05 ~ 30713 @ 60+0.05
Bayeselo = 1.5&#58; <games> ~ 28599 + 4*26478 = 134511 @ 15+0.05 ~ 33628 @ 60+0.05
Bayeselo = 2&#58;   <games> ~ 27908 + 4*28387 = 141456 @ 15+0.05 ~ 35364 @ 60+0.05
Bayeselo = 2.5&#58; <games> ~ 26427 + 4*30394 = 148003 @ 15+0.05 ~ 37001 @ 60+0.05
Bayeselo = 3&#58;   <games> ~ 24447 + 4*30719 = 147323 @ 15+0.05 ~ 36831 @ 60+0.05

I hope that someone can confirm my results. It is needless to say that my simulator is far from perfect.

Regards from Spain.

Ajedrecista.

It's obviously dangerous to keep testing something until it finally gets the result you want - but I think a very general answer to your question is that if this concerns people (rejected a potentially good patch) then you must arrange the test to be more stringent - while asking if you are willing to accept more regressions. If you want your cake and to eat it too it will require a whole lot more games obviously. Of course re-running a test is a bit dangerous - sooner or later they will get it accepted if that is what they want.

I think tinkering with alpha, beta and the 2 ELO values is always going to be an issue subject to disagreement or opinion.

A more interesting question is the 2 or 3 phased approach where a test must pass more than 1 test. I think that is a really great idea. In the testing framework I'm working on (inspired by what Stockfish uses) the first test will be a very fast test which is easy to pass even if the change is somewhat negative ELO-wise but it's only purpose is to quickly dismiss obviously bad changes. That first pass will run at very fast time controls.

I don't know what the number are for requiring more than 1 test to pass but I think each test should be less strict (obviously) if you do that or else very little would pass - and it's not easy to analyze because we are assuming the first (or perhaps "early") passes are not as relevant. We have been burned too many times by non-scalable changes that tested very well at fast time controls but represented a regression at any "reasonable" time control.

JuLieN · Post by **JuLieN** » Sun Oct 06, 2013 1:58 pm

Jesse Gersenson · Post by **Jesse Gersenson** » Sun Oct 06, 2013 3:22 pm

Houdini wrote:Indeed, the pace of Stockfish improvement is amazing, the development framework constructed by Gary is awesome.
Clearly no individual or two-person team can keep up with this in the long run, so this could mean the end of commercial chess engines as we currently know them. Maybe in 2 years time only Stockfish and derivates will continue to be developed.

Robert

I was positive a large scale testing framework would win and Don D. made a testing framework for Komodo but it was stopped in Oct, 2012. A few months later Stockfish ran with the idea. Did Gary get the idea from Don?

18 days from SF4 release and about ~30+ ELO gain!

Re: 19 days from SF 4 release and about ~30 Elo gain!

Re: 19 days from SF 4 release and about ~30 Elo gain!

Re: 18 days from SF4 release and about ~30+ ELO gain!