Stockfish no progress in 2month and half , why ?

JJJ · Post by **JJJ** » Mon Aug 28, 2017 2:15 pm

I ask here. I d like to know why with many green patches, regression test indicate no progress ? Last time it was +27~ elo and now seems +29 at best. 2 elo with how many green patches ? Do you think the way Stockfish is tested does not work anymore ?

Just asking, not here to offense Stockfish team.

MikeB · Post by **MikeB** » Wed Aug 30, 2017 5:21 am

JJJ wrote:I ask here. I d like to know why with many green patches, regression test indicate no progress ? Last time it was +27~ elo and now seems +29 at best. 2 elo with how many green patches ? Do you think the way Stockfish is tested does not work anymore ?

Just asking, not here to offense Stockfish team.

Not speaking for the SF team , but for chess programs in general it's never a straight-line up, self play testing will tend to over exaggerate the benefit of a patch, not all patches are tested ( some are deemed non functional - although I have seen the benchmark nodes change on non-functional patches - simple logic will tell anyone if the benchmark changes, it is not a non-functional patch) and some simplification patches cost ELO - with that said, this is how the process works for any engine . You need simplification patches and sometimes a total re-write because the path you're on is only going to give you minimal ELO gains from here on out. Every engine goes through this and that is most likely the primary reason why singular authors eventually burnout rather quickly ( 10 to 15 years is considered long) - those authors who keep at it for more than 20 - 25 years consecutively and non-stop are a rare bird indeed. (Dart/Hyatt and a few others)

Dirt · Post by **Dirt** » Wed Aug 30, 2017 6:06 am

JJJ wrote:I ask here. I d like to know why with many green patches, regression test indicate no progress ? Last time it was +27~ elo and now seems +29 at best. 2 elo with how many green patches ? Do you think the way Stockfish is tested does not work anymore ?

Just asking, not here to offense Stockfish team.

Well, Marco is busy making tablebases not work. That can't help.

shrapnel · Post by **shrapnel** » Wed Aug 30, 2017 6:24 am

Stockfish has solved Chess.
Maybe we should all take up Chinese Checkers or something....

Uri Blass · Post by **Uri Blass** » Wed Aug 30, 2017 6:26 am

most improvement fail at long time control and part of the improvement that pass may be lucky runs(There is a probability of 5% that 0 elo change is going to pass).

I think that the way the stockfish team work is not scientific in order to know the reasons.

The correct way should be to test every patch that they accept including non functional patch with 40,000 games at LTC against previous version(Note that for functional patches I do not suggest not to use SPRT but to have an additional test with fixed number of games also for them).

In this way we can get a better estimate about the value of every patch in elo terms and we can really see if self play really exaggerate the benefit of patches.

Note that I do not believe that the problem is that simplifications lose elo considering the fact that most simplifications that you test at LTC also pass with more than 50%.

I guess that simplifications give positive elo improvement when the problem is that some patches that people consider to be non functional changes that people even do not test lose elo.

Note that
Having the same bench is not a proof that the patch is a non functional patch not only because bench is based on small number of positions but also because bench is not the same as playing a game when you use hash in the next move for analysis.

whereagles · Post by **whereagles** » Wed Aug 30, 2017 10:26 am

no progress because it gets harder and harder as you get better

Michel · Post by **Michel** » Wed Aug 30, 2017 10:46 am

Uri Blass wrote:most improvement fail at long time control and part of the improvement that pass may be lucky runs(There is a probability of 5% that 0 elo change is going to pass).

It is much smaller: namely 0.05^2=0.25%.

Probably the majority of the patches that pass STC are lucky runs these days (this will happen for 1 neutral patch in 20). However most of those lucky runs will be caught by the LTC test. This creates somehow the perception that the STC test is not a good predictor for the LTC test, leading people to make misguided calls for increasing the STC TC.

I think that the way the stockfish team work is not scientific in order to know the reasons.

The correct way should be to test every patch that they accept including non functional patch with 40,000 games at LTC against previous version(Note that for functional patches I do not suggest not to use SPRT but to have an additional test with fixed number of games also for them).

In this way we can get a better estimate about the value of every patch in elo terms and we can really see if self play really exaggerate the benefit of patches.

I agree that this would be an interesting experiment. But recall that these are 1-2 elo patches. If you really want to evaluate the value of such patches in a statistically sound way, 40,000 games is far from enough. There is a huge difference with having a procedure that on average evaluates patches correctly, and a procedure that evaluates every individual patch correctly.

Note that I do not believe that the problem is that simplifications lose elo considering the fact that most simplifications that you test at LTC also pass with more than 50%.

I guess that simplifications give positive elo improvement when the problem is that some patches that people consider to be non functional changes that people even do not test lose elo.

Note that
Having the same bench is not a proof that the patch is a non functional patch not only because bench is based on small number of positions but also because bench is not the same as playing a game when you use hash in the next move for analysis.

mcostalba · Post by **mcostalba** » Wed Aug 30, 2017 11:53 am

Michel wrote: Probably the majority of the patches that pass STC are lucky runs these days (this will happen for 1 neutral patch in 20). However most of those lucky runs will be caught by the LTC test. This creates somehow the perception that the STC test is not a good predictor for the LTC test, leading people to make misguided calls for increasing the STC TC.

This is a comment that makes sense (a novelty in this thread).

In these 2 months there has been a huge number of tests and attempts tried by many people, not less then in the past, and for me this is the most important point. It means interest of developers is still high with SF.

Also finding good patches is a statistical process: sometime you find 3 in a row, sometime you fish for months for nothing....

Still too early to tell if we reached a plateau with current development model or it is just a temporary glitch.

Uri Blass · Post by **Uri Blass** » Wed Aug 30, 2017 1:03 pm

Michel wrote:
It is much smaller: namely 0.05^2=0.25%.

Probably the majority of the patches that pass STC are lucky runs these days (this will happen for 1 neutral patch in 20). However most of those lucky runs will be caught by the LTC test. This creates somehow the perception that the STC test is not a good predictor for the LTC test, leading people to make misguided calls for increasing the STC TC.

We do not know if the problem is that majority of patches that pass STC are lucky runs or majority of patches are patches that are good only at STC.

0.05^2 is only for patches that are 0 elo both in short time control and long time control but the probability is different for patches that are good for short time control but not for long time control.

Rodolfo Leoni · Post by **Rodolfo Leoni** » Wed Aug 30, 2017 1:21 pm

mcostalba wrote:
Michel wrote: Probably the majority of the patches that pass STC are lucky runs these days (this will happen for 1 neutral patch in 20). However most of those lucky runs will be caught by the LTC test. This creates somehow the perception that the STC test is not a good predictor for the LTC test, leading people to make misguided calls for increasing the STC TC.
This is a comment that makes sense (a novelty in this thread).

In these 2 months there has been a huge number of tests and attempts tried by many people, not less then in the past, and for me this is the most important point. It means interest of developers is still high with SF.

Also finding good patches is a statistical process: sometime you find 3 in a row, sometime you fish for months for nothing....

Still too early to tell if we reached a plateau with current development model or it is just a temporary glitch.

I apologize for being so ignorant, but... Is it possible that test conditions should be revisited? What's the suite of the 10000 games per patch? And, to conclude, is it possible that SF7 is playing its % of perfect games under that test conditions so that improvements can't be detected anymore?

My two cents.

Stockfish no progress in 2month and half , why ?

Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?