Stockfish no progress in 2month and half , why ?

mcostalba · Post by **mcostalba** » Wed Aug 30, 2017 2:07 pm

Rodolfo Leoni wrote:
mcostalba wrote:
Michel wrote: Probably the majority of the patches that pass STC are lucky runs these days (this will happen for 1 neutral patch in 20). However most of those lucky runs will be caught by the LTC test. This creates somehow the perception that the STC test is not a good predictor for the LTC test, leading people to make misguided calls for increasing the STC TC.
This is a comment that makes sense (a novelty in this thread).

In these 2 months there has been a huge number of tests and attempts tried by many people, not less then in the past, and for me this is the most important point. It means interest of developers is still high with SF.

Also finding good patches is a statistical process: sometime you find 3 in a row, sometime you fish for months for nothing....

Still too early to tell if we reached a plateau with current development model or it is just a temporary glitch.
I apologize for being so ignorant, but... Is it possible that test conditions should be revisited? What's the suite of the 10000 games per patch? And, to conclude, is it possible that SF7 is playing its % of perfect games under that test conditions so that improvements can't be detected anymore?

My two cents.

Current test conditions are able to detect almost always a good patch of at least 1-2 ELO improvement. There is no need to change that. In particular lowering the threshold would (1) commit many neutral patches and this in the long term is very bad (2) commit even regression patches sometime.

The test conditions did not change from last year or the year before and the math behind them did not change too

Unfortunately the only way to improve SF is to find good patches, there are no shortcuts or workarounds to this very simple reality.

I'd also like to highlight that detecting tests with 1-2 ELO threshold for submitted patches it is already really low. Very few people here realize how low this threshold is, before fishtest this was not even thinkable to detect such very small improvements in a reliable and statistically sound way.

Frank Brenner · Post by **Frank Brenner** » Wed Aug 30, 2017 5:04 pm

Current test conditions are able to detect almost always a good patch of at least 1-2 ELO improvement. There is no need to change that.....The test conditions did not change from last year or the year before and the math behind them did not change too

The math did not change, but did the stockfish team got the math ?

During the last 2.5 Month many patches were accepted on STC&LTC and have been added to the mastercode.

If you make math and sum up the ELO gain of each little Patch, you can sum up to ... let me say 20 or 30 Points.

But according to the latest regression test the sum is not 20 or 30 , it is ONLY 2 points .

Last year we were able to sum up each individual Patch to a sum of 50 and the regression test showed only 20 ...

My two cents for avoiding this and to make it better:

Take an arbitrary old version lets say the last version from May-2017 as reference.
If you test a new patch P1, let the most recent version including this patch play against the Reference with STC and LTC.
If the test is succesful add the patch and note the Performance (for example: + 13 ELO STC/+ 8 LTC)

The next Patch P2 that want to be tested at STC and LTC must be stronger against the reference than the last version in both time conrols!
If P2 is succesful update the score (for example + 16 STC/+11 LTC)

Of course, you can cancel a bad patch early before 40.000 Games are played, when the math says that the probability for an improvement is almost 0.
As long as the patch is not rejected all 40.000 games have finally to be played, to receive a well calculated new Score (for example +16/+11)

The next patch must be tested with P1&P2 against the old May-Reference.

[/quote]

Uri Blass · Post by **Uri Blass** » Wed Aug 30, 2017 6:31 pm

Frank Brenner wrote:
Current test conditions are able to detect almost always a good patch of at least 1-2 ELO improvement. There is no need to change that.....The test conditions did not change from last year or the year before and the math behind them did not change too

The math did not change, but did the stockfish team got the math ?

During the last 2.5 Month many patches were accepted on STC&LTC and have been added to the mastercode.

If you make math and sum up the ELO gain of each little Patch, you can sum up to ... let me say 20 or 30 Points.

But according to the latest regression test the sum is not 20 or 30 , it is ONLY 2 points .

[/quote]

The math is wrong.

I did not see an unbiased estimate for every patch during the last 2.5 months in order to claim that the sum is at least 20 points.

Note that you should get an estimate also for "non functional" changes.

lantonov · Post by **lantonov** » Wed Aug 30, 2017 6:44 pm

IMO, this site gives a good picture of SF's progress.

Rodolfo Leoni · Post by **Rodolfo Leoni** » Wed Aug 30, 2017 8:14 pm

lantonov wrote:IMO, this site gives a good picture of SF's progress.

The graph shows no progresses since few months. Besides, I must believe Marco as he can have a better view of the whole. From my viewpoint, it'd be safer to test new patches against (at least) 4 different versions. But it's just a viewpoint.

Uri Blass · Post by **Uri Blass** » Wed Aug 30, 2017 8:31 pm

lantonov wrote:IMO, this site gives a good picture of SF's progress.

It seems that they did not test every compile.

For example I see 2 different compiles from 8.7 and 4 different complies from 2.7 in the following page
http://abrok.eu/stockfish/?page=2

In their page I see only 1 complie from 8.7 and 3 compiles from 3.7

Note that it is a problem because I would like to use their data to have an unbiased estimate for gain from earning elo functional changes but it seems impossible if it is not clear that every earning elo patch is tested seperately against stockfish7 like the version before the earning elo patch.

I would like to see simply sum of elo gain against stockfish7 from
1)earning elo patches
2)non losing elo patches(tested at (-1,3)
3)non functional changes

Frank Brenner · Post by **Frank Brenner** » Wed Aug 30, 2017 8:53 pm

Uri Blass wrote:
Frank Brenner wrote:
Current test conditions are able to detect almost always a good patch of at least 1-2 ELO improvement. There is no need to change that.....The test conditions did not change from last year or the year before and the math behind them did not change too

The math did not change, but did the stockfish team got the math ?

During the last 2.5 Month many patches were accepted on STC&LTC and have been added to the mastercode.

If you make math and sum up the ELO gain of each little Patch, you can sum up to ... let me say 20 or 30 Points.

But according to the latest regression test the sum is not 20 or 30 , it is ONLY 2 points .

The math is wrong.

I did not see an unbiased estimate for every patch during the last 2.5 months in order to claim that the sum is at least 20 points.

Note that you should get an estimate also for "non functional" changes.

To get an unbiased estimate for a single patch you can make the following math, here one example:

Take a look at the stockfish framework green results and take one result:
Code: Select all
25-05-17	Vo	yellowCombo2	diff	
LLR&#58; 2.95 (-2.94,2.94&#41; &#91;0.00,5.00&#93;
Total&#58; 11928 W&#58; 1613 L&#58; 1457 D&#58; 8858
sprt @ 60+0.6 th 1	Give this a shot at long time control... low throughput.
The performance is (1613+8858/2) / 11928 = 0,506539... (that means 50.6539 percent)

Take the elo formula and calculate the best estimate for the elo-gain:

Elo-Gain = -400 * LOG10 ( 1/ performance - 1)
= -400 * LOG10 ( 1/0,506539 - 1 ) = +4,544 ELO

So for this single patch based on 11928 Games the most probable elo Gain is 4,544 ELO.

Now you can assign a random-variable to each patch that has successful passed LTC&STC .

For this above single patch the expected-value is 4,544 ELO (
Of course, the other parameters, the exact distribution depends mainly on the number of played games... but this is not so important.
Important is: For each epsilon>0 the probability that the real-world elo-gain for this single patch is in the intervall [Value-epsilon .... Value+epsilon] is highest for Value = 4,544

This is a reliable way to make rough estimate of the elo-gain of each individual patch.

If this sum is much higher than the real Elo-Gain coming from the regression test, you should start search for a reason, because there is something wrong with the Framework-Math.

In our situaton the real value is +2 ELo instead of 15 or 20 or 30 as it should be (i did not make the calculaton for each patch).

It is a wrong way of thinking that the non-functional patches are responsible for this issue, since a non-functional patch should not make any change in playing strength, and if yes, the change in playing strength should be very small.
However, if you think that this issue has happened just because of all the non-functional-patches, you should skip this kind of patches, when they are so expensive that they destroy all the "good" patches of a period over 2.5 month .... But i dont think that the non-functional-patches have a deep influence on that issue.

Henk · Post by **Henk** » Thu Aug 31, 2017 12:10 am

whereagles wrote:no progress because it gets harder and harder as you get better

This is true for all hobbies. It takes more and more effort the closer you get to the top. Reaching the top is dangerous too.

Uri Blass · Post by **Uri Blass** » Thu Aug 31, 2017 12:15 am

Frank Brenner wrote:
Uri Blass wrote:
Frank Brenner wrote:
Current test conditions are able to detect almost always a good patch of at least 1-2 ELO improvement. There is no need to change that.....The test conditions did not change from last year or the year before and the math behind them did not change too

The math did not change, but did the stockfish team got the math ?

During the last 2.5 Month many patches were accepted on STC&LTC and have been added to the mastercode.

If you make math and sum up the ELO gain of each little Patch, you can sum up to ... let me say 20 or 30 Points.

But according to the latest regression test the sum is not 20 or 30 , it is ONLY 2 points .

The math is wrong.

I did not see an unbiased estimate for every patch during the last 2.5 months in order to claim that the sum is at least 20 points.

Note that you should get an estimate also for "non functional" changes.

To get an unbiased estimate for a single patch you can make the following math, here one example:

Take a look at the stockfish framework green results and take one result:
Code: Select all
25-05-17	Vo	yellowCombo2	diff	
LLR&#58; 2.95 (-2.94,2.94&#41; &#91;0.00,5.00&#93;
Total&#58; 11928 W&#58; 1613 L&#58; 1457 D&#58; 8858
sprt @ 60+0.6 th 1	Give this a shot at long time control... low throughput.
The performance is (1613+8858/2) / 11928 = 0,506539... (that means 50.6539 percent)

Take the elo formula and calculate the best estimate for the elo-gain:

Elo-Gain = -400 * LOG10 ( 1/ performance - 1)
= -400 * LOG10 ( 1/0,506539 - 1 ) = +4,544 ELO

So for this single patch based on 11928 Games the most probable elo Gain is 4,544 ELO.

Now you can assign a random-variable to each patch that has successful passed LTC&STC .

For this above single patch the expected-value is 4,544 ELO (
Of course, the other parameters, the exact distribution depends mainly on the number of played games... but this is not so important.
Important is: For each epsilon>0 the probability that the real-world elo-gain for this single patch is in the intervall [Value-epsilon .... Value+epsilon] is highest for Value = 4,544

This is a reliable way to make rough estimate of the elo-gain of each individual patch.

If this sum is much higher than the real Elo-Gain coming from the regression test, you should start search for a reason, because there is something wrong with the Framework-Math.

In our situaton the real value is +2 ELo instead of 15 or 20 or 30 as it should be (i did not make the calculaton for each patch).

It is a wrong way of thinking that the non-functional patches are responsible for this issue, since a non-functional patch should not make any change in playing strength, and if yes, the change in playing strength should be very small.
However, if you think that this issue has happened just because of all the non-functional-patches, you should skip this kind of patches, when they are so expensive that they destroy all the "good" patches of a period over 2.5 month .... But i dont think that the non-functional-patches have a deep influence on that issue.
No

SPRT is clearly biased because you know that the patch already passed before you make the estimate.

The same patch can also fail SPRT but in case the patch fail you simply do not count it.

If you want to make an unbiased estimate then you need to do a test that has no influencing of accepting the patch like fixed number of games after you already decided to accept the patch.

JJJ · Post by **JJJ** » Thu Aug 31, 2017 1:43 am

Why don't you test something like this :

Remove all green patch and test a version with all non fonctionnal patch only, then do the opposite , remove all non fonctionnal patch and test all green patch.

At least you will now better and quickly if the probleme is from there or not.

Many green patches should have add +20 elo at least, that is not the case, even 10 elo would have been nice, but almost 0 looks wrong to me. I think there was this probleme just after Stockfish 8, when many green patches did not much elo increase for a while.

Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?

Re: Stockfish no progress in 2month and half , why ?