Observation from SF development

Discussion of anything and everything relating to chess playing software and machines.

Moderators: hgm, Rebel, chrisw

Jouni
Posts: 3293
Joined: Wed Mar 08, 2006 8:15 pm

Observation from SF development

Post by Jouni »

After 17.8. we have these LTC patches until 28.8.:

Total: 30893 W: 5154 L: 4910 D: 20829 Elo +2.74

Total: 36203 W: 6046 L: 5781 D: 24376 Elo +2.54

Total: 41084 W: 6769 L: 6680 D: 27635 Elo +0.75

Total: 13655 W: 2294 L: 2099 D: 9262 Elo +4.96

Wow nice >10 ELO gain? No, regression test gave just 1,0 ELO :o . Also in NCM after more than 100 000 games no visible progress.
Jouni
Dann Corbit
Posts: 12542
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Observation from SF development

Post by Dann Corbit »

There are error bars, of course, on all of the measurements.
And the measurements are really valid only for the exact conditions of the tests.

But if you look at Pohl's site, you will see a relentless and merciless march towards the stars.
Eventually, they would reach infinity.

So the numbers may not add up like cord wood and create the giant pile of Elo instantly.
But the technique clearly does work.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.
Uri Blass
Posts: 10309
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Observation from SF development

Post by Uri Blass »

Jouni wrote: Thu Aug 30, 2018 4:46 pm After 17.8. we have these LTC patches until 28.8.:

Total: 30893 W: 5154 L: 4910 D: 20829 Elo +2.74

Total: 36203 W: 6046 L: 5781 D: 24376 Elo +2.54

Total: 41084 W: 6769 L: 6680 D: 27635 Elo +0.75

Total: 13655 W: 2294 L: 2099 D: 9262 Elo +4.96

Wow nice >10 ELO gain? No, regression test gave just 1,0 ELO :o . Also in NCM after more than 100 000 games no visible progress.
Numbers are wrong.

If you want unbiased estimates you should use fixed number of games and not SPRT.
They have many patches that fail at long time control and I guess most of the patches that passed long time control are going to fail if they test again.

Imagine that every patch is 0 elo improvement and you use SPRT.
If you test enough patches some are going to pass with +2 elo or even +4 elo biased estimate.

After enough time you can easily get 100 elo but when you test with 40000 games no visible progress.

It may be more interesting if they test every patch with 40000 games after it pass and use the result as the estimate(and use also negative numbers if they get them and if you want an unbiased estimate you are not allowed to revert patches that got less than 50% with 40000 games after they already passed SPRT).



I guess we are going to see a lower elo sum even if they test always against the previous version.

Uri
Jouni
Posts: 3293
Joined: Wed Mar 08, 2006 8:15 pm

Re: Observation from SF development

Post by Jouni »

BTW latest regression test shows regression: -1,7 ELO. So the trend to infinity is slowing down :) .
Jouni
User avatar
Ajedrecista
Posts: 1971
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Observation from SF development.

Post by Ajedrecista »

Hello Jouni:
Jouni wrote: Thu Aug 30, 2018 4:46 pm After 17.8. we have these LTC patches until 28.8.:

Total: 30893 W: 5154 L: 4910 D: 20829 Elo +2.74

Total: 36203 W: 6046 L: 5781 D: 24376 Elo +2.54

Total: 41084 W: 6769 L: 6680 D: 27635 Elo +0.75

Total: 13655 W: 2294 L: 2099 D: 9262 Elo +4.96

Wow nice >10 ELO gain? No, regression test gave just 1,0 ELO :o . Also in NCM after more than 100 000 games no visible progress.
As Uri already said, error bars exist and numbers are wrong just because Elo estimates from SPRT tests involve different math than Elo estimates from fixed number of games matches.

If those four tests were fixed game matches, I get your Elo estimates; for 95% confidence intervals, I get +0.53 to +2.74, +0.50 to +4.59, -1.17 to +2.67 and +1.66 to +8.27, respectively. Please compare these intervals with the correct ones shown below.

Michel van den Bergh did the correspondient math with Elo estimates from SPRT tests and anyone can see them now, just clicking over the two places that I write in bold:
18-08-26 Viz tweak_se_malus diff

LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 13655 W: 2294 L: 2099 D: 9262


sprt @ 60+0.6 th 1

LTC now...
That is, over the box with the SPRT stats or over the word 'SPRT'.

Here are the results:

http://tests.stockfishchess.org/html/li ... 02bdba9914

Code: Select all

TC      60+0.6
SPRT    elo0: 0.00  alpha: 0.05  elo1: 5.00  beta: 0.05
LLR     2.96 [-2.94,2.94] (accepted)
Elo     2.30 [-0.17,4.63] (95%)
LOS     96.6%
Games   30893 [w:16.7%, l:15.9%, d:67.4%]
http://tests.stockfishchess.org/html/li ... 02bdbadfa9

Code: Select all

TC      60+0.6
SPRT    elo0: 0.00  alpha: 0.05  elo1: 5.00  beta: 0.05
LLR     2.95 [-2.94,2.94] (accepted)
Elo     2.10 [-0.23,4.27] (95%)
LOS     96.2%
Games   36203 [w:16.7%, l:16.0%, d:67.3%]
http://tests.stockfishchess.org/html/li ... 02bdbb1b85

Code: Select all

TC      60+0.6
SPRT    elo0: -3.00  alpha: 0.05  elo1: 1.00  beta: 0.05
LLR     2.96 [-2.94,2.94] (accepted)
Elo     0.39 [-1.72,2.41] (95%)
LOS     64.7%
Games   41084 [w:16.5%, l:16.3%, d:67.3%]
http://tests.stockfishchess.org/html/li ... 02bdbb8038

Code: Select all

TC      60+0.6
SPRT    elo0: 0.00  alpha: 0.05  elo1: 4.00  beta: 0.05
LLR     2.95 [-2.94,2.94] (accepted)
Elo     4.60 [1.19,7.96] (95%)
LOS     99.6%
Games   13655 [w:16.8%, l:15.4%, d:67.8%]
Other important issue is that Elo gains from different tests are not additive: Fishtest is full of examples, just take all the patches between two consecutive 40000-game regression/progression matches and you will see that the difference between these 40000-game matches is usually less than the sum of gains of intermediate patches. Uri said that, again.

Regards from Spain.

Ajedrecista.