Observation from SF development

Jouni · Post by **Jouni** » Thu Aug 30, 2018 4:46 pm

After 17.8. we have these LTC patches until 28.8.:

Total: 30893 W: 5154 L: 4910 D: 20829 Elo +2.74

Total: 36203 W: 6046 L: 5781 D: 24376 Elo +2.54

Total: 41084 W: 6769 L: 6680 D: 27635 Elo +0.75

Total: 13655 W: 2294 L: 2099 D: 9262 Elo +4.96

Wow nice >10 ELO gain? No, regression test gave just 1,0 ELO

. Also in NCM after more than 100 000 games no visible progress.

Dann Corbit · Post by **Dann Corbit** » Thu Aug 30, 2018 5:39 pm

There are error bars, of course, on all of the measurements.
And the measurements are really valid only for the exact conditions of the tests.

But if you look at Pohl's site, you will see a relentless and merciless march towards the stars.
Eventually, they would reach infinity.

So the numbers may not add up like cord wood and create the giant pile of Elo instantly.
But the technique clearly does work.

Uri Blass · Post by **Uri Blass** » Thu Aug 30, 2018 8:41 pm

Jouni wrote: ↑Thu Aug 30, 2018 4:46 pm After 17.8. we have these LTC patches until 28.8.:

Total: 30893 W: 5154 L: 4910 D: 20829 Elo +2.74

Total: 36203 W: 6046 L: 5781 D: 24376 Elo +2.54

Total: 41084 W: 6769 L: 6680 D: 27635 Elo +0.75

Total: 13655 W: 2294 L: 2099 D: 9262 Elo +4.96

Wow nice >10 ELO gain? No, regression test gave just 1,0 ELO . Also in NCM after more than 100 000 games no visible progress.

Numbers are wrong.

If you want unbiased estimates you should use fixed number of games and not SPRT.
They have many patches that fail at long time control and I guess most of the patches that passed long time control are going to fail if they test again.

Imagine that every patch is 0 elo improvement and you use SPRT.
If you test enough patches some are going to pass with +2 elo or even +4 elo biased estimate.

After enough time you can easily get 100 elo but when you test with 40000 games no visible progress.

It may be more interesting if they test every patch with 40000 games after it pass and use the result as the estimate(and use also negative numbers if they get them and if you want an unbiased estimate you are not allowed to revert patches that got less than 50% with 40000 games after they already passed SPRT).

I guess we are going to see a lower elo sum even if they test always against the previous version.

Uri

Jouni · Post by **Jouni** » Sat Sep 01, 2018 9:34 am

BTW latest regression test shows regression: -1,7 ELO. So the trend to infinity is slowing down

.

Ajedrecista · Post by **Ajedrecista** » Sat Sep 01, 2018 8:39 pm

Hello Jouni:

Jouni wrote: ↑Thu Aug 30, 2018 4:46 pm After 17.8. we have these LTC patches until 28.8.:

Total: 30893 W: 5154 L: 4910 D: 20829 Elo +2.74

Total: 36203 W: 6046 L: 5781 D: 24376 Elo +2.54

Total: 41084 W: 6769 L: 6680 D: 27635 Elo +0.75

Total: 13655 W: 2294 L: 2099 D: 9262 Elo +4.96

Wow nice >10 ELO gain? No, regression test gave just 1,0 ELO . Also in NCM after more than 100 000 games no visible progress.

As Uri already said, error bars exist and numbers are wrong just because Elo estimates from SPRT tests involve different math than Elo estimates from fixed number of games matches.

If those four tests were fixed game matches, I get your Elo estimates; for 95% confidence intervals, I get +0.53 to +2.74, +0.50 to +4.59, -1.17 to +2.67 and +1.66 to +8.27, respectively. Please compare these intervals with the correct ones shown below.

Michel van den Bergh did the correspondient math with Elo estimates from SPRT tests and anyone can see them now, just clicking over the two places that I write in bold:

18-08-26 Viz tweak_se_malus diff

LLR: 2.95 (-2.94,2.94) [0.00,4.00]
Total: 13655 W: 2294 L: 2099 D: 9262

sprt @ 60+0.6 th 1

LTC now...

That is, over the box with the SPRT stats or over the word 'SPRT'.

Here are the results:

http://tests.stockfishchess.org/html/li ... 02bdba9914

Code: Select all

TC      60+0.6
SPRT    elo0: 0.00  alpha: 0.05  elo1: 5.00  beta: 0.05
LLR     2.96 [-2.94,2.94] (accepted)
Elo     2.30 [-0.17,4.63] (95%)
LOS     96.6%
Games   30893 [w:16.7%, l:15.9%, d:67.4%]

http://tests.stockfishchess.org/html/li ... 02bdbadfa9

Code: Select all

TC      60+0.6
SPRT    elo0: 0.00  alpha: 0.05  elo1: 5.00  beta: 0.05
LLR     2.95 [-2.94,2.94] (accepted)
Elo     2.10 [-0.23,4.27] (95%)
LOS     96.2%
Games   36203 [w:16.7%, l:16.0%, d:67.3%]

http://tests.stockfishchess.org/html/li ... 02bdbb1b85

Code: Select all

TC      60+0.6
SPRT    elo0: -3.00  alpha: 0.05  elo1: 1.00  beta: 0.05
LLR     2.96 [-2.94,2.94] (accepted)
Elo     0.39 [-1.72,2.41] (95%)
LOS     64.7%
Games   41084 [w:16.5%, l:16.3%, d:67.3%]

http://tests.stockfishchess.org/html/li ... 02bdbb8038

Code: Select all

TC      60+0.6
SPRT    elo0: 0.00  alpha: 0.05  elo1: 4.00  beta: 0.05
LLR     2.95 [-2.94,2.94] (accepted)
Elo     4.60 [1.19,7.96] (95%)
LOS     99.6%
Games   13655 [w:16.8%, l:15.4%, d:67.8%]

Other important issue is that Elo gains from different tests are not additive: Fishtest is full of examples, just take all the patches between two consecutive 40000-game regression/progression matches and you will see that the difference between these 40000-game matches is usually less than the sum of gains of intermediate patches. Uri said that, again.

Regards from Spain.

Ajedrecista.

Observation from SF development

Observation from SF development

Re: Observation from SF development

Re: Observation from SF development

Re: Observation from SF development

Re: Observation from SF development.