Stockfish testing at STC and LTC: one question

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Jouni
Posts: 3278
Joined: Wed Mar 08, 2006 8:15 pm

Stockfish testing at STC and LTC: one question

Post by Jouni »

It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!
Jouni
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Stockfish testing at STC and LTC: one question

Post by cdani »

Who knows, without more games is impossible to tell. When I want to be more sure, I just extend the test. Flexibility in testing methods is..., well, necessary.
Last edited by cdani on Tue Sep 19, 2017 5:40 pm, edited 1 time in total.
jhellis3
Posts: 546
Joined: Sat Aug 17, 2013 12:36 am

Re: Stockfish testing at STC and LTC: one question

Post by jhellis3 »

Yes, this is why Stockfish has lost hundreds of Elo over the last couple of years and has had miserable results at TCEC.
User avatar
cdani
Posts: 2204
Joined: Sat Jan 18, 2014 10:24 am
Location: Andorra

Re: Stockfish testing at STC and LTC: one question

Post by cdani »

jhellis3 wrote:Yes, this is why Stockfish has lost hundreds of Elo over the last couple of years and has had miserable results at TCEC.
Not at all, of course, but always something more can be done.
User avatar
Houdini
Posts: 1471
Joined: Tue Mar 16, 2010 12:00 am

Re: Stockfish testing at STC and LTC: one question

Post by Houdini »

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!
The amount of noise in engine testing is such that it's nearly impossible to extrapolate the results to longer TC.
The error margins are very big compared to the difference in results.
User avatar
Evert
Posts: 2929
Joined: Sat Jan 22, 2011 12:42 am
Location: NL

Re: Stockfish testing at STC and LTC: one question

Post by Evert »

How do you get those Elo estimates? Elo estimates based on the SPRT test runs are not reliable. All I'm seeing from the numbers you quote is an increased draw rate with longer time control, which I think is expected.
mjlef
Posts: 1494
Joined: Thu Mar 30, 2006 2:08 pm

Re: Stockfish testing at STC and LTC: one question

Post by mjlef »

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!
Take a look at draw percentages. At longer time controls they increase a lot. For example, in CCRL a typical draw percentage for the stronger programs is 40% at 40/4, but at 40/40 is it around 60%. You just get more draws as programs search deeper and play better. So a contraction of 4 elo to 2 elo at a longer time control is quite normal.

It is very hard to get enough data at very long time controls to prove a change is good or not. Anything over more than just a few seconds per move just takes too long. I really appreciate the in between lists like IPON's 5'+3". It is a reasonable attempt to get enough games to say something with some reasonable error margins. Perhaps as computers get cheaper on the cloud we can test at a much longer time control.
Uri Blass
Posts: 10267
Joined: Thu Mar 09, 2006 12:37 am
Location: Tel-Aviv Israel

Re: Stockfish testing at STC and LTC: one question

Post by Uri Blass »

Jouni wrote:It's usual, that excellent patch gives +4 ELO at STC, but +2 at LTC. So is this indicating, that at 360+3,6 level we get probably NOTHING? And may be regression at tournament level!
It is usual that the stockfish team have no interest how much elo they get otherwise they could do better by using fixed number of games.

We know almost nothing about the elo improvement of a patch from
results of SPRT.

Performance of +4 elo when they pass SPRT at STC means nothing becuase if you test the patch many times the patch may fail SPRT in part of the cases and give also 0 elo or 1 elo so the average result is clearly less than 4 elo.
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: Stockfish testing at STC and LTC: one question

Post by jdart »

I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon
Dann Corbit
Posts: 12537
Joined: Wed Mar 08, 2006 8:57 pm
Location: Redmond, WA USA

Re: Stockfish testing at STC and LTC: one question

Post by Dann Corbit »

jdart wrote:I have observed myself that STC results appear to be a lot noisier than LTC results. So a positive STC result is a bad predictor of what LTC or real tournament results will be. This is a bit surprising because for years, starting with Rybka, engines were using hyper-bullet games for testing. There is some validity to that method because many got a good ELO gain from it. But it is not the best or most reliable method. It is a way to short cut testing at real time controls, which would require a huge number of processor cores to perform in a reasonable time period.

--Jon
When you test with a certain set of conditions, the results are totally valid for exactly those conditions. The results may or may not translate to another set of conditions.

Generally speaking, things that work well at ultra high speed will work well at other speeds to. That is why the model tends to work and Stockfish is an extremely strong engine.

On the other hand, they are tuning SF for high speed blitz games so they will achieve that.
But I think every other engine is doing the same thing, so it really won't make any difference any way.
Besides which, nobody has the resources to test at tournament time control.
Taking ideas is not a vice, it is a virtue. We have another word for this. It is called learning.
But sharing ideas is an even greater virtue. We have another word for this. It is called teaching.