Page 1 of 1

testing consistency

Posted: Sun Dec 16, 2018 3:58 pm
by jdart
Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

Re: testing consistency

Posted: Sun Dec 16, 2018 4:27 pm
by tomitank
jdart wrote: Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon
(I think) developments below 10 points are unreliable.
Fabien told me this once and I have to say it was right.
Obviously it's an exaggeration, but it is unnecessary to kill a lot of time in things that do not bring enough benefits.

Re: testing consistency

Posted: Sun Dec 16, 2018 5:22 pm
by Ferdy
jdart wrote: Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?

Re: testing consistency

Posted: Sun Dec 16, 2018 5:42 pm
by jdart
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon

Re: testing consistency

Posted: Mon Dec 17, 2018 11:26 am
by Ferdy
jdart wrote: Sun Dec 16, 2018 5:42 pm
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon
Since the master had performed better than the candidate against those 5 common opponents, I prefer to keep the master as still the best. The performance against 5 opponents have more weight, after all in the real world, the engine is matched against different opponents. Or extend the test with more games or increase the opponents.

Re: testing consistency

Posted: Tue Dec 25, 2018 8:56 pm
by Laskos
jdart wrote: Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon
What were H1 and H0?
All might be consistent within 2 sigma.
Comparing to master against the same opponents, you have 1.41 times the error of each.
So hyper-bullet 95% confidence result is -2 +/- 4 Elo points or so.
Blitz 95% confidence result is -9 +/- 11 Elo points or so.

Still, I would probably go for rejecting this patch, but again, what were H1 and H0 to take a positive in only 3800 games? Alpha, beta?

Re: testing consistency

Posted: Wed Dec 26, 2018 3:16 am
by jdart
I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon

Re: testing consistency

Posted: Wed Dec 26, 2018 10:07 am
by Laskos
jdart wrote: Wed Dec 26, 2018 3:16 am I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon
So, as an estimate, as you mixed up SPRT with confidence margins, in terms of 95% confidence you had something like:
7 +/- 7 Elo points self-play bullet
-2 +/- 4 Elo points hyper-bullet against foreign opponents
-9 +/- 11 Elo points bullet against foreign opponents.

So, it is sort of consistent within 2SD, although with some bad luck, but not an outrageous bad luck. All in all, I would reject that patch, as those against foreign opponents are more reliable (but 4 times more expensive).

Re: testing consistency

Posted: Wed Dec 26, 2018 4:54 pm
by jdart
I agree and did reject it. I am also now using longer TC tests more often for validation.

--Jon