Recently I am getting inconsistent results from testing at various time controls/conditions.
For example, one candidate change gave these results:
SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch
So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.
--Jon
testing consistency
Moderators: hgm, Rebel, chrisw
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
-
- Posts: 276
- Joined: Sat Mar 04, 2017 12:24 pm
- Location: Hungary
Re: testing consistency
(I think) developments below 10 points are unreliable.jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.
For example, one candidate change gave these results:
SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch
So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.
--Jon
Fabien told me this once and I have to say it was right.
Obviously it's an exaggeration, but it is unnecessary to kill a lot of time in things that do not bring enough benefits.
-
- Posts: 4833
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: testing consistency
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.
For example, one candidate change gave these results:
SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch
So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.
--Jon
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: testing consistency
Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
--Jon
-
- Posts: 4833
- Joined: Sun Aug 10, 2008 3:15 pm
- Location: Philippines
Re: testing consistency
Since the master had performed better than the candidate against those 5 common opponents, I prefer to keep the master as still the best. The performance against 5 opponents have more weight, after all in the real world, the engine is matched against different opponents. Or extend the test with more games or increase the opponents.
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: testing consistency
What were H1 and H0?jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.
For example, one candidate change gave these results:
SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch
So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.
--Jon
All might be consistent within 2 sigma.
Comparing to master against the same opponents, you have 1.41 times the error of each.
So hyper-bullet 95% confidence result is -2 +/- 4 Elo points or so.
Blitz 95% confidence result is -9 +/- 11 Elo points or so.
Still, I would probably go for rejecting this patch, but again, what were H1 and H0 to take a positive in only 3800 games? Alpha, beta?
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: testing consistency
I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py
Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05
--Jon
Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05
--Jon
-
- Posts: 10948
- Joined: Wed Jul 26, 2006 10:21 pm
- Full name: Kai Laskos
Re: testing consistency
So, as an estimate, as you mixed up SPRT with confidence margins, in terms of 95% confidence you had something like:jdart wrote: ↑Wed Dec 26, 2018 3:16 am I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py
Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05
--Jon
7 +/- 7 Elo points self-play bullet
-2 +/- 4 Elo points hyper-bullet against foreign opponents
-9 +/- 11 Elo points bullet against foreign opponents.
So, it is sort of consistent within 2SD, although with some bad luck, but not an outrageous bad luck. All in all, I would reject that patch, as those against foreign opponents are more reliable (but 4 times more expensive).
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: testing consistency
I agree and did reject it. I am also now using longer TC tests more often for validation.
--Jon
--Jon