testing consistency

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

testing consistency

Post by jdart »

Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon
tomitank
Posts: 276
Joined: Sat Mar 04, 2017 12:24 pm
Location: Hungary

Re: testing consistency

Post by tomitank »

jdart wrote: Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon
(I think) developments below 10 points are unreliable.
Fabien told me this once and I have to say it was right.
Obviously it's an exaggeration, but it is unnecessary to kill a lot of time in things that do not bring enough benefits.
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: testing consistency

Post by Ferdy »

jdart wrote: Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: testing consistency

Post by jdart »

When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon
Ferdy
Posts: 4833
Joined: Sun Aug 10, 2008 3:15 pm
Location: Philippines

Re: testing consistency

Post by Ferdy »

jdart wrote: Sun Dec 16, 2018 5:42 pm
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon
Since the master had performed better than the candidate against those 5 common opponents, I prefer to keep the master as still the best. The performance against 5 opponents have more weight, after all in the real world, the engine is matched against different opponents. Or extend the test with more games or increase the opponents.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: testing consistency

Post by Laskos »

jdart wrote: Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon
What were H1 and H0?
All might be consistent within 2 sigma.
Comparing to master against the same opponents, you have 1.41 times the error of each.
So hyper-bullet 95% confidence result is -2 +/- 4 Elo points or so.
Blitz 95% confidence result is -9 +/- 11 Elo points or so.

Still, I would probably go for rejecting this patch, but again, what were H1 and H0 to take a positive in only 3800 games? Alpha, beta?
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: testing consistency

Post by jdart »

I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: testing consistency

Post by Laskos »

jdart wrote: Wed Dec 26, 2018 3:16 am I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon
So, as an estimate, as you mixed up SPRT with confidence margins, in terms of 95% confidence you had something like:
7 +/- 7 Elo points self-play bullet
-2 +/- 4 Elo points hyper-bullet against foreign opponents
-9 +/- 11 Elo points bullet against foreign opponents.

So, it is sort of consistent within 2SD, although with some bad luck, but not an outrageous bad luck. All in all, I would reject that patch, as those against foreign opponents are more reliable (but 4 times more expensive).
jdart
Posts: 4366
Joined: Fri Mar 10, 2006 5:23 am
Location: http://www.arasanchess.org

Re: testing consistency

Post by jdart »

I agree and did reject it. I am also now using longer TC tests more often for validation.

--Jon