testing consistency

jdart · Post by **jdart** » Sun Dec 16, 2018 3:58 pm

Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

tomitank · Post by **tomitank** » Sun Dec 16, 2018 4:27 pm

jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

(I think) developments below 10 points are unreliable.
Fabien told me this once and I have to say it was right.
Obviously it's an exaggeration, but it is unnecessary to kill a lot of time in things that do not bring enough benefits.

Ferdy · Post by **Ferdy** » Sun Dec 16, 2018 5:22 pm

jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?

jdart · Post by **jdart** » Sun Dec 16, 2018 5:42 pm

When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?

Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon

Ferdy · Post by **Ferdy** » Mon Dec 17, 2018 11:26 am

jdart wrote: ↑Sun Dec 16, 2018 5:42 pm
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon

Since the master had performed better than the candidate against those 5 common opponents, I prefer to keep the master as still the best. The performance against 5 opponents have more weight, after all in the real world, the engine is matched against different opponents. Or extend the test with more games or increase the opponents.

Laskos · Post by **Laskos** » Tue Dec 25, 2018 8:56 pm

jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

What were H1 and H0?
All might be consistent within 2 sigma.
Comparing to master against the same opponents, you have 1.41 times the error of each.
So hyper-bullet 95% confidence result is -2 +/- 4 Elo points or so.
Blitz 95% confidence result is -9 +/- 11 Elo points or so.

Still, I would probably go for rejecting this patch, but again, what were H1 and H0 to take a positive in only 3800 games? Alpha, beta?

jdart · Post by **jdart** » Wed Dec 26, 2018 3:16 am

I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon

Laskos · Post by **Laskos** » Wed Dec 26, 2018 10:07 am

jdart wrote: ↑Wed Dec 26, 2018 3:16 am I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon

So, as an estimate, as you mixed up SPRT with confidence margins, in terms of 95% confidence you had something like:
7 +/- 7 Elo points self-play bullet
-2 +/- 4 Elo points hyper-bullet against foreign opponents
-9 +/- 11 Elo points bullet against foreign opponents.

So, it is sort of consistent within 2SD, although with some bad luck, but not an outrageous bad luck. All in all, I would reject that patch, as those against foreign opponents are more reliable (but 4 times more expensive).

jdart · Post by **jdart** » Wed Dec 26, 2018 4:54 pm

I agree and did reject it. I am also now using longer TC tests more often for validation.

--Jon

testing consistency

testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency