TalkChess.com

Posted: **Sun Dec 16, 2018 3:58 pm**

Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

Posted: **Sun Dec 16, 2018 4:27 pm**

jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

(I think) developments below 10 points are unreliable.
Fabien told me this once and I have to say it was right.
Obviously it's an exaggeration, but it is unnecessary to kill a lot of time in things that do not bring enough benefits.

Posted: **Sun Dec 16, 2018 5:22 pm**

jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?

Posted: **Sun Dec 16, 2018 5:42 pm**

When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?

Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon

Posted: **Mon Dec 17, 2018 11:26 am**

jdart wrote: ↑Sun Dec 16, 2018 5:42 pm
When you say relative to master did you also conduct matches on those candidate's 5 opponents against the master?
Yes, when I give relative ELOs for those matches, they are compared to the results the master branch had.

--Jon

Since the master had performed better than the candidate against those 5 common opponents, I prefer to keep the master as still the best. The performance against 5 opponents have more weight, after all in the real world, the engine is matched against different opponents. Or extend the test with more games or increase the opponents.

Posted: **Tue Dec 25, 2018 8:56 pm**

jdart wrote: ↑Sun Dec 16, 2018 3:58 pm Recently I am getting inconsistent results from testing at various time controls/conditions.

For example, one candidate change gave these results:

SPRT against master (1 opponent), 60+0.6 time control: significant + (H1) after about 3800 games.
Hyper-bullet, 40000 games against 5 opponents: -2 ELO [-4,4] relative to master branch.
3:0+1 blitz, 4000 games against 5 opponents: -9 ELO [-8,7] relative to master branch

So in this case at least self-play testing at 60+0.6 doesn't seem to predict performance against other opponents. Also the hyper-bullet and blitz results are different. Of course this might be expected, the test conditions are not the same. Still, if this is of frequent occurrence it would be concerning because it's common to use 60+0.6 SPRT as a proxy for performance against other opponents at longer time controls. I am wondering if others have had a similar experience.

--Jon

What were H1 and H0?
All might be consistent within 2 sigma.
Comparing to master against the same opponents, you have 1.41 times the error of each.
So hyper-bullet 95% confidence result is -2 +/- 4 Elo points or so.
Blitz 95% confidence result is -9 +/- 11 Elo points or so.

Still, I would probably go for rejecting this patch, but again, what were H1 and H0 to take a positive in only 3800 games? Alpha, beta?

Posted: **Wed Dec 26, 2018 3:16 am**

I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon

Posted: **Wed Dec 26, 2018 10:07 am**

jdart wrote: ↑Wed Dec 26, 2018 3:16 am I am using this script: https://github.com/jdart1/arasan-chess/ ... monitor.py

Sprt function is called with elo1=-1.5 elo2=4.5
Alpha=-0.05 beta=0.05

--Jon

So, as an estimate, as you mixed up SPRT with confidence margins, in terms of 95% confidence you had something like:
7 +/- 7 Elo points self-play bullet
-2 +/- 4 Elo points hyper-bullet against foreign opponents
-9 +/- 11 Elo points bullet against foreign opponents.

So, it is sort of consistent within 2SD, although with some bad luck, but not an outrageous bad luck. All in all, I would reject that patch, as those against foreign opponents are more reliable (but 4 times more expensive).

Posted: **Wed Dec 26, 2018 4:54 pm**

I agree and did reject it. I am also now using longer TC tests more often for validation.

--Jon

TalkChess.com

testing consistency

testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency

Re: testing consistency