I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.
But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:
GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%
but at 40/4 time control (40 games):
GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%
This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?
fast game testing
Moderators: hgm, Rebel, chrisw
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
-
- Posts: 613
- Joined: Sun Jan 18, 2009 7:03 am
Re: fast game testing
40 games means nothing, that's the problem. I often see results like this after 40 games in self-play, but after couple of thousands of games the result gets close to level.jdart wrote: but at 40/4 time control (40 games):
GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%
Joona Kiiski
-
- Posts: 2559
- Joined: Fri Nov 26, 2010 2:00 pm
- Location: Czech Republic
- Full name: Martin Sedlak
Re: fast game testing
My guess is that 40 games are not enough. Try to play 200 and see if it holds.
Martin
Martin
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: fast game testing
Well, that it is the obvious answer, but also if you look at a rating list such as
http://www.husvankempen.de/nunn/40_4_Ra ... liste.html
this is based on 40/4 and Arasan is +60 elo above GNU Chess. In fact it should be more because Arasan had 2 cores for my test run. So the fast time control games do not match this.
http://www.husvankempen.de/nunn/40_4_Ra ... liste.html
this is based on 40/4 and Arasan is +60 elo above GNU Chess. In fact it should be more because Arasan had 2 cores for my test run. So the fast time control games do not match this.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: fast game testing
This has been noticed for many years. A few do not believe it happens, but the time control can absolutely affect the final results. If you go too far, the clock becomes a major issue of course, as many programs have problems at very short time controls and begin to lose on time, even in winning positions. But in normal programs, where you play A vs B, you will see disparities in long vs short games. A good while back (when we first started cluster testing) I discovered that Crafty would absolutely crush fruit at very fast time controls (even though zero games were lost on time) but at longer games, like 60+60, fruit would beat Crafty convincingly.jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.
But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:
GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%
but at 40/4 time control (40 games):
GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%
This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?
However, that is not what I really care about. What I am interested in, is crafty version X+1 better than Crafty version X. And there, most any time control will work if I just compare the Elo of the two versions. If X+1 is better, it will have a better rating as it crushes Fruit at short time controls, and it will have a better rating as it gets crushed by fruit at long time controls.
If your goal is to compare two different programs, then this becomes more important. And I should add there are a number of cases I have found where even comparing X to X+1 can get one answer at short games, another answer at long games (X is better at fast games, X+1 at longer games, or vice-versa).
Best testing is at the time control you intend to play. But that is impractical unless you have 64K node clusters or something equally huge. I don't have anything approaching that. 30K games at 60+60 takes a week or more.
-
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: fast game testing
This is dangerous. You should NEVER look at the 40 game results. The error bar is huge. I don't look until the 3-4K game level, and even then the results can change by 30K games...jdart wrote:Well, that it is the obvious answer, but also if you look at a rating list such as
http://www.husvankempen.de/nunn/40_4_Ra ... liste.html
this is based on 40/4 and Arasan is +60 elo above GNU Chess. In fact it should be more because Arasan had 2 cores for my test run. So the fast time control games do not match this.
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: fast game testing
The rating list I quoted is based on 1000 games. The rating difference between Arasan and GnuChess exceeds the error bar.
-
- Posts: 1971
- Joined: Wed Jul 13, 2011 9:04 pm
- Location: Madrid, Spain.
Error bars for this couple of matches.
Hello Jon:
I post for you the uncertainties that I got, using your match statistics (wins, draws and loses); this own method is nor BayesElo neither EloStat, but should give similar results:
Regards from Spain.
Ajedrecista.
I agree with all the people here: more games are needed for narrowing the uncertainties of the match.jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.
But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:
GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%
but at 40/4 time control (40 games):
GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%
This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?
I post for you the uncertainties that I got, using your match statistics (wins, draws and loses); this own method is nor BayesElo neither EloStat, but should give similar results:
I hope no typos in these calculations. With those forty games, the error bar is more less |+98.42| + |-83.84| ~ 182.26 ~ 2·|<e>|, which is huge. With 200 games, keeping the draw ratio of 35% and the score of 65%: <e> ~ ± 39.82 and K ~ 563.11, if I am not wrong. IMHO, the obvious solution (playing more games) is the right one. It depends on the error bar you want. I wish you good luck with Arasan!(Referred to the first engine; I assume that is Arasan):
n = number of games
w = number of wins
l = number of loses
d = number of draws
D = draw ratio
mu = relative score
R = rating of the second engine (I assume that is GNU)
rd = rating difference
sd = standard deviation
rd(+) = upper rating difference
rd(-) = lower rating difference
e(+) = uncertainty between rd and rd(+)
e(-) = uncertainty between rd and rd(-)
<e> = average uncertainty
n = w + l + d
D = d/n
mu = (w + d/2)/n
1 - mu = (d/2 + l)/n
rd = 400·log[mu/(1 - mu)]
sd = sqrt{(1/n)·[mu·(1 - mu) - D/4]}
rd(+) = 400·log{[mu + (1.96)·sd]/[1 - mu - (1.96)·sd]}
rd(-) = 400·log{[mu - (1.96)·sd]/[1 - mu + (1.96)·sd]}
e(+) = [rd(+)] - rd > 0
e(-) = [rd(-)] - rd < 0
<e> = ±[|e(+)| + |e(-)|]/2 = ±{[e(+)] - [e(-)]}/2 = ± 200·log{[mu + (1.96)·sd][1 - mu + (1.96)·sd]/[mu - (1.96)·sd][1 - mu + (1.96)·sd]}
K = |<e>·sqrt(n)|
K is a 'sanity check': usual values (most of the time, but not always) for 95% confidence (~ 1.96-sigma) are between 550 and 600, according with my tiny experience. It does not mean than K can not be less than 550 or greater than 600.
Rating interval (with 1.96-sigma confidence ~ 95% confidence): ]R + rd(-), R + rd(+)[
(Calculations have been done with a Casio calculator, so may contain errors and also roundings).
==================================================================================================
n = 400:
197.5 - 202.5 (+149 -154 = 97)
rd ~ -4.34
(1.96)·sd ~ 4.26425%
(1.96)·n·sd ~ 17.057 points
rd(+) ~ +25.33 ; e(+) ~ +29.68
rd(-) ~ -34.08 ; e(-) ~ -29.74
<e> ~ ± 29.71 ; K = |<e>·sqrt(n)| ~ 594.16
[Rating interval (with 1.96-sigma confidence ~ 95% confidence)] ~ ]R - 29.74, R + 29.68[
==================================================================================================
n = 40:
26 - 14 (+19 -7 = 14)
rd ~ +107.54
(1.96)·sd ~ 11.5955%
(1.96)·n·sd ~ 4.6382 points
rd(+) ~ +205.96 ; e(+) ~ +98.42
rd(-) ~ +23.69 ; e(-) ~ -83.84
<e> ~ ± 91.13 ; K = |<e>·sqrt(n)| ~ 576.38
[Rating interval (with 1.96-sigma confidence ~ 95% confidence)] ~ ]R - 83.84, R + 98.42[
Regards from Spain.
Ajedrecista.
-
- Posts: 171
- Joined: Wed Dec 28, 2011 8:44 pm
- Location: United States
Re: fast game testing
I noticed the same thing when I went from 60+1 second games to 6 + 0.1 second games for testing EXchess. Arasan was (is) on of my regular testing opponents, and it went from beating EXchess handily to being slightly worse at the faster time controls. I was using xboard, so it was easy to see that Arasan was being *very* conservative on it time usage in the 6+0.1 games.... it essentially did not eat into the 6 seconds remaining at all and just worked with the increment.jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.
But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:
GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%
but at 40/4 time control (40 games):
GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%
This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?
- Dan
-
- Posts: 4367
- Joined: Fri Mar 10, 2006 5:23 am
- Location: http://www.arasanchess.org
Re: fast game testing
Ok, maybe I need to work on that. I haven't till recently run it at insanely fast time controls. It does fine on the chess servers at normal blitz levels.
--Jon
--Jon