fast game testing

jdart · Post by **jdart** » Sun Jan 08, 2012 3:41 pm

I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.

But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:

GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%

but at 40/4 time control (40 games):

GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%

This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?

zamar · Post by **zamar** » Sun Jan 08, 2012 3:47 pm

jdart wrote: but at 40/4 time control (40 games):

GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%

40 games means nothing, that's the problem. I often see results like this after 40 games in self-play, but after couple of thousands of games the result gets close to level.

mar · Post by **mar** » Sun Jan 08, 2012 3:47 pm

My guess is that 40 games are not enough. Try to play 200 and see if it holds.

Martin

jdart · Post by **jdart** » Sun Jan 08, 2012 4:11 pm

Well, that it is the obvious answer, but also if you look at a rating list such as

http://www.husvankempen.de/nunn/40_4_Ra ... liste.html

this is based on 40/4 and Arasan is +60 elo above GNU Chess. In fact it should be more because Arasan had 2 cores for my test run. So the fast time control games do not match this.

bob · Post by **bob** » Sun Jan 08, 2012 5:30 pm

jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.

But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:

GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%

but at 40/4 time control (40 games):

GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%

This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?

This has been noticed for many years. A few do not believe it happens, but the time control can absolutely affect the final results. If you go too far, the clock becomes a major issue of course, as many programs have problems at very short time controls and begin to lose on time, even in winning positions. But in normal programs, where you play A vs B, you will see disparities in long vs short games. A good while back (when we first started cluster testing) I discovered that Crafty would absolutely crush fruit at very fast time controls (even though zero games were lost on time) but at longer games, like 60+60, fruit would beat Crafty convincingly.

However, that is not what I really care about. What I am interested in, is crafty version X+1 better than Crafty version X. And there, most any time control will work if I just compare the Elo of the two versions. If X+1 is better, it will have a better rating as it crushes Fruit at short time controls, and it will have a better rating as it gets crushed by fruit at long time controls.

If your goal is to compare two different programs, then this becomes more important. And I should add there are a number of cases I have found where even comparing X to X+1 can get one answer at short games, another answer at long games (X is better at fast games, X+1 at longer games, or vice-versa).

Best testing is at the time control you intend to play. But that is impractical unless you have 64K node clusters or something equally huge. I don't have anything approaching that. 30K games at 60+60 takes a week or more.

bob · Post by **bob** » Sun Jan 08, 2012 5:32 pm

jdart wrote:Well, that it is the obvious answer, but also if you look at a rating list such as

http://www.husvankempen.de/nunn/40_4_Ra ... liste.html

this is based on 40/4 and Arasan is +60 elo above GNU Chess. In fact it should be more because Arasan had 2 cores for my test run. So the fast time control games do not match this.

This is dangerous. You should NEVER look at the 40 game results. The error bar is huge. I don't look until the 3-4K game level, and even then the results can change by 30K games...

jdart · Post by **jdart** » Sun Jan 08, 2012 5:43 pm

The rating list I quoted is based on 1000 games. The rating difference between Arasan and GnuChess exceeds the error bar.

Ajedrecista · Post by **Ajedrecista** » Sun Jan 08, 2012 5:52 pm

Hello Jon:

jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.

But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:

GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%

but at 40/4 time control (40 games):

GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%

This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?

I agree with all the people here: more games are needed for narrowing the uncertainties of the match.

I post for you the uncertainties that I got, using your match statistics (wins, draws and loses); this own method is nor BayesElo neither EloStat, but should give similar results:

(Referred to the first engine; I assume that is Arasan):

n = number of games
w = number of wins
l = number of loses
d = number of draws
D = draw ratio
mu = relative score
R = rating of the second engine (I assume that is GNU)
rd = rating difference
sd = standard deviation
rd(+) = upper rating difference
rd(-) = lower rating difference
e(+) = uncertainty between rd and rd(+)
e(-) = uncertainty between rd and rd(-)
<e> = average uncertainty

n = w + l + d
D = d/n
mu = (w + d/2)/n
1 - mu = (d/2 + l)/n

rd = 400·log[mu/(1 - mu)]
sd = sqrt{(1/n)·[mu·(1 - mu) - D/4]}
rd(+) = 400·log{[mu + (1.96)·sd]/[1 - mu - (1.96)·sd]}
rd(-) = 400·log{[mu - (1.96)·sd]/[1 - mu + (1.96)·sd]}
e(+) = [rd(+)] - rd > 0
e(-) = [rd(-)] - rd < 0
<e> = ±[|e(+)| + |e(-)|]/2 = ±{[e(+)] - [e(-)]}/2 = ± 200·log{[mu + (1.96)·sd][1 - mu + (1.96)·sd]/[mu - (1.96)·sd][1 - mu + (1.96)·sd]}
K = |<e>·sqrt(n)|

K is a 'sanity check': usual values (most of the time, but not always) for 95% confidence (~ 1.96-sigma) are between 550 and 600, according with my tiny experience. It does not mean than K can not be less than 550 or greater than 600.

Rating interval (with 1.96-sigma confidence ~ 95% confidence): ]R + rd(-), R + rd(+)[

(Calculations have been done with a Casio calculator, so may contain errors and also roundings).

==================================================================================================

n = 400:

197.5 - 202.5 (+149 -154 = 97)
rd ~ -4.34
(1.96)·sd ~ 4.26425%
(1.96)·n·sd ~ 17.057 points

rd(+) ~ +25.33 ; e(+) ~ +29.68
rd(-) ~ -34.08 ; e(-) ~ -29.74
<e> ~ ± 29.71 ; K = |<e>·sqrt(n)| ~ 594.16

[Rating interval (with 1.96-sigma confidence ~ 95% confidence)] ~ ]R - 29.74, R + 29.68[

==================================================================================================

n = 40:

26 - 14 (+19 -7 = 14)
rd ~ +107.54
(1.96)·sd ~ 11.5955%
(1.96)·n·sd ~ 4.6382 points

rd(+) ~ +205.96 ; e(+) ~ +98.42
rd(-) ~ +23.69 ; e(-) ~ -83.84
<e> ~ ± 91.13 ; K = |<e>·sqrt(n)| ~ 576.38

[Rating interval (with 1.96-sigma confidence ~ 95% confidence)] ~ ]R - 83.84, R + 98.42[

I hope no typos in these calculations. With those forty games, the error bar is more less |+98.42| + |-83.84| ~ 182.26 ~ 2·|<e>|, which is huge. With 200 games, keeping the draw ratio of 35% and the score of 65%: <e> ~ ± 39.82 and K ~ 563.11, if I am not wrong. IMHO, the obvious solution (playing more games) is the right one. It depends on the error bar you want. I wish you good luck with Arasan!

Regards from Spain.

Ajedrecista.

dchoman · Post by **dchoman** » Sun Jan 08, 2012 5:59 pm

jdart wrote:I have started running matches at fast time control (10 sec + 0.1 sec increment) for testing and tuning purposes. I am not yet doing a huge quantity of games but have done 2-3 thousand overnight, using cutechess-cli. I am getting depth 10-14 typically at this time control, enough to have multithreading useful if enabled.

But I am a bit bothered because I get significantly different match results at longer time control. For example: a recent run at 10+0.1:

GNU Chess 5.07.170.5b_TCEC : +149 -154 =97 400 total 49.375%

but at 40/4 time control (40 games):

GNU Chess 5.07.170.5b_TCEC : +19 -7 =14 40 total 65%

This could just be fluctuation but it is a big difference. I am not getting any losses on time. So, is my program just bad at short time control, or is this sort of thing usual?

I noticed the same thing when I went from 60+1 second games to 6 + 0.1 second games for testing EXchess. Arasan was (is) on of my regular testing opponents, and it went from beating EXchess handily to being slightly worse at the faster time controls. I was using xboard, so it was easy to see that Arasan was being *very* conservative on it time usage in the 6+0.1 games.... it essentially did not eat into the 6 seconds remaining at all and just worked with the increment.

- Dan

jdart · Post by **jdart** » Sun Jan 08, 2012 6:23 pm

Ok, maybe I need to work on that. I haven't till recently run it at insanely fast time controls. It does fine on the chess servers at normal blitz levels.

--Jon

fast game testing

fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Re: fast game testing

Error bars for this couple of matches.

Re: fast game testing

Re: fast game testing