Precision of cutechess-cli

Discussion of chess software programming and technical issues.

Moderator: Ras

Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Precision of cutechess-cli

Post by Carbec »

Hi,
I make gauntlets tournaments to see if a modification is ok, as everybody I think. But I wonder now if the elo given is valid or not. For example :
after
1000 games : elo = 36 +- 19 , so 17-55
2000 games : elo = 29 +- 13 , so 16-42
3000 games : elo = 26 +- 11 , so 15-37
4000 games : elo = 22 +- 9 , so 13-31
As you see, the minimum is decreasing. one should expect its increasing. I stopped the tournament at 4000 games, as it takes a very long time on my pc. But what will happen later ?
Do you see the same behavior ?
Thanks for info

Here is a picture of the evolution :

Image

Philippe
User avatar
Ajedrecista
Posts: 2099
Joined: Wed Jul 13, 2011 9:04 pm
Location: Madrid, Spain.

Re: Precision of cutechess-cli.

Post by Ajedrecista »

Hello Philippe:

I do not see any problem. Take in mind that the central score is also dropping in each 1000-game group. More less:

Code: Select all

Games 1 to 1000: +36 Elo ~ 55.16% (let's say 551.5 vs 448.5).

Games 1 to 2000: +29 Elo ~ 54.16% (let's say 1083.5 vs 916.5).
Discounting games 1 to 1000:
Games 1001 to 2000 circa 532 vs 468 ~ +22 Elo.

Games 1 to 3000: 26 Elo ~ 53.73% (let's say 1612 vs 1388).
Discounting games 1 to 2000:
Games 2001 to 3000 circa 528.5 vs 471.5 ~ +20 Elo.

Games 1 to 4000: 22 Elo ~ 53.16% (let's say 2126.5 vs 1873.5).
Discounting games 1 to 3000:
Games 3001 to 4000 circa 514.5 vs 485.5 ~ +10 Elo.
I do not know if cutechess-cli uses the trinomial model or the pentanomial model to compute error bars. I am going to use the simpler trinomial model, which adding one losing game dominates more the central score than the reduction of sample standard deviation values:

Code: Select all

µ_n     = points/n
µ_(n+1) = points/(n + 1)

points = n*µ_n = (n + 1)*µ_(n+1)
µ_(n+1)     = [n/(n + 1)]*µ_n
1 - µ_(n+1) = [1/(n + 1)]*µ_n

µ_(n+1) - µ_n = points*[1/(n + 1) - 1/n] = -points/[n*(n+1)] < 0

95% confidence interval: ± 1.96 sample standard deviations.
1.96*s_n     = 1.96*sqrt{[  µ_n  *(1 - µ_n)     - 0.25*draws/n]      /(n - 1)}
1.96*s_(n+1) = 1.96*sqrt({µ_(n+1)*[1 - µ_(n+1)] - 0.25*draws/(n + 1)}/   n)

1.96*s_(n+1) = 1.96*sqrt[( {[n/(n + 1)]*µ_n}*{[1/(n + 1)]*µ_n} - 0.25*draws/(n + 1) )/n]

1.96*s_(n+1) - 1.96*s_n can be solved analitically, but the result is ugly and not trivial.

Compare (µ_n - 1.96*s_n) vs [µ_(n+1) - 1.96*s_(n+1)]:

µ_n     - 1.96*s_n     = µ_n             - 1.96*sqrt{[  µ_n  *(1 - µ_n) - 0.25*draws/n]       /(n - 1)}
µ_(n+1) - 1.96*s_(n+1) = [n/(n + 1)]*µ_n - [1.96/(n + 1)]*sqrt{(µ_n)² - [(n + 1)/n]*0.25*draws}

[µ_(n+1) - 1.96*s_(n+1)] - (µ_n - 1.96*s_n) can be solved analitically, but the result is ugly and not trivial.
I hope no typos.

I will put a simple numerical example to see what I want to show. I only work with scores since Elo is an additional transformation 400*log10[µ/(1 - µ)]:

Code: Select all

Roundings up to 0.000001:

wins = 390 , draws = 320 , loses = 290 ; n = 1000
wins = 390 , draws = 320 , loses = 291 ; n = 1001

µ(n = 1000) = (390 + 320/2)/1000 = 550/1000 = 0.55
µ(n = 1001) = (390 + 320/2)/1001 = 550/1001 ~ 0.549451
µ(n = 1001) < µ(n = 1000)

s = sqrt{[µ*(1 - µ) - 0.25*draws/n]/(n - 1)}

1.96*s(n = 1000) = 1.96*sqrt{[(550/1000)*(1 - 550/1000) - 0.25*320/1000]/999}  ~ 0.025379
1.96*s(n = 1001) = 1.96*sqrt{[(550/1001)*(1 - 550/1001) - 0.25*320/1001]/1000} ~ 0.025377
s(n = 1001) < s(n = 1000)

µ(n = 1000) - 1.96*s(n = 1000) = 550/1000 - 1.96*sqrt{[(550/1000)*(1 - 550/1000) - 0.25*320/1000]/999}  ~ 0.524621
µ(n = 1001) - 1.96*s(n = 1001) = 550/1001 - 1.96*sqrt{[(550/1001)*(1 - 550/1001) - 0.25*320/1001]/1000} ~ 0.524074
µ(n = 1001) - 1.96*s(n = 1001) < µ(n = 1000) - 1.96*s(n = 1000)
The same as other games are played: a win will work with the upper bound as the lose here with the lower bound, while the effect of draws is even smaller. As more games are played, the score should balanced itself (some wins and some loses) that cause small variations of score in the long run while the standard deviation keeps decreasing slowly at a rate of O[1/sqrt(games)] (big O notation); and eventually upper_bound(many games) < upper_bound(few games) and lower_bound(many games) > lower_bound(few games), which is what you expect.

Your graph is exactly what I would expect. The annoying thing you see (not that annoying after all) is that the modification started too well and began to sink later to the true value. It is like you check whether a coin is fair (heads and tails) and you get HHHTTHTHTT: heads started quite well (3-0) although perfectly possible (12.5% of probability, or 25% of probability of getting 3-0 of heads or 3-0 or tails), then the result balanced itself with more flips.

Good luck with your development! :-)

Regards from Spain.

Ajedrecista.
Carbec
Posts: 162
Joined: Thu Jan 20, 2022 9:42 am
Location: France
Full name: Philippe Chevalier

Re: Precision of cutechess-cli

Post by Carbec »

Hi,

Thanks for your answer. Im not a statistician, so I have difficulties to understand.
But, I read in another post about the option "restart" for cutechess. I then activated this option
and remade a tournament, this time with 8000 games. The results are interesting :
1000 : 35 +- 19 : 16 - 54
2000 : 31 +- 13 : 18 - 44
3000 : 30 +- 11 : 19 - 41
4000 : 29 +- 9 : 20 - 38
5000 : 29 +- 8 : 21 - 37
6000 : 28 +- 8 : 20 - 36
7000 : 29 +- 7 : 22 - 36
8000 : 27 +- 7 : 20 - 34
The decrease I saw without the option "restart=on" disapeared.
I suppose there is a little bug in my code that cause the slow decrease in performance.
The new evolution as this :
Image

Philippe