Hello Philippe:
I do not see any problem. Take in mind that the central score is also dropping in each 1000-game group. More less:
Code: Select all
Games 1 to 1000: +36 Elo ~ 55.16% (let's say 551.5 vs 448.5).
Games 1 to 2000: +29 Elo ~ 54.16% (let's say 1083.5 vs 916.5).
Discounting games 1 to 1000:
Games 1001 to 2000 circa 532 vs 468 ~ +22 Elo.
Games 1 to 3000: 26 Elo ~ 53.73% (let's say 1612 vs 1388).
Discounting games 1 to 2000:
Games 2001 to 3000 circa 528.5 vs 471.5 ~ +20 Elo.
Games 1 to 4000: 22 Elo ~ 53.16% (let's say 2126.5 vs 1873.5).
Discounting games 1 to 3000:
Games 3001 to 4000 circa 514.5 vs 485.5 ~ +10 Elo.
I do not know if cutechess-cli uses the trinomial model or the pentanomial model to compute error bars. I am going to use the simpler trinomial model, which adding one losing game dominates more the central score than the reduction of sample standard deviation values:
Code: Select all
µ_n = points/n
µ_(n+1) = points/(n + 1)
points = n*µ_n = (n + 1)*µ_(n+1)
µ_(n+1) = [n/(n + 1)]*µ_n
1 - µ_(n+1) = [1/(n + 1)]*µ_n
µ_(n+1) - µ_n = points*[1/(n + 1) - 1/n] = -points/[n*(n+1)] < 0
95% confidence interval: ± 1.96 sample standard deviations.
1.96*s_n = 1.96*sqrt{[ µ_n *(1 - µ_n) - 0.25*draws/n] /(n - 1)}
1.96*s_(n+1) = 1.96*sqrt({µ_(n+1)*[1 - µ_(n+1)] - 0.25*draws/(n + 1)}/ n)
1.96*s_(n+1) = 1.96*sqrt[( {[n/(n + 1)]*µ_n}*{[1/(n + 1)]*µ_n} - 0.25*draws/(n + 1) )/n]
1.96*s_(n+1) - 1.96*s_n can be solved analitically, but the result is ugly and not trivial.
Compare (µ_n - 1.96*s_n) vs [µ_(n+1) - 1.96*s_(n+1)]:
µ_n - 1.96*s_n = µ_n - 1.96*sqrt{[ µ_n *(1 - µ_n) - 0.25*draws/n] /(n - 1)}
µ_(n+1) - 1.96*s_(n+1) = [n/(n + 1)]*µ_n - [1.96/(n + 1)]*sqrt{(µ_n)² - [(n + 1)/n]*0.25*draws}
[µ_(n+1) - 1.96*s_(n+1)] - (µ_n - 1.96*s_n) can be solved analitically, but the result is ugly and not trivial.
I hope no typos.
I will put a simple numerical example to see what I want to show. I only work with scores since Elo is an additional transformation 400*log10[µ/(1 - µ)]:
Code: Select all
Roundings up to 0.000001:
wins = 390 , draws = 320 , loses = 290 ; n = 1000
wins = 390 , draws = 320 , loses = 291 ; n = 1001
µ(n = 1000) = (390 + 320/2)/1000 = 550/1000 = 0.55
µ(n = 1001) = (390 + 320/2)/1001 = 550/1001 ~ 0.549451
µ(n = 1001) < µ(n = 1000)
s = sqrt{[µ*(1 - µ) - 0.25*draws/n]/(n - 1)}
1.96*s(n = 1000) = 1.96*sqrt{[(550/1000)*(1 - 550/1000) - 0.25*320/1000]/999} ~ 0.025379
1.96*s(n = 1001) = 1.96*sqrt{[(550/1001)*(1 - 550/1001) - 0.25*320/1001]/1000} ~ 0.025377
s(n = 1001) < s(n = 1000)
µ(n = 1000) - 1.96*s(n = 1000) = 550/1000 - 1.96*sqrt{[(550/1000)*(1 - 550/1000) - 0.25*320/1000]/999} ~ 0.524621
µ(n = 1001) - 1.96*s(n = 1001) = 550/1001 - 1.96*sqrt{[(550/1001)*(1 - 550/1001) - 0.25*320/1001]/1000} ~ 0.524074
µ(n = 1001) - 1.96*s(n = 1001) < µ(n = 1000) - 1.96*s(n = 1000)
The same as other games are played: a win will work with the upper bound as the lose here with the lower bound, while the effect of draws is even smaller. As more games are played, the score should balanced itself (some wins and some loses) that cause small variations of score in the long run while the standard deviation keeps decreasing slowly at a rate of O[1/sqrt(games)] (big O notation); and eventually upper_bound(many games) < upper_bound(few games) and lower_bound(many games) > lower_bound(few games), which is what you expect.
Your graph is exactly what I would expect. The annoying thing you see (not that annoying after all) is that the modification started too well and began to sink later to the true value. It is like you check whether a coin is fair (heads and tails) and you get HHHTTHTHTT: heads started quite well (3-0) although perfectly possible (12.5% of probability, or 25% of probability of getting 3-0 of heads or 3-0 or tails), then the result balanced itself with more flips.
Good luck with your development!
Regards from Spain.
Ajedrecista.