mcostalba wrote:I am really clueless in this field of statistical calculations.

Nevertheless I noted some things when looking at a matches between engine A and B, that I'd like experts to comment on:

1) The current algorithm to calculate ELO, LOS, etc use the "final" result, the numbers of wins, lost, draws

2) Experiencing with testing at different TC I noted that at fast TC results are more "volatile" than at longer TC. At fast TC a +10 ELO after 1000 games could easily be reverted in the next 10K games, instead at long TC already after 300 games the current winner has a big potential to be the final winner even after 10K games.

3) In all the ELO calculations the probability for the stronger engine to win a match is assumed to be independent from the TC used. This IMHO is a flaw.

My understanding is that in any match there is a noise level that alters the natural result, i.e. the stronger wins. The reason why we need a lot of games is to average out this noise that is assumed with zero median. But I also think that this noise is **not** independent from the TC used.

When playing a match we have much more information that the final result. We have all the single games results series. My understanding is that analyzing the sequence of single results a "variance" or noise level estimation could be calculated. The level of noise could be then used to reach the final LOS: my guess is that at long TC less games are needed than at very fast TC (note that total time could be the same or even longer for long TC) to reach a given LOS with a given accuracy.

So my final question is, does anybody ever considered to model the single game result noise using the games series and then use it to calculate the LOS ?

I also had that impression that there is a "noise" in the results that is stronger at very fast timecontrols. But the fact that it diminishes with timecontrol, suggests that it is an overlaying noise, and therefore

*not* related to the playing strength of the engines. As such it may be a Gaussian noise, or partly non Gaussian (as in experiments with systematic error), but the elo model does not apply to it (either way). Elo also does not take this kind of noise in consideration, as far as I know. At least I can't remember reading much about it in Arpad Elo's book but it was rather a long time ago that I read that.

I am not sure if there is much you can do by studying subsamples of the whole dataset used. (*). Maybe it could be used to estimate how strong this, still hypothetical I think, overlaying noise is. I am not sure if it would change the actual calculation of the elonumbers. It is just a hunch but maybe a more complex model with this noise taken into consideration, would only demand

*more* games for a given confidence level or needed LOS. But I think Rémi Coulom or Harm Geert would be more capable to answer those questions.

Regards, Eelco

(*):There was an article in Scientific American, maybe ten years or more ago, that went into these kind of techniques without too much math. If I had a reference to a webbased article of it I would give it, but it was in a time when there was not so much to be found on the net. I vaguely remember the article also mentioned something like "bootstrapping" in this context, but I don't remember the specifics of it.