Can you define "real error bar"? To compute that you need two things. (1) an INFINITE number of games so you know the absolute truth about the ratings and (2) a large sample to compare. Since you don't have (1) the "real error bar" is meaningless.Ozymandias wrote:The first part is covered with more than 8 million unique starting positions. More than 5 million already tested, about 3 million left.bob wrote:This is an "independent trials" statistical analysis. If you just stuff the same game into the mix 100K times, it is pretty obvious that Elo calculations will see the error bar drop to the 1-2 range. But it is also pretty obvious that the result will be wrong, because that would not be 100K independent trials.
This means that duplicate games are NFG. Or if you use starting positions, and some of them are pre-determined wins for black or white, or draws, then you have fewer independent trials, and while a program like BayesElo will give you a small error bar, it will be completely wrong.
The second part could obviously be a problem for engine rating, because about 5% of the games finish before 10 ply of the starting position. But that's exactly what I'm trying to filter out, bad opening lines. The fact that I'm not getting as accurate a rating as I could, for engines, isn't an awful problem, because I only need to know if a new one is clearly (10 ELO) better or worse. It'd be nice to have finer grain, but that's it.
As I said, I don't even run simulations under Ordo, to find out the error bar, because I'm going to find out what the real one is anyway (about +/- 4 for the minimum 2.5 mill).bob wrote:You are doing something badly wrong somewhere. 2.5 M games should not have a +/- 4 Elo error bar by any known method of calculation I know of. The more common problem is that the REAL error bar is larger than the reported error bar because of duplicate games or openings...
As an example, I'm looking at the last two updates, where the addition to the roster is SugaR 2.0. After the initial 830k games, exceptionally low, it got a rating of XX53. The subsequent burt of the usual 2.5 mill, where it performed at XX49, brought the current rating to XX50, after 3.3 million games. That translates to a 3 ELO point drop after the initial run.
Now, to sampling theory. See the central limit theorem from probability and statistics first. If you take a random sample from a large population, the mean of that sample will be within some error bar of the mean of the entire population. And in fact, when you take many such random samples from that population, the samples will be normally distributed about the population mean.
You have two values of interest. Variance from mean (how far do the samples vary from the total mean) and confidence interval (how confident are you that the mean of your samples is within some fixed error bar of the actual mean.)
We generally use 95% confidence, which means that 95% of the matches we play lie within the two-sigma confidence interval as given by basic probability. That confidence interval gets larger with fewer games, smaller with more games. And nothing else matters. However, if you look at the central limit theorem closely, there are caveats:
The most important is that the observations (samples, games, whatever) have to be independent. IE if you only get heads because the coin has a bias, then the samples are not independent. If the samples are from different populations (programs a, b and C) vs (d, e and f) then it doesn't hold.
With chess we have multiple issues to deal with. (1) different games between any two opponents. Repeated games offer no new information, but they artificially reduce the confidence interval since it assumes independence that is not there. (2) two programs where one is so much stronger than the other that it wins every game. All you can predict from that is that the stronger program is strong enough to win every game, but that tells you nothing about the stronger program against other programs not in that population but not in that sample.
There is no "actual error bar" when it is defined as "the distance from the mean of the observations and the mean of the totality of all games, since the latter is unknown.
This is probability, not a discrete math problem with a closed solution with one final answer.