bob wrote:This is an "independent trials" statistical analysis. If you just stuff the same game into the mix 100K times, it is pretty obvious that Elo calculations will see the error bar drop to the 1-2 range. But it is also pretty obvious that the result will be wrong, because that would not be 100K independent trials.
This means that duplicate games are NFG. Or if you use starting positions, and some of them are pre-determined wins for black or white, or draws, then you have fewer independent trials, and while a program like BayesElo will give you a small error bar, it will be completely wrong.
The first part is covered with more than 8 million unique starting positions. More than 5 million already tested, about 3 million left.
The second part could obviously be a problem for engine rating, because about 5% of the games finish before 10 ply of the starting position. But that's exactly what I'm trying to filter out, bad opening lines. The fact that I'm not getting as accurate a rating as I could, for engines, isn't an awful problem, because I only
need to know if a new one is clearly (10 ELO) better or worse. It'd be nice to have finer grain, but that's it.
bob wrote:You are doing something badly wrong somewhere. 2.5 M games should not have a +/- 4 Elo error bar by any known method of calculation I know of. The more common problem is that the REAL error bar is larger than the reported error bar because of duplicate games or openings...
As I said, I don't even run simulations under Ordo, to find out the error bar, because I'm going to find out what the real one is anyway (about +/- 4 for the minimum 2.5 mill).
As an example, I'm looking at the last two updates, where the addition to the roster is SugaR 2.0. After the initial 830k games, exceptionally low, it got a rating of XX53. The subsequent burt of the usual 2.5 mill, where it performed at XX49, brought the current rating to XX50, after 3.3 million games. That translates to a 3 ELO point drop after the initial run.