Bob, the error bars are those shown by rating list, which uses one of the tools like Ordo or BayesElo. I took several days ago an extreme case to check the usual logisitc ELO model, and it worked quite well even with extreme values. Komodo 9.2 versus much weaker SOS 3. Then SOS 3 versus still much weaker Bikjump, a "beginner" engine, to get the rating of top engine Komodo 9.2 against the "beginner" engine Bikjump via the "intermediate" engine SOS 3:bob wrote:That was my point. If the average ratings for player A's opponents is X, and the average rating for player B's opponents is X+50, it is going to be VERY difficult to compare their ratings with any accuracy and use the resulting Elo numbers to predict outcome between the two versions. The two versions of the original program are different, the average opponents are different, WHICH is responsible for the Elo gain or loss?Michel wrote:Why not? As long if the graph is connected the comparison is fine.Robert Hyatt wrote:You have to look at all the data. For example, look at the average opponent rating for cheng 4cpu vs cheng 1cpu. 1cpu played against an opponent average about 50 Elo stronger than cheng 4cpu. What would you expect that to cause? Make cheng 1cpu look weaker? You can't compare Elo numbers between partially or fully disjoint sets of opponents...
If A plays B and B plays C and C plays D you can still compare A and D. The comparison via intermediate engines just blows up the error bars.
So in that specific CEGT comparison, the error bars are not +/-26. They are more like +/- 75...
In this case, saying A is +130 better than B is quite inaccurate. It is most likely better, to be sure. But how much better is much harder to determine without more data points.
Score of Komodo 9.2 vs SOS 3: 9865 - 32 - 103 [0.992] 10000
ELO difference: 830
Score of SOS 3 vs Bikjump: 6863 - 636 - 514 [0.889] 8013
ELO difference: 361
The ELO model predicts that Komodo 9.2 is stronger than Bikjump by 830+361 = 1191 ELO points, predicted only via the intermediate engine SOS 3. The "real" rating is:
Score of Komodo 9.2 vs Bikjump: 19967 - 6 - 27 [0.999] 20000
ELO difference: 1204
Prediction is only 13 ELO points off on the 1200 ELO difference span. The error is both statistical and due to the ELO model, but it shows that the model is not bad even in such extreme case.