I still agree with H.G.MullerNorm Pollock wrote:I think you answered it yourself when you said:hgm wrote:I can understand why you don't want the list to grow to unwieldly proportions, including all kind of obsolete engine versions with poorly known rating.Spock wrote:The list of killed engines was introduced about a month ago
The total number of games killed is currently about 2,200
We try to ensure all engines on the list get at least 200 games. If a new engine version comes out quickly, and the old version only has a small number of games, then either we commit to getting it up to 200 games as well as the new version, or "kill" the old one.... As you say, the list can quickly get out of control if we don't take steps to tidy it up
But I seriously question the statistical wisdom of removing their games from the database. These games do contain information that is still useful for narrowing down the ratings of other engines that have played them, that BayesElo would extract.
Example:
Say I have engines A and B and I play them two games each against the engines C1, C2, ... C200. Say A and B score both 50% from these gauntlets.
A and B then each have 400 games, and there is good evidence that they are equally strong. Statistically about as good as when they had played 200 games against each other, but without the systematic error that would result from playing against the same opponent too often. All the engines C1, ... C200 would have only played 4 games, though, and their ratings are hardly known at all.
But 'killing' these C engines would leave the relative strength of A and B totally undefined. It would be equivalent in terms of accuracy loss to removing 200 games between the two of them, without need or reason.
An extreme example, perhaps, to make it very obvious. But the effect will always be there, no matter how small the fraction of games thrown away is, compared to the total. These games still contain about 25% of the information as the games between 'alive' engines.
"All the engines C1, ... C200 would have only played 4 games, though, and their ratings are hardly known at all."
Their tentative elo ratings will be based upon the standard initial elo value that all engines start from, which is data input by the user, and 4 games. Possibly very inaccurate elo ratings. These ratings will then influence A and B's elo rating, and then have a ripple effect until all engines in the cluster are affected.
I would not have confidence in such ratings. A chain is as weak as the weakest link, and in this case, having 200 weak elo ratings (weak in terms of reliability) is like having 200 weak links. Not good.
The point is that the best estimate for rating should not ignore data of engines that played against more than one opponent and got different results against them.
Even if engine X has only 2 games when it beat Y and lost against Z ignoring the games is not fair for Y and Z because the results can help to find the relative difference between Y and Z.
I think that the weight of the games of engine X should be smaller then the weight of games of engines that played more games and maybe the games are counter productive if you use the default way to calculate rating but there should be some productive way to calculate rating that does not totally ignore the games.
Uri