Hi,
I have just noticed this thread, and don't have time to read it all. I'll do it later.
Maybe someone already said this but you have to understand something very important about the confidence intervals of bayeselo: bayeselo has 3 different methods for computing confidence intervals. The default algorithm is the fastest, but not the most accurate: it does as if the estimated rating of opponents are their true ratings. This will underestimate the uncertainty in general, and particularly of a program who played a lot of games against opponents who played only a few.
The most accurate method computes the whole covariance matrix. It has a cost quadratic in the number of players, so I do not use it as the default, because it would take too much time and memory when rating thousands of players.
So, if you are rating a few players and would like better confidence intervals, you should run the "covariance" command in bayeselo, after "mm".
Also, it is important to understand that a confidence interval in general has little meaning, because elo ratings are relative. They are not estimated independently: there is some covariance in the uncertainty. That is to say, you cannot estimate the probability that one program is stronger than another by just looking at their confidence intervals. That is why I made the "los" command (for "likelihood of superiority"). "los" is a more significant indicator. In order to compare two programs, instead of running two separate experiments with two separate PGN files and look at the confidene intervals, it is a lot better to have only one PGN file with all the games, and compare the two programs with the los command.
Rémi
more on engine testing
Moderators: hgm, Dann Corbit, Harvey Williamson
-
Sven
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: more on engine testing
Hi Rémi,
I have two questions regarding BayesElo. The key point is, when comparing two engine versions A and A': do you get the same rating difference of A and A' regardless whether you only include games of A/A' vs. some opponents, or you also add games between these opponents?
Take the following four scenarios, each resulting in a certain set of games being rated:
Sc.1) Player A meets K opponents B1 .. BK, N games each match, with equally distributed colors, so a total of K * N games gets rated. K > 1, N > 0 but "large enough to avoid trivial cases".
Sc.2) Player A plays a round-robin tournament together with K other players B1 .. BK, again with N games each match and equally distributed colors, so a total of (K * (K-1) / 2) * N games gets rated. K > 1, N > 0 as above. All games of A are exactly copied from scenario 1, so we only add games between B1 .. BK.
Sc.3) Exchange A by a different version A', and let A' play against B1 .. BK the same way as in Sc.1.
Sc.4) Modify the set of games rated in Sc.2 by exchanging all games of A by all games of A' as played in Sc.3.
Now here are my questions.
Q1: Does Sc.2 produce a different relative rating for player A than Sc.1 when using BayesElo, or is the rating of A independent from games between opponents?
Q2: If Q1 is answered with "yes, produces different rating" [edited], is the rating difference between A' in Sc.3 and A in Sc.1 the same as between A' in Sc.4 and A in Sc.2?
There were different opinions about it within this thread, so I am also interested in what the BayesElo expert thinks about this
Sven
I have two questions regarding BayesElo. The key point is, when comparing two engine versions A and A': do you get the same rating difference of A and A' regardless whether you only include games of A/A' vs. some opponents, or you also add games between these opponents?
Take the following four scenarios, each resulting in a certain set of games being rated:
Sc.1) Player A meets K opponents B1 .. BK, N games each match, with equally distributed colors, so a total of K * N games gets rated. K > 1, N > 0 but "large enough to avoid trivial cases".
Sc.2) Player A plays a round-robin tournament together with K other players B1 .. BK, again with N games each match and equally distributed colors, so a total of (K * (K-1) / 2) * N games gets rated. K > 1, N > 0 as above. All games of A are exactly copied from scenario 1, so we only add games between B1 .. BK.
Sc.3) Exchange A by a different version A', and let A' play against B1 .. BK the same way as in Sc.1.
Sc.4) Modify the set of games rated in Sc.2 by exchanging all games of A by all games of A' as played in Sc.3.
Now here are my questions.
Q1: Does Sc.2 produce a different relative rating for player A than Sc.1 when using BayesElo, or is the rating of A independent from games between opponents?
Q2: If Q1 is answered with "yes, produces different rating" [edited], is the rating difference between A' in Sc.3 and A in Sc.1 the same as between A' in Sc.4 and A in Sc.2?
There were different opinions about it within this thread, so I am also interested in what the BayesElo expert thinks about this
Sven
-
hgm
- Posts: 27700
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: more on engine testing
Exactly!Uri Blass wrote:It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).
You hit upon the core of a slumbering discussion I have with Rémi. Of course the case you quote as an example (1500 or 1600, with no probability in between) does not occur in practice. The rating model uses a smooth monotonous function for score vs Elo-difference, and that means that the likelihoods will be smooth singly-peaked functions of the rating differences.
But in some very common cases (like swiss tournaments), the likelihoods can be very skewed distributions: their maximum can be quite far from their average, due to asymmetry. And if you use the maximum-likelihood estimate for the ratings in such a case, the expected average error, as well as the expected maximum error in the score you predict with the aid of those ratings will be much larger than when you would have taken the expectation value of the ratings under the likelihood function. So BayesElo performs like crap when you calculate ratings in such a case (e.g. the ChessWar Promo division).
It is not easy to formulate the problem in such a way that exact mathematical slution is possible. Bayesian analysis is always tricky: it must assume a prior likelyhood for the ratings. But how to assign that? In a sense the rating difference between two players is an arbitrary derived function, the fundamental quantity is the win probability. So would you take a prior likelihood that is flat on the score scale, or on the rating-difference scale. That is not the same, as expected scores are not linear functions f the rating difference. For more than two players it is not obvious what to do at all: For 4 players, there are 6 win probabilities, but only 4 ratings, and only 3 independent rating differences. So the space of ratings is embedded as a contorted manifold in the higher-dimensional space of win probabilities, and it becomes not obvious what 'flat probability distribution' means on such a non-flat space.
-
Rémi Coulom
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: more on engine testing
No. The games between these opponents will have an effect on their ratings, which in turn has an effect on the ratings of A and A'Sven Schüle wrote:Hi Rémi,
I have two questions regarding BayesElo. The key point is, when comparing two engine versions A and A': do you get the same rating difference of A and A' regardless whether you only include games of A/A' vs. some opponents, or you also add games between these opponents?
Yes, different rating, in generalTake the following four scenarios, each resulting in a certain set of games being rated:
Sc.1) Player A meets K opponents B1 .. BK, N games each match, with equally distributed colors, so a total of K * N games gets rated. K > 1, N > 0 but "large enough to avoid trivial cases".
Sc.2) Player A plays a round-robin tournament together with K other players B1 .. BK, again with N games each match and equally distributed colors, so a total of (K * (K-1) / 2) * N games gets rated. K > 1, N > 0 as above. All games of A are exactly copied from scenario 1, so we only add games between B1 .. BK.
Sc.3) Exchange A by a different version A', and let A' play against B1 .. BK the same way as in Sc.1.
Sc.4) Modify the set of games rated in Sc.2 by exchanging all games of A by all games of A' as played in Sc.3.
Now here are my questions.
Q1: Does Sc.2 produce a different relative rating for player A than Sc.1 when using BayesElo, or is the rating of A independent from games between opponents?
No, for the same reason.Q2: If Q1 is answered with "yes, produces different rating" [edited], is the rating difference between A' in Sc.3 and A in Sc.1 the same as between A' in Sc.4 and A in Sc.2?
I repeat that the most important idea, is that if you wish to compare A and A', you should evaluate their ratings from one single PGN file that contains all the games both played. And the graph of players linked by games should be connected. It makes no sense to compare two ratings obtained in two separate subsets of games.
Rémi
Last edited by Rémi Coulom on Wed Aug 06, 2008 2:00 pm, edited 1 time in total.
-
Rémi Coulom
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: more on engine testing
A also agree that computing the expected rating would be good. But I don't know how to do it. When the number of games is high and the winning rate is not close to 0% or 100%, there is virtually no difference between expected rating and maximum a posteriori.hgm wrote:Exactly!Uri Blass wrote:It does not seem to me logical
If the rating has 51% probability to be 1500 and 49% probability to be 1600 then I think that 1549 is a better estimate than 1500 that is the maximal liklihood rating(the example is simple and of course practically rating is continous variable and not a descrete variable).
You hit upon the core of a slumbering discussion I have with Rémi. Of course the case you quote as an example (1500 or 1600, with no probability in between) does not occur in practice. The rating model uses a smooth monotonous function for score vs Elo-difference, and that means that the likelihoods will be smooth singly-peaked functions of the rating differences.
Rémi
-
Rémi Coulom
- Posts: 438
- Joined: Mon Apr 24, 2006 8:06 pm
Re: more on engine testing
I think I will explain this a little more, because it is the root of many misunderstandings in this thread, so it may deserve some more details.Rémi Coulom wrote:It makes no sense to compare two ratings obtained in two separate subsets of games.
First Elo ratings are not absolute. They indicate the relative strength. They can be translated by any constant value, since the probabilities of winning are computed as a function of the difference of Elo ratings. In Bayeselo, ratings are normalized by translating the ratings so that the average is zero.
This means that if you play two set of games with the same pairing scheme, like Bob did, ratings should converge to the same value as the number of games goes to infinity. So, in a way, the comparison of the two ratings of those two experiment does make sense.
But the confidence interval given by Bayeselo is not at all a confidence interval for the limit rating that would be obtained when the number of games goes to infinity. If you wish to get an estimate of this limit, and thus make comparisons between two different sets of games played in similar conditions, you must also add the uncertainty of the evaluation of the constant that was used to translate the ratings.
This additional uncertainty should be of similar magnitude to the uncertainty of the players who played few games.
So I hope it is clear now, why you should put all the games in the same PGN, and use los to estimated the significance of the comparison.
Rémi
-
Sven
- Posts: 4052
- Joined: Thu May 15, 2008 9:57 pm
- Location: Berlin, Germany
- Full name: Sven Schüle
Re: more on engine testing
Thanks, Rémi, for your clear statement.Rémi Coulom wrote:No. The games between these opponents will have an effect on their ratings, which in turn has an effect on the ratings of A and A'Sven Schüle wrote:Hi Rémi,
I have two questions regarding BayesElo. The key point is, when comparing two engine versions A and A': do you get the same rating difference of A and A' regardless whether you only include games of A/A' vs. some opponents, or you also add games between these opponents?
[...]
I repeat that the most important idea, is that if you wish to compare A and A', you should evaluate their ratings from one single PGN file that contains all the games both played. And the graph of players linked by games should be connected. It makes no sense to compare two ratings obtained in two separate subsets of games.
An interesting question would of course be _why_ the inter-opponent games influence the ratings of A and A' (Uri, for instance, wrote that he thought the opposite were true), but most important for me is the conclusion from your statement, maybe like an advice for testing. I'll try to approach it now.
What I learn from your statement is that rating only the games from a gauntlet of an engine A against some opponents may give results that are less reliable than rating these games together with games between the opponents.
So one of the first steps would be to select a fixed set of opponents, and let them play a round-robin. The number of games per match, here and in all further steps, is chosen "high enough", I intentionally leave open here what this means exactly.
Then start testing versions of engine A. Play games of A against the selected opponents, and add these to the existing PGN file.
Then, at some point there is a version A'. Let it play games against the same opponents, not necessarily (but optionally) including also A.
Now calculate ratings. At this point, A and A' can be compared quite reliably.
The same can be repeated with further versions A'', A''' and so on, by always adding games of the newest version against the same opponents to the existing PGN file and then repeating the Elo calculation.
(Btw, this method appears much simpler and also better to me than my own proposal since you need only one PGN file for all, and since also the games of A' are used for the "new" rating of A.)
This should produce results that say more about changes in playing strength between versions of A than the method that omits the inter-opponent games.
At least this is what I understood from your post, do you agree?
If you do, I expect that it may have some impact (a positive one, I hope!) on the testing process of some engine authors who might have based upon the simple method up to now. I would also propose to Bob that he tries to apply this method, and that he comes back with data that show whether his current observations are still present with that "hopefully improved" evaluation method.
Personally I want to add that I'm still not sure whether the effect Bob describes will really disappear this way. I just propose that he tries this direction, as one possible way to find an explanation.
Sven
-
hgm
- Posts: 27700
- Joined: Fri Mar 10, 2006 10:06 am
- Location: Amsterdam
- Full name: H G Muller
Re: more on engine testing
This is very tricky: sometimes there are 'hidden massacres' in the pairing network:Rémi Coulom wrote:When the number of games is high and the winning rate is not close to 0% or 100%, there is virtually no difference between expected rating and maximum a posteriori.
Suppose you have 200 players. A group of 100 of them play a complete round robin, (i.e. 99 games each), and turn out to be all about equally strong (so the scores are normally distributed with sigma ~4, and all players had results between 38 and 61 (+/-3 sigma should be enough to catch all the 100 players, as the probability a player exceeds that by chance is <1%). The same is true for the other grou of 100 players.
But now suppose that each of the players from group A has played one game against a player of group B (and vice versa), and that all the A players won that game. (This is pretty much what happens in the first round of a Swiss tournament, seeded by rating.)
The 1 game hardly affects the result of the individual players, they now each have 100 games, and for the A players the score was between 39% and 62%, for the B players between 38% and 61%. All very far from 100%. But B as a group was slaughtered by the A players with a 100% score!
This case is totally mishandled by BayesElo. It will not succeed in getting the Elo difference between the A and B groups average anywhere near correctly. While the 100 games played between the two should be enough to give a good clue (or in this case, a good lower limit for the difference).
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: more on engine testing
I was not contesting your computations. I was trying to show that matches between computers exhibit some unusual variability, and that even 25,000 game matches can produce completely different Elo estimates...Rémi Coulom wrote:Hi,
I have just noticed this thread, and don't have time to read it all. I'll do it later.
Maybe someone already said this but you have to understand something very important about the confidence intervals of bayeselo: bayeselo has 3 different methods for computing confidence intervals. The default algorithm is the fastest, but not the most accurate: it does as if the estimated rating of opponents are their true ratings. This will underestimate the uncertainty in general, and particularly of a program who played a lot of games against opponents who played only a few.
The most accurate method computes the whole covariance matrix. It has a cost quadratic in the number of players, so I do not use it as the default, because it would take too much time and memory when rating thousands of players.
So, if you are rating a few players and would like better confidence intervals, you should run the "covariance" command in bayeselo, after "mm".
Also, it is important to understand that a confidence interval in general has little meaning, because elo ratings are relative. They are not estimated independently: there is some covariance in the uncertainty. That is to say, you cannot estimate the probability that one program is stronger than another by just looking at their confidence intervals. That is why I made the "los" command (for "likelihood of superiority"). "los" is a more significant indicator. In order to compare two programs, instead of running two separate experiments with two separate PGN files and look at the confidene intervals, it is a lot better to have only one PGN file with all the games, and compare the two programs with the los command.
Rémi
My issue is "how many games will it take to get a _very_ accurate estimation of a rating so that I can determine whether a modest change is good or bad? And more recently this has become not only can I play enough games, but is it even possible to measure small changes since there is so much randomness in the games played between computers?
-
bob
- Posts: 20943
- Joined: Mon Feb 27, 2006 7:30 pm
- Location: Birmingham, AL
Re: more on engine testing
Let me make sure I understand your "graph" comment.
I am playing matches between Crafty and 5 different opponents, a total of roughly 25,000 games. I do use one large PGN file, but it would be _very_ difficult (if not impossible) to properly order the games into the sequence they were physically played. What I end up doing is to create small PGN files of the form matchX.Y.Z where X is the number of the opponent, Y is the position number, and Z is the number of the individual match.. To explain Z, I run 40 positions, 4 games per each opponent, and then 32 of those tests. Z is a number between 1 and 32.
I may have Y and Z backward but it is irrelevant here most likely. When I combine these into a single PGN file, the original files are lexically ordered by filenames, which is probably not the order the games were intended to be played. But when I run a second run with Crafty' (rather than Crafty) the ordering is at least consistent between the two different tests.
Should I combine the two PGN collections into one to run the test? Or is the two separate runs and comparing Elo the solution? Something tells me that combining will produce different results than two separate tests.
edit:
another question. do you believe (and I can/will test this for my own benefit) that the results would be better analyzed if I did the following:
Use Crafty and Crafty'. Rather than a separate test for each, combine Crafty and Crafty' plus the 5 opponents, and play a full round-robin between all 7 opponents and compare the final ratings for crafty/crafty'? Do you believe that fewer games will be needed? For example, in the two ugly results I posted at the top of this thread (the only two big runs in the post where the ratings were significantly different) that is 40 positions, times 4 games per position, or 160 games total for a single "mini-match' between two opponents. With a mix of 7 opponents, this requires 720 mini-matches to play everyone against everyone. Or a total of 115,200 games. I can clearly save by only playing 2 games per position to alternate colors, but that is almost 60,000 games still.
It seems to be important to use this many opponents if not more, so the only way to make it tractable for the non-cluster user would be to use fewer starting positoins, which also would seem to leave gaps in the testing results.
I am playing matches between Crafty and 5 different opponents, a total of roughly 25,000 games. I do use one large PGN file, but it would be _very_ difficult (if not impossible) to properly order the games into the sequence they were physically played. What I end up doing is to create small PGN files of the form matchX.Y.Z where X is the number of the opponent, Y is the position number, and Z is the number of the individual match.. To explain Z, I run 40 positions, 4 games per each opponent, and then 32 of those tests. Z is a number between 1 and 32.
I may have Y and Z backward but it is irrelevant here most likely. When I combine these into a single PGN file, the original files are lexically ordered by filenames, which is probably not the order the games were intended to be played. But when I run a second run with Crafty' (rather than Crafty) the ordering is at least consistent between the two different tests.
Should I combine the two PGN collections into one to run the test? Or is the two separate runs and comparing Elo the solution? Something tells me that combining will produce different results than two separate tests.
edit:
another question. do you believe (and I can/will test this for my own benefit) that the results would be better analyzed if I did the following:
Use Crafty and Crafty'. Rather than a separate test for each, combine Crafty and Crafty' plus the 5 opponents, and play a full round-robin between all 7 opponents and compare the final ratings for crafty/crafty'? Do you believe that fewer games will be needed? For example, in the two ugly results I posted at the top of this thread (the only two big runs in the post where the ratings were significantly different) that is 40 positions, times 4 games per position, or 160 games total for a single "mini-match' between two opponents. With a mix of 7 opponents, this requires 720 mini-matches to play everyone against everyone. Or a total of 115,200 games. I can clearly save by only playing 2 games per position to alternate colors, but that is almost 60,000 games still.
It seems to be important to use this many opponents if not more, so the only way to make it tractable for the non-cluster user would be to use fewer starting positoins, which also would seem to leave gaps in the testing results.