The IPON BayesElo mystery solved.

lkaufman · Post by **lkaufman** » Tue Jan 03, 2012 6:17 pm

With the help of Mark Watkins, I can now present the solution to the question of why the IPON ratings for top engines keep coming out significantly lower than the average of the performance ratings.
The answer is pretty clear. BayesElo (which IPON uses) has a parameter called "drawelo". I assume (unless Ingo says otherwise) that IPON uses the default value. This default value was taken from a study of a database with a rather low percentage of draws. The frequency of draws in the IPON games is substantially higher than the frequency that is implied by the default value. The drawelo parameter would have to be much higher to reflect the actual draw percentage in the IPON data. The consequence of using a too-low value for "drawelo" is that the ratings get contracted towards the mean. This is exactly what we are observing. Mystery solved!
This means that all the talk about performance ratings being meaningless or "prior" being important (I'm guilty on that one) was nonsense. If "drawelo" actually matched the data, averaging the performance ratings would come quite close to predicting the final rating.
So now the question is whether IPON should "fix" the problem by using a drawelo value that corresponds to the data. It's a matter of opinion, but my opinion would be to leave things as they are. The reasons for this are twofold:
1. Bayeselo and Elostat would be much farther apart in general if not for the drawelo problem. The reason is that elostat compresses the ratings for a completely different reason, having to do with the incorrectness of averaging ratings. Purely by coincidence, the compression of bayeselo ratings caused by the drawelo default use is roughly the same as the compression in ratings caused by using elostat (given the IPON data), so huge disparities in general are avoided by this "error".
2. Engine vs. engine ratings overstate rating differences in terms of how the engines would perform against humans. The artificial compression caused by using drawelo default accidentally makes the ratings more realistic (relative to one another) in terms of how they would perform against the top human players.

So, although the use of the default is an "error", I say "leave it alone"! Presumably the above also applies to CCRL and to any rating groups that use Bayeselo with default values.

Larry

ThatsIt · Post by **ThatsIt** » Tue Jan 03, 2012 8:03 pm

There is no mystery at all.

During an IPON-run you see the following (example):

Code: Select all

Engine A vs Engine B &#40;ELO 2755&#41; 60.0-40.0 perf=2825
Engine A vs Engine C &#40;ELO 2705&#41; 82.0-18.0 perf=2968
Engine A vs Engine D &#40;ELO 2815&#41; 62.0-38.0 perf=2900
Engine A vs Engine E &#40;ELO 2800&#41; 65.0-35.0 perf=2908
Engine A vs Engine F &#40;ELO 2680&#41; 85.0-15.0 perf=2981

If you made the blunder to calculate:
2825+2968+2900+2908+2981 / 5
you will get ELO 2916.

But the correct calculation is:
ELO 2755+2705+2815+2800+2680 / 5 = ELO 2751 (average)
Perf. = 60+82+62+65+85 = 354 out of 500 games = 70.8% = ELO 2905.

The more results above ~80% (or below ~20%) you get in such matches,
the larger the discrepancy will be.

Best wishes,
G.S.

Uri Blass · Post by **Uri Blass** » Tue Jan 03, 2012 8:49 pm

ThatsIt wrote:There is no mystery at all.

During an IPON-run you see the following (example):
Code: Select all
Engine A vs Engine B &#40;ELO 2755&#41; 60.0-40.0 perf=2825
Engine A vs Engine C &#40;ELO 2705&#41; 82.0-18.0 perf=2968
Engine A vs Engine D &#40;ELO 2815&#41; 62.0-38.0 perf=2900
Engine A vs Engine E &#40;ELO 2800&#41; 65.0-35.0 perf=2908
Engine A vs Engine F &#40;ELO 2680&#41; 85.0-15.0 perf=2981
If you made the blunder to calculate:
2825+2968+2900+2908+2981 / 5
you will get ELO 2916.

But the correct calculation is:
ELO 2755+2705+2815+2800+2680 / 5 = ELO 2751 (average)
Perf. = 60+82+62+65+85 = 354 out of 500 games = 70.8% = ELO 2905.

The more results above ~80% (or below ~20%) you get in such matches,
the larger the discrepancy will be.

Best wishes,
G.S.

The last calculation is not correct.

scoring 354 out of 500 against a fixed rating of 2751 is performance of 2905

When you have opponents with different level it is clearly harder to score 354 out of 500

For example imagine that 300 games are against an opponent with rating of 2905 and 200 games are against an opponent with rating 2520 so the average of the opponents is 2751

2905 player is expected to score 150 out of 300 in the first 300 games
and it is impossible for him to score 204 out of 200 in the rest of the games.

It means that practically 354 out of 500 against the opponents in my example is performance that is clearly higher than 2905.

Laskos · Post by **Laskos** » Tue Jan 03, 2012 8:51 pm

ThatsIt wrote:There is no mystery at all.

During an IPON-run you see the following (example):
Code: Select all
Engine A vs Engine B &#40;ELO 2755&#41; 60.0-40.0 perf=2825
Engine A vs Engine C &#40;ELO 2705&#41; 82.0-18.0 perf=2968
Engine A vs Engine D &#40;ELO 2815&#41; 62.0-38.0 perf=2900
Engine A vs Engine E &#40;ELO 2800&#41; 65.0-35.0 perf=2908
Engine A vs Engine F &#40;ELO 2680&#41; 85.0-15.0 perf=2981
If you made the blunder to calculate:
2825+2968+2900+2908+2981 / 5
you will get ELO 2916.

But the correct calculation is:
ELO 2755+2705+2815+2800+2680 / 5 = ELO 2751 (average)
Perf. = 60+82+62+65+85 = 354 out of 500 games = 70.8% = ELO 2905.

The more results above ~80% (or below ~20%) you get in such matches,
the larger the discrepancy will be.

Best wishes,
G.S.

LOL

You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number. Maybe I will try to do some simple cases rigorously using trinomials with "Mathematica", to see what happens, but what Larry said makes some sense.

Kai

ThatsIt · Post by **ThatsIt** » Tue Jan 03, 2012 9:05 pm

Hi Uri !

The issue is the discrepancy, no more, no less !
And, the calc is correct:

Code: Select all

Wins   = 300
Draws  = 108
Losses = 92
Av.Op. Elo = 2751

Result     &#58; 354.0/500 (+300,=108,-92&#41;
Perf.      &#58; 70.8 %
Margins    &#58;
 68 %      &#58; (+  1.7,-  1.8 %) -> &#91; 69.0, 72.5 %&#93;
 95 %      &#58; (+  3.3,-  3.5 %) -> &#91; 67.3, 74.1 %&#93;
 99.7 %    &#58; (+  5.0,-  5.4 %) -> &#91; 65.4, 75.8 %&#93;

Elo        &#58; 2905
Margins    &#58;
 68 %      &#58; (+ 15,- 15&#41; -> &#91;2890,2919&#93;
 95 %      &#58; (+ 29,- 29&#41; -> &#91;2876,2934&#93;
 99.7 %    &#58; (+ 44,- 43&#41; -> &#91;2862,2949&#93;

@ Kai = no need to answer such a post.

Best wishes,
G.S.

IGarcia · Post by **IGarcia** » Tue Jan 03, 2012 9:31 pm

Kai Laskos wrote:
You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number. Maybe I will try to do some simple cases rigorously using trinomials with "Mathematica", to see what happens, but what Larry said makes some sense.

Kai

you are the typical guy speaking about something you think you understand, and while doing that embarrassing yourself. Are you a politician?

@Uri:
you have a point, but because the amount fo games are always the same against all engines, there is no problem to average all the oponents ELO (not the performance!) and then combine with the results to get the general performance elo.

Dont know basileo and i dont know that "drawelo" ... probably there is a mistake in the algorithm. But the most common problem is people averaging performance.

Laskos · Post by **Laskos** » Tue Jan 03, 2012 9:37 pm

IGarcia wrote:
Kai Laskos wrote:
You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number. Maybe I will try to do some simple cases rigorously using trinomials with "Mathematica", to see what happens, but what Larry said makes some sense.

Kai
you are the typical guy speaking about something you think you understand, and while doing that embarrassing yourself. Are you a politician?

@Uri:
you have a point, but because the amount fo games are always the same against all engines, there is no problem to average all the oponents ELO (not the performance!) and then combine with the results to get the general performance elo.

Dont know basileo and i dont know that "drawelo" ... probably there is a mistake in the algorithm. But the most common problem is people averaging performance.

Sorry, when the things are really hard, and I see you or that other guy making a joke of a derivation, I can only say that you are both free to speak here, it's a free forum, for all sorts of folks like you. At least I didn't say it's clear to me what happens.

Kai

IGarcia · Post by **IGarcia** » Tue Jan 03, 2012 9:45 pm

Laskos wrote: Sorry, when the things are really hard, and I see you or that other guy making a joke of a derivation, I can only say that you are both free to speak here, it's a free forum, for all sorts of folks like you.

Kai

sure its free to speak (write). Still is hard to read:

Laskos wrote: "You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number"

as a correct answer of a problem. please don't take it personal.

regards.

Laskos · Post by **Laskos** » Tue Jan 03, 2012 9:50 pm

IGarcia wrote:
Laskos wrote: Sorry, when the things are really hard, and I see you or that other guy making a joke of a derivation, I can only say that you are both free to speak here, it's a free forum, for all sorts of folks like you.

Kai
sure its free to speak (write). Still is hard to read:

Laskos wrote: "You are adding some apples and oranges there, then divide them by grapefruits, getting the "correct" and "fruitless" number"
as a correct answer of a problem. please don't take it personal.

regards.

Why it's hard to read? One cannot add there simply numbers, one has to make convolutions there of trinomial distributions, and use cumulative distribution functions. You (or that other guy) didn't even give the number of draws, which is important too. I don't know the answer, but I know that even using "Mathematica", the rigorous result is not straightforward.

Kai

hgm · Post by **hgm** » Tue Jan 03, 2012 10:06 pm

I don't get it. The drawValue shouldn't affect the ratings in BayesElo, should it? Given a certain rating difference x, one can calculate the probability for a draw, and it will be higher if drawValue is higher (because it will be equal to F(x+drawValue) - F(x-drawValue), where F is the cumulative Elo distribution). But unless drawValue is ridiculously large, the shape of this draw probability distribution is practically independent of it, as the expression is a quite accurate estimate for 2*drawValue*(d/dx)F(x), i.e. proportional to the Bell-shaped Elo curve itself.

In Bayesian analysis, only the shape of the curve is important, and the absolute magnitude is divided out. If it sees there is a draw game, no matter how small the probability of draw games in general, it will always be taken as evidence that the ratings are close (+/- the SD of the Elo curve). That it judges the draw probability quite small factors out.

That BayesElo tends to strongly compress the rating scale with the standard prior (of 2 draws) if the group has a very wide Elo range (e.g. spread over >1000 Elo) is well known. You only need a single game between a top and a bottom engine, and it will never believe that there difference can be anywhere near 1000 Elo, as it counts two draws between them, and this would get astronomically small probability when the players are >1000 Elo apart, as most Elo models have exponential tails. It rather believes all other differences along the scale are mostly due to luck, than believe those two virtual draws between so widely separated players!

The IPON BayesElo mystery solved.

The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.

Re: The IPON BayesElo mystery solved.