EloStat, Bayeselo and Ordo

hgm · Post by **hgm** » Tue Jun 26, 2012 8:44 am

michiguel wrote:It is what I asked up in the thread
"How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that. "

Kai reported that, and it looks like a drawback. I was curious why.

No, it is not a drawback, but the harsh reality of statistical life. If someone scores 3 out of 4, in most cases this is simply due to luck, and attaching too much meaning to it will significantly overrate him. But if he scores 300 out of 400, you can be pretty sure it is because he is was stronger.

Rémi Coulom · Post by **Rémi Coulom** » Tue Jun 26, 2012 10:07 am

hgm wrote:Sorry, I am mixing things up. It was Edmund who gave the link above, and also compiled the data. In particular the graph in

http://talkchess.com/forum/viewtopic.ph ... 76&t=42729

I remember that discussion, but it did not compare the predicition ability of different models.

Rémi

hgm · Post by **hgm** » Tue Jun 26, 2012 10:31 am

It compared the correctness of the models, which should be the same thing. If the curve used by a model correctly decribes the WDL probabilities, it follows by pure math how well it will predict.

Ozymandias · Post by **Ozymandias** » Tue Jun 26, 2012 1:31 pm

Did you check the Deloitte/FIDE Chess Rating Challenge?

hgm · Post by **hgm** » Tue Jun 26, 2012 2:22 pm

No, but I don't expect it to be of much interest. Determining ratings of humans is a completely different game, because their ratings vary in time.

michiguel · Post by **michiguel** » Tue Jun 26, 2012 5:48 pm

hgm wrote:
michiguel wrote:It is what I asked up in the thread
"How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that. "

Kai reported that, and it looks like a drawback. I was curious why.
No, it is not a drawback, but the harsh reality of statistical life. If someone scores 3 out of 4, in most cases this is simply due to luck, and attaching too much meaning to it will significantly overrate him. But if he scores 300 out of 400, you can be pretty sure it is because he is was stronger.

No, it is a drawback. If you score 3/4, the measure should give exactly the same value as 300/400. The error of the measure should be different.

Miguel

hgm · Post by **hgm** » Tue Jun 26, 2012 6:21 pm

michiguel wrote:No, it is a drawback. If you score 3/4, the measure should give exactly the same value as 300/400. The error of the measure should be different.

Unfortunately hard math tells us that is not true. Unless you are prepared to accept asymmetric error bars, which is the same thing as admitting you are using a faulty average. Calculate it yourself, if you don't believe it. The likelihood for heads probability p in a coin flip after observing 3 heads is (unnormalized) p^3 * (1-p). The expectation value of this distribution is NOT 3/4, but 2/3.

Laskos · Post by **Laskos** » Tue Jun 26, 2012 11:11 pm

hgm wrote:
michiguel wrote:No, it is a drawback. If you score 3/4, the measure should give exactly the same value as 300/400. The error of the measure should be different.
Unfortunately hard math tells us that is not true. Unless you are prepared to accept asymmetric error bars, which is the same thing as admitting you are using a faulty average. Calculate it yourself, if you don't believe it. The likelihood for heads probability p in a coin flip after observing 3 heads is (unnormalized) p^3 * (1-p). The expectation value of this distribution is NOT 3/4, but 2/3.

Yes, but this wouldn't explain some large differences for 90 and 360 games. By the way, if I am not wrong, the result is not exactly 2/3, more like 0.69. And the maximum likelihood is still at 3/4. As for 90 and 360, the results are 0.747 ans 0.749, would be hardly visible in ratings, which, it seems, are compressed in Bayeselo by the artifficial draw rules and priors.

Kai

hgm · Post by **hgm** » Wed Jun 27, 2012 7:29 am

Laskos wrote:Yes, but this wouldn't explain some large differences for 90 and 360 games. By the way, if I am not wrong, the result is not exactly 2/3, more like 0.69. And the maximum likelihood is still at 3/4. As for 90 and 360, the results are 0.747 ans 0.749, would be hardly visible in ratings, which, it seems, are compressed in Bayeselo by the artifficial draw rules and priors.

INT from 0 to 1 p^3 (1-p) dp = [ 1/4 p^4 - 1/5 p^5 ] from 0 to 1 = 1/4 - 1/5 = 1/20

INT from 0 to 1 p^4 (1-p) dp = [ 1/5 p^5 - 1/6 p^6 ] from 0 to 1 = 1/5 - 1/6 = 1/30

E(p) = (1/30) / (1/20) = 2/3

The maximum likelihood is indeed at 3/4, but that is not what is significant for making accurate predictions. The prediction error is minimal in the sense of least squares only when you predict the average.

As for the difference betwen 90 and 360 games I would not know, without further looking into what games exactly these were. I suppose this was not a plain match between two opponents, because in that case indeed the effect of the prior should be negligible. Note furthermore that the prior can be switched off in BayesElo.

There is nothing 'artificial' about the 'draw rules' (assuming you mean double-counting of draws). This aspect of the model was confirmed by analysis of the actual computer data. The likelihood of a single draw is indeed equal to that of one win plus one loss within a reasonable accuracy. Any analysis that does not take account of that fact just sucks, in the sense that it expands the rating scale, predicting too extreme results between the top and bottom dwellers.

Laskos · Post by **Laskos** » Wed Jun 27, 2012 11:32 am

hgm wrote:
Laskos wrote:Yes, but this wouldn't explain some large differences for 90 and 360 games. By the way, if I am not wrong, the result is not exactly 2/3, more like 0.69. And the maximum likelihood is still at 3/4. As for 90 and 360, the results are 0.747 ans 0.749, would be hardly visible in ratings, which, it seems, are compressed in Bayeselo by the artifficial draw rules and priors.
INT from 0 to 1 p^3 (1-p) dp = [ 1/4 p^4 - 1/5 p^5 ] from 0 to 1 = 1/4 - 1/5 = 1/20

INT from 0 to 1 p^4 (1-p) dp = [ 1/5 p^5 - 1/6 p^6 ] from 0 to 1 = 1/5 - 1/6 = 1/30

E(p) = (1/30) / (1/20) = 2/3

The maximum likelihood is indeed at 3/4, but that is not what is significant for making accurate predictions. The prediction error is minimal in the sense of least squares only when you predict the average.

As for the difference betwen 90 and 360 games I would not know, without further looking into what games exactly these were. I suppose this was not a plain match between two opponents, because in that case indeed the effect of the prior should be negligible. Note furthermore that the prior can be switched off in BayesElo.

There is nothing 'artificial' about the 'draw rules' (assuming you mean double-counting of draws). This aspect of the model was confirmed by analysis of the actual computer data. The likelihood of a single draw is indeed equal to that of one win plus one loss within a reasonable accuracy. Any analysis that does not take account of that fact just sucks, in the sense that it expands the rating scale, predicting too extreme results between the top and bottom dwellers.

Sorry, it's 2/3 indeed, the max. likelihood is 3/4. Hope the other two numbers are correct, which show that for some 90 to 360 games it's pretty irrelevant. Probably 1 draw equal to 1 win and 1 loss was badly analysed by computer data. How do you explain the blue dots compared to the green line in the second plot of Edmund and other empirical data? http://talkchess.com/forum/viewtopic.ph ... =&start=10
You have to admit that this draw rule is necessarily an approximation which may compress the predictions (and does it).

Kai

EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo

Re: EloStat, Bayeselo and Ordo