No, it is not a drawback, but the harsh reality of statistical life. If someone scores 3 out of 4, in most cases this is simply due to luck, and attaching too much meaning to it will significantly overrate him. But if he scores 300 out of 400, you can be pretty sure it is because he is was stronger.michiguel wrote:It is what I asked up in the thread
"How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that. "
Kai reported that, and it looks like a drawback. I was curious why.
EloStat, Bayeselo and Ordo
Moderators: Harvey Williamson, bob, hgm
Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
 hgm
 Posts: 24935
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: EloStat, Bayeselo and Ordo

 Posts: 436
 Joined: Mon Apr 24, 2006 6:06 pm
 Contact:
Re: EloStat, Bayeselo and Ordo
I remember that discussion, but it did not compare the predicition ability of different models.hgm wrote:Sorry, I am mixing things up. It was Edmund who gave the link above, and also compiled the data. In particular the graph in
http://talkchess.com/forum/viewtopic.ph ... 76&t=42729
Rémi
 hgm
 Posts: 24935
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: EloStat, Bayeselo and Ordo
It compared the correctness of the models, which should be the same thing. If the curve used by a model correctly decribes the WDL probabilities, it follows by pure math how well it will predict.
 Ozymandias
 Posts: 1222
 Joined: Sun Oct 25, 2009 12:30 am
Re: EloStat, Bayeselo and Ordo
Did you check the Deloitte/FIDE Chess Rating Challenge?
 hgm
 Posts: 24935
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: EloStat, Bayeselo and Ordo
No, but I don't expect it to be of much interest. Determining ratings of humans is a completely different game, because their ratings vary in time.
Re: EloStat, Bayeselo and Ordo
No, it is a drawback. If you score 3/4, the measure should give exactly the same value as 300/400. The error of the measure should be different.hgm wrote:No, it is not a drawback, but the harsh reality of statistical life. If someone scores 3 out of 4, in most cases this is simply due to luck, and attaching too much meaning to it will significantly overrate him. But if he scores 300 out of 400, you can be pretty sure it is because he is was stronger.michiguel wrote:It is what I asked up in the thread
"How is it explained that you copy the same results 4 times and obtained different results? that is not related to draws or anything like that. "
Kai reported that, and it looks like a drawback. I was curious why.
Miguel
 hgm
 Posts: 24935
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: EloStat, Bayeselo and Ordo
Unfortunately hard math tells us that is not true. Unless you are prepared to accept asymmetric error bars, which is the same thing as admitting you are using a faulty average. Calculate it yourself, if you don't believe it. The likelihood for heads probability p in a coin flip after observing 3 heads is (unnormalized) p^3 * (1p). The expectation value of this distribution is NOT 3/4, but 2/3.michiguel wrote:No, it is a drawback. If you score 3/4, the measure should give exactly the same value as 300/400. The error of the measure should be different.
Re: EloStat, Bayeselo and Ordo
Yes, but this wouldn't explain some large differences for 90 and 360 games. By the way, if I am not wrong, the result is not exactly 2/3, more like 0.69. And the maximum likelihood is still at 3/4. As for 90 and 360, the results are 0.747 ans 0.749, would be hardly visible in ratings, which, it seems, are compressed in Bayeselo by the artifficial draw rules and priors.hgm wrote:Unfortunately hard math tells us that is not true. Unless you are prepared to accept asymmetric error bars, which is the same thing as admitting you are using a faulty average. Calculate it yourself, if you don't believe it. The likelihood for heads probability p in a coin flip after observing 3 heads is (unnormalized) p^3 * (1p). The expectation value of this distribution is NOT 3/4, but 2/3.michiguel wrote:No, it is a drawback. If you score 3/4, the measure should give exactly the same value as 300/400. The error of the measure should be different.
Kai
 hgm
 Posts: 24935
 Joined: Fri Mar 10, 2006 9:06 am
 Location: Amsterdam
 Full name: H G Muller
 Contact:
Re: EloStat, Bayeselo and Ordo
INT from 0 to 1 p^3 (1p) dp = [ 1/4 p^4  1/5 p^5 ] from 0 to 1 = 1/4  1/5 = 1/20Laskos wrote:Yes, but this wouldn't explain some large differences for 90 and 360 games. By the way, if I am not wrong, the result is not exactly 2/3, more like 0.69. And the maximum likelihood is still at 3/4. As for 90 and 360, the results are 0.747 ans 0.749, would be hardly visible in ratings, which, it seems, are compressed in Bayeselo by the artifficial draw rules and priors.
INT from 0 to 1 p^4 (1p) dp = [ 1/5 p^5  1/6 p^6 ] from 0 to 1 = 1/5  1/6 = 1/30
E(p) = (1/30) / (1/20) = 2/3
The maximum likelihood is indeed at 3/4, but that is not what is significant for making accurate predictions. The prediction error is minimal in the sense of least squares only when you predict the average.
As for the difference betwen 90 and 360 games I would not know, without further looking into what games exactly these were. I suppose this was not a plain match between two opponents, because in that case indeed the effect of the prior should be negligible. Note furthermore that the prior can be switched off in BayesElo.
There is nothing 'artificial' about the 'draw rules' (assuming you mean doublecounting of draws). This aspect of the model was confirmed by analysis of the actual computer data. The likelihood of a single draw is indeed equal to that of one win plus one loss within a reasonable accuracy. Any analysis that does not take account of that fact just sucks, in the sense that it expands the rating scale, predicting too extreme results between the top and bottom dwellers.
Re: EloStat, Bayeselo and Ordo
Sorry, it's 2/3 indeed, the max. likelihood is 3/4. Hope the other two numbers are correct, which show that for some 90 to 360 games it's pretty irrelevant. Probably 1 draw equal to 1 win and 1 loss was badly analysed by computer data. How do you explain the blue dots compared to the green line in the second plot of Edmund and other empirical data? http://talkchess.com/forum/viewtopic.ph ... =&start=10hgm wrote:INT from 0 to 1 p^3 (1p) dp = [ 1/4 p^4  1/5 p^5 ] from 0 to 1 = 1/4  1/5 = 1/20Laskos wrote:Yes, but this wouldn't explain some large differences for 90 and 360 games. By the way, if I am not wrong, the result is not exactly 2/3, more like 0.69. And the maximum likelihood is still at 3/4. As for 90 and 360, the results are 0.747 ans 0.749, would be hardly visible in ratings, which, it seems, are compressed in Bayeselo by the artifficial draw rules and priors.
INT from 0 to 1 p^4 (1p) dp = [ 1/5 p^5  1/6 p^6 ] from 0 to 1 = 1/5  1/6 = 1/30
E(p) = (1/30) / (1/20) = 2/3
The maximum likelihood is indeed at 3/4, but that is not what is significant for making accurate predictions. The prediction error is minimal in the sense of least squares only when you predict the average.
As for the difference betwen 90 and 360 games I would not know, without further looking into what games exactly these were. I suppose this was not a plain match between two opponents, because in that case indeed the effect of the prior should be negligible. Note furthermore that the prior can be switched off in BayesElo.
There is nothing 'artificial' about the 'draw rules' (assuming you mean doublecounting of draws). This aspect of the model was confirmed by analysis of the actual computer data. The likelihood of a single draw is indeed equal to that of one win plus one loss within a reasonable accuracy. Any analysis that does not take account of that fact just sucks, in the sense that it expands the rating scale, predicting too extreme results between the top and bottom dwellers.
You have to admit that this draw rule is necessarily an approximation which may compress the predictions (and does it).
Kai