New rating system research

vladstamate · Post by **vladstamate** » Thu Aug 05, 2010 7:44 am

Found this today:

http://games.slashdot.org/story/10/08/0 ... e-Over-Elo

""Less than 24 hours ago, Jeff Sonas, the creator of the Chessmetrics rating system, launched a competition to find a chess rating algorithm that performs better than the official Elo rating system. The competition requires entrants to build their rating systems based on the results of more than 65,000 historical chess games. Entrants then test their algorithms by predicting the results of another 7,809 games. Already three teams have managed create systems that make more accurate predictions than the official Elo approach. It's not a surprise that Elo has been outdone — after all, the system was invented half a century ago before we could easily crunch large amounts of historical data. However, it is a big surprise that Elo has been bettered so quickly!""

Regards,
Vlad.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Thu Aug 05, 2010 3:03 pm

In fact, at this time, there are 25 submissions and only 1 outperformed ELO.

This is quite bad, considering that considerable research into better systems already existed (Glicko for example).

That said, surely it must be possible to do much better now.

Dann Corbit · Post by **Dann Corbit** » Thu Aug 05, 2010 8:51 pm

vladstamate wrote:Found this today:

http://games.slashdot.org/story/10/08/0 ... e-Over-Elo

""Less than 24 hours ago, Jeff Sonas, the creator of the Chessmetrics rating system, launched a competition to find a chess rating algorithm that performs better than the official Elo rating system. The competition requires entrants to build their rating systems based on the results of more than 65,000 historical chess games. Entrants then test their algorithms by predicting the results of another 7,809 games. Already three teams have managed create systems that make more accurate predictions than the official Elo approach. It's not a surprise that Elo has been outdone — after all, the system was invented half a century ago before we could easily crunch large amounts of historical data. However, it is a big surprise that Elo has been bettered so quickly!""

Regards,
Vlad.

This link is quite old, but interesting and on that topic:
http://www.ratingtheory.com/

Here is a program that demonstrates his methods:
http://cap.connx.com/tournament_software/prog10.cpp

QED · Post by **QED** » Fri Aug 06, 2010 12:20 am

Which Elo? The one done by elostat, or the one done by Bayeselo?

Uri Blass · Post by **Uri Blass** » Fri Aug 06, 2010 12:51 am

Gian-Carlo Pascutto wrote:In fact, at this time, there are 25 submissions and only 1 outperformed ELO.

This is quite bad, considering that considerable research into better systems already existed (Glicko for example).

That said, surely it must be possible to do much better now.

I see no reason to hurry.
Submissions must be made by 2:00am Monday 15 November 2010 and I do not see a reason for people to give their submissions more than 3 months before the final date.

I am surprised by the fact that there are so many submissions.

Uri

CRoberson · Post by **CRoberson** » Mon Aug 09, 2010 6:59 pm

Gian-Carlo Pascutto wrote:In fact, at this time, there are 25 submissions and only 1 outperformed ELO.

This is quite bad, considering that considerable research into better systems already existed (Glicko for example).

That said, surely it must be possible to do much better now.

There are 2 Glicko systems (1 & 2). Glicko 1 is not good. It leaves out an important concept that was obvious when it first went on FICS. Glicko 2 attempts to fix the issue. Glicko 1 leaves out the concept of "practice makes perfect:. Glicko 1 only uses the concept "practice makes you consistent": the more your play the less you can change your rating. On FICS, (years ago and maybe still) this means play until the RD value gets small. Then play for a month on any other server than FICS. During that time the FICS RD value will increase - lack of practice increases the variance and decreases consistency. Once the FICS RD value is high, switch back to FICS. If you don't believe me look at the math yourself, look at Dr. Glickman's web pages or (for more fun) put a weak comp on FICS and run it until the RD value is very low. Then run a much stronger immediately and for a long time. Actually, don't do that: you program will not get a better raring and it will pull the other ratings down since Glickman 1 is not a zero sum system.

Dr. Glickman figured this out later and created Glcko 2.

Gian-Carlo Pascutto · Post by **Gian-Carlo Pascutto** » Tue Aug 10, 2010 2:22 pm

The question is not if Glicko 1 is perfect. If it was, there would be no Glicko 2.

The relevant question here is if it is better than ELO in predictive power. It would be cool to see the result of Glicko 1 on the dataset (perhaps with a lower bound on RD, which was suggested by Glickman afterwards).

Your example of switching a strong for a weak computer is a not a very good one. They're effectively 2 seperate players (if not, there can be no concept of one being weak and the other strong!). Human strength changes are much more smooth than that, and the problem isn't so pronounced there.

Uri Blass · Post by **Uri Blass** » Wed Aug 11, 2010 12:33 am

Gian-Carlo Pascutto wrote:The question is not if Glicko 1 is perfect. If it was, there would be no Glicko 2.

The relevant question here is if it is better than ELO in predictive power. It would be cool to see the result of Glicko 1 on the dataset (perhaps with a lower bound on RD, which was suggested by Glickman afterwards).

Your example of switching a strong for a weak computer is a not a very good one. They're effectively 2 seperate players (if not, there can be no concept of one being weak and the other strong!). Human strength changes are much more smooth than that, and the problem isn't so pronounced there.

Of course I expect something better than elo and elo has weaknesses.

elo cannot predict a case when A score 100% with white and 0% with black.

I take an extreme case and of course it does not happen but there are players who score better than their rating with one color and worse than their rating in another color.

There are many other things that elo cannot predict and the competition is not about finding a new rating system but about being better in predicting the results.

A better rating for the players can help but it is only a tool and not the target and it is possible to use not only the rating of the players to predict the results.

Uri

Don · Post by **Don** » Thu Aug 12, 2010 2:55 pm

Uri Blass wrote:
Gian-Carlo Pascutto wrote:The question is not if Glicko 1 is perfect. If it was, there would be no Glicko 2.

The relevant question here is if it is better than ELO in predictive power. It would be cool to see the result of Glicko 1 on the dataset (perhaps with a lower bound on RD, which was suggested by Glickman afterwards).

Your example of switching a strong for a weak computer is a not a very good one. They're effectively 2 seperate players (if not, there can be no concept of one being weak and the other strong!). Human strength changes are much more smooth than that, and the problem isn't so pronounced there.
Of course I expect something better than elo and elo has weaknesses.

elo cannot predict a case when A score 100% with white and 0% with black.

I take an extreme case and of course it does not happen but there are players who score better than their rating with one color and worse than their rating in another color.

There are many other things that elo cannot predict and the competition is not about finding a new rating system but about being better in predicting the results.

A better rating for the players can help but it is only a tool and not the target and it is possible to use not only the rating of the players to predict the results.

Uri

I'm not sure anything substantially better exists, because the basic ELO formulation is a sound mathematical model for playing strength.

So if this is about tweaking things, it's surely possible to change the constants in the formula and to consider many other details, such as color, how active the players are and so on.

I think one thing that might boost the predictive ability a bit is to consider the history of the player. Some players are very consistent compared to others and some players do much better with one color over another (I knew a player who believe as white he should always try to win and as black tended to be eager for a draw.)

In reality, ELO is 1 dimensional (a single number represents your playing ability) but humans and computers are multi-dimensional, so if this could somehow be captured then it might be possible to make the rating system more accurate - but I would think such a sophisticated system would require many more games to properly sample. I have no clue how one could construct such a system or even how to identify what the other dimensions of playing strength are and how they interact.

If that could be done, then we could compare computers and humans without having to do rating adjustments - because it's well known that computer vs computer produces different results that human vs computer.

bob · Post by **bob** » Thu Aug 12, 2010 6:51 pm

Don wrote:
Uri Blass wrote:
Gian-Carlo Pascutto wrote:The question is not if Glicko 1 is perfect. If it was, there would be no Glicko 2.

The relevant question here is if it is better than ELO in predictive power. It would be cool to see the result of Glicko 1 on the dataset (perhaps with a lower bound on RD, which was suggested by Glickman afterwards).

Your example of switching a strong for a weak computer is a not a very good one. They're effectively 2 seperate players (if not, there can be no concept of one being weak and the other strong!). Human strength changes are much more smooth than that, and the problem isn't so pronounced there.
Of course I expect something better than elo and elo has weaknesses.

elo cannot predict a case when A score 100% with white and 0% with black.

I take an extreme case and of course it does not happen but there are players who score better than their rating with one color and worse than their rating in another color.

There are many other things that elo cannot predict and the competition is not about finding a new rating system but about being better in predicting the results.

A better rating for the players can help but it is only a tool and not the target and it is possible to use not only the rating of the players to predict the results.

Uri
I'm not sure anything substantially better exists, because the basic ELO formulation is a sound mathematical model for playing strength.

So if this is about tweaking things, it's surely possible to change the constants in the formula and to consider many other details, such as color, how active the players are and so on.

I think one thing that might boost the predictive ability a bit is to consider the history of the player. Some players are very consistent compared to others and some players do much better with one color over another (I knew a player who believe as white he should always try to win and as black tended to be eager for a draw.)

In reality, ELO is 1 dimensional (a single number represents your playing ability) but humans and computers are multi-dimensional, so if this could somehow be captured then it might be possible to make the rating system more accurate - but I would think such a sophisticated system would require many more games to properly sample. I have no clue how one could construct such a system or even how to identify what the other dimensions of playing strength are and how they interact.

If that could be done, then we could compare computers and humans without having to do rating adjustments - because it's well known that computer vs computer produces different results that human vs computer.

I think the main weakness of Elo, addressed by the Glicko approach, was that it was designed around a relatively infrequent playing style. A human might play in a tournament every few months. A GM might play more frequently, maybe once a month. Today's players are radically different, particularly with respect to computers that can play tens of thousands of games per hour. The Elo system has to respond to strength increases, where the Glicko system responds more slowly as you play games more frequently, since nobody's rating slides up and down by 200+ points over a 24 hour period, even though we see this on chess servers all the time.

So other than the "less change as you play more games during a given period of time" I have not seen any drawback, assuming you want a single number that gives some estimate of playing strength compared to other players using the same system. You could factor in health, endurance, mood, etc. And have a system too complex to use.

New rating system research

New rating system research

Re: New rating system research

Re: New rating system research

Re: New rating system research

Re: New rating system research

Re: New rating system research

Re: New rating system research

Re: New rating system research

Re: New rating system research

Re: New rating system research