1 draw=1 win + 1 loss (always!)

Michel · Post by **Michel** » Tue Sep 24, 2013 8:16 am

It is true that one model compresses the rating scale compared to another, but that should not matter. Models that differ in weighting the draws compared to wins and losses should eventually (in the limit of large number of games) provide the same ratings, except for the scaling.

This is true for small elo differences but not for large ones. For example LogisticElo and BayesElo differ by a constant for large elo differences. Not by a scale factor.

For two players there is a function that convert one type of elo into another. But for more than two players this does not work anymore, at least not mathematically. The conversion function depends on the game population.

For example for the LogisticElo/BayesElo conversion it seems there is a "heuristic" scale factor which is slightly different from the theoretical one for two players.

hgm · Post by **hgm** » Tue Sep 24, 2013 8:26 am

If the tails of the rating curve are different, the compression of the scale should depend on the statistics of the pairings, i.e. whether you predominantly pair players with very close rating, or with very large rating differences. Perhaps I should have made the caveat that the lareg-number-of-games limit should be taken under conditions where there isn't an unnatural bias in the selection of opponents (such as pairing a large fraction of the players only with far stronger players, and never with weaker players). In any case I think it would be reasonable to require symmetric pairing (i.e. equally ofteh against players that are X Elo stronger as to players that are X Elo weaker. I don't think you could make meaningful ratings when you strongly violate that assumption.

But my point was that none of that matters much. For deciding how many win+loss are equivalent to 1 draw, you are only compaing P_win(D_Elo)*P_loss(D_Elo) versus P_draw(D_Elo). That is, all quantities taken at the same D_Elo. So you are not in any way sensitive to distortion of the Elo scale.

Michel · Post by **Michel** » Tue Sep 24, 2013 10:07 am

It seems there are two questions.

(1) What is the correct elo model? In other words what is the function F such that the expectation value for the FIDE score between players 1,2 is F(elo2-elo1). This question is only meaningful for >2 players.

(2) What is the correct draw model? I.e. how to predict the number of draw between players 1,2, knowing P(win),P(loss)? In contrast to (1) this is a question involving only two players.

Unfortunately it seems difficult to validate (1) without making an assumption on (2) since (1) only tells you the expected score, not its distribution. So despite appearances (1)(2) are intertwined.

Of course for small elo differences this is all irrelevant. One may assume that F is linear (with constant term 0.5) and the draw ratio constant (as the draw ratio only affects the variance which is a second order effect). Since the proportionality factor in F can be scaled away one obtains a unique theory depending on one parameter (the draw ratio).

This is the reason why for engine testing the elo model is irrelevant, assuming you use only engines which are close in strength.

hgm · Post by **hgm** » Tue Sep 24, 2013 10:33 am

Actually the question of ow many wins+losses are equivalent (in predictive power) to one draw requires more than 2 players to answer. Because the absolute probability for a draw does not play a role in a maximum-likelihood prediction. So you want to fit P_win*P_loss by C*P_draw^N, which has parameters C and N, which cannot both be determined from just a single equation (that you would get from two players). You would at least need two pairings (e.g. one with Delta_Elo = 0, the other with Delta_Elo = 250).

Btw, the correct formulation for (2) is "to predict P(draw) from the expected score P(win) + 0.5*P(draw)". (Or, equivalently, from the 'score excess' P(win) - P(loss).) Otherwise taking 1 - P(win) - P(loss) would be too easy. (Just nitpicking...)

Elo lists derived from games between nearly equally strong players only have a very poorly defined scale. I once calculated that games between players that differed ~300 Elo (I think. Or was it 450?) would contribute the most to determining the scale. (But I think this was for a Gaussian model. Just mapping the standard deviation of the result through the Erf onto a standard deviation of Elo, dividing it by the Elo difference. With higher Elo difference the results get less error (because it approaches the predictable 1-0), but the Elo curve gets so flat that this maps to a very large error in Elo.

Daniel Shawul · Post by **Daniel Shawul** » Tue Sep 24, 2013 3:32 pm

1)For chess programs rating:
Using game length may increase the rating of programs that never resign(or if the interface adjudicate games based on evaluations it is going to increase the rating of programs that never show very bad evaluation).

Well it was just a suggestion of the kind of things you can incorporate in rating estimation. Estimating the true strength of players as fast as possible requires you to look at many factors other than just the end result. I guess for 'honest computer evaluations' of no more than +=1 pawn difference , the GUI can force a draw even if engine don't resign. This is used in practice so the idea is workable.

If you want to use the pgn of the games and not only the results then it is better to use computer analysis of the games in order to calculate rating
so both players can earn rating points if they played better than their rating and it is possible that both players lose rating points if they played worse than their rating based on computer analysis.

Note that
I do not like this idea because there is a problem in calculating the rating of the strong programs in this way.
For example if you use houdini to analyze the games of houdini it may increase houdini's rating and if you want accurate result by computer analysis you may need significantly more time to analyze the games
relative to the time that is used to play the games.

I guess that you can rate quality of moves by analyzing with houdini for say the top-5 and acknowledge a good player for its move quality by giving more rating even though the end result is a loss for him. Note that in the end whatever model you construct it will be tested for prediction power on games that is not houdini's. Clearly Houdini is gonna be helped by this rating system because its move will always be the top 1, which is the case even when it looses. So you have to weigh that with the end result. At the end of the day, the model can not be perfect unless the engine is perfect. This seems rather complicated but game length seems to be workable to account for 'strong' wins. This is along the line of improvement of bayesel over elostat that 10-0 is different from 1-0 with the bayesian approach

2)For chess human rating:
I am against all these ideas in human-human games because they encourage cheating.
2 players can simply prepare their game at home and earn rating points from their draw if you use computer analysis to calculate rating.

I am also against the idea that 2 draws do not give the same as win and loss for rating or for ranking of humans because I think this idea also encourage cheating(if a pair of win and loss is not equal to 2 draws then players with equal strength can get motivation to fix their result before the game so they get more from their expected 50% result).

Note that if the players are equal strength, they are going to have equal number of wins and losses so their rating difference will always be 0. But if one gets a win e.g. W+3D instead of 2W+D, the ratings are going to differ based on the draw model used. More draws are considered indicative of equality so the W+3D will result in less rating difference.

Daniel Shawul · Post by **Daniel Shawul** » Tue Sep 24, 2013 4:56 pm

Note that if the players are equal strength, they are going to have equal number of wins and losses so their rating difference will always be 0. But if one gets a win e.g. W+3D instead of 2W+D, the ratings are going to differ based on the draw model used. More draws are considered indicative of equality so the W+3D will result in less rating difference.

To demonstrate this point I plotted the posterior probability of rating after W+3D and 2W+D for all draw models. Right now bayeselo (RK) will assign less elo difference for W+3D, and also the same for GD. DV will ofcourse consider them as being equal.

Laskos · Post by **Laskos** » Mon Jun 09, 2014 9:54 pm

Linked this thread with some empirical data on the other sub-forum.
http://www.talkchess.com/forum/viewtopi ... 5&start=90

1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)

Re: 1 draw=1 win + 1 loss (always!)