Name for elo without draws?

Discussion of chess software programming and technical issues.

Moderators: hgm, Rebel, chrisw

Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Name for elo without draws?

Post by Michel »

The Davidson model equates two draws to one win + one loss, so effectively observing draws has exactly the same consequences for the likelihoods as observing wins and losses.
Isn't this only true if you assume that C is a hard constant?

In practice C is unknown and in the Bayesian world it would come with its own prior. So the parameter space is two dimensional and you may just as well use p(w,d) as prior.

It seems to me that asymptotically the actual observations ultimately wipe out whatever information concerning LOS the prior might provide. Except of course that you can never get rid of hard zeros, which would occur if you assume C to be a hard constant.

But Cromwell's principle says you should avoid this:

I beseech you, in the bowels of Christ, think it possible that you may be mistaken.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
hgm
Posts: 27790
Joined: Fri Mar 10, 2006 10:06 am
Location: Amsterdam
Full name: H G Muller

Re: Name for elo without draws?

Post by hgm »

When C is not a given constant, the Davidson model is not really a model at all, and only states there are draw, win and loss probabilities that could be anything (provided they sum to 1).

Asymptotically, the numbers of W/D/L speak for themselves. The observational factors that multiply the prior will approach Gaussians that contract to delta-functions as the number of tries approaches infinity. If the logarithmic derivative of the prior is finite everywhere (which requires absence of zeros) the delta-functions cannot be distorted.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Name for elo without draws?

Post by Michel »

When C is not a given constant, the Davidson model is not really a model at all, and only states there are draw, win and loss probabilities that could be anything (provided they sum to 1).
That is what I wrote....
Asymptotically, the numbers of W/D/L speak for themselves. The observational factors that multiply the prior will approach Gaussians that contract to delta-functions as the number of tries approaches infinity. If the logarithmic derivative of the prior is finite everywhere (which requires absence of zeros) the delta-functions cannot be distorted.
By asymptotically I mean of course some type of expansion in powers of 1/sqrt(N). The first term should not depend on the number of draws but the higher order terms will for a general prior.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Name for elo without draws?

Post by Michel »

So if we are testing two engines against the null hypothesis H0 elo_diff=0 and the result is (W,L,D) then P(W|W+L) follows a binomial distribution with both alternative equally likely and we can use it to compute a p-value which may be used to reject the null hypothesis.

Presumably this does not work for H0:elo_diff=elo0 versus H1:elo_diff elo1 since elo is inherently influenced by draws.

wilo=0 is equivalent to elo=0 but for other values the relation depends on the draw ratio.

But I guess it would work for H0:wilo_diff=wilo0 against H1:wilo_diff=wilo1.

Now the question is: do wilo's really work? More precisely is

wilo(engine1,engine3)

a function of

wilo(engine1,engine2) and wilo(engine2,engine3) ???
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Name for elo without draws?

Post by Laskos »

Michel wrote:
Now the question is: do wilo's really work? More precisely is

wilo(engine1,engine3)

a function of

wilo(engine1,engine2) and wilo(engine2,engine3) ???
Well, we don't know whether ELO works. Logistic and Gaussian are good empirical candidates, I don't think it's proven which is better, for computer chess at least. My flimsy results suggest that Gaussian fits better for simple additivity, but nothing for sure.
Let's take the quantity w/(l+w). It's very possible that close to 0.5 the linearity is verified, i.e 52% and then again 52% gives 54%. The tails are very possibly not heavy-tailed. To verify this one has to check that 0.9 and 0.9 gives something of the order 0.99 or larger, and not 0.95. Very plausible, and we are left again with logistic and Gaussian as good candidates for simple additivity of wilos. I could test a bit with some real engines, if I find time.
Adam Hair
Posts: 3226
Joined: Wed May 06, 2009 10:31 pm
Location: Fuquay-Varina, North Carolina

Re: Name for elo without draws?

Post by Adam Hair »

Laskos wrote:
Michel wrote:
Now the question is: do wilo's really work? More precisely is

wilo(engine1,engine3)

a function of

wilo(engine1,engine2) and wilo(engine2,engine3) ???
Well, we don't know whether ELO works. Logistic and Gaussian are good empirical candidates, I don't think it's proven which is better, for computer chess at least. My flimsy results suggest that Gaussian fits better for simple additivity, but nothing for sure.
Let's take the quantity w/(l+w). It's very possible that close to 0.5 the linearity is verified, i.e 52% and then again 52% gives 54%. The tails are very possibly not heavy-tailed. To verify this one has to check that 0.9 and 0.9 gives something of the order 0.99 or larger, and not 0.95. Very plausible, and we are left again with logistic and Gaussian as good candidates for simple additivity of wilos. I could test a bit with some real engines, if I find time.
I have most of the necessary data already from tests I ran last year. I just need to generate more games so that the tails can be inspected.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Name for elo without draws?

Post by Laskos »

Adam Hair wrote:
Laskos wrote:
Michel wrote:
Now the question is: do wilo's really work? More precisely is

wilo(engine1,engine3)

a function of

wilo(engine1,engine2) and wilo(engine2,engine3) ???
Well, we don't know whether ELO works. Logistic and Gaussian are good empirical candidates, I don't think it's proven which is better, for computer chess at least. My flimsy results suggest that Gaussian fits better for simple additivity, but nothing for sure.
Let's take the quantity w/(l+w). It's very possible that close to 0.5 the linearity is verified, i.e 52% and then again 52% gives 54%. The tails are very possibly not heavy-tailed. To verify this one has to check that 0.9 and 0.9 gives something of the order 0.99 or larger, and not 0.95. Very plausible, and we are left again with logistic and Gaussian as good candidates for simple additivity of wilos. I could test a bit with some real engines, if I find time.
I have most of the necessary data already from tests I ran last year. I just need to generate more games so that the tails can be inspected.
This morning it occurred to me that by series expansion of scores using a draw model for ELO I can see whether WILOs are well behaving and obey a similar relation of transitivity and additivity. I took Davidson draw model, as it is already fairly verified for computer chess.

The score in the ELO model (Davidson):

Code: Select all

s_elo=(w+C*sqrt[w*l]/2)/(w+l+C*sqrt[w*l])
The score in WILO:

Code: Select all

s_wilo=w/(w+l)
1/ Focus on close to equal result in both: score_elo ~ score_wilo ~ 0.5, w~l. Series expansion of (score_wilo - 0.5)/(score_elo - 0.5) for w around l:

Code: Select all

(score_wilo - 0.5)/(score_elo - 0.5) = (1 + C/2) + C * (w-l)^2/(16*l^2) + O(w-l)^3
Around w=l the linearity and additivity of score_wilo is assured (if it works for score_elo), there is only a factor of proportionality between them of (1 + C/2). Remember, C is from Davidson model. Interesting to note that the second term of expansion vanishes.

2/ Focus on l->0. Series expansion of (1 - score_wilo)/(1 - score_elo) for l around 0:

Code: Select all

(1 - score_wilo)/(1 - score_elo) = 2*sqrt(l/w)/C + l*(2*C^2 - 4)/(C^2*w) + O(l^(3/2))
The score_wilo will be closer to 1 than score_elo for l->0, assuring that tails are not heavy. The exact nature of the factor of proportionality 2*sqrt(l/w)/C is related to the shift in the argument in chosen empirically light-tailed distribution function.



Having these nice features at hand, I tested a real world engine at some ridiculously fast time controls, just to have many games. I choose Shredder, because it obeys go nodes command very precisely, for desired rating intervals.

a/ Using logistic for both ELO and WILO on pretty linear shorter interval. Doesn't really matter Gaussian or logistic:

Code: Select all

Score of 1300 vs 1000: 520 - 296 - 184  [0.612] 1000
ELO difference: 79
WILO difference: 98

Score of 1600 vs 1300: 497 - 308 - 195  [0.595] 1000
ELO difference: 66
WILO difference: 83

Score of 1600 vs 1000: 602 - 222 - 176  [0.690] 1000
ELO difference: 139
WILO difference: 174
Transitivity is similar on this interval in both ELO and WILO, letting aside statistical errors. To be expected from that factor of proportionality of (1/).

b/ Trickier tails. Now the differences between distributions are important. Using Gaussian and logistic for ELO and WILO:

Code: Select all

Score of 2000 vs 500: 1770 - 92 - 138  [0.919] 2000
Logistic ELO difference: 422
Logistic WILO difference: 514
Gaussian ELO difference: 396
Gaussian WILO difference: 467

Score of 10000 vs 2000: 1810 - 52 - 138  [0.940] 2000
Logistic ELO difference: 476
Logistic WILO difference: 617
Gaussian ELO difference: 440
Gaussian WILO difference: 541

Score of 10000 vs 500: 3969 - 6 - 25  [0.995] 4000
Logistic ELO difference: 933
Logistic WILO difference: 1128  
Gaussian ELO difference: 736
Gaussian WILO difference: 839
In this case, logistic seems to fit a bit better both ELO and WILO than Gaussian, the shift of (2/) being a fairly well behaving factor of transformation between WILOs and ELOs for large values too.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Name for elo without draws?

Post by Michel »

It is an interesting question how to test an elo model....

Here is what I see.

(1) Take an elo model.
(2) Take a game data base.
(3) Estimate elo's from data base using model.
(4) Generate "predictions" using estimated elo.
(5) Compute chi^2. One needs to determine the number of degrees of freedom (dof) (the chi^2 distribution depends on it). It is #results-#estimated elo's+1 (the +1 is because only elo differences matter). If other quantities are also estimated (e.g. draw_elo) then this changes dof accordingly.
(6) Check p value for computed chi^2 value.

Comparing the results for different elo model's may give some indication how good they are.

Other suggestions?
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.
User avatar
Laskos
Posts: 10948
Joined: Wed Jul 26, 2006 10:21 pm
Full name: Kai Laskos

Re: Name for elo without draws?

Post by Laskos »

Michel wrote:It is an interesting question how to test an elo model....

Here is what I see.

(1) Take an elo model.
(2) Take a game data base.
(3) Estimate elo's from data base using model.
(4) Generate "predictions" using estimated elo.
(5) Compute chi^2. One needs to determing the number of degrees of freedom (the chi^2 distribution depends on it). These are #results-#estimated elo's+1 (the +1 is because only elo differences matter).
(6) Check p value for computed chi^2 value.

Comparing the results for different elo model's may give some indication how good they are.

Other suggestions?
There is a small problem with most databases: there are few games between engines separated by >99% score. Adding smaller intervals ruins the model-selection resolution.
Michel
Posts: 2272
Joined: Mon Sep 29, 2008 1:50 am

Re: Name for elo without draws?

Post by Michel »

There is a small problem with most databases: there are few games between engines separated by >99% score. Adding smaller intervals ruins the model-selection resolution.
What do you mean by adding intervals?

I think my proposal was not theoretically correct and should be amended. p-values can only be used as evidence against a null hypothesis, never as evidence for a null hypothesis (a basic statistical error :-( ).

When comparing two models it seems more logical to compute their generalized likelihood ratio. Unfortunately I am not entirely sure about the distribution of the resulting test statistic. Wilks's theorem does not appear to apply in this situation.

https://en.wikipedia.org/wiki/Likelihood-ratio_test

Anyway assuming this can be solved, for a given level of significance the test would have three outcomes: accept model 1, accept model 2, or no conclusion.
Ideas=science. Simplification=engineering.
Without ideas there is nothing to simplify.