Adam Hair wrote:Laskos wrote:Michel wrote:
Now the question is: do wilo's really work? More precisely is
wilo(engine1,engine3)
a function of
wilo(engine1,engine2) and wilo(engine2,engine3) ???
Well, we don't know whether ELO works. Logistic and Gaussian are good empirical candidates, I don't think it's proven which is better, for computer chess at least. My flimsy results suggest that Gaussian fits better for simple additivity, but nothing for sure.
Let's take the quantity w/(l+w). It's very possible that close to 0.5 the linearity is verified, i.e 52% and then again 52% gives 54%. The tails are very possibly not heavy-tailed. To verify this one has to check that 0.9 and 0.9 gives something of the order 0.99 or larger, and not 0.95. Very plausible, and we are left again with logistic and Gaussian as good candidates for simple additivity of wilos. I could test a bit with some real engines, if I find time.
I have most of the necessary data already from tests I ran last year. I just need to generate more games so that the tails can be inspected.
This morning it occurred to me that by series expansion of scores using a draw model for ELO I can see whether WILOs are well behaving and obey a similar relation of transitivity and additivity. I took Davidson draw model, as it is already fairly verified for computer chess.
The score in the ELO model (Davidson):
Code: Select all
s_elo=(w+C*sqrt[w*l]/2)/(w+l+C*sqrt[w*l])
The score in WILO:
1/ Focus on close to equal result in both: score_elo ~ score_wilo ~ 0.5, w~l. Series expansion of (score_wilo - 0.5)/(score_elo - 0.5) for w around l:
Code: Select all
(score_wilo - 0.5)/(score_elo - 0.5) = (1 + C/2) + C * (w-l)^2/(16*l^2) + O(w-l)^3
Around w=l the linearity and additivity of score_wilo is assured (if it works for score_elo), there is only a factor of proportionality between them of (1 + C/2). Remember, C is from Davidson model. Interesting to note that the second term of expansion vanishes.
2/ Focus on l->0. Series expansion of (1 - score_wilo)/(1 - score_elo) for l around 0:
Code: Select all
(1 - score_wilo)/(1 - score_elo) = 2*sqrt(l/w)/C + l*(2*C^2 - 4)/(C^2*w) + O(l^(3/2))
The score_wilo will be closer to 1 than score_elo for l->0, assuring that tails are not heavy. The exact nature of the factor of proportionality 2*sqrt(l/w)/C is related to the shift in the argument in chosen empirically light-tailed distribution function.
Having these nice features at hand, I tested a real world engine at some ridiculously fast time controls, just to have many games. I choose Shredder, because it obeys go nodes command very precisely, for desired rating intervals.
a/ Using logistic for both ELO and WILO on pretty linear shorter interval. Doesn't really matter Gaussian or logistic:
Code: Select all
Score of 1300 vs 1000: 520 - 296 - 184 [0.612] 1000
ELO difference: 79
WILO difference: 98
Score of 1600 vs 1300: 497 - 308 - 195 [0.595] 1000
ELO difference: 66
WILO difference: 83
Score of 1600 vs 1000: 602 - 222 - 176 [0.690] 1000
ELO difference: 139
WILO difference: 174
Transitivity is similar on this interval in both ELO and WILO, letting aside statistical errors. To be expected from that factor of proportionality of (1/).
b/ Trickier tails. Now the differences between distributions are important. Using Gaussian and logistic for ELO and WILO:
Code: Select all
Score of 2000 vs 500: 1770 - 92 - 138 [0.919] 2000
Logistic ELO difference: 422
Logistic WILO difference: 514
Gaussian ELO difference: 396
Gaussian WILO difference: 467
Score of 10000 vs 2000: 1810 - 52 - 138 [0.940] 2000
Logistic ELO difference: 476
Logistic WILO difference: 617
Gaussian ELO difference: 440
Gaussian WILO difference: 541
Score of 10000 vs 500: 3969 - 6 - 25 [0.995] 4000
Logistic ELO difference: 933
Logistic WILO difference: 1128
Gaussian ELO difference: 736
Gaussian WILO difference: 839
In this case, logistic seems to fit a bit better both ELO and WILO than Gaussian, the shift of (2/) being a fairly well behaving factor of transformation between WILOs and ELOs for large values too.