Empirically 1 win + 1 loss ~ 2 draws
Posted: Tue Jun 24, 2014 8:36 pm
There are databases of computer chess games that can prove it too, but they are messy, each point has large uncertainty, and Elo span is usually small in direct matches, where shorter (Rao-Kupper) or longer (Davidson) tails of distributions are tested. I performed some more clinical tests, using only 4 engines in 5 runs and 9-11 datapoints for each run, but with many games (1000 games each datapoint).
There are clearly two competing models:
The model to fit the data is
where a and C are now variables to fit the datapoints, a=1 being Rao-Kupper model (used in BayesElo) and a=2 being the Davidson model.
The Elo span in matches is 1000-1500 Elo points, so tails are visible.
We have 5 values for a: {1.955, 1.639, 1.941, 2.476, 2.167} with an average of a=2.036. VERY close to the Davidson model with a=2 draws. Rao-Kupper (a=1) is ruled out.
All in all, the Davidson model P(D|E) = C*Sqrt[P(W|E)*P(L|E)], which assumes that 1 win + 1 loss = 2 draws fits much better the empirical data than Rao-Kupper model used in BayesElo P(D|E) = C*P(W|E)*P(L|E), where 1 win + 1 loss = 1 draw. So, it's not advisable to use BayesElo for computer chess ratings. The model assumed in Ordo is 1 win + 1 loss = 2 draws, so it would be advisable to use Ordo in computer chess ratings instead of BayesElo.
There are clearly two competing models:
Code: Select all
Rao-Kupper:
d = C*w*(1 - w - d)
d -> (C*w - C*w^2)/(1 + C*w)
Code: Select all
Davidson:
d^2 = C*w*(1 - w - d)
d -> 1/2 (-C*w + Sqrt[C]*Sqrt[w]*Sqrt[4 - 4 w + C*w])
Code: Select all
d^a = C*w(1 - w - d)
where a and C are now variables to fit the datapoints, a=1 being Rao-Kupper model (used in BayesElo) and a=2 being the Davidson model.
The Elo span in matches is 1000-1500 Elo points, so tails are visible.
Code: Select all
Anchor: SF depth 6
Komodo depths:
K1 : 1000 (+986,= 12,- 2), 99.2 %
K2 : 1000 (+930,= 60,- 10), 96.0 %
K3 : 1000 (+837,= 95,- 68), 88.4 %
K4 : 1000 (+663,=215,-122), 77.0 %
K5 : 1000 (+488,=207,-305), 59.2 %
K6 : 1000 (+192,=248,-560), 31.6 %
K7 : 1000 (+ 84,=168,-748), 16.8 %
K8 : 1000 (+ 18,=108,-874), 7.2 %
K9 : 1000 (+ 14,= 46,-940), 3.7 %
K10 : 1000 (+ 2,= 26,-972), 1.5 %
model: d^a = C*w*(1-d-w)
Least Squares:
a = 1.955
C = 0.426
Anchor: IvanHoe depth 5
Komodo depths:
K1 : 1000 (+982,= 14,- 4), 98.9 %
K2 : 1000 (+946,= 39,- 15), 96.5 %
K3 : 1000 (+834,= 99,- 67), 88.3 %
K4 : 1000 (+630,=214,-156), 73.7 %
K5 : 1000 (+440,=249,-311), 56.5 %
K6 : 1000 (+173,=248,-579), 29.7 %
K7 : 1000 (+ 55,=169,-776), 14.0 %
K8 : 1000 (+ 19,=102,-879), 7.0 %
K9 : 1000 (+ 4,= 34,-962), 2.1 %
model: d^a = C*w*(1-d-w)
Least Squares
a = 1.639
C = 0.818
Anchor: SF depth 5
Houdini depths:
H1 : 1000 (+907,= 79,- 14), 94.7 %
H2 : 1000 (+815,=148,- 37), 88.9 %
H3 : 1000 (+594,=293,-113), 74.1 %
H4 : 1000 (+403,=375,-222), 59.1 %
H5 : 1000 (+203,=413,-384), 40.9 %
H6 : 1000 (+ 86,=312,-602), 24.2 %
H7 : 1000 (+ 30,=243,-727), 15.2 %
H8 : 1000 (+ 18,=191,-791), 11.3 %
H9 : 1000 (+ 3,= 95,-902), 5.1 %
model: d^a = C*w*(1-d-w)
Least Squares:
a = 1.941
C = 1.771
Anchor: Komodo depth 5
Houdini depths:
H1 : 1000 (+928,= 67,- 5), 96.2 %
H2 : 1000 (+878,=106,- 16), 93.1 %
H3 : 1000 (+724,=211,- 65), 83.0 %
H4 : 1000 (+524,=301,-175), 67.5 %
H5 : 1000 (+310,=396,-294), 50.8 %
H6 : 1000 (+139,=385,-476), 33.1 %
H7 : 1000 (+ 54,=327,-619), 21.8 %
H8 : 1000 (+ 22,=252,-726), 14.8 %
H9 : 1000 (+ 5,=113,-882), 6.2 %
model: d^a = C*w*(1-d-w)
Least Sqaures:
a = 2.476
C = 0.931
Anchor: Houdini depth 5
SF depths:
S1 : 1000 (+925,= 72,- 3), 96.1 %
S2 : 1000 (+883,=109,- 8), 93.8 %
S3 : 1000 (+762,=208,- 30), 86.6 %
S4 : 1000 (+620,=311,- 69), 77.5 %
S5 : 1000 (+405,=394,-201), 60.2 %
S6 : 1000 (+241,=380,-379), 43.1 %
S7 : 1000 (+117,=315,-568), 27.5 %
S8 : 1000 (+ 41,=235,-724), 15.8 %
S9 : 1000 (+ 5,=143,-852), 7.6 %
S10 : 1000 (+ 7,= 72,-921), 4.3 %
S11 : 1000 (+ 0,= 49,-951), 2.5 %
model: d^a = C*w*(1-d-w)
Least Sqaures:
a = 2.167
C = 1.457
All in all, the Davidson model P(D|E) = C*Sqrt[P(W|E)*P(L|E)], which assumes that 1 win + 1 loss = 2 draws fits much better the empirical data than Rao-Kupper model used in BayesElo P(D|E) = C*P(W|E)*P(L|E), where 1 win + 1 loss = 1 draw. So, it's not advisable to use BayesElo for computer chess ratings. The model assumed in Ordo is 1 win + 1 loss = 2 draws, so it would be advisable to use Ordo in computer chess ratings instead of BayesElo.