The result is ELO model independent and Draw model independent. Some draw models can be adapted to describe the new counting. I used both usual balanced openings and unbalanced ones, to try to optimize the statistical significance of a match for the same number of games. If a 1-0 0-1 result on the same opening (reversed) shows up, I interpret this position as having ELO_Opening much larger than ELO_Diff.
If both ELO_Opening and ELO_Draw are much larger than ELO_Diff, then, for example Rao-Kupper P(Win) = L(ELO_Diff + ELO_Opening - ELO_Draw), can be rearranged by Taylor expanding around ELO_Diff = 0. And P(New_Draw) = P(Draw) + 2 equal terms each subtracted from P(Win) and P(Loss). However, nice properties like P(New_Draw) ~ P(New_Win)*P(New_Loss) will be probably lost. Draw rate should be sufficiently large for unbalanced openings to have beneficial effect because I used ELO_Draw as large. Also, the ELO difference between engines shouldn't be too large. In the following I am using Davidson model for more conformity to empirical data and in order to preserve the "effective" number of games and the ELO difference. The result is in fact independent of the ELO model and draw model. Therefore 1 Win + 1 Loss on the same opening (and reversed) are transformed into 2 Draws.
I played the same opening pair of games only once, to not touch a more delicate question of what to do with say same opening appearing in two pairs, one Win-Win and another in Loss-Loss from the same side. But I don't think this is a big practical problem, so openings can be played at will, only need to be repeated (side and reverse).
For each datapoint a 4000 games match of two related recent Stockfishes is used at 10''+0.1'' time control.
Code: Select all
Unbalance: 0.0 Score of SF2 vs SF1: 1046 - 530 - 2424 [0.565] 4000 ELO difference: 45 Significance (SD): 13.0 Win-Win games: 440 New count: +826 -310 =2864 New Significance (SD): 15.3 Unbalance: 0.4 Score of SF2 vs SF1: 1157 - 590 - 2253 [0.571] 4000 ELO difference: 50 Win-Win games: 586 New count: +864 -297 =2839 New Significance (SD): 16.6 Unbalance: 0.6 Score of SF2 vs SF1: 1294 - 692 - 2014 [0.575] 4000 ELO difference: 53 Win-Win games: 776 New count: +906 -304 =2790 New Significance (SD): 17.3 Unbalance: 0.8 Score of SF2 vs SF1: 1384 - 830 - 1786 [0.569] 4000 ELO difference: 48 Win-Win games: 1070 New count: +849 -295 =2856 New Significance (SD): 16.4 Unbalance: 1.0 Score of SF2 vs SF1: 1490 - 962 - 1548 [0.566] 4000 ELO difference: 46 Win-Win games: 1308 New count: +836 -308 =2856 New Significance (SD): 15.6 Unbalance: 1.4 Score of SF2 vs SF1: 1658 - 1221 - 1121 [0.555] 4000 ELO difference: 38 Win-Win games: 1940 New count: +688 -251 =3061 New Significance (SD): 14.3 Unbalance: 1.8 Score of SF2 vs SF1: 1807 - 1502 - 691 [0.538] 4000 ELO difference: 27 Win-Win games: 2606 New count: +504 -199 =3297 New Significance (SD): 11.5
The benefit of this "new counting of draws" seems unequivocal (above 3 standard deviations confidence corroborating all the data), and it alone brings the number of games down by some 30% for the same statistical significance. Combined with unbalanced openings, the number of required games is reduced to almost a half.