The result is ELO model independent and Draw model independent. Some draw models can be adapted to describe the new counting. I used both usual balanced openings and unbalanced ones, to try to optimize the statistical significance of a match for the same number of games. If a 1-0 0-1 result on the same opening (reversed) shows up, I interpret this position as having ELO_Opening much larger than ELO_Diff.
If both ELO_Opening and ELO_Draw are much larger than ELO_Diff, then, for example Rao-Kupper P(Win) = L(ELO_Diff + ELO_Opening - ELO_Draw), can be rearranged by Taylor expanding around ELO_Diff = 0. And P(New_Draw) = P(Draw) + 2 equal terms each subtracted from P(Win) and P(Loss). However, nice properties like P(New_Draw) ~ P(New_Win)*P(New_Loss) will be probably lost. Draw rate should be sufficiently large for unbalanced openings to have beneficial effect because I used ELO_Draw as large. Also, the ELO difference between engines shouldn't be too large. In the following I am using Davidson model for more conformity to empirical data and in order to preserve the "effective" number of games and the ELO difference. The result is in fact independent of the ELO model and draw model. Therefore 1 Win + 1 Loss on the same opening (and reversed) are transformed into 2 Draws.
I played the same opening pair of games only once, to not touch a more delicate question of what to do with say same opening appearing in two pairs, one Win-Win and another in Loss-Loss from the same side. But I don't think this is a big practical problem, so openings can be played at will, only need to be repeated (side and reverse).
For each datapoint a 4000 games match of two related recent Stockfishes is used at 10''+0.1'' time control.
Code: Select all
Unbalance: 0.0
Score of SF2 vs SF1: 1046 - 530 - 2424 [0.565] 4000
ELO difference: 45
Significance (SD): 13.0
Win-Win games: 440
New count: +826 -310 =2864
New Significance (SD): 15.3
Unbalance: 0.4
Score of SF2 vs SF1: 1157 - 590 - 2253 [0.571] 4000
ELO difference: 50
Win-Win games: 586
New count: +864 -297 =2839
New Significance (SD): 16.6
Unbalance: 0.6
Score of SF2 vs SF1: 1294 - 692 - 2014 [0.575] 4000
ELO difference: 53
Win-Win games: 776
New count: +906 -304 =2790
New Significance (SD): 17.3
Unbalance: 0.8
Score of SF2 vs SF1: 1384 - 830 - 1786 [0.569] 4000
ELO difference: 48
Win-Win games: 1070
New count: +849 -295 =2856
New Significance (SD): 16.4
Unbalance: 1.0
Score of SF2 vs SF1: 1490 - 962 - 1548 [0.566] 4000
ELO difference: 46
Win-Win games: 1308
New count: +836 -308 =2856
New Significance (SD): 15.6
Unbalance: 1.4
Score of SF2 vs SF1: 1658 - 1221 - 1121 [0.555] 4000
ELO difference: 38
Win-Win games: 1940
New count: +688 -251 =3061
New Significance (SD): 14.3
Unbalance: 1.8
Score of SF2 vs SF1: 1807 - 1502 - 691 [0.538] 4000
ELO difference: 27
Win-Win games: 2606
New count: +504 -199 =3297
New Significance (SD): 11.5
The benefit of this "new counting of draws" seems unequivocal (above 3 standard deviations confidence corroborating all the data), and it alone brings the number of games down by some 30% for the same statistical significance. Combined with unbalanced openings, the number of required games is reduced to almost a half.