Counting 1 win + 1 loss as 2 draws

Laskos · Post by **Laskos** » Tue Dec 15, 2015 7:09 am

I tried to calculate the statistical significance of a match between two engines (Win-Loss)/sqrt(Win+Loss) by computing pair-wise counting of Wins and Losses according to opening positions used (side and reversed).

The result is ELO model independent and Draw model independent. Some draw models can be adapted to describe the new counting. I used both usual balanced openings and unbalanced ones, to try to optimize the statistical significance of a match for the same number of games. If a 1-0 0-1 result on the same opening (reversed) shows up, I interpret this position as having ELO_Opening much larger than ELO_Diff.
If both ELO_Opening and ELO_Draw are much larger than ELO_Diff, then, for example Rao-Kupper P(Win) = L(ELO_Diff + ELO_Opening - ELO_Draw), can be rearranged by Taylor expanding around ELO_Diff = 0. And P(New_Draw) = P(Draw) + 2 equal terms each subtracted from P(Win) and P(Loss). However, nice properties like P(New_Draw) ~ P(New_Win)*P(New_Loss) will be probably lost. Draw rate should be sufficiently large for unbalanced openings to have beneficial effect because I used ELO_Draw as large. Also, the ELO difference between engines shouldn't be too large. In the following I am using Davidson model for more conformity to empirical data and in order to preserve the "effective" number of games and the ELO difference. The result is in fact independent of the ELO model and draw model. Therefore 1 Win + 1 Loss on the same opening (and reversed) are transformed into 2 Draws.

I played the same opening pair of games only once, to not touch a more delicate question of what to do with say same opening appearing in two pairs, one Win-Win and another in Loss-Loss from the same side. But I don't think this is a big practical problem, so openings can be played at will, only need to be repeated (side and reverse).

For each datapoint a 4000 games match of two related recent Stockfishes is used at 10''+0.1'' time control.

Code: Select all

Unbalance&#58; 0.0
Score of SF2 vs SF1&#58; 1046 - 530 - 2424  &#91;0.565&#93; 4000
ELO difference&#58; 45
Significance &#40;SD&#41;&#58; 13.0
Win-Win games&#58; 440
New count&#58; +826 -310 =2864
New Significance &#40;SD&#41;&#58; 15.3


Unbalance&#58; 0.4
Score of SF2 vs SF1&#58; 1157 - 590 - 2253  &#91;0.571&#93; 4000
ELO difference&#58; 50
Win-Win games&#58; 586
New count&#58; +864 -297 =2839
New Significance &#40;SD&#41;&#58; 16.6


Unbalance&#58; 0.6
Score of SF2 vs SF1&#58; 1294 - 692 - 2014  &#91;0.575&#93; 4000
ELO difference&#58; 53
Win-Win games&#58; 776
New count&#58; +906 -304 =2790
New Significance &#40;SD&#41;&#58; 17.3


Unbalance&#58; 0.8
Score of SF2 vs SF1&#58; 1384 - 830 - 1786  &#91;0.569&#93; 4000
ELO difference&#58; 48
Win-Win games&#58; 1070
New count&#58; +849 -295 =2856
New Significance &#40;SD&#41;&#58; 16.4


Unbalance&#58; 1.0
Score of SF2 vs SF1&#58; 1490 - 962 - 1548  &#91;0.566&#93; 4000
ELO difference&#58; 46
Win-Win games&#58; 1308
New count&#58; +836 -308 =2856
New Significance &#40;SD&#41;&#58; 15.6


Unbalance&#58; 1.4
Score of SF2 vs SF1&#58; 1658 - 1221 - 1121  &#91;0.555&#93; 4000
ELO difference&#58; 38
Win-Win games&#58; 1940
New count&#58; +688 -251 =3061
New Significance &#40;SD&#41;&#58; 14.3


Unbalance&#58; 1.8
Score of SF2 vs SF1&#58; 1807 - 1502 - 691  &#91;0.538&#93; 4000
ELO difference&#58; 27
Win-Win games&#58; 2606
New count&#58; +504 -199 =3297
New Significance &#40;SD&#41;&#58; 11.5

The benefit of this "new counting of draws" seems unequivocal (above 3 standard deviations confidence corroborating all the data), and it alone brings the number of games down by some 30% for the same statistical significance. Combined with unbalanced openings, the number of required games is reduced to almost a half.

stegemma · Post by **stegemma** » Wed Dec 16, 2015 8:23 pm

The statistical part of your post is too complex to me but the subject let me think that I would prefer 1win+1loss against Kasparov than 2 draws (and I'm sorry that I could never have... 2 loss!!!). This means that a draw is a draw and can't be compared with a win/loss... IMHO.

Laskos · Post by **Laskos** » Wed Dec 16, 2015 9:54 pm

stegemma wrote:The statistical part of your post is too complex to me but the subject let me think that I would prefer 1win+1loss against Kasparov than 2 draws (and I'm sorry that I could never have... 2 loss!!!). This means that a draw is a draw and can't be compared with a win/loss... IMHO.

Well, rating calculators like BayesELO implicitly assume such or similar things. The significance here refers to significance of the result in rejecting the null hypothesis that the engines are equal in strength. To reject the hypothesis that you are as strong as Kasparov I wouldn't need any games between you two

.

stegemma · Post by **stegemma** » Wed Dec 16, 2015 11:06 pm

Laskos wrote:[...]To reject the hypothesis that you are as strong as Kasparov I wouldn't need any games between you two .

That's why we'll never play together...

Zenmastur · Post by **Zenmastur** » Thu Dec 17, 2015 12:47 am

Laskos wrote:I tried to calculate the statistical significance of a match between two engines (Win-Loss)/sqrt(Win+Loss) by computing pair-wise counting of Wins and Losses according to opening positions used (side and reversed).

The result is ELO model independent and Draw model independent. Some draw models can be adapted to describe the new counting. I used both usual balanced openings and unbalanced ones, to try to optimize the statistical significance of a match for the same number of games. If a 1-0 0-1 result on the same opening (reversed) shows up, I interpret this position as having ELO_Opening much larger than ELO_Diff.
If both ELO_Opening and ELO_Draw are much larger than ELO_Diff, then, for example Rao-Kupper P(Win) = L(ELO_Diff + ELO_Opening - ELO_Draw), can be rearranged by Taylor expanding around ELO_Diff = 0. And P(New_Draw) = P(Draw) + 2 equal terms each subtracted from P(Win) and P(Loss). However, nice properties like P(New_Draw) ~ P(New_Win)*P(New_Loss) will be probably lost. Draw rate should be sufficiently large for unbalanced openings to have beneficial effect because I used ELO_Draw as large. Also, the ELO difference between engines shouldn't be too large. In the following I am using Davidson model for more conformity to empirical data and in order to preserve the "effective" number of games and the ELO difference. The result is in fact independent of the ELO model and draw model. Therefore 1 Win + 1 Loss on the same opening (and reversed) are transformed into 2 Draws.

I played the same opening pair of games only once, to not touch a more delicate question of what to do with say same opening appearing in two pairs, one Win-Win and another in Loss-Loss from the same side. But I don't think this is a big practical problem, so openings can be played at will, only need to be repeated (side and reverse).

For each datapoint a 4000 games match of two related recent Stockfishes is used at 10''+0.1'' time control.
Code: Select all
Unbalance&#58; 0.0
Score of SF2 vs SF1&#58; 1046 - 530 - 2424  &#91;0.565&#93; 4000
ELO difference&#58; 45
Significance &#40;SD&#41;&#58; 13.0
Win-Win games&#58; 440
New count&#58; +826 -310 =2864
New Significance &#40;SD&#41;&#58; 15.3


Unbalance&#58; 0.4
Score of SF2 vs SF1&#58; 1157 - 590 - 2253  &#91;0.571&#93; 4000
ELO difference&#58; 50
Win-Win games&#58; 586
New count&#58; +864 -297 =2839
New Significance &#40;SD&#41;&#58; 16.6


Unbalance&#58; 0.6
Score of SF2 vs SF1&#58; 1294 - 692 - 2014  &#91;0.575&#93; 4000
ELO difference&#58; 53
Win-Win games&#58; 776
New count&#58; +906 -304 =2790
New Significance &#40;SD&#41;&#58; 17.3


Unbalance&#58; 0.8
Score of SF2 vs SF1&#58; 1384 - 830 - 1786  &#91;0.569&#93; 4000
ELO difference&#58; 48
Win-Win games&#58; 1070
New count&#58; +849 -295 =2856
New Significance &#40;SD&#41;&#58; 16.4


Unbalance&#58; 1.0
Score of SF2 vs SF1&#58; 1490 - 962 - 1548  &#91;0.566&#93; 4000
ELO difference&#58; 46
Win-Win games&#58; 1308
New count&#58; +836 -308 =2856
New Significance &#40;SD&#41;&#58; 15.6


Unbalance&#58; 1.4
Score of SF2 vs SF1&#58; 1658 - 1221 - 1121  &#91;0.555&#93; 4000
ELO difference&#58; 38
Win-Win games&#58; 1940
New count&#58; +688 -251 =3061
New Significance &#40;SD&#41;&#58; 14.3


Unbalance&#58; 1.8
Score of SF2 vs SF1&#58; 1807 - 1502 - 691  &#91;0.538&#93; 4000
ELO difference&#58; 27
Win-Win games&#58; 2606
New count&#58; +504 -199 =3297
New Significance &#40;SD&#41;&#58; 11.5
The benefit of this "new counting of draws" seems unequivocal (above 3 standard deviations confidence corroborating all the data), and it alone brings the number of games down by some 30% for the same statistical significance. Combined with unbalanced openings, the number of required games is reduced to almost a half.

So, one of the consequences of what you are saying is that the number of games to test a patch could be reduced by 30% if you use openings that are un-balanced, is that correct?

You seem to be making a few other significant points but the explanation you gave isn't clear enough for me to discern exactly what they are. Could you elaborate a little more?

Thanks and regards,

Forrest

Laskos · Post by **Laskos** » Thu Dec 17, 2015 3:06 pm

Zenmastur wrote:
So, one of the consequences of what you are saying is that the number of games to test a patch could be reduced by 30% if you use openings that are un-balanced, is that correct?

You seem to be making a few other significant points but the explanation you gave isn't clear enough for me to discern exactly what they are. Could you elaborate a little more?

Thanks and regards,

Forrest

Number of games can be reduced by 30% if one treats 1-0 0-1 result from the same opening (side and reversed) as 2 draws. In fact one can treat that as 1 draw, discard totally etc., the result is independent of the draw model if our goal is to reject the null hypothesis that the engines are equal. The soundness of this procedure is speculative, one probably must experiment, as it's hard to build a consistent theory about that. The experiment might be something like to check that a certain theoretical p-value for a given observed result using the new counting is verified experimentally in many trials. The empirical degree of correlation 'side result'-'reversed result' is an important factor which determines the soundness of this counting.

The unbalanced advantage is hard to confirm with very much confidence, I tried yesterday using 10,000 games for 2 datapoints, one using balanced openings, another using unbalanced, it is still 2 sigma as before. Also, it seems to depend of the draw ratio, which is assumed high. As of now, with 60%-65% draw ratio, the unbalanced openings seem to reduce further the number of required games by 10%-20%.

Counting 1 win + 1 loss as 2 draws

Counting 1 win + 1 loss as 2 draws

Re: Counting 1 win + 1 loss as 2 draws

Re: Counting 1 win + 1 loss as 2 draws

Re: Counting 1 win + 1 loss as 2 draws

Re: Counting 1 win + 1 loss as 2 draws

Re: Counting 1 win + 1 loss as 2 draws