All we need for statistical significance of the result is ELO difference over standard deviation. And in fact not even that, but (win_ratio - loss_ratio)/sigma.
Short derivation working very well for pretty closely matched engines (can be generalized rigorously). Up to say 60%/40% result mismatch, sigma is very close to sqrt(win_ratio+loss_ratio)/sqrt(N), where N is the total number of games.
So (win_ratio - loss_ratio)/sigma = N*(win_ratio - loss_ratio)/sqrt(N*[win_ratio + loss_ratio]) = (Wins - Losses)/sqrt(Wins + Losses), where the notation should be pretty clear.
This
Code: Select all
(Wins - Losses)/sqrt(Wins + Losses)
Example TCEC Superfinal result: +9 -2 =89
Rigorously:
N=100
win_ratio - loss_ratio = (9-2)/100 = 0.07
sigma = sqrt(w*(1-w)+l*(1-l)+2*w*l)/sqrt(N) = 0.032419
(w-l)/sigma = 2.159
The given simple expression:
(Wins - Losses)/sqrt(Wins + Losses) = (9-2)/sqrt(9+2) = 2.111
The interpretation is the following: the result of the match is a bit above 2 standard deviations off perfect equality, or stronger engine has a LOS of a bit above 97.7%. When the result is 5-6 standard deviations, it's even hard to write the LOS, but statistical significance of 5-6 standard deviations is clear.
Even with this few games, the expression works very well. In case of something like Fishtest patches, it will work almost perfectly, as ELO differences are small and the number of games large.