A simple expression
Posted: Thu Dec 10, 2015 12:15 am
It describes very well the result of an engine match between 2 engines. In fact, better than ELO difference and standard deviation combined.
All we need for statistical significance of the result is ELO difference over standard deviation. And in fact not even that, but (win_ratio - loss_ratio)/sigma.
Short derivation working very well for pretty closely matched engines (can be generalized rigorously). Up to say 60%/40% result mismatch, sigma is very close to sqrt(win_ratio+loss_ratio)/sqrt(N), where N is the total number of games.
So (win_ratio - loss_ratio)/sigma = N*(win_ratio - loss_ratio)/sqrt(N*[win_ratio + loss_ratio]) = (Wins - Losses)/sqrt(Wins + Losses), where the notation should be pretty clear.
This
is independent of number of draws and is simply the number of standard deviations the result is off the perfect 50% equality. To me it seems a better description of the result than ELO and error margins, and better than LOS, because LOS goes very close to 1 above 3-4 standard deviations. I don't know how this utterly simple expression somehow evaded me, I guess some people here knew it.
Example TCEC Superfinal result: +9 -2 =89
Rigorously:
N=100
win_ratio - loss_ratio = (9-2)/100 = 0.07
sigma = sqrt(w*(1-w)+l*(1-l)+2*w*l)/sqrt(N) = 0.032419
(w-l)/sigma = 2.159
The given simple expression:
(Wins - Losses)/sqrt(Wins + Losses) = (9-2)/sqrt(9+2) = 2.111
The interpretation is the following: the result of the match is a bit above 2 standard deviations off perfect equality, or stronger engine has a LOS of a bit above 97.7%. When the result is 5-6 standard deviations, it's even hard to write the LOS, but statistical significance of 5-6 standard deviations is clear.
Even with this few games, the expression works very well. In case of something like Fishtest patches, it will work almost perfectly, as ELO differences are small and the number of games large.
All we need for statistical significance of the result is ELO difference over standard deviation. And in fact not even that, but (win_ratio - loss_ratio)/sigma.
Short derivation working very well for pretty closely matched engines (can be generalized rigorously). Up to say 60%/40% result mismatch, sigma is very close to sqrt(win_ratio+loss_ratio)/sqrt(N), where N is the total number of games.
So (win_ratio - loss_ratio)/sigma = N*(win_ratio - loss_ratio)/sqrt(N*[win_ratio + loss_ratio]) = (Wins - Losses)/sqrt(Wins + Losses), where the notation should be pretty clear.
This
Code: Select all
(Wins - Losses)/sqrt(Wins + Losses)
Example TCEC Superfinal result: +9 -2 =89
Rigorously:
N=100
win_ratio - loss_ratio = (9-2)/100 = 0.07
sigma = sqrt(w*(1-w)+l*(1-l)+2*w*l)/sqrt(N) = 0.032419
(w-l)/sigma = 2.159
The given simple expression:
(Wins - Losses)/sqrt(Wins + Losses) = (9-2)/sqrt(9+2) = 2.111
The interpretation is the following: the result of the match is a bit above 2 standard deviations off perfect equality, or stronger engine has a LOS of a bit above 97.7%. When the result is 5-6 standard deviations, it's even hard to write the LOS, but statistical significance of 5-6 standard deviations is clear.
Even with this few games, the expression works very well. In case of something like Fishtest patches, it will work almost perfectly, as ELO differences are small and the number of games large.