TalkChess.com

Posted: **Thu Dec 10, 2015 12:15 am**

It describes very well the result of an engine match between 2 engines. In fact, better than ELO difference and standard deviation combined.

All we need for statistical significance of the result is ELO difference over standard deviation. And in fact not even that, but (win_ratio - loss_ratio)/sigma.

Short derivation working very well for pretty closely matched engines (can be generalized rigorously). Up to say 60%/40% result mismatch, sigma is very close to sqrt(win_ratio+loss_ratio)/sqrt(N), where N is the total number of games.

So (win_ratio - loss_ratio)/sigma = N*(win_ratio - loss_ratio)/sqrt(N*[win_ratio + loss_ratio]) = (Wins - Losses)/sqrt(Wins + Losses), where the notation should be pretty clear.

This

Code: Select all

&#40;Wins - Losses&#41;/sqrt&#40;Wins + Losses&#41;

is independent of number of draws and is simply the number of standard deviations the result is off the perfect 50% equality. To me it seems a better description of the result than ELO and error margins, and better than LOS, because LOS goes very close to 1 above 3-4 standard deviations. I don't know how this utterly simple expression somehow evaded me, I guess some people here knew it.

Example TCEC Superfinal result: +9 -2 =89

Rigorously:
N=100
win_ratio - loss_ratio = (9-2)/100 = 0.07
sigma = sqrt(w*(1-w)+l*(1-l)+2*w*l)/sqrt(N) = 0.032419
(w-l)/sigma = 2.159

The given simple expression:
(Wins - Losses)/sqrt(Wins + Losses) = (9-2)/sqrt(9+2) = 2.111

The interpretation is the following: the result of the match is a bit above 2 standard deviations off perfect equality, or stronger engine has a LOS of a bit above 97.7%. When the result is 5-6 standard deviations, it's even hard to write the LOS, but statistical significance of 5-6 standard deviations is clear.

Even with this few games, the expression works very well. In case of something like Fishtest patches, it will work almost perfectly, as ELO differences are small and the number of games large.

Posted: **Thu Dec 10, 2015 2:49 am**

Nice!

Your formula is in fact statistically correct from a frequentist point of view.

Your formula computes how many standard deviations W-L is from zero under the null hypothesis w=l (equal strength). So sigma is computed under the null hypothesis w=l (using that w+l=1-d, and d is estimated from the sample).

So under the null hypothesis and assuming normal approximation (W-L)/sqrt(W+L) is normally distributed with expectation value zero and standard deviation 1. So you can convert it to a p-value (which is the frequentist version of LOS).

Posted: **Thu Dec 10, 2015 3:23 am**

Another interpretation is that if we discard draws then we are left with W+L Bernouilli trials with equal probabilities for the alternatives (under the null hypothesis). From this one obtains that (W-L)/sqrt(W+L) has standard deviation precisely 1 (always under the null hypothesis).

Posted: **Thu Dec 10, 2015 9:46 am**

I compute LOS as

.5 + .5 * std::erf((wins-losses)/std::sqrt(2.0*(wins+losses)))

That basically means I use the same simple formula and then transform it to a p-value.

Posted: **Thu Dec 10, 2015 3:31 pm**

Michel wrote:Nice!

Your formula is in fact statistically correct from a frequentist point of view.

Your formula computes how many standard deviations W-L is from zero under the null hypothesis w=l (equal strength). So sigma is computed under the null hypothesis w=l (using that w+l=1-d, and d is estimated from the sample).

So under the null hypothesis and assuming normal approximation (W-L)/sqrt(W+L) is normally distributed with expectation value zero and standard deviation 1. So you can convert it to a p-value (which is the frequentist version of LOS).

Thanks for this frequentist perspective, and I actually don't feel like the usage of p-value in medical and social sciences is very sound, I like more the "five sigma" of physics. P-value of 0.05 or 0.01 can give large Type I error quickly, insomuch that with research groups seeking to prove miracles, some false miracles, out of many possible false miracles, will indeed appear to be proven from time to time with a very high likelihood (Type I error error explodes). The methodology assumes that one doesn't seek to prove miracles. That's why I like more this "number of standard deviations" than hardly intuitive p-value. Even knowing that all these as "stopping rule" have a theoretically unbounded Type I error.

Posted: **Thu Dec 10, 2015 3:37 pm**

AlvaroBegue wrote:I compute LOS as

.5 + .5 * std::erf((wins-losses)/std::sqrt(2.0*(wins+losses)))

That basically means I use the same simple formula and then transform it to a p-value.

Yes, I also used this, somehow blindly, because from now on I will mainly use the quantity inside erf.

Posted: **Thu Dec 10, 2015 3:43 pm**

Laskos wrote:
Michel wrote:Nice!

Your formula is in fact statistically correct from a frequentist point of view.

Your formula computes how many standard deviations W-L is from zero under the null hypothesis w=l (equal strength). So sigma is computed under the null hypothesis w=l (using that w+l=1-d, and d is estimated from the sample).

So under the null hypothesis and assuming normal approximation (W-L)/sqrt(W+L) is normally distributed with expectation value zero and standard deviation 1. So you can convert it to a p-value (which is the frequentist version of LOS).
Thanks for this frequentist perspective, and I actually don't feel like the usage of p-value in medical and social sciences is very sound, I like more the "five sigma" of physics. P-value of 0.05 or 0.01 can give large Type I error quickly, insomuch that with research groups seeking to prove miracles, some false miracles, out of many possible false miracles, will indeed appear to be proven from time to time with a very high likelihood (Type I error error explodes). The methodology assumes that one doesn't seek to prove miracles. That's why I like more this "number of standard deviations" than hardly intuitive p-value. Even knowing that all these as "stopping rule" have a theoretically unbounded Type I error.

Both p-values and t-values are perfectly easy to understand. The p-value means that, if the change you are exploring were actually a no-op, you would expect this number to be drawn from a uniform distribution in the interval [0,1]. If the number you get is 0.9999997, you might think this is too much of a coincidence. The t-value means that, if the change you are exploring were actually a no-op, you would expect this number to follow a standard normal distribution. It the number you get is 5.0, you would be just as convinced that this is too much of a coincidence as you were with the p-value.

Posted: **Thu Dec 10, 2015 3:48 pm**

AlvaroBegue wrote:
Laskos wrote:
Michel wrote:Nice!

Your formula is in fact statistically correct from a frequentist point of view.

Your formula computes how many standard deviations W-L is from zero under the null hypothesis w=l (equal strength). So sigma is computed under the null hypothesis w=l (using that w+l=1-d, and d is estimated from the sample).

So under the null hypothesis and assuming normal approximation (W-L)/sqrt(W+L) is normally distributed with expectation value zero and standard deviation 1. So you can convert it to a p-value (which is the frequentist version of LOS).
Thanks for this frequentist perspective, and I actually don't feel like the usage of p-value in medical and social sciences is very sound, I like more the "five sigma" of physics. P-value of 0.05 or 0.01 can give large Type I error quickly, insomuch that with research groups seeking to prove miracles, some false miracles, out of many possible false miracles, will indeed appear to be proven from time to time with a very high likelihood (Type I error error explodes). The methodology assumes that one doesn't seek to prove miracles. That's why I like more this "number of standard deviations" than hardly intuitive p-value. Even knowing that all these as "stopping rule" have a theoretically unbounded Type I error.
Both p-values and t-values are perfectly easy to understand. The p-value means that, if the change you are exploring were actually a no-op, you would expect this number to be drawn from a uniform distribution in the interval [0,1]. If the number you get is 0.9999997, you might think this is too much of a coincidence. The t-value means that, if the change you are exploring were actually a no-op, you would expect this number to follow a standard normal distribution. It the number you get is 5.0, you would be just as convinced that this is too much of a coincidence as you were with the p-value.

Yes, they are the same thing, but I rarely see a p-value of 0.99999999993 in medical or social sciences, while in physics the same relevance t-value appears all the time (and it's easier to write as "6"). Then, do you know how Type I error behaves for a p-value of 0.05 as stopping rule? I simulated it, this stopping rule is practically worthless in engine testing.

Posted: **Thu Dec 10, 2015 3:52 pm**

Laskos wrote:
AlvaroBegue wrote:
Laskos wrote:
Michel wrote:Nice!

Your formula is in fact statistically correct from a frequentist point of view.

Your formula computes how many standard deviations W-L is from zero under the null hypothesis w=l (equal strength). So sigma is computed under the null hypothesis w=l (using that w+l=1-d, and d is estimated from the sample).

So under the null hypothesis and assuming normal approximation (W-L)/sqrt(W+L) is normally distributed with expectation value zero and standard deviation 1. So you can convert it to a p-value (which is the frequentist version of LOS).
Thanks for this frequentist perspective, and I actually don't feel like the usage of p-value in medical and social sciences is very sound, I like more the "five sigma" of physics. P-value of 0.05 or 0.01 can give large Type I error quickly, insomuch that with research groups seeking to prove miracles, some false miracles, out of many possible false miracles, will indeed appear to be proven from time to time with a very high likelihood (Type I error error explodes). The methodology assumes that one doesn't seek to prove miracles. That's why I like more this "number of standard deviations" than hardly intuitive p-value. Even knowing that all these as "stopping rule" have a theoretically unbounded Type I error.
Both p-values and t-values are perfectly easy to understand. The p-value means that, if the change you are exploring were actually a no-op, you would expect this number to be drawn from a uniform distribution in the interval [0,1]. If the number you get is 0.9999997, you might think this is too much of a coincidence. The t-value means that, if the change you are exploring were actually a no-op, you would expect this number to follow a standard normal distribution. It the number you get is 5.0, you would be just as convinced that this is too much of a coincidence as you were with the p-value.
Yes, they are the same thing, but I rarely see a p-value of 0.99999999993 in medical or social sciences, while in physics the same relevance t-value appears all the time (and it's easier to write as "6").

You won't see 0.99999999993 in chess engine tests either, unless you have time for billions of games. Perhaps that's why the p-value scale is commonly used both in chess and in medical science.

Posted: **Thu Dec 10, 2015 3:54 pm**

AlvaroBegue wrote:
Laskos wrote:
AlvaroBegue wrote:
Laskos wrote:
Michel wrote:Nice!

Your formula is in fact statistically correct from a frequentist point of view.

Your formula computes how many standard deviations W-L is from zero under the null hypothesis w=l (equal strength). So sigma is computed under the null hypothesis w=l (using that w+l=1-d, and d is estimated from the sample).

So under the null hypothesis and assuming normal approximation (W-L)/sqrt(W+L) is normally distributed with expectation value zero and standard deviation 1. So you can convert it to a p-value (which is the frequentist version of LOS).
Thanks for this frequentist perspective, and I actually don't feel like the usage of p-value in medical and social sciences is very sound, I like more the "five sigma" of physics. P-value of 0.05 or 0.01 can give large Type I error quickly, insomuch that with research groups seeking to prove miracles, some false miracles, out of many possible false miracles, will indeed appear to be proven from time to time with a very high likelihood (Type I error error explodes). The methodology assumes that one doesn't seek to prove miracles. That's why I like more this "number of standard deviations" than hardly intuitive p-value. Even knowing that all these as "stopping rule" have a theoretically unbounded Type I error.
Both p-values and t-values are perfectly easy to understand. The p-value means that, if the change you are exploring were actually a no-op, you would expect this number to be drawn from a uniform distribution in the interval [0,1]. If the number you get is 0.9999997, you might think this is too much of a coincidence. The t-value means that, if the change you are exploring were actually a no-op, you would expect this number to follow a standard normal distribution. It the number you get is 5.0, you would be just as convinced that this is too much of a coincidence as you were with the p-value.
Yes, they are the same thing, but I rarely see a p-value of 0.99999999993 in medical or social sciences, while in physics the same relevance t-value appears all the time (and it's easier to write as "6").
You won't see 0.99999999993 in chess engine tests either, unless you have time for billions of games. Perhaps that's why the p-value scale is commonly used both in chess and in medical science.

That sort of LOS appears all the time in Fishtest regressions, they just write it as "100%".

TalkChess.com

A simple expression

A simple expression

Re: A simple expression

Re: A simple expression

Re: A simple expression

Re: A simple expression

Re: A simple expression

Re: A simple expression

Re: A simple expression

Re: A simple expression

Re: A simple expression