Some properties of the Type I error in p-value stopping ru

Laskos · Post by **Laskos** » Tue Mar 01, 2016 1:15 pm

Let's take the important t-value in Chess match results: (W-L)/sqrt(W+L) = (2*W-N)/sqrt(N), N is the scaling parameter, and integer. If null hypothesis is assumed true (W=L=N/2) and assuming normal approximation this quantity is normally distributed with expectation value zero and standard deviation 1. So it is convertible to p-value. We would like to have a p-value stopping rule rejecting the null hypothesis H0: W=L with certain Type I error (incorrect rejection of a true null hypothesis H0). While closest to the best solution to this is SPRT, many people still use p-value (confidence intervals) in determining "superiority" (W<>L, or rejection of H0). Say people using Bayeselo 2 standard deviations (t=2), are in fact stopping at certain p-value. Methodologically, I also assume that people stop as soon as they see the p-value decreasing below some threshold. This stopping rule has UNBOUNDED Type 1 error. In the limit N->infinity, Type I error is 100%.

Practically, aside theoretical considerations, the scale of the problem (N) is bounded, and this stopping rule can still be considered on some finite range. But it's important to control the Type I error for this quantity. My experiment starts here. I don't know the theoretical derivation for this case of this quantity, so I performed simulations. First observation: doubling the scale N gives a sensibly constant additional Type I error:

Code: Select all

   N     Type I error for t=2
           from N to 2*N

  500       7.92%
 1000       7.96%
 2000       7.87%
 4000       7.96%
 8000       7.90%

Constant within error margins. Therefore, the Type I error is logarithmic in N. It confirms that it's unbounded. Total Type I error for N: 500->16000 in this case (t=2) is 33.8%. But Type I error being logarithmic in N, for finite N there is some use of the stopping rule. If the error from doubling times log2(N) is sensibly smaller than 1, then Type I error is controlled, though there is some balance to do between smaller error and the necessary effort. The stopping rule is far from being optimal, but at least it can be soundly used.

Second observation: the Type I error from doubling seems to follow closely the quantity Exp(-t^2/2). So the error goes pretty quickly to some small values with increasing of t-value:

Code: Select all

t-value  Type I error   Exp&#40;-t^2/2&#41;     p-value
         per doubling

5.0         0.00040%      0.00037%      0.000057%  
4.5         0.0044%       0.0040%       0.00068%
4.0         0.033%        0.034%        0.0063%  
3.5         0.18%         0.219%        0.046%
3.0         0.81%         1.11%         0.27%
2.5         2.66%         4.39%         1.24%
2.0         7.92%        13.53%         4.55%
1.5        20.23%        32.47%        13.36% 
1.0        49.08%        60.65%        31.73%

On practical grounds: in physical sciences N is at most say 2^300, so a stopping rule for this quantity with usual t=5 and more gives less than 1% Type 1 error. In Chess testing using games, N is at most 2^20, and a stopping rule based on t=3.5 can be safely used with less than 5% error. Again, this p-value stopping rule is far from being optimal with regard to effort. On the other hand, the often used 2 standard deviations stopping rule (p=0.05) is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.

Laskos · Post by **Laskos** » Tue Mar 08, 2016 9:23 pm

Laskos wrote:On the other hand, the often used stopping rule p=0.05 is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.

In "Nature News" from the paper:
Nature 531, 151 (10 March 2016) doi:10.1038/nature.2016.19503
a similar thing is asserted by American Statistical Association (ASA):
http://www.nature.com/news/statistician ... NatureNews

Laskos · Post by **Laskos** » Fri Jul 28, 2017 11:13 am

Laskos wrote:
Code: Select all
t-value  Type I error   Exp&#40;-t^2/2&#41;     p-value
         per doubling

5.0         0.00040%      0.00037%      0.000057%  
4.5         0.0044%       0.0040%       0.00068%
4.0         0.033%        0.034%        0.0063%  
3.5         0.18%         0.219%        0.046%
3.0         0.81%         1.11%         0.27%
2.5         2.66%         4.39%         1.24%
2.0         7.92%        13.53%         4.55%
1.5        20.23%        32.47%        13.36% 
1.0        49.08%        60.65%        31.73%
On practical grounds: in physical sciences N is at most say 2^300, so a stopping rule for this quantity with usual t=5 and more gives less than 1% Type 1 error. In Chess testing using games, N is at most 2^20, and a stopping rule based on t=3.5 can be safely used with less than 5% error. Again, this p-value stopping rule is far from being optimal with regard to effort. On the other hand, the often used 2 standard deviations stopping rule (p=0.05) is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.

Funny, "Big names in statistics" are needed to reach the same conclusions (more than a year later):

Big names in statistics want to shake up much-maligned P value
One of scientists’ favourite statistics — the P value — should face tougher standards, say leading researchers.
http://www.nature.com/news/big-names-in ... ue-1.22375

Laskos · Post by **Laskos** » Tue Nov 28, 2017 6:17 pm

Laskos wrote:
Laskos wrote:
Code: Select all
t-value  Type I error   Exp&#40;-t^2/2&#41;     p-value
         per doubling

5.0         0.00040%      0.00037%      0.000057%  
4.5         0.0044%       0.0040%       0.00068%
4.0         0.033%        0.034%        0.0063%  
3.5         0.18%         0.219%        0.046%
3.0         0.81%         1.11%         0.27%
2.5         2.66%         4.39%         1.24%
2.0         7.92%        13.53%         4.55%
1.5        20.23%        32.47%        13.36% 
1.0        49.08%        60.65%        31.73%
On practical grounds: in physical sciences N is at most say 2^300, so a stopping rule for this quantity with usual t=5 and more gives less than 1% Type 1 error. In Chess testing using games, N is at most 2^20, and a stopping rule based on t=3.5 can be safely used with less than 5% error. Again, this p-value stopping rule is far from being optimal with regard to effort. On the other hand, the often used 2 standard deviations stopping rule (p=0.05) is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.
Funny, "Big names in statistics" are needed to reach the same conclusions (more than a year later):

Big names in statistics want to shake up much-maligned P value
One of scientists’ favourite statistics — the P value — should face tougher standards, say leading researchers.
http://www.nature.com/news/big-names-in ... ue-1.22375

Another paper in "Nature" on this, combined with the thread on priors and Bayesian analysis http://talkchess.com/forum/viewtopic.php?t=64084

https://www.nature.com/articles/d41586- ... n=20171128

Laskos · Post by **Laskos** » Wed Feb 06, 2019 10:26 am

And here are the very first early stops safely allowed, where the Type I error (false positive) is significantly smaller than 5%. It is assumed that we don't know anything about engines (uniform prior), therefore p-vale depends only on Wins and Losses.

An easy hack in "Mathematica" to derive the table for very small number of games with very skewed results, allowing for a very quick safe stop. I think SPRT doesn't allow for such quick stops, as Type I error is bounded there, not necessarily close to the boundary with very small numbers of games. The table should be read: for N1 Wins there can be at most N2 Losses in order to stop.

"Mathematica" hack:

Code: Select all

err = 0.01;

For [i = 0, i <= 50,
 For [j = 0, j < i,
           
   a = Sum[2*Binomial[i + k, k]/2^(i + k), {k, 0, j}]; 
   b = Sum[2*Binomial[i + k + 1, k + 1]/2^(i + k + 1), {k, 0, j}];      
   If [a < err && b > err, Print["Wins=", i, " , ", "Losses=", j]];
   
 j++]
i++]

Table of very early safe stops to detect a positive:

Code: Select all

Wins=8 , Losses=0
Wins=9 , Losses=0
Wins=10 , Losses=0
Wins=11 , Losses=1
Wins=12 , Losses=1
Wins=13 , Losses=2
Wins=14 , Losses=2
Wins=15 , Losses=3
Wins=16 , Losses=3
Wins=17 , Losses=4
Wins=18 , Losses=4
Wins=19 , Losses=5
Wins=20 , Losses=5
Wins=21 , Losses=6
Wins=22 , Losses=6
Wins=23 , Losses=7
Wins=24 , Losses=8
Wins=25 , Losses=8
Wins=26 , Losses=9
Wins=27 , Losses=9
Wins=28 , Losses=10
Wins=29 , Losses=11
Wins=30 , Losses=11
Wins=31 , Losses=12
Wins=32 , Losses=13
Wins=33 , Losses=13
Wins=34 , Losses=14
Wins=35 , Losses=15
Wins=36 , Losses=15
Wins=37 , Losses=16
Wins=38 , Losses=17
Wins=39 , Losses=17
Wins=40 , Losses=18
Wins=41 , Losses=19
Wins=42 , Losses=19
Wins=43 , Losses=20
Wins=44 , Losses=21
Wins=45 , Losses=22
Wins=46 , Losses=22
Wins=47 , Losses=23
Wins=48 , Losses=24
Wins=49 , Losses=24
Wins=50 , Losses=25

Again, Losses here are "at most".

Further, for more games, if one wants to stop, use 3-3.5 standard deviations rule up to tens of thousands of games (but not millions).

Some properties of the Type I error in p-value stopping ru

Some properties of the Type I error in p-value stopping ru

Re: Some properties of the Type I error in p-value stopping

Re: Some properties of the Type I error in p-value stopping

Re: Some properties of the Type I error in p-value stopping

Re: Some properties of the Type I error in p-value stopping ru