## Some properties of the Type I error in p-value stopping ru

Discussion of chess software programming and technical issues.

Moderators: hgm, Harvey Williamson, bob

Forum rules
This textbox is used to restore diagrams posted with the [d] tag before the upgrade.
Posts: 9313
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Some properties of the Type I error in p-value stopping ru

Let's take the important t-value in Chess match results: (W-L)/sqrt(W+L) = (2*W-N)/sqrt(N), N is the scaling parameter, and integer. If null hypothesis is assumed true (W=L=N/2) and assuming normal approximation this quantity is normally distributed with expectation value zero and standard deviation 1. So it is convertible to p-value. We would like to have a p-value stopping rule rejecting the null hypothesis H0: W=L with certain Type I error (incorrect rejection of a true null hypothesis H0). While closest to the best solution to this is SPRT, many people still use p-value (confidence intervals) in determining "superiority" (W<>L, or rejection of H0). Say people using Bayeselo 2 standard deviations (t=2), are in fact stopping at certain p-value. Methodologically, I also assume that people stop as soon as they see the p-value decreasing below some threshold. This stopping rule has UNBOUNDED Type 1 error. In the limit N->infinity, Type I error is 100%.

Practically, aside theoretical considerations, the scale of the problem (N) is bounded, and this stopping rule can still be considered on some finite range. But it's important to control the Type I error for this quantity. My experiment starts here. I don't know the theoretical derivation for this case of this quantity, so I performed simulations. First observation: doubling the scale N gives a sensibly constant additional Type I error:

Code: Select all

``````   N     Type I error for t=2
from N to 2*N

500       7.92%
1000       7.96%
2000       7.87%
4000       7.96%
8000       7.90%``````
Constant within error margins. Therefore, the Type I error is logarithmic in N. It confirms that it's unbounded. Total Type I error for N: 500->16000 in this case (t=2) is 33.8%. But Type I error being logarithmic in N, for finite N there is some use of the stopping rule. If the error from doubling times log2(N) is sensibly smaller than 1, then Type I error is controlled, though there is some balance to do between smaller error and the necessary effort. The stopping rule is far from being optimal, but at least it can be soundly used.

Second observation: the Type I error from doubling seems to follow closely the quantity Exp(-t^2/2). So the error goes pretty quickly to some small values with increasing of t-value:

Code: Select all

``````t-value  Type I error   Exp&#40;-t^2/2&#41;     p-value
per doubling

5.0         0.00040%      0.00037%      0.000057%
4.5         0.0044%       0.0040%       0.00068%
4.0         0.033%        0.034%        0.0063%
3.5         0.18%         0.219%        0.046%
3.0         0.81%         1.11%         0.27%
2.5         2.66%         4.39%         1.24%
2.0         7.92%        13.53%         4.55%
1.5        20.23%        32.47%        13.36%
1.0        49.08%        60.65%        31.73%``````
On practical grounds: in physical sciences N is at most say 2^300, so a stopping rule for this quantity with usual t=5 and more gives less than 1% Type 1 error. In Chess testing using games, N is at most 2^20, and a stopping rule based on t=3.5 can be safely used with less than 5% error. Again, this p-value stopping rule is far from being optimal with regard to effort. On the other hand, the often used 2 standard deviations stopping rule (p=0.05) is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.

Posts: 9313
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Some properties of the Type I error in p-value stopping

Laskos wrote:On the other hand, the often used stopping rule p=0.05 is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.
In "Nature News" from the paper:
Nature 531, 151 (10 March 2016) doi:10.1038/nature.2016.19503
a similar thing is asserted by American Statistical Association (ASA):
http://www.nature.com/news/statistician ... NatureNews

Posts: 9313
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Some properties of the Type I error in p-value stopping

Code: Select all

``````t-value  Type I error   Exp&#40;-t^2/2&#41;     p-value
per doubling

5.0         0.00040%      0.00037%      0.000057%
4.5         0.0044%       0.0040%       0.00068%
4.0         0.033%        0.034%        0.0063%
3.5         0.18%         0.219%        0.046%
3.0         0.81%         1.11%         0.27%
2.5         2.66%         4.39%         1.24%
2.0         7.92%        13.53%         4.55%
1.5        20.23%        32.47%        13.36%
1.0        49.08%        60.65%        31.73%``````
On practical grounds: in physical sciences N is at most say 2^300, so a stopping rule for this quantity with usual t=5 and more gives less than 1% Type 1 error. In Chess testing using games, N is at most 2^20, and a stopping rule based on t=3.5 can be safely used with less than 5% error. Again, this p-value stopping rule is far from being optimal with regard to effort. On the other hand, the often used 2 standard deviations stopping rule (p=0.05) is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.
Funny, "Big names in statistics" are needed to reach the same conclusions (more than a year later):

Big names in statistics want to shake up much-maligned P value
One of scientists’ favourite statistics — the P value — should face tougher standards, say leading researchers.

http://www.nature.com/news/big-names-in ... ue-1.22375

Posts: 9313
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Some properties of the Type I error in p-value stopping

Code: Select all

``````t-value  Type I error   Exp&#40;-t^2/2&#41;     p-value
per doubling

5.0         0.00040%      0.00037%      0.000057%
4.5         0.0044%       0.0040%       0.00068%
4.0         0.033%        0.034%        0.0063%
3.5         0.18%         0.219%        0.046%
3.0         0.81%         1.11%         0.27%
2.5         2.66%         4.39%         1.24%
2.0         7.92%        13.53%         4.55%
1.5        20.23%        32.47%        13.36%
1.0        49.08%        60.65%        31.73%``````
On practical grounds: in physical sciences N is at most say 2^300, so a stopping rule for this quantity with usual t=5 and more gives less than 1% Type 1 error. In Chess testing using games, N is at most 2^20, and a stopping rule based on t=3.5 can be safely used with less than 5% error. Again, this p-value stopping rule is far from being optimal with regard to effort. On the other hand, the often used 2 standard deviations stopping rule (p=0.05) is virtually impossible to apply beyond a sole doubling and is hardly of any use as a serious stopping rule for this quantity.
Funny, "Big names in statistics" are needed to reach the same conclusions (more than a year later):

Big names in statistics want to shake up much-maligned P value
One of scientists’ favourite statistics — the P value — should face tougher standards, say leading researchers.

http://www.nature.com/news/big-names-in ... ue-1.22375
Another paper in "Nature" on this, combined with the thread on priors and Bayesian analysis http://talkchess.com/forum/viewtopic.php?t=64084

https://www.nature.com/articles/d41586- ... n=20171128

Posts: 9313
Joined: Wed Jul 26, 2006 8:21 pm
Full name: Kai Laskos

### Re: Some properties of the Type I error in p-value stopping ru

And here are the very first early stops safely allowed, where the Type I error (false positive) is significantly smaller than 5%. It is assumed that we don't know anything about engines (uniform prior), therefore p-vale depends only on Wins and Losses.

An easy hack in "Mathematica" to derive the table for very small number of games with very skewed results, allowing for a very quick safe stop. I think SPRT doesn't allow for such quick stops, as Type I error is bounded there, not necessarily close to the boundary with very small numbers of games. The table should be read: for N1 Wins there can be at most N2 Losses in order to stop.

"Mathematica" hack:

Code: Select all

``````err = 0.01;

For [i = 0, i <= 50,
For [j = 0, j < i,

a = Sum[2*Binomial[i + k, k]/2^(i + k), {k, 0, j}];
b = Sum[2*Binomial[i + k + 1, k + 1]/2^(i + k + 1), {k, 0, j}];
If [a < err && b > err, Print["Wins=", i, " , ", "Losses=", j]];

j++]
i++]``````
Table of very early safe stops to detect a positive:

Code: Select all

``````Wins=8 , Losses=0
Wins=9 , Losses=0
Wins=10 , Losses=0
Wins=11 , Losses=1
Wins=12 , Losses=1
Wins=13 , Losses=2
Wins=14 , Losses=2
Wins=15 , Losses=3
Wins=16 , Losses=3
Wins=17 , Losses=4
Wins=18 , Losses=4
Wins=19 , Losses=5
Wins=20 , Losses=5
Wins=21 , Losses=6
Wins=22 , Losses=6
Wins=23 , Losses=7
Wins=24 , Losses=8
Wins=25 , Losses=8
Wins=26 , Losses=9
Wins=27 , Losses=9
Wins=28 , Losses=10
Wins=29 , Losses=11
Wins=30 , Losses=11
Wins=31 , Losses=12
Wins=32 , Losses=13
Wins=33 , Losses=13
Wins=34 , Losses=14
Wins=35 , Losses=15
Wins=36 , Losses=15
Wins=37 , Losses=16
Wins=38 , Losses=17
Wins=39 , Losses=17
Wins=40 , Losses=18
Wins=41 , Losses=19
Wins=42 , Losses=19
Wins=43 , Losses=20
Wins=44 , Losses=21
Wins=45 , Losses=22
Wins=46 , Losses=22
Wins=47 , Losses=23
Wins=48 , Losses=24
Wins=49 , Losses=24
Wins=50 , Losses=25``````
Again, Losses here are "at most".

Further, for more games, if one wants to stop, use 3-3.5 standard deviations rule up to tens of thousands of games (but not millions).