In statistics, almost sure hypothesis testing or a.s. hypothesis testing utilizes

almost sure convergence In probability theory, there exist several different notions of convergence of random variables. The convergence of sequences of random variables to some limit random variable is an important concept in probability theory, and its applications to ...

in order to determine the validity of a statistical hypothesis with probability one. This is to say that whenever the

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...

is true, then an a.s. hypothesis test will fail to reject the null hypothesis w.p. 1 for all sufficiently large samples. Similarly, whenever the

alternative hypothesis In statistical hypothesis testing, the alternative hypothesis is one of the proposed proposition in the hypothesis test. In general the goal of hypothesis test is to demonstrate that in the given condition, there is sufficient evidence supporting ...

is true, then an a.s. hypothesis test will reject the null hypothesis with probability one, for all sufficiently large samples. Along similar lines, an a.s.

confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...

eventually contains the parameter of interest with probability 1. Dembo and Peres (1994) proved the existence of almost sure hypothesis tests.

Description

For simplicity, assume we have a sequence of independent and identically distributed normal random variables,

\textstyle x_i \sim N(\mu,1)

, with mean

\textstyle \mu

, and unit variance. Suppose that nature or simulation has chosen the true mean to be

\textstyle \mu_0

, then the probability distribution function of the mean,

\textstyle \mu

, is given by :

\Pr(\mu\le t) = t\in[\mu_0,+\infty_

where_an_Iverson_bracket.html" ;"title="mu_0,+\infty.html" ;"title="t\in[\mu_0,+\infty">t\in[\mu_0,+\infty where an Iverson bracket">mu_0,+\infty.html" ;"title="t\in[\mu_0,+\infty">t\in[\mu_0,+\infty where an Iverson bracket has been used. A naïve approach to estimating this distribution function would be to replace true mean on the right hand side with an estimate such as the sample mean,

\textstyle \hat

, but :

\begin
& \operatorname E\left[ t\in \left[\widehat,+\infty\right]\right ] = \Pr(\widehat\le t) \\[4pt]
=  & \Phi(\sqrt(t-\mu_0)) \rightarrow \Pr(\mu\le t) -0.5[\mu_0=t]
\end

which means the approximation to the true distribution function will be off by 0.5 at the true mean. However,

\textstyle \left widehat,+\infty\right /math> is nothing more than a one-sided 50% confidence interval;  more generally, let \textstyle Z_be the critical value used in a one-sided \textstyle 1-\alpha_n confidence interval, then 

: \operatorname E\left \in_\left[\hat_-_\frac,_+\infty_\right \right_.html" ;"title="hat_-_\frac,_+\infty_\right.html" ;"title="\in \left[\hat - \frac, +\infty \right">\in \left[\hat - \frac, +\infty \right\right ">hat_-_\frac,_+\infty_\right.html" ;"title="\in \left[\hat - \frac, +\infty \right">\in \left[\hat - \frac, +\infty \right\right \rightarrow \Pr(\mu\le t) -
\lim_ \alpha_n [\mu_0=t] If we set \textstyle \alpha_n=0.05, then the error of the approximation is reduced from 0.5 to 0.05, which is a factor of 10. Of course, if we let \textstyle \alpha_n \rightarrow 0, then 

: \operatorname E\left \in_\left[\widehat_-_\frac,_+\infty_\right \right_.html" ;"title="widehat_-_\frac,_+\infty_\right.html" ;"title="\in \left[\widehat - \frac, +\infty \right">\in \left[\widehat - \frac, +\infty \right\right ">widehat_-_\frac,_+\infty_\right.html" ;"title="\in \left[\widehat - \frac, +\infty \right">\in \left[\widehat - \frac, +\infty \right\right \rightarrow \Pr(\mu\le t) However, this only shows that the expectation is close to the limiting value. Naaman (2016) showed that setting the significance level at \textstyle \alpha_n=n^with \textstyle p>1 results in a finite number of type I and type II errors w.p.1 under fairly mild regularity conditions. This means that for each \textstyle t,  there exists an \textstyle N(t),  such that for all \textstyle n>N(t), 

: \left \in_\left[\widehat_-_\frac,_+\infty_\right \right_.html" ;"title="widehat_-_\frac,_+\infty_\right.html" ;"title="\in \left[\widehat - \frac, +\infty \right">\in \left[\widehat - \frac, +\infty \right\right ">widehat_-_\frac,_+\infty_\right.html" ;"title="\in \left[\widehat - \frac, +\infty \right">\in \left[\widehat - \frac, +\infty \right\right = \Pr(\mu\le t) where the equality holds w.p. 1. So the indicator function of a one-sided a.s. confidence interval is a good approximation to the true distribution function.

Applications

Optional stopping

For example, suppose a researcher performed an experiment with a sample size of 10 and found no statistically significant result. Then suppose she decided to add one more observation, and retest continuing this process until a significant result was found. Under this scenario, given the initial batch of 10 observations resulted in an insignificant result, the probability that the experiment will be stopped at some finite sample size,

N_

, can be bounded using Boole's inequality :

\Pr(N_s<+\infty) < \sum_^ \alpha_n<0.0952

where

\alpha_n=n^

. This compares favorably with fixed significance level testing which has a finite stopping time with probability one; however, this bound will not be meaningful for all sequences of significance level, as the above sum can be greater than one (setting

\alpha_n=n^

would be one example). But even using that bandwidth, if the testing was done in batches of 10, then :

\Pr\left(N_s<+\infty\right)<\sum\limits_^\infty \left( 10i \right)^ <0.3

which results in a relatively large probability that the process will never end.

Publication bias

As another example of the power of this approach, if an academic journal only accepts papers with p-values less than 0.05, then roughly 1 in 20 independent studies of the same effect would find a significant result when there was none. However, if the journal required a minimum sample size of 100 and a maximum significance level is given by

\alpha_n, then one would expect roughly 1 in 250 studies would find an effect when there was none (if the minimum sample size was 30, it would still be 1 in 60). If the maximum significance level was given by \alpha_n (which will have better small sample performance with regard to type I error when multiple comparisons are a concern), one would expect roughly 1 in 10000 studies would find an effect when there was none (if the minimum sample size was 30, it would be 1 in 900). Additionally, A.S. hypothesis testing is robust to the multiple comparisons.

Jeffreys–Lindley paradox

Lindley's paradox Lindley's paradox is a counterintuitive situation in statistics in which the Bayesian and frequentist approaches to a hypothesis testing problem give different results for certain choices of the prior distribution. The problem of the disagreement ...

occurs when # The result is "significant" by a frequentist test, at, for example, the 5% level, indicating sufficient evidence to reject the null hypothesis, and # The posterior probability of the null hypothesis is high, indicating strong evidence that the null hypothesis is in better agreement with the data than is the alternative hypothesis. However, the paradox does not apply to a.s. hypothesis tests. The Bayesian and the frequentist will eventually reach the same conclusion.

References

* * {{cite journal , last1=Dembo , first1=Amir , last2=Peres , first2=Yuval , title=A topological criterion for hypothesis testing , journal=

The Annals of Statistics The ''Annals of Statistics'' is a peer-reviewed statistics journal published by the Institute of Mathematical Statistics. It was started in 1973 as a continuation in part of the ''Annals of Mathematical Statistics (1930)'', which was split into the ...