A permutation test (also called re-randomization test) is an exact

statistical hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...

making use of the

proof by contradiction In logic and mathematics, proof by contradiction is a form of proof that establishes the truth or the validity of a proposition, by showing that assuming the proposition to be false leads to a contradiction. Proof by contradiction is also known ...

. A permutation test involves two or more samples. The null hypothesis is that all samples come from the same distribution

H_0: F=G

. Under the

null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is ...

, the distribution of the test statistic is obtained by calculating all possible values of the

test statistic A test statistic is a statistic (a quantity derived from the sample) used in statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specifi ...

under possible rearrangements of the observed data. Permutation tests are, therefore, a form of resampling. Permutation tests can be understood as

surrogate data testing Surrogate data testing (or the ''method of surrogate data'') is a statistical proof by contradiction technique and similar to permutation tests and as a resampling technique related (but different) to parametric bootstrapping. It is used to detec ...

where the

surrogate data Surrogate data, sometimes known as analogous data, usually refers to time series data that is produced using well-defined (linear) models like Autoregressive–moving-average model, ARMA processes that reproduce various statistical properties like ...

under the null hypothesis are obtained through permutations of the original data. In other words, the method by which treatments are allocated to subjects in an experimental design is mirrored in the analysis of that design. If the labels are exchangeable under the null hypothesis, then the resulting tests yield exact significance levels; see also

exchangeability In statistics, an exchangeable sequence of random variables (also sometimes interchangeable) is a sequence ''X''1, ''X''2, ''X''3, ... (which may be finitely or infinitely long) whose joint probability distribution does not change whe ...

. Confidence intervals can then be derived from the tests. The theory has evolved from the works of

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who ...

and

E. J. G. Pitman Edwin James George Pitman (29 October 1897 – 21 July 1993) was an Australian mathematician who made significant contributions to statistics and probability theory. In particular, he is remembered primarily as the originator of the Pitman perm ...

in the 1930s. Permutation tests should not be confused with randomized tests.

Method

To illustrate the basic idea of a permutation test, suppose we collect random variables

X_A

and

X_B

for each individual from two groups

A

and

B

whose sample means are

\bar_

and

\bar_

, and that we want to know whether

X_A

and

X_B

come from the same distribution. Let

n_

and

n_

be the sample size collected from each group. The permutation test is designed to determine whether the observed difference between the sample means is large enough to reject, at some significance level, the null hypothesis H

_

that the data drawn from

A

is from the same distribution as the data drawn from

B

. The test proceeds as follows. First, the difference in means between the two samples is calculated: this is the observed value of the test statistic,

T_\text

. Next, the observations of groups

A

and

B

are pooled, and the difference in sample means is calculated and recorded for every possible way of dividing the pooled values into two groups of size

n_

and

n_

(i.e., for every permutation of the group labels A and B). The set of these calculated differences is the exact distribution of possible differences (for this sample) under the null hypothesis that group labels are exchangeable (i.e., are randomly assigned). The one-sided p-value of the test is calculated as the proportion of sampled permutations where the difference in means was greater than

T_\text

. The two-sided p-value of the test is calculated as the proportion of sampled permutations where the

absolute difference The absolute difference of two real numbers x and y is given by , x-y, , the absolute value of their difference. It describes the distance on the real line between the points corresponding to x and y. It is a special case of the Lp distance for ...

was greater than

, T_\text,

. Many implementations of permutation tests require that the observed data itself be counted as one of the permutations so that the permutation p-value will never be zero. Alternatively, if the only purpose of the test is to reject or not reject the null hypothesis, one could sort the recorded differences, and then observe if

T_\text

is contained within the middle

(1 - \alpha) \times 100

% of them, for some significance level

\alpha

. If it is not, we reject the hypothesis of identical probability curves at the

\alpha\times100\%

significance level.

Relation to parametric tests

Permutation tests are a subset of

non-parametric statistics Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...

. Assuming that our experimental data come from data measured from two treatment groups, the method simply generates the distribution of mean differences under the assumption that the two groups are not distinct in terms of the measured variable. From this, one then uses the observed statistic (

T_\text

above) to see to what extent this statistic is special, i.e., the likelihood of observing the magnitude of such a value (or larger) if the treatment labels had simply been randomized after treatment. In contrast to permutation tests, the distributions underlying many popular "classical" statistical tests, such as the ''t''-test, ''F''-test, ''z''-test, and ''χ''² test, are obtained from theoretical probability distributions.

Fisher's exact test Fisher's exact test is a statistical significance test used in the analysis of contingency tables. Although in practice it is employed when sample sizes are small, it is valid for all sample sizes. It is named after its inventor, Ronald Fisher, a ...

is an example of a commonly used permutation test for evaluating the association between two dichotomous variables. When sample sizes are very large, the Pearson's chi-square test will give accurate results. For small samples, the chi-square reference distribution cannot be assumed to give a correct description of the probability distribution of the test statistic, and in this situation the use of Fisher's exact test becomes more appropriate. Permutation tests exist in many situations where parametric tests do not (e.g., when deriving an optimal test when losses are proportional to the size of an error rather than its square). All simple and many relatively complex parametric tests have a corresponding permutation test version that is defined by using the same test statistic as the parametric test, but obtains the p-value from the sample-specific permutation distribution of that statistic, rather than from the theoretical distribution derived from the parametric assumption. For example, it is possible in this manner to construct a permutation ''t''-test, a permutation ''χ''² test of association, a permutation version of Aly's test for comparing variances and so on. The major drawbacks to permutation tests are that they * Can be computationally intensive and may require "custom" code for difficult-to-calculate statistics. This must be rewritten for every case. * Are primarily used to provide a p-value. The inversion of the test to get confidence regions/intervals requires even more computation.

Advantages

Permutation tests exist for any test statistic, regardless of whether or not its distribution is known. Thus one is always free to choose the statistic which best discriminates between hypothesis and alternative and which minimizes losses. Permutation tests can be used for analyzing unbalanced designs and for combining dependent tests on mixtures of categorical, ordinal, and metric data (Pesarin, 2001) . They can also be used to analyze qualitative data that has been quantitized (i.e., turned into numbers). Permutation tests may be ideal for analyzing quantitized data that do not satisfy statistical assumptions underlying traditional parametric tests (e.g., t-tests, ANOVA), see PERMANOVA. Before the 1980s, the burden of creating the reference distribution was overwhelming except for data sets with small sample sizes. Since the 1980s, the confluence of relatively inexpensive fast computers and the development of new sophisticated path algorithms applicable in special situations made the application of permutation test methods practical for a wide range of problems. It also initiated the addition of exact-test options in the main statistical software packages and the appearance of specialized software for performing a wide range of uni- and multi-variable exact tests and computing test-based "exact" confidence intervals.

Limitations

An important assumption behind a permutation test is that the observations are exchangeable under the null hypothesis. An important consequence of this assumption is that tests of difference in location (like a permutation t-test) require equal variance under the normality assumption. In this respect, the permutation t-test shares the same weakness as the classical Student's t-test (the

Behrens–Fisher problem In statistics, the Behrens–Fisher problem, named after Walter Behrens and Ronald Fisher, is the problem of interval estimation and hypothesis testing concerning the difference between the means of two normally distributed populations when t ...

). A third alternative in this situation is to use a bootstrap-based test. Good (2005) explains the difference between permutation tests and bootstrap tests the following way: "Permutations test hypotheses concerning distributions; bootstraps test hypotheses concerning parameters. As a result, the bootstrap entails less-stringent assumptions." Bootstrap tests are not exact. In some cases, a permutation test based on a properly studentized statistic can be asymptotically exact even when the exchangeability assumption is violated. Bootstrap-based tests can test with the null hypothesis

H_0: F \neq G

and, therefore, are suited for performing equivalence testing.

Monte Carlo testing

An asymptotically equivalent permutation test can be created when there are too many possible orderings of the data to allow complete enumeration in a convenient manner. This is done by generating the reference distribution by Monte Carlo sampling, which takes a small (relative to the total number of permutations) random sample of the possible replicates. The realization that this could be applied to any permutation test on any dataset was an important breakthrough in the area of applied statistics. The earliest known references to this approach are Eden and

Yates Yates may refer to: Places United States * Fort Yates, North Dakota *Yates Spring, a spring in Georgia, United States *Yates City, Illinois * Yates Township, Illinois *Yates Center, Kansas * Yates, Michigan * Yates Township, Michigan *Yates, Misso ...

(1933) and Dwass (1957). This type of permutation test is known under various names: ''approximate permutation test'', ''Monte Carlo permutation tests'' or ''random permutation tests''. After

N

random permutations, it is possible to obtain a confidence interval for the p-value based on the Binomial distribution, see

Binomial proportion confidence interval In statistics, a binomial proportion confidence interval is a confidence interval for the probability of success calculated from the outcome of a series of success–failure experiments (Bernoulli trial, Bernoulli trials). In other words, a binomia ...

. For example, if after

N = 10000

random permutations the p-value is estimated to be

\widehat=0.05

, then a 99% confidence interval for the true

p

(the one that would result from trying all possible permutations) is

\left hat-z\sqrt, \hat+z\sqrt \right .045, 0.055

. On the other hand, the purpose of estimating the p-value is most often to decide whether

p \leq \alpha

, where

\scriptstyle\ \alpha

is the threshold at which the null hypothesis will be rejected (typically

\alpha=0.05

). In the example above, the confidence interval only tells us that there is roughly a 50% chance that the p-value is smaller than 0.05, i.e. it is completely unclear whether the null hypothesis should be rejected at a level

\alpha=0.05

. If it is only important to know whether

p \leq \alpha

for a given

\alpha

, it is logical to continue simulating until the statement

p \leq \alpha

can be established to be true or false with a very low probability of error. Given a bound

\epsilon

on the admissible probability of error (the probability of finding that

\widehat > \alpha

when in fact

p \leq \alpha

or vice versa), the question of how many permutations to generate can be seen as the question of when to stop generating permutations, based on the outcomes of the simulations so far, in order to guarantee that the conclusion (which is either

p \leq \alpha

p > \alpha

) is correct with probability at least as large as

1-\epsilon

. (

\epsilon

will typically be chosen to be extremely small, e.g. 1/1000.) Stopping rules to achieve this have been developed which can be incorporated with minimal additional computational cost. In fact, depending on the true underlying p-value it will often be found that the number of simulations required is remarkably small (e.g. as low as 5 and often not larger than 100) before a decision can be reached with virtual certainty.

Example tests

Permutational analysis of variance Permutational multivariate analysis of variance (PERMANOVA), is a non-parametric multivariate statistical permutation test. PERMANOVA is used to compare groups of objects and test the null hypothesis that the centroids and dispersion of the groups ...

Literature

Original references: * Fisher, R.A. (1935) ''

The Design of Experiments ''The Design of Experiments'' is a 1935 book by the English statistician Ronald Fisher about the design of experiments and is considered a foundational work in experimental design. Among other contributions, the book introduced the concept of th ...

'', New York: Hafner * Pitman, E. J. G. (1937) "Significance tests which may be applied to samples from any population", ''Royal Statistical Society Supplement'', 4: 119-130 and 225-32 (parts I and II). * Modern references: * *Edgington. E.S. (1995) ''Randomization tests'', 3rd ed. New York: Marcel-Dekker * Good, Phillip I. (2005) ''Permutation, Parametric and Bootstrap Tests of Hypotheses'', 3rd ed.,

Springer Springer or springers may refer to: Publishers * Springer Science+Business Media, aka Springer International Publishing, a worldwide publishing group founded in 1842 in Germany formerly known as Springer-Verlag. ** Springer Nature, a multinationa ...

* * Lunneborg, Cliff. (1999) ''Data Analysis by Resampling'', Duxbury Press. . * Pesarin, F. (2001). ''Multivariate Permutation Tests : With Applications in Biostatistics'',

John Wiley & Sons John Wiley & Sons, Inc., commonly known as Wiley (), is an American multinational publishing company founded in 1807 that focuses on academic publishing and instructional materials. The company produces books, journals, and encyclopedias, i ...

. * Computational methods: * * *

Current research on permutation tests

* Good, P.I. (2012) Practitioner's Guide to Resampling Methods. * Good, P.I. (2005) Permutation, Parametric, and Bootstrap Tests of Hypotheses
Bootstrap Sampling tutorial
* Hesterberg, T. C., D. S. Moore, S. Monaghan, A. Clipson, and R. Epstein (2005):
Bootstrap Methods and Permutation Testssoftware
* Moore, D. S., G. McCabe, W. Duckworth, and S. Sclove (2003)
Bootstrap Methods and Permutation Tests
* Simon, J. L. (1997)
Resampling: The New Statistics
* Yu, Chong Ho (2003)
Resampling methods: concepts, applications, and justification. Practical Assessment, Research & Evaluation, 8(19)
''(statistical bootstrapping)''

References

{{statistics, inference, collapsed Statistical tests Nonparametric statistics