Equivalence Test
   HOME

TheInfoList



OR:

Equivalence tests are a variety of
hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
s used to draw
statistical Statistics (from German: ''Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industria ...
inferences from observed data. In these tests, the
null hypothesis In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
is defined as an effect large enough to be deemed interesting, specified by an equivalence bound. The alternative hypothesis is any effect that is less extreme than said equivalence bound. The observed data are statistically compared against the equivalence bounds. If the statistical test indicates the observed data is surprising, assuming that true effects are at least as extreme as the equivalence bounds, a Neyman-Pearson approach to statistical inferences can be used to reject effect sizes larger than the equivalence bounds with a pre-specified
Type 1 error In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
rate.   Equivalence testing originates from the field of
clinical trials Clinical trials are prospective biomedical or behavioral research studies on human participants designed to answer specific questions about biomedical or behavioral interventions, including new treatments (such as novel vaccines, drugs, dietar ...
. One application, known as a non-inferiority trial, is used to show that a new drug that is cheaper than available alternatives works as well as an existing drug. In essence, equivalence tests consist of calculating a
confidence interval In frequentist statistics, a confidence interval (CI) is a range of estimates for an unknown parameter. A confidence interval is computed at a designated ''confidence level''; the 95% confidence level is most common, but other levels, such as 9 ...
around an observed
effect size In statistics, an effect size is a value measuring the strength of the relationship between two variables in a population, or a sample-based estimate of that quantity. It can refer to the value of a statistic calculated from a sample of data, the ...
and rejecting effects more extreme than the equivalence bound when the confidence interval does not overlap with the equivalence bound. In two-sided tests, both upper and lower equivalence bounds are specified. In non-inferiority trials, where the goal is to test the hypothesis that a new treatment is not worse than existing treatments, only a lower equivalence bound is specified.    Equivalence tests can be performed in addition to null-hypothesis significance tests. This might prevent common misinterpretations of p-values larger than the alpha level as support for the absence of a true effect. Furthermore, equivalence tests can identify effects that are statistically significant but practically insignificant, whenever effects are statistically different from zero, but also statistically smaller than any effect size deemed worthwhile (see the first figure). Equivalence tests were originally used in areas such as pharmaceutics, frequently in bioequivalence trials. However, these tests can be applied to any instance where the research question asks whether the means of two sets of scores are practically or theoretically equivalent. As such, equivalence analyses have seen increased usage in almost all medical research fields. Additionally, the field of psychology has been adopting the use of equivalence testing, particularly in clinical trials. This is not to say, however, that equivalence analyses should be limited to clinical trials, and the application of these tests can occur in a range of research areas. In this regard, equivalence tests have recently been introduced in evaluation of measurement devices, artificial intelligence as well as exercise physiology and sports science. Several tests exist for equivalence analyses; however, more recently the two-one-sided t-tests (TOST) procedure has been garnering considerable attention. As outlined below, this approach is an adaptation of the widely known t-test.  


TOST procedure

A very simple equivalence testing approach is the ‘two one-sided t-tests’ (TOST) procedure. In the TOST procedure an upper (ΔU) and lower (–ΔL) equivalence bound is specified based on the smallest effect size of interest (e.g., a positive or negative difference of d = 0.3). Two composite null hypotheses are tested: H01: Δ ≤ –ΔL and H02: Δ ≥ ΔU. When both these one-sided tests can be statistically rejected, we can conclude that –ΔL < Δ < ΔU, or that the observed effect falls within the equivalence bounds and is statistically smaller than any effect deemed worthwhile and considered practically equivalent". Alternatives to the TOST procedure have been developed as well. A recent modification to TOST makes the approach feasible in cases of repeated measures and assessing multiple variables.


Comparison between t-test and equivalence test

The equivalence test can be ''induced'' from the
t-test A ''t''-test is any statistical hypothesis testing, statistical hypothesis test in which the test statistic follows a Student's t-distribution, Student's ''t''-distribution under the null hypothesis. It is most commonly applied when the test stati ...
. Consider a t-test at the significance level αt-test with a
power Power most often refers to: * Power (physics), meaning "rate of doing work" ** Engine power, the power put out by an engine ** Electric power * Power (social and political), the ability to influence people or events ** Abusive power Power may a ...
of 1-βt-test for a relevant effect size dr. If Δ=dr as well as αequiv.-testt-test and βequiv.-testt-test coincide, i.e. the error types (type I and type II) are interchanged between the t-test and the equivalence test, then the t-test will obtain the same results as the equivalence test. To achieve this for the t-test, either the sample size calculation needs to be carried out correctly, or the t-test significance level αt-test needs to be adjusted, referred to as the so-called ''revised t-test''. Both approaches have difficulties in practice since sample size planning relies on unverifiable assumptions of the standard deviation, and the revised t-test yields numerical problems. Preserving the test behavior, those limitations can be removed by using an equivalence test.   The figure below allows a visual comparison of the equivalence test and the t-test when the sample size calculation is affected by differences between the a priori standard deviation \sigma and the sample's standard deviation \widehat, which is a common problem. Using an equivalence test instead of a t-test additionally ensures that αequiv.-test is bounded, which the t-test does not do in case that \widehat > \sigma with the type II error growing arbitrary large. On the other hand, having \widehat < \sigma results in the t-test being stricter than the dr specified in the planning, which may randomly penalize the sample source (e.g., a device manufacturer). This makes the equivalence test safer to use.


See also

*
Bootstrap (statistics) Bootstrapping is any test or metric that uses random sampling with replacement (e.g. mimicking the sampling process), and falls under the broader class of resampling methods. Bootstrapping assigns measures of accuracy (bias, variance, confidenc ...
-based testing


Literature

*


References

{{Reflist Statistical hypothesis testing Equivalence (mathematics)