A **statistical hypothesis** is a hypothesis that is testable on the basis of observed data modeled as the realised values taken by a collection of random variables.^{[1]} A set of data (or several sets of data, taken together) are modelled as being realised values of a collection of random variables having a joint probability distribution in some set of possible joint distributions. The hypothesis being tested is exactly that set of possible probability distributions. A **statistical hypothesis test** is a method of statistical inference. An alternative hypothesis is proposed for the probability distribution of the data, either explicitly or only informally. The comparison of the two models is deemed *statistically significant* if, according to a threshold probability -- the significance level -- the data is very unlikely to have occurred under the null hypothesis. A hypothesis test specifies which outcomes of a study may lead to a rejection of the null hypothesis at a pre-specified level of significance, while using a pre-chosen measure of deviation from that hypothesis (the test statistic, or goodness-of-fit measure). The pre-chosen level of significance is the maximal allowed "false positive rate". One wants to control the risk of incorrectly rejecting a true null hypothesis.

The process of distinguishing between the null hypothesis and the alternative hypothesis is aided by considering two conceptual types of errors. The first type of error occurs when the null hypothesis is wrongly rejected. The second type of error occurs when the null hypothesis is wrongly not rejected. (The two types are known as type 1 and type 2 errors.)

Hypothesis tests based on statistical significance are another way of expressing confidence intervals (more precisely, confidence sets). In other words, every hypothesis test based on significance can be obtained via a confidence interval, and every confidence interval can be obtained via a hypothesis test based on significance.^{[2]}

Significance-based hypothesis testing is the most common framework for statistical hypothesis testing. An alternative framework for statistical hypothesis testing is to specify a set of statistical models, one for each candidate hypothesis, and then use model selection techniques to choose the most appropriate model.^{[3]} The most common selection techniques are based on either Akaike information criterion or Bayes factor. However, this is not really an "alternative framework", though one can call it a more complex framework. It is a situation in which one likes to distinguish between many possible hypotheses, not just two. Alternatively, one can see it as a hybrid between testing and estimation, where one of the parameters is discrete, and specifies which of a hierarchy of more and more complex models is correct.

- Null hypothesis significance testing* is the name for a version of hypothesis testing with no explicit mention of possible alternatives, and not much consideration of error rates. It was championed by Ronald Fisher in a context in which he downplayed any explicit choice of alternative hypothesis and consequently paid no attention to the power of a test. One simply set up a null hypothesis as a kind of straw man, or more kindly, as a formalisation of a standard, establishment, default idea of how things were. One tried to overthrow this conventional view by showing that it led to the conclusion that something extremely unlikely had happened, thereby discrediting the theory.

A unifying position of critics is that statistics should not lead to an accept-reject conclusion or decision, but to an estimated value with an interval estimate; this data-analysis philosophy is broadly referred to as estimation statistics. Estimation statistics can be accomplished with either frequentist [1] or Bayesian methods.^{[73]}

One strong critic of significance testing suggested a list of reporting alternatives:^{[74]} effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality. None of these suggested alternatives produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals. "The distinction between the ... approaches is largely one of reporting and interpretation."^{[75]}

On one "alternative" there is no disagreement: Fisher himself said,^{[26]} "In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result." Cohen, an influential critic of significance testing, concurred,^{Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review,[69] medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias[70] and a journal (Journal of Articles in Support of the Null Hypothesis) has been created to publish such results exclusively.[71] Textbooks have added some cautions[72] and increased coverage of the tools necessary to estimate the size of the sample required to produce significant results. Major organizations have not abandoned use of significance tests although some have discussed doing so.[69]
}

A unifying position of critics is that statistics should not lead to an accept-reject conclusion or decision, but to an estimated value with an interval estimate; this data-analysis philosophy is broadly referred to as estimation statistics. Estimation statistics can be accomplished with either frequentist [1] or Bayesian methods.^{[73]}

One strong critic of significance testing suggested a list of reporting alternatives:^{[74]} effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality. None of these suggested alternatives produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals. "The distinction between the ... approaches is largely one of reporting and interpretation."^{[75]One strong critic of significance testing suggested a list of reporting alternatives:[74]} effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality. None of these suggested alternatives produces a conclusion/decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals. "The distinction between the ... approaches is largely one of reporting and interpretation."^{[75]}

On one "alternative" there is no disagreement: Fisher himself said,^{[26]} "In relation to the test of significance, we may say that a phenomenon is experimentally demonstrable when we know how to conduct an experiment which will rarely fail to give us a statistically significant result." Cohen, an influential critic of significance testing, concurred,^{[66]} "... don't look for a magic alternative to NHST *[null hypothesis significance testing]* ... It doesn't exist." "... given the problems of statistical induction, we must finally rely, as have the older sciences, on replication." The "alternative" to significance testing is repeated testing. The easiest way to decrease statistical uncertainty is by obtaining more data, whether by increased sample size or by repeated tests. Nickerson claimed to have never seen the publication of a literally replicated experiment in psychology.^{[67]} An indirect approach to replication is meta-analysis.

Bayesian inference is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)).^{[67]} For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. Psychologist John K. Kruschke has suggested Bayesian estimation as an alternative for the *t*-test.^{[76]} Alternatively two competing models/hypothesis can be compared using Bayes factors.^{[77]} Bayesian methods could be criticized for requiring information that is seldom available in the cases where significance testing is most heavily used. Neither the prior probabilities nor the probability distribution of the test statistic under the alternative hypothesis are often available in the social sciences.^{[67]}

Advocates of a Bayesian approach sometimes claim that the goal of a researcher is most often to objectively assess the probability that a hypothesis is true based on the data they have collected.^{[78]}^{[79]} Neither Fisher's significance testing, nor Neyman–Pearson hypothesis testing can provide this information, and do not claim to. The probability a hypothesis is true can only be derived from use of Bayes' Theorem, which was unsatisfactory to both the Fisher and Neyman–Pearson camps due to the explicit use of subjectivity in the form of the prior probability.^{[35]}^{[80]} Fisher's strategy is to sidestep this with the *p*-value (an objective *index* based on the data alone) followed by *inductive inference*, while Neyman–Pearson devised their approach of *inductive behaviour*.

Hypothesis testing and philosophy intersect. Inferential statistics, which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher David Hume wrote, "All knowledge degenerates into probability." Competing practical definitions of probability reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by the philosophy of science.

Fisher and Neyman opposed the subjectivity of probability. Their views contributed to the objective definitions. The core of their historical disagreement was philosophical.

Many of the philosophical criticisms of hypothesis testing are discussed by statisticians in other contexts, particularly correlation does not imply causation and the design of experiments.
Hypothesis testing is of continuing interest to philosophers.^{[39]}^{[81]}

Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught.^{[82]}^{[83]} Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly.^{[84]}^{[85]}^{[citation needed]}^{[84]}^{[85]}^{[citation needed]} An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the Bible Analyzer). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like *z*, Student's *t*, *F* and chi-squared). Statistical hypothesis testing is considered a mature area within statistics,^{[75]} but a limited amount of development continues.

An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors.^{[86]} While the problem was addressed more

An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors.^{[86]} While the problem was addressed more than a decade ago,^{[87]} and calls for educational reform continue,^{[88]} students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing.^{[89]} Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject.^{[90]}