History
While hypothesis testing was popularized early in the 20th century, early forms were used in the 1700s. The first use is credited to John Arbuthnot (1710), followed by Pierre-Simon Laplace (1770s), in analyzing the human sex ratio at birth; see .Choice of null hypothesis
Paul Meehl has argued that the epistemological importance of the choice of null hypothesis has gone largely unacknowledged. When the null hypothesis is predicted by theory, a more precise experiment will be a more severe test of the underlying theory. When the null hypothesis defaults to "no difference" or "no effect", a more precise experiment is a less severe test of the theory that motivated performing the experiment. An examination of the origins of the latter practice may therefore be useful: 1778: Pierre Laplace compares the birthrates of boys and girls in multiple European cities. He states: "it is natural to conclude that these possibilities are very nearly in the same ratio". Thus, the null hypothesis in this case that the birthrates of boys and girls should be equal given "conventional wisdom". 1900: Karl Pearson develops the chi squared test to determine "whether a given form of frequency curve will effectively describe the samples drawn from a given population." Thus the null hypothesis is that a population is described by some distribution predicted by theory. He uses as an example the numbers of five and sixes in the Weldon dice throw data. 1904: Karl Pearson develops the concept of " contingency" in order to determine whether outcomes are independent of a given categorical factor. Here the null hypothesis is by default that two things are unrelated (e.g. scar formation and death rates from smallpox). The null hypothesis in this case is no longer predicted by theory or conventional wisdom, but is instead the principle of indifference that led Fisher and others to dismiss the use of "inverse probabilities".Modern origins and early controversy
Modern significance testing is largely the product of Karl Pearson ( ''p''-value, Pearson's chi-squared test), William Sealy Gosset ( Student's t-distribution), andNull hypothesis significance testing (NHST)
The modern version of hypothesis testing is generally called the null hypothesis significance testing (NHST) and is a hybrid of the Fisher approach with the Neyman-Pearson approach. In 2000, Raymond S. Nickerson wrote an article stating that NHST was (at the time) "arguably the most widely used method of analysis of data collected in psychological experiments and has been so for about 70 years" and that it was at the same time "very controversial". This fusion resulted from confusion by writers of statistical textbooks (as predicted by Fisher) beginning in the 1940s (but signal detection, for example, still uses the Neyman/Pearson formulation). Great conceptual differences and many caveats in addition to those mentioned above were ignored. Neyman and Pearson provided the stronger terminology, the more rigorous mathematics and the more consistent philosophy, but the subject taught today in introductory statistics has more similarities with Fisher's method than theirs. Sometime around 1940, authors of statistical text books began combining the two approaches by using the ''p''-value in place of the test statistic (or data) to test against the Neyman–Pearson "significance level".Philosophy
Hypothesis testing and philosophy intersect. Inferential statistics, which includes hypothesis testing, is applied probability. Both probability and its application are intertwined with philosophy. Philosopher David Hume wrote, "All knowledge degenerates into probability." Competing practical definitions of probability reflect philosophical differences. The most common application of hypothesis testing is in the scientific interpretation of experimental data, which is naturally studied by theEducation
Statistics is increasingly being taught in schools with hypothesis testing being one of the elements taught. Many conclusions reported in the popular press (political opinion polls to medical studies) are based on statistics. Some writers have stated that statistical analysis of this kind allows for thinking clearly about problems involving mass data, as well as the effective reporting of trends and inferences from said data, but caution that writers for a broad public should have a solid understanding of the field in order to use the terms and concepts correctly.'Statistical methods and statistical terms are necessary in reporting the mass data of social and economic trends, business conditions, "opinion" polls, the census. But without writers who use the words with honesty and readers who know what they mean, the result can only be semantic nonsense.' "...the basic ideas in statistics assist us in thinking clearly about the problem, provide some guidance about the conditions that must be satisfied if sound inferences are to be made, and enable us to detect many inferences that have no good logical foundation." An introductory college statistics class places much emphasis on hypothesis testing – perhaps half of the course. Such fields as literature and divinity now include findings based on statistical analysis (see the Bible Analyzer). An introductory statistics class teaches hypothesis testing as a cookbook process. Hypothesis testing is also taught at the postgraduate level. Statisticians learn how to create good statistical test procedures (like ''z'', Student's ''t'', ''F'' and chi-squared). Statistical hypothesis testing is considered a mature area within statistics, but a limited amount of development continues. An academic study states that the cookbook method of teaching introductory statistics leaves no time for history, philosophy or controversy. Hypothesis testing has been taught as received unified method. Surveys showed that graduates of the class were filled with philosophical misconceptions (on all aspects of statistical inference) that persisted among instructors. While the problem was addressed more than a decade ago, and calls for educational reform continue, students still graduate from statistics classes holding fundamental misconceptions about hypothesis testing. Ideas for improving the teaching of hypothesis testing include encouraging students to search for statistical errors in published papers, teaching the history of statistics and emphasizing the controversy in a generally dry subject. Raymond S. Nickerson commented:Performing a frequentist hypothesis test in practice
The typical steps involved in performing a frequentist hypothesis test in practice are: # Define a hypothesis (claim which is testable using data). # Select a relevant statistical test with associated test statistic T. # Derive the distribution of the test statistic under the null hypothesis from the assumptions. In standard cases this will be a well-known result. For example, the test statistic might follow a Student's t distribution with known degrees of freedom, or a normal distribution with known mean and variance. # Select a significance level (''α''), the maximum acceptable false positive rate. Common values are 5% and 1%. # Compute from the observations the observed value tobs of the test statistic T. # Decide to either reject the null hypothesis in favor of the alternative or not reject it. The Neyman-Pearson decision rule is to reject the null hypothesis H0 if the observed value tobs is in the critical region, and not to reject the null hypothesis otherwise.Practical example
The difference in the two processes applied to the radioactive suitcase example (below): * "The Geiger-counter reading is 10. The limit is 9. Check the suitcase." * "The Geiger-counter reading is high; 97% of safe suitcases have lower readings. The limit is 95%. Check the suitcase." The former report is adequate, the latter gives a more detailed explanation of the data and the reason why the suitcase is being checked. Not rejecting the null hypothesis does not mean the null hypothesis is "accepted" per se (though Neyman and Pearson used that word in their original writings; see the Interpretation section). The processes described here are perfectly adequate for computation. They seriously neglect the design of experiments considerations. It is particularly critical that appropriate sample sizes be estimated before conducting the experiment. The phrase "test of significance" was coined by statisticianInterpretation
When the null hypothesis is true and statistical assumptions are met, the probability that the p-value will be less than or equal to the significance level is at most . This ensures that the hypothesis test maintains its specified false positive rate (provided that statistical assumptions are met). The ''p''-value is the probability that a test statistic which is at least as extreme as the one obtained would occur under the null hypothesis. At a significance level of 0.05, a fair coin would be expected to (incorrectly) reject the null hypothesis (that it is fair) in 1 out of 20 tests on average. The ''p''-value does not provide the probability that either the null hypothesis or its opposite is correct (a common source of confusion). If the ''p''-value is less than the chosen significance threshold (equivalently, if the observed test statistic is in the critical region), then we say the null hypothesis is rejected at the chosen level of significance. If the ''p''-value is ''not'' less than the chosen significance threshold (equivalently, if the observed test statistic is outside the critical region), then the null hypothesis is not rejected at the chosen level of significance. In the "lady tasting tea" example (below), Fisher required the lady to properly categorize all of the cups of tea to justify the conclusion that the result was unlikely to result from chance. His test revealed that if the lady was effectively guessing at random (the null hypothesis), there was a 1.4% chance that the observed results (perfectly ordered tea) would occur.Use and importance
Statistics are helpful in analyzing most collections of data. This is equally true of hypothesis testing which can justify conclusions even when no scientific theory exists. In the Lady tasting tea example, it was "obvious" that no difference existed between (milk poured into tea) and (tea poured into milk). The data contradicted the "obvious". Real world applications of hypothesis testing include: * Testing whether more men than women suffer from nightmares * Establishing authorship of documents * Evaluating the effect of the full moon on behavior * Determining the range at which a bat can detect an insect by echo * Deciding whether hospital carpeting results in more infections * Selecting the best means to stop smoking * Checking whether bumper stickers reflect car owner behavior * Testing the claims of handwriting analysts Statistical hypothesis testing plays an important role in the whole of statistics and in statistical inference. For example, Lehmann (1992) in a review of the fundamental paper by Neyman and Pearson (1933) says: "Nevertheless, despite their shortcomings, the new paradigm formulated in the 1933 paper, and the many developments carried out within its framework continue to play a central role in both the theory and practice of statistics and can be expected to do so in the foreseeable future". Significance testing has been the favored statistical tool in some experimental social sciences (over 90% of articles in the ''Journal of Applied Psychology'' during the early 1990s). Other fields have favored the estimation of parameters (e.g. effect size). Significance testing is used as a substitute for the traditional comparison of predicted value and experimental result at the core of theCautions
"If the government required statistical procedures to carry warning labels like those on drugs, most inference methods would have long labels indeed." This caution applies to hypothesis tests and alternatives to them. The successful hypothesis test is associated with a probability and a type-I error rate. The conclusion ''might'' be wrong. The conclusion of the test is only as solid as the sample upon which it is based. The design of the experiment is critical. A number of unexpected effects have been observed including: * The clever Hans effect. A horse appeared to be capable of doing simple arithmetic. * The Hawthorne effect. Industrial workers were more productive in better illumination, and most productive in worse. * The placebo effect. Pills with no medically active ingredients were remarkably effective. A statistical analysis of misleading data produces misleading conclusions. The issue of data quality can be more subtle. In forecasting for example, there is no agreement on a measure of forecast accuracy. In the absence of a consensus measurement, no decision based on measurements will be without controversy. Publication bias: Statistically nonsignificant results may be less likely to be published, which can bias the literature. Multiple testing: When multiple true null hypothesis tests are conducted at once without adjustment, the overall probability of Type I error is higher than the nominal alpha level. Those making critical decisions based on the results of a hypothesis test are prudent to look at the details rather than the conclusion alone. In the physical sciences most results are fully accepted only when independently confirmed. The general advice concerning statistics is, "Figures never lie, but liars figure" (anonymous).Definition of terms
The following definitions are mainly based on the exposition in the book by Lehmann and Romano: *Statistical hypothesis: A statement about the parameters describing aNonparametric bootstrap hypothesis testing
Bootstrap-based resampling methods can be used for null hypothesis testing. A bootstrap creates numerous simulated samples by randomly resampling (with replacement) the original, combined sample data, assuming the null hypothesis is correct. The bootstrap is very versatile as it is distribution-free and it does not rely on restrictive parametric assumptions, but rather on empirical approximate methods with asymptotic guarantees. Traditional parametric hypothesis tests are more computationally efficient but make stronger structural assumptions. In situations where computing the probability of the test statistic under the null hypothesis is hard or impossible (due to perhaps inconvenience or lack of knowledge of the underlying distribution), the bootstrap offers a viable method for statistical inference.Examples
Human sex ratio
The earliest use of statistical hypothesis testing is generally credited to the question of whether male and female births are equally likely (null hypothesis), which was addressed in the 1700s by John Arbuthnot (1710), and later by Pierre-Simon Laplace (1770s). Arbuthnot examined birth records in London for each of the 82 years from 1629 to 1710, and applied the sign test, a simple non-parametric test. In every year, the number of males born in London exceeded the number of females. Considering more male or more female births as equally likely, the probability of the observed outcome is 0.582, or about 1 in 4,836,000,000,000,000,000,000,000; in modern terms, this is the ''p''-value. Arbuthnot concluded that this is too small to be due to chance and must instead be due to divine providence: "From whence it follows, that it is Art, not Chance, that governs." In modern terms, he rejected the null hypothesis of equally likely male and female births at the ''p'' = 1/282 significance level. Laplace considered the statistics of almost half a million births. The statistics showed an excess of boys compared to girls. Reprinted in English translation: He concluded by calculation of a ''p''-value that the excess was a real, but unexplained, effect.Lady tasting tea
In a famous example of hypothesis testing, known as the ''Lady tasting tea'', Originally from Fisher's book ''Design of Experiments''. Dr. Muriel Bristol, a colleague of Fisher, claimed to be able to tell whether the tea or the milk was added first to a cup. Fisher proposed to give her eight cups, four of each variety, in random order. One could then ask what the probability was for her getting the number she got correct, but just by chance. The null hypothesis was that the Lady had no such ability. The test statistic was a simple count of the number of successes in selecting the 4 cups. The critical region was the single case of 4 successes of 4 possible based on a conventional probability criterion (< 5%). A pattern of 4 successes corresponds to 1 out of 70 possible combinations (p≈ 1.4%). Fisher asserted that no alternative hypothesis was (ever) required. The lady correctly identified every cup, which would be considered a statistically significant result.Courtroom trial
A statistical test procedure is comparable to a criminal trial; a defendant is considered not guilty as long as his or her guilt is not proven. The prosecutor tries to prove the guilt of the defendant. Only when there is enough evidence for the prosecution is the defendant convicted. In the start of the procedure, there are two hypotheses : "the defendant is not guilty", and : "the defendant is guilty". The first one, , is called the '' null hypothesis''. The second one, , is called the ''alternative hypothesis''. It is the alternative hypothesis that one hopes to support. The hypothesis of innocence is rejected only when an error is very unlikely, because one does not want to convict an innocent defendant. Such an error is called '' error of the first kind'' (i.e., the conviction of an innocent person), and the occurrence of this error is controlled to be rare. As a consequence of this asymmetric behaviour, an '' error of the second kind'' (acquitting a person who committed the crime), is more common. A criminal trial can be regarded as either or both of two decision processes: guilty vs not guilty or evidence vs a threshold ("beyond a reasonable doubt"). In one view, the defendant is judged; in the other view the performance of the prosecution (which bears the burden of proof) is judged. A hypothesis test can be regarded as either a judgment of a hypothesis or as a judgment of evidence.Clairvoyant card game
A person (the subject) is tested for clairvoyance. They are shown the back face of a randomly chosen playing card 25 times and asked which of the four suits it belongs to. The number of hits, or correct answers, is called ''X''. As we try to find evidence of their clairvoyance, for the time being the null hypothesis is that the person is not clairvoyant. The alternative is: the person is (more or less) clairvoyant. If the null hypothesis is valid, the only thing the test person can do is guess. For every card, the probability (relative frequency) of any single suit appearing is 1/4. If the alternative is valid, the test subject will predict the suit correctly with probability greater than 1/4. We will call the probability of guessing correctly ''p''. The hypotheses, then, are: * null hypothesis (just guessing) and * alternative hypothesis (true clairvoyant). When the test subject correctly predicts all 25 cards, we will consider them clairvoyant, and reject the null hypothesis. Thus also with 24 or 23 hits. With only 5 or 6 hits, on the other hand, there is no cause to consider them so. But what about 12 hits, or 17 hits? What is the critical number, ''c'', of hits, at which point we consider the subject to be clairvoyant? How do we determine the critical value ''c''? With the choice ''c''=25 (i.e. we only accept clairvoyance when all cards are predicted correctly) we're more critical than with ''c''=10. In the first case almost no test subjects will be recognized to be clairvoyant, in the second case, a certain number will pass the test. In practice, one decides how critical one will be. That is, one decides how often one accepts an error of the first kind – a false positive, or Type I error. With ''c'' = 25 the probability of such an error is: : and hence, very small. The probability of a false positive is the probability of randomly guessing correctly all 25 times. Being less critical, with ''c'' = 10, gives: : Thus, ''c'' = 10 yields a much greater probability of false positive. Before the test is actually performed, the maximum acceptable probability of a Type I error (''α'') is determined. Typically, values in the range of 1% to 5% are selected. (If the maximum acceptable error rate is zero, an infinite number of correct guesses is required.) Depending on this Type 1 error rate, the critical value ''c'' is calculated. For example, if we select an error rate of 1%, ''c'' is calculated thus: : From all the numbers c, with this property, we choose the smallest, in order to minimize the probability of a Type II error, a false negative. For the above example, we select: .Variations and sub-classes
Statistical hypothesis testing is a key technique of both frequentist inference and Bayesian inference, although the two types of inference have notable differences. Statistical hypothesis tests define a procedure that controls (fixes) the probability of incorrectly ''deciding'' that a default position ( null hypothesis) is incorrect. The procedure is based on how likely it would be for a set of observations to occur if the null hypothesis were true. This probability of making an incorrect decision is ''not'' the probability that the null hypothesis is true, nor whether any specific alternative hypothesis is true. This contrasts with other possible techniques of decision theory in which the null and alternative hypothesis are treated on a more equal basis. One naïve Bayesian approach to hypothesis testing is to base decisions on the posterior probability, but this fails when comparing point and continuous hypotheses. Other approaches to decision making, such as Bayesian decision theory, attempt to balance the consequences of incorrect decisions across all possibilities, rather than concentrating on a single null hypothesis. A number of other approaches to reaching a decision based on data are available via decision theory and optimal decisions, some of which have desirable properties. Hypothesis testing, though, is a dominant approach to data analysis in many fields of science. Extensions to the theory of hypothesis testing include the study of the power of tests, i.e. the probability of correctly rejecting the null hypothesis given that it is false. Such considerations can be used for the purpose of sample size determination prior to the collection of data.Neyman–Pearson hypothesis testing
An example of Neyman–Pearson hypothesis testing (or null hypothesis statistical significance testing) can be made by a change to the radioactive suitcase example. If the "suitcase" is actually a shielded container for the transportation of radioactive material, then a test might be used to select among three hypotheses: no radioactive source present, one present, two (all) present. The test could be required for safety, with actions required in each case. The Neyman–Pearson lemma of hypothesis testing says that a good criterion for the selection of hypotheses is the ratio of their probabilities (a likelihood ratio). A simple method of solution is to select the hypothesis with the highest probability for the Geiger counts observed. The typical result matches intuition: few counts imply no source, many counts imply two sources and intermediate counts imply one source. Notice also that usually there are problems for proving a negative. Null hypotheses should be at least falsifiable. Neyman–Pearson theory can accommodate both prior probabilities and the costs of actions resulting from decisions.Section 8.2 The former allows each test to consider the results of earlier tests (unlike Fisher's significance tests). The latter allows the consideration of economic issues (for example) as well as probabilities. A likelihood ratio remains a good criterion for selecting among hypotheses. The two forms of hypothesis testing are based on different problem formulations. The original test is analogous to a true/false question; the Neyman–Pearson test is more like multiple choice. In the view of Tukey the former produces a conclusion on the basis of only strong evidence while the latter produces a decision on the basis of available evidence. While the two tests seem quite different both mathematically and philosophically, later developments lead to the opposite claim. Consider many tiny radioactive sources. The hypotheses become 0,1,2,3... grains of radioactive sand. There is little distinction between none or some radiation (Fisher) and 0 grains of radioactive sand versus all of the alternatives (Neyman–Pearson). The major Neyman–Pearson paper of 1933 also considered composite hypotheses (ones whose distribution includes an unknown parameter). An example proved the optimality of the (Student's) ''t''-test, "there can be no better test for the hypothesis under consideration" (p 321). Neyman–Pearson theory was proving the optimality of Fisherian methods from its inception. Fisher's significance testing has proven a popular flexible statistical tool in application with little mathematical growth potential. Neyman–Pearson hypothesis testing is claimed as a pillar of mathematical statistics, creating a new paradigm for the field. It also stimulated new applications in statistical process control, detection theory, decision theory andCriticism
Criticism of statistical hypothesis testing fills volumes. Much of the criticism can be summarized by the following issues: * The interpretation of a ''p''-value is dependent upon stopping rule and definition of multiple comparison. The former often changes during the course of a study and the latter is unavoidably ambiguous. (i.e. "p values depend on both the (data) observed and on the other possible (data) that might have been observed but weren't"). * Confusion resulting (in part) from combining the methods of Fisher and Neyman–Pearson which are conceptually distinct. "Until we go through the accounts of testing hypotheses, separating eyman–Pearsondecision elements from isherconclusion elements, the intimate mixture of disparate elements will be a continual source of confusion." ... "There is a place for both "doing one's best" and "saying only what is certain," but it is important to know, in each instance, both which one is being done, and which one ought to be done." * Emphasis on statistical significance to the exclusion of estimation and confirmation by repeated experiments. * Rigidly requiring statistical significance as a criterion for publication, resulting in publication bias. Most of the criticism is indirect. Rather than being wrong, statistical hypothesis testing is misunderstood, overused and misused. * When used to detect whether a difference exists between groups, a paradox arises. As improvements are made to experimental design (e.g. increased precision of measurement and sample size), the test becomes more lenient. Unless one accepts the absurd assumption that all sources of noise in the data cancel out completely, the chance of finding statistical significance in either direction approaches 100%. However, this absurd assumption that the mean difference between two groups cannot be zero implies that the data cannot be independent and identically distributed (i.i.d.) because the expected difference between any two subgroups of i.i.d. random variates is zero; therefore, the i.i.d. assumption is also absurd. *Layers of philosophical concerns. The probability of statistical significance is a function of decisions made by experimenters/analysts. If the decisions are based on convention they are termed arbitrary or mindless while those not so based may be termed subjective. To minimize type II errors, large samples are recommended. In psychology practically all null hypotheses are claimed to be false for sufficiently large samples so "...it is usually nonsensical to perform an experiment with the ''sole'' aim of rejecting the null hypothesis." "Statistically significant findings are often misleading" in psychology. Statistical significance does not imply practical significance, and correlation does not imply causation. Casting doubt on the null hypothesis is thus far from directly supporting the research hypothesis. *" does not tell us what we want to know". This paper lead to the review of statistical practices by the APA. Cohen was a member of the Task Force that did the review. Lists of dozens of complaints are available. Critics and supporters are largely in factual agreement regarding the characteristics of null hypothesis significance testing (NHST): While it can provide critical information, it is ''inadequate as the sole tool for statistical analysis''. ''Successfully rejecting the null hypothesis may offer no support for the research hypothesis.'' The continuing controversy concerns the selection of the best statistical practices for the near-term future given the existing practices. However, adequate research design can minimize this issue. Critics would prefer to ban NHST completely, forcing a complete departure from those practices, while supporters suggest a less absolute change. Controversy over significance testing, and its effects on publication bias in particular, has produced several results. The American Psychological Association has strengthened its statistical reporting requirements after review, "Hypothesis tests. It is hard to imagine a situation in which a dichotomous accept-reject decision is better than reporting an actual p value or, better still, a confidence interval." (p 599). The committee used the cautionary term "forbearance" in describing its decision against a ban of hypothesis testing in psychology reporting. (p 603) medical journal publishers have recognized the obligation to publish some results that are not statistically significant to combat publication bias, and a journal (''Journal of Articles in Support of the Null Hypothesis'') has been created to publish such results exclusively.''Journal of Articles in Support of the Null Hypothesis'' websiteAlternatives
A unifying position of critics is that statistics should not lead to an accept-reject conclusion or decision, but to an estimated value with an interval estimate; this data-analysis philosophy is broadly referred to as estimation statistics. Estimation statistics can be accomplished with either frequentist or Bayesian methods. Critics of significance testing have advocated basing inference less on p-values and more on confidence intervals for effect sizes for importance, prediction intervals for confidence, replications and extensions for replicability, meta-analyses for generality :. But none of these suggested alternatives inherently produces a decision. Lehmann said that hypothesis testing theory can be presented in terms of conclusions/decisions, probabilities, or confidence intervals: "The distinction between the ... approaches is largely one of reporting and interpretation." Bayesian inference is one proposed alternative to significance testing. (Nickerson cited 10 sources suggesting it, including Rozeboom (1960)). For example, Bayesian parameter estimation can provide rich information about the data from which researchers can draw inferences, while using uncertain priors that exert only minimal influence on the results when enough data is available. Psychologist John K. Kruschke has suggested Bayesian estimation as an alternative for the ''t''-test and has also contrasted Bayesian estimation for assessing null values with Bayesian model comparison for hypothesis testing. Two competing models/hypotheses can be compared usingSee also
*References
Further reading
* Lehmann E.L. (1992) "Introduction to Neyman and Pearson (1933) On the Problem of the Most Efficient Tests of Statistical Hypotheses". In: ''Breakthroughs in Statistics, Volume 1'', (Eds Kotz, S., Johnson, N.L.), Springer-Verlag. (followed by reprinting of the paper) *External links
*Online calculators
* Som