In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, hypotheses suggested by a given dataset, when tested with the same dataset that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning (double dipping) would be involved: something seems true in the limited data set; therefore we hypothesize that it is true in general; therefore we wrongly test it on the same, limited data set, which seems to confirm that it is true. Generating hypotheses based on data already observed, in the absence of testing them on new data, is referred to as post hoc theorizing (from
Latin
Latin (, or , ) is a classical language belonging to the Italic branch of the Indo-European languages. Latin was originally a dialect spoken in the lower Tiber area (then known as Latium) around present-day Rome, but through the power of the ...
''
post hoc'', "after this").
The correct procedure is to test any hypothesis on a data set that was not used to generate the hypothesis.
The general problem
Testing a hypothesis suggested by the data can very easily result in false positives (
type I error
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
s). If one looks long enough and in enough different places, eventually data can be found to support any hypothesis. Yet, these positive data do not by themselves constitute
evidence
Evidence for a proposition is what supports this proposition. It is usually understood as an indication that the supported proposition is true. What role evidence plays and how it is conceived varies from field to field.
In epistemology, evidenc ...
that the hypothesis is correct. The negative test data that were thrown out are just as important, because they give one an idea of how common the positive results are compared to chance. Running an experiment, seeing a pattern in the data, proposing a hypothesis from that pattern, then using the ''same'' experimental data as evidence for the new hypothesis is extremely suspect, because data from all other experiments, completed or potential, has essentially been "thrown out" by choosing to look only at the experiments that suggested the new hypothesis in the first place.
A large set of tests as described above greatly inflates the
probability
Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
of
type I error
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
as all but the data most favorable to the
hypothesis
A hypothesis (plural hypotheses) is a proposed explanation for a phenomenon. For a hypothesis to be a scientific hypothesis, the scientific method requires that one can test it. Scientists generally base scientific hypotheses on previous obse ...
is discarded. This is a risk, not only in
hypothesis testing
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
but in all
statistical inference
Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
as it is often problematic to accurately describe the process that has been followed in searching and discarding
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. In other words, one wants to keep all data (regardless of whether they tend to support or refute the hypothesis) from "good tests", but it is sometimes difficult to figure out what a "good test" is. It is a particular problem in
statistical model
A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
ling, where many different models are rejected by
trial and error
Trial and error is a fundamental method of problem-solving characterized by repeated, varied attempts which are continued until success, or until the practicer stops trying.
According to W.H. Thorpe, the term was devised by C. Lloyd Morgan (1 ...
before publishing a result (see also
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
,
publication bias
In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a significant finding disturbs the balance o ...
).
The error is particularly prevalent in
data mining and
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
. It also commonly occurs in
academic publishing
Academic publishing is the subfield of publishing which distributes academic research and scholarship. Most academic work is published in academic journal articles, books or theses. The part of academic written output that is not formally publ ...
where only reports of positive, rather than negative, results tend to be accepted, resulting in the effect known as
publication bias
In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a significant finding disturbs the balance o ...
.
Correct procedures
All strategies for sound testing of hypotheses suggested by the data involve including a wider range of tests in an attempt to validate or refute the new hypothesis. These include:
*Collecting
confirmation sample
In Christian denominations that practice infant baptism, confirmation is seen as the sealing of the covenant created in baptism. Those being confirmed are known as confirmands. For adults, it is an affirmation of belief. It involves laying on ...
s
*
Cross-validation
*Methods of compensation for
multiple comparisons
In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values.
The more inferences ...
*Simulation studies including adequate representation of the multiple-testing actually involved
Henry Scheffé's simultaneous test of all contrasts in
multiple comparison
In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values.
The more inferences ...
problems is the most well-known remedy in the case of
analysis of variance
Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statisticia ...
.
Henry Scheffé
Henry Scheffé (April 11, 1907 – July 5, 1977) was an American statistician.
He is known for the Lehmann–Scheffé theorem and Scheffé's method.
Education and career
Scheffé was born in New York City on April 11, 1907, the child of Germa ...
, "A Method for Judging All Contrasts in the Analysis of Variance", ''Biometrika
''Biometrika'' is a peer-reviewed scientific journal published by Oxford University Press for thBiometrika Trust The editor-in-chief is Paul Fearnhead (Lancaster University). The principal focus of this journal is theoretical statistics. It was es ...
'', 40, pages 87–104 (1953). It is a method designed for testing hypotheses suggested by the data while avoiding the fallacy described above.
See also
*
Bonferroni correction
In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem.
Background
The method is named for its use of the Bonferroni inequalities.
An extension of the method to confidence intervals was proposed by Oliv ...
*
Data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
*
Data dredging
Data dredging (also known as data snooping or ''p''-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. ...
*
Exploratory data analysis
In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but pr ...
*
HARKing
HARKing (hypothesizing after the results are known) is an acronym coined by social psychologist Norbert Kerr that refers to the questionable research practice of “presenting a post hoc hypothesis in the introduction of a research report as if it ...
*
''p''-hacking
*
Post hoc analysis
In a scientific study, post hoc analysis (from Latin '' post hoc'', "after this") consists of statistical analyses that were specified after the data were seen. They are usually used to uncover specific differences between three or more group mea ...
*
Predictive analytics
Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events.
In business ...
*
Texas sharpshooter fallacy
The Texas sharpshooter fallacy is an informal fallacy which is committed when differences in data are ignored, but similarities are overemphasized. From this reasoning, a false conclusion is inferred. This fallacy is the philosophical or rhetorical ...
*
Type I and type II errors
In statistical hypothesis testing, a type I error is the mistaken rejection of an actually true null hypothesis (also known as a "false positive" finding or conclusion; example: "an innocent person is convicted"), while a type II error is the fa ...
*
Uncomfortable science
Notes and references
{{reflist
Statistical hypothesis testing
Misuse of statistics
Multiple comparisons