Data dredging (also known as data snooping or ''p''-hacking)
is the misuse of
data analysis
Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
to find patterns in data that can be presented as
statistically significant
In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (simply by chance alone). More precisely, a study's defined significance level, denoted by \alpha, is the p ...
, thus dramatically increasing and understating the risk of false positives. This is done by performing many
statistical tests
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
on the data and only reporting those that come back with significant results.
The process of data dredging involves testing multiple hypotheses using a single
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
by
exhaustively searching—perhaps for combinations of variables that might show a
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable.
Conventional tests of
statistical significance
In statistical hypothesis testing, a result has statistical significance when it is very unlikely to have occurred given the null hypothesis (simply by chance alone). More precisely, a study's defined significance level, denoted by \alpha, is the p ...
are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the null hypothesis). This level of risk is called the
''significance''. When large numbers of tests are performed, some produce false results of this type; hence 5% of randomly chosen hypotheses might be (erroneously) reported to be statistically significant at the 5% significance level, 1% might be (erroneously) reported to be statistically significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some
spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results.
Data dredging is an example of disregarding the
multiple comparisons problem
In statistics, the multiple comparisons, multiplicity or multiple testing problem occurs when one considers a set of statistical inferences simultaneously or infers a subset of parameters selected based on the observed values.
The more inferences ...
. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined.
[
]
Drawing conclusions from data
The conventional
frequentist
Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pro ...
statistical hypothesis testing
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
procedure is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data, followed by carrying out a statistical
significance test
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
to see how likely such results would be found if chance alone were at work. (The last step is called testing against the
null hypothesis
In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
.)
A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same
statistical population
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypoth ...
, it is impossible to assess the likelihood that chance alone would produce such patterns. See
testing hypotheses suggested by the data
In statistics, hypotheses suggested by a given dataset, when tested with the same dataset that suggested them, are likely to be accepted even when they are not true. This is because circular reasoning (double dipping) would be involved: somethin ...
.
Here is a simple example.
Throwing a coin five times, with a result of 2 heads and 3 tails, might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. It is important to realize that the statistical significance under the incorrect procedure is completely spurious – significance tests do not protect against data dredging.
Hypothesis suggested by non-representative data
Suppose that a study of a random sample of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data snooping might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data snooping, could then be "People born on August 7 have a much higher chance of switching minors more than twice in college."
The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be
reproducible
Reproducibility, also known as replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or in a ...
; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.
Bias
Bias is a systematic error in the analysis. For example, doctors directed HIV patients at high cardiovascular risk to a particular HIV treatment,
abacavir, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalised abacavir, since its patients were more high-risk so more of them had heart attacks.
This problem can be very severe, for example, in the
observational study
In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample (statistics), sample to a statistical population, population where the dependent and independent variables, independ ...
.
[
]
Missing factors, unmeasured confounders, and loss to follow-up can also lead to bias.
By selecting papers with a significant
''p''-value, negative studies are selected against—which is the
publication bias
In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a significant finding disturbs the balance o ...
. This is also known as "
file drawer bias", because less significant ''p''-value results are left in the file drawer and never published.
Multiple modelling
Another aspect of the conditioning of
statistical test
A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis.
Hypothesis testing allows us to make probabilistic statements about population parameters.
...
s by knowledge of the data can be seen while using the A crucial step in the process is to decide which
covariate
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s to include in a relationship explaining one or more other variables. There are both statistical (see
Stepwise regression
In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...
) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model, but it may also introduce bias and alter
mean square error
In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...
in estimation.
[
][
]
Examples in meteorology and epidemiology
In
meteorology
Meteorology is a branch of the atmospheric sciences (which include atmospheric chemistry and physics) with a major focus on weather forecasting. The study of meteorology dates back millennia, though significant progress in meteorology did not ...
, hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's
predictive power
The concept of predictive power, the power of a scientific theory to generate testable predictions, differs from ''explanatory power'' and ''descriptive power'' (where phenomena that are already known are retrospectively explained or described ...
versus the
null hypothesis
In scientific research, the null hypothesis (often denoted ''H''0) is the claim that no difference or relationship exists between two sets of data or variables being analyzed. The null hypothesis is that any experimentally observed difference is d ...
. This process ensures that no one can accuse the researcher of hand-tailoring the
predictive model
Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
to the data on hand, since the upcoming weather is not yet available.
As another example, suppose that observers note that a particular town appears to have a
cancer cluster
A cancer cluster is a disease cluster in which a high number of cancer cases occurs in a group of people in a particular geographic area over a limited period of time.[demographic data
Demography () is the statistical study of populations, especially human beings.
Demographic analysis examines and measures the dimensions and dynamics of populations; it can cover whole societies or groups defined by criteria such as edu ...](_ ...<br></span></div>, but lack a firm hypothesis of why this is so. However, they have access to a large amount of <div class=)
about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a
''p''-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is likely to obtain a ''p''-value less than 0.01 for many null hypotheses.
Remedies
Looking for patterns in data is legitimate. Applying a
statistical test of significance, or hypothesis test, to the same data that a pattern emerges from is wrong. One way to construct hypotheses while avoiding data dredging is to conduct
randomized
In common usage, randomness is the apparent or actual lack of pattern or predictability in events. A random sequence of events, symbols or steps often has no order and does not follow an intelligible pattern or combination. Individual rand ...
out-of-sample test
Cross-validation, sometimes called rotation estimation or out-of-sample testing, is any of various similar model validation techniques for assessing how the results of a statistics, statistical analysis will Generalization error, generalize to ...
s. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid. (This is a simple type of
cross-validation and is often termed training-test or split-half validation.)
Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance ("alpha") by this number; this is the
Bonferroni correction
In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem.
Background
The method is named for its use of the Bonferroni inequalities.
An extension of the method to confidence intervals was proposed by Oliv ...
. However, this is a very conservative metric. A family-wise alpha of 0.05, divided in this way by 1,000 to account for 1,000 significance tests, yields a very stringent per-hypothesis alpha of 0.00005. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are the
Scheffé method and, if the researcher has in mind only pairwise comparisons, the
Tukey method. The use of Benjamini and Hochberg's
false discovery rate
In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the expe ...
is a more sophisticated approach that has become a popular method for control of multiple hypothesis tests.
When neither approach is practical, one can make a clear distinction between data analyses that are
confirmatory and analyses that are
exploratory. Statistical inference is appropriate only for the former.
Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated ''by the same method'' used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data.
Academic journals increasingly shift to the
registered report format, which aims to counteract very serious issues such as data dredging and
, which have made theory-testing research very unreliable. For example,
Nature Human Behaviour
''Nature Human Behaviour'' is a monthly multidisciplinary online-only peer-reviewed scientific journal covering all aspects of human behaviour. It was established in January 2017 and is published by Springer Nature Publishing. The editor-in-chi ...
has adopted the registered report format, as it “shift
the emphasis from the results of research to the questions that guide the research and the methods used to answer them”. The
European Journal of Personality
The ''European Journal of Personality'' (EJP) is the official bimonthly academic journal of the European Association of Personality Psychology covering research on personality, published by SAGE Publishing. According to citation reports based on ...
defines this format as follows: “In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes.”
Methods and results can also be made publicly available, as in the
open science approach, making it yet more difficult for data dredging to take place.
See also
*
*
*
*
*
*
*
*
*
*
*
*
*
*
*
Notes
References
Further reading
*
*
*
*
External links
A bibliography on data-snooping biasSpurious Correlations a gallery of examples of implausible correlations
*
Video explaining p-hackingby "
Neuroskeptic", a blogger at Discover Magazine
Step Away From Stepwise an article in the Journal of Big Data criticising stepwise regression.
{{DEFAULTSORT:Data Dredging
Bias
Cognitive biases
Scientific misconduct
Data mining
Design of experiments
Statistical hypothesis testing
Misuse of statistics
Ethically disputed research practices