Spurious correlations - spelling bee spiders

Data dredging, also known as data snooping or ''p''-hacking is the misuse of

data analysis Data analysis is the process of inspecting, Data cleansing, cleansing, Data transformation, transforming, and Data modeling, modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Da ...

to find patterns in data that can be presented as

statistically significant In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...

, thus dramatically increasing and understating the risk of

false positives A false positive is an error in binary classification in which a test result incorrectly indicates the presence of a condition (such as a disease when the disease is not present), while a false negative is the opposite error, where the test res ...

. This is done by performing many

statistical test A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. ...

s on the data and only reporting those that come back with significant results. Thus data dredging is also often a misused or misapplied form of

data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

. The process of data dredging involves testing multiple hypotheses using a single

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...

by exhaustively searching—perhaps for combinations of variables that might show a

correlation In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...

, and perhaps for groups of cases or observations that show differences in their mean or in their breakdown by some other variable. Conventional tests of

statistical significance In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...

are based on the probability that a particular result would arise if chance alone were at work, and necessarily accept some risk of mistaken conclusions of a certain type (mistaken rejections of the

null hypothesis The null hypothesis (often denoted ''H''0) is the claim in scientific research that the effect being studied does not exist. The null hypothesis can also be described as the hypothesis in which no relationship exists between two sets of data o ...

). This level of risk is called the ''significance''. When large numbers of tests are performed, some produce false results of this type; hence 5% of randomly chosen hypotheses might be (erroneously) reported to be statistically significant at the 5% significance level, 1% might be (erroneously) reported to be statistically significant at the 1% significance level, and so on, by chance alone. When enough hypotheses are tested, it is virtually certain that some will be reported to be statistically significant (even though this is misleading), since almost every data set with any degree of randomness is likely to contain (for example) some spurious correlations. If they are not cautious, researchers using data mining techniques can be easily misled by these results. The term ''p-hacking'' (in reference to ''p''-values) was coined in a 2014 paper by the three researchers behind the blog

Data Colada Data Colada is a blog dedicated to investigative analysis and Replication crisis, replication of academic research, focusing in particular on the validity of findings in the Social science, social sciences. It is known for its advocacy against pr ...

, which has been focusing on uncovering such problems in social sciences research. Data dredging is an example of disregarding the

multiple comparisons problem Multiple comparisons, multiplicity or multiple testing problem occurs in statistics when one considers a set of statistical inferences simultaneously or estimates a subset of parameters selected based on the observed values. The larger the numbe ...

. One form is when subgroups are compared without alerting the reader to the total number of subgroup comparisons examined. When misused it is a questionable research practice that can undermine scientific integrity.

Types

Drawing conclusions from data

The conventional

statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

procedure using

frequentist probability Frequentist probability or frequentism is an interpretation of probability; it defines an event's probability (the ''long-run probability'') as the limit of its relative frequency in infinitely many trials. Probabilities can be found (in pr ...

is to formulate a research hypothesis, such as "people in higher social classes live longer", then collect relevant data. Lastly, a statistical

significance test A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

is carried out to see how likely the results are by chance alone (also called testing against the null hypothesis). A key point in proper statistical analysis is to test a hypothesis with evidence (data) that was not used in constructing the hypothesis. This is critical because every

contains some patterns due entirely to chance. If the hypothesis is not tested on a different data set from the same

statistical population In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hyp ...

, it is impossible to assess the likelihood that chance alone would produce such patterns. For example,

flipping a coin Coin flipping, coin tossing, or heads or tails is using the thumb to make a coin go up while spinning in the air and checking obverse and reverse, which side is showing when it is down onto a surface, in order to randomly choose between two alter ...

five times with a result of 2 heads and 3 tails might lead one to hypothesize that the coin favors tails by 3/5 to 2/5. If this hypothesis is then tested on the existing data set, it is confirmed, but the confirmation is meaningless. The proper procedure would have been to form in advance a hypothesis of what the tails probability is, and then throw the coin various times to see if the hypothesis is rejected or not. If three tails and two heads are observed, another hypothesis, that the tails probability is 3/5, could be formed, but it could only be tested by a new set of coin tosses. The statistical significance under the incorrect procedure is completely spurious—significance tests do not protect against data dredging.

Optional stopping

Optional stopping is a practice where one collects data until some stopping criteria is reached. While it is a valid procedure, it is easily misused. The problem is that p-value of an optionally stopped statistical test is larger than what it seems. Intuitively, this is because the p-value is supposed to be the sum of all events at least as rare as what is observed. With optional stopping, there are even rarer events that are difficult to account for, i.e. not triggering the optional stopping rule, and collect even more data, before stopping. Neglecting these events leads to a p-value that's too low. In fact, if the null hypothesis is true, then ''any'' significance level can be reached if one is allowed to keep collecting data and stop when the desired p-value (calculated as if one has always been planning to collect exactly this much data) is obtained. For a concrete example of testing for a fair coin, see . Or, more succinctly, the proper calculation of p-value requires accounting for counterfactuals, that is, what the experimenter ''could'' have done in reaction to data that ''might'' have been. Accounting for what might have been is hard, even for honest researchers. One benefit of preregistration is to account for all counterfactuals, allowing the p-value to be calculated correctly. The problem of early stopping is not just limited to researcher misconduct. There is often pressure to stop early if the cost of collecting data is high. Some animal ethics boards even mandate early stopping if the study obtains a significant result midway.

Post-hoc data replacement

If data is removed ''after'' some data analysis is already done on it, such as on the pretext of "removing outliers", then it would increase the false positive rate. Replacing "outliers" by replacement data increases the false positive rate further.

Post-hoc grouping

If a dataset contains multiple features, then one or more of the features can be used as grouping, and potentially create a statistically significant result. For example, if a dataset of patients records their age and sex, then a researcher can consider grouping them by age and check if the illness recovery rate is correlated with age. If it does not work, then the researcher might check if it correlates with sex. If not, then perhaps it correlates with age after controlling for sex, etc. The number of possible groupings grows exponentially with the number of features.

Hypothesis suggested by non-representative data

Suppose that a study of a

random sample In this statistics, quality assurance, and survey methodology, sampling is the selection of a subset or a statistical sample (termed sample for short) of individuals from within a statistical population to estimate characteristics of the whole ...

of people includes exactly two people with a birthday of August 7: Mary and John. Someone engaged in data dredging might try to find additional similarities between Mary and John. By going through hundreds or thousands of potential similarities between the two, each having a low probability of being true, an unusual similarity can almost certainly be found. Perhaps John and Mary are the only two people in the study who switched minors three times in college. A hypothesis, biased by data dredging, could then be "people born on August 7 have a much higher chance of switching minors more than twice in college." The data itself taken out of context might be seen as strongly supporting that correlation, since no one with a different birthday had switched minors three times in college. However, if (as is likely) this is a spurious hypothesis, this result will most likely not be

reproducible Reproducibility, closely related to replicability and repeatability, is a major principle underpinning the scientific method. For the findings of a study to be reproducible means that results obtained by an experiment or an observational study or ...

; any attempt to check if others with an August 7 birthday have a similar rate of changing minors will most likely get contradictory results almost immediately.

Systematic bias

Bias is a systematic error in the analysis. For example, doctors directed

HIV The human immunodeficiency viruses (HIV) are two species of '' Lentivirus'' (a subgroup of retrovirus) that infect humans. Over time, they cause acquired immunodeficiency syndrome (AIDS), a condition in which progressive failure of the im ...

patients at high cardiovascular risk to a particular HIV treatment,

abacavir Abacavir, sold under the brand name Ziagen among others, is a medication used to treat HIV/AIDS. Similar to other nucleoside analog reverse-transcriptase inhibitors (NRTIs), abacavir is used together with other HIV medications, and is not re ...

, and lower-risk patients to other drugs, preventing a simple assessment of abacavir compared to other treatments. An analysis that did not correct for this bias unfairly penalized abacavir, since its patients were more high-risk so more of them had heart attacks. This problem can be very severe, for example, in the

observational study In fields such as epidemiology, social sciences, psychology and statistics, an observational study draws inferences from a sample (statistics), sample to a statistical population, population where the dependent and independent variables, independ ...

. Missing factors, unmeasured

confounders In causal inference, a confounder is a variable that influences both the dependent variable and independent variable, causing a spurious association. Confounding is a causal concept, and as such, cannot be described in terms of correlation ...

, and loss to follow-up can also lead to bias. By selecting papers with significant ''p''-values, negative studies are selected against, which is

publication bias In published academic research, publication bias occurs when the outcome of an experiment or research study biases the decision to publish or otherwise distribute it. Publishing only results that show a Statistical significance, significant find ...

. This is also known as ''file drawer bias'', because less significant ''p''-value results are left in the file drawer and never published.

Multiple modelling

Another aspect of the conditioning of

s by knowledge of the data can be seen while using the A crucial step in the process is to decide which

covariate A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical function ...

s to include in a relationship explaining one or more other variables. There are both statistical (see

stepwise regression In statistics, stepwise regression is a method of fitting regression models in which the choice of predictive variables is carried out by an automatic procedure. In each step, a variable is considered for addition to or subtraction from the set of ...

) and substantive considerations that lead the authors to favor some of their models over others, and there is a liberal use of statistical tests. However, to discard one or more variables from an explanatory relation on the basis of the data means one cannot validly apply standard statistical procedures to the retained variables in the relation as though nothing had happened. In the nature of the case, the retained variables have had to pass some kind of preliminary test (possibly an imprecise intuitive one) that the discarded variables failed. In 1966, Selvin and Stuart compared variables retained in the model to the fish that don't fall through the net—in the sense that their effects are bound to be bigger than those that do fall through the net. Not only does this alter the performance of all subsequent tests on the retained explanatory model, but it may also introduce bias and alter

mean square error In statistics, the mean squared error (MSE) or mean squared deviation (MSD) of an estimator (of a procedure for estimating an unobserved quantity) measures the average of the squares of the errors—that is, the average squared difference between ...

in estimation.

Examples

In meteorology and epidemiology

meteorology Meteorology is the scientific study of the Earth's atmosphere and short-term atmospheric phenomena (i.e. weather), with a focus on weather forecasting. It has applications in the military, aviation, energy production, transport, agricultur ...

, hypotheses are often formulated using weather data up to the present and tested against future weather data, which ensures that, even subconsciously, future data could not influence the formulation of the hypothesis. Of course, such a discipline necessitates waiting for new data to come in, to show the formulated theory's

predictive power The concept of predictive power, the power of a scientific theory to generate testable predictions, differs from ''explanatory power'' and ''descriptive power'' (where phenomena that are already known are retrospectively explained or described ...

versus the

. This process ensures that no one can accuse the researcher of hand-tailoring the

predictive model Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...

to the data on hand, since the upcoming weather is not yet available. As another example, suppose that observers note that a particular town appears to have a

cancer cluster A cancer cluster is a disease cluster in which a high number of cancer cases occurs in a group of people in a particular geographic area over a limited period of time. Historical examples of work-related cancer clusters are well documented in th ...

, but lack a firm hypothesis of why this is so. However, they have access to a large amount of

demographic data Demography () is the statistical study of human populations: their size, composition (e.g., ethnic group, age), and how they change through the interplay of fertility (births), mortality (deaths), and migration. Demographic analysis examine ...

about the town and surrounding area, containing measurements for the area of hundreds or thousands of different variables, mostly uncorrelated. Even if all these variables are independent of the cancer incidence rate, it is highly likely that at least one variable correlates significantly with the cancer rate across the area. While this may suggest a hypothesis, further testing using the same variables but with data from a different location is needed to confirm. Note that a ''p''-value of 0.01 suggests that 1% of the time a result at least that extreme would be obtained by chance; if hundreds or thousands of hypotheses (with mutually relatively uncorrelated independent variables) are tested, then one is likely to obtain a ''p''-value less than 0.01 for many null hypotheses.

Appearance in media

One example is the chocolate weight loss hoax study conducted by journalist

John Bohannon John Bohannon is an American science journalist and scientist who is Director of Science at Primer, an artificial intelligence company headquartered in San Francisco, California. He is known for his career prior to Primer as a science journalist a ...

, who explained publicly in a ''Gizmodo'' article that the study was deliberately conducted fraudulently as a

social experiment A social experiment is a method of psychological or sociological research that observes people's reactions to certain situations or events. The experiment depends on a particular social approach where the main source of information is the parti ...

. This study was widespread in many media outlets around 2015, with many people believing the claim that eating a chocolate bar every day would cause them to lose weight, against their better judgement. This

study Study or studies may refer to: General * Education **Higher education * Clinical trial * Experiment * Field of study * Observational study * Scientific study * Research * Study skills, abilities and approaches applied to learning Other * Study ...

was published in the Institute of Diet and Health. According to Bohannon, to reduce the ''p''-value to below 0.05, taking 18 different variables into consideration when testing was crucial.

Remedies

While looking for patterns in data is legitimate, applying a statistical test of significance or

hypothesis test A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. ...

to the same data until a pattern emerges is prone to abuse. One way to construct hypotheses while avoiding data dredging is to conduct randomized out-of-sample tests. The researcher collects a data set, then randomly partitions it into two subsets, A and B. Only one subset—say, subset A—is examined for creating hypotheses. Once a hypothesis is formulated, it must be tested on subset B, which was not used to construct the hypothesis. Only where B also supports such a hypothesis is it reasonable to believe the hypothesis might be valid. (This is a simple type of cross-validation and is often termed training-test or split-half validation.) Another remedy for data dredging is to record the number of all significance tests conducted during the study and simply divide one's criterion for significance (alpha) by this number; this is the

Bonferroni correction In statistics, the Bonferroni correction is a method to counteract the multiple comparisons problem. Background The method is named for its use of the Bonferroni inequalities. Application of the method to confidence intervals was described by ...

. However, this is a very conservative metric. A family-wise alpha of 0.05, divided in this way by 1,000 to account for 1,000 significance tests, yields a very stringent per-hypothesis alpha of 0.00005. Methods particularly useful in analysis of variance, and in constructing simultaneous confidence bands for regressions involving basis functions are

Scheffé's method In statistics, Scheffé's method, named after American statistician Henry Scheffé, is a method for adjusting significance levels in a linear regression analysis to account for multiple comparisons. It is particularly useful in analysis of var ...

and, if the researcher has in mind only

pairwise comparison Pairwise generally means "occurring in pairs" or "two at a time." Pairwise may also refer to: * Pairwise disjoint In set theory in mathematics and Logic#Formal logic, formal logic, two Set (mathematics), sets are said to be disjoint sets if th ...

s, the Tukey method. To avoid the extreme conservativeness of the Bonferroni correction, more sophisticated selective inference methods are available. The most common selective inference method is the use of Benjamini and Hochberg's

false discovery rate In statistics, the false discovery rate (FDR) is a method of conceptualizing the rate of type I errors in null hypothesis testing when conducting multiple comparisons. FDR-controlling procedures are designed to control the FDR, which is the exp ...

controlling procedure: it is a less conservative approach that has become a popular method for control of multiple hypothesis tests. When neither approach is practical, one can make a clear distinction between data analyses that are confirmatory and analyses that are exploratory. Statistical inference is appropriate only for the former. Ultimately, the statistical significance of a test and the statistical confidence of a finding are joint properties of data and the method used to examine the data. Thus, if someone says that a certain event has probability of 20% ± 2% 19 times out of 20, this means that if the probability of the event is estimated ''by the same method'' used to obtain the 20% estimate, the result is between 18% and 22% with probability 0.95. No claim of statistical significance can be made by only looking, without due regard to the method used to assess the data. Academic journals increasingly shift to the

registered report Preregistration is the practice of registering the hypotheses, methods, or analyses of a scientific study before it is conducted. Clinical trial registration is similar, although it may not require the registration of a study's analysis protocol. F ...

format, which aims to counteract very serious issues such as data dredging and , which have made theory-testing research very unreliable. For example, ''

Nature Human Behaviour ''Nature Human Behaviour'' is a monthly multidisciplinary online-only peer-reviewed scientific journal covering all aspects of human behaviour. It was established in January 2017 and is published by Nature Portfolio. The editor-in-chief is Stav ...

'' has adopted the registered report format, as it "shift the emphasis from the results of research to the questions that guide the research and the methods used to answer them". The ''

European Journal of Personality The ''European Journal of Personality'' (EJP) is the official bimonthly academic journal of the European Association of Personality Psychology covering research on personality, published by SAGE Publishing. According to citation reports based on ...

'' defines this format as follows: "In a registered report, authors create a study proposal that includes theoretical and empirical background, research questions/hypotheses, and pilot data (if available). Upon submission, this proposal will then be reviewed prior to data collection, and if accepted, the paper resulting from this peer-reviewed procedure will be published, regardless of the study outcomes." Methods and results can also be made publicly available, as in the

open science Open science is the movement to make scientific research (including publications, data, physical samples, and software) and its dissemination accessible to all levels of society, amateur or professional. Open science is transparent and accessib ...

approach, making it yet more difficult for data dredging to take place.

Notes

References

External links

A bibliography on data-snooping bias

Spurious Correlations
a gallery of examples of implausible correlations *
Video explaining p-hacking
by "

Neuroskeptic Neuroskeptic is a British neuroscientist and pseudonymous science blogger. They are known for their efforts uncovering fake and plagiarized articles published in predatory journals. They have also blogged about the limitations of MRI scans, which ...

", a blogger at Discover Magazine
Step Away From Stepwise
an article in the

Journal of Big Data ''Journal of Big Data'' is a scientific journal that publishes open-access original research on big data. Published by SpringerOpen since 2014, it examines data capture and storage; search, sharing, and analytics; big data technologies; data v ...

criticizing stepwise regression {{DEFAULTSORT:Data Dredging Bias Cognitive biases Scientific misconduct Data mining Design of experiments Statistical hypothesis testing Misuse of statistics