In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, a spurious relationship or spurious correlation is a
mathematical relationship in which two or more events or variables are
associated but ''
not''
causally related, due to either coincidence or the presence of a certain third, unseen factor (referred to as a "common response variable", "confounding factor", or "
lurking variable").
Examples
An example of a spurious relationship can be found in the
time-series literature, where a spurious regression is one that provides misleading statistical evidence of a
linear relationship between independent
non-stationary variables. In fact, the non-stationarity may be due to the presence of a
unit root
In probability theory and statistics, a unit root is a feature of some stochastic processes (such as random walks) that can cause problems in statistical inference involving time series models. A linear stochastic process has a unit root if ...
in both variables. In particular, any two
nominal economic variables are likely to be correlated with each other, even when neither has a causal effect on the other, because each equals a
real variable times the
price level, and the common presence of the price level in the two data series imparts correlation to them. (See also
spurious correlation of ratios.)
Another example of a spurious relationship can be seen by examining a city's
ice cream
Ice cream is a frozen dessert typically made from milk or cream that has been flavoured with a sweetener, either sugar or an alternative, and a spice, such as Chocolate, cocoa or vanilla, or with fruit, such as strawberries or peaches. Food ...
sales. The sales might be highest when the rate of drownings in city
swimming pool
A swimming pool, swimming bath, wading pool, paddling pool, or simply pool, is a structure designed to hold water to enable Human swimming, swimming and associated activities. Pools can be built into the ground (in-ground pools) or built abo ...
s is highest. To allege that ice cream sales cause drowning, or vice versa, would be to imply a spurious relationship between the two. In reality, a
heat wave
A heat wave or heatwave, sometimes described as extreme heat, is a period of abnormally hot weather generally considered to be at least ''five consecutive days''. A heat wave is usually measured relative to the usual climate in the area and ...
may have caused both. The heat wave is an example of a hidden or unseen variable, also known as a
confounding variable.
Another commonly noted example is a series of Dutch statistics showing a positive correlation between the number of storks nesting in a series of springs and the number of human babies born at that time. Of course there was no causal connection; they were correlated with each other only because of two independent coincidences. During the Pagan era, which can be traced back at least to medieval times more than 600 years ago, it was common for couples to wed during the annual summer solstice, because summer was associated with fertility. At the same time, storks would commence their annual migration, flying all the way from Europe to Africa. The birds would then return the following spring — exactly nine months later.
In rare cases, a spurious relationship can occur between two completely unrelated variables without any confounding variable, as was the case between the success of the
Washington Commanders
The Washington Commanders are a professional American football team based in the Washington metropolitan area. The Commanders compete in the National Football League (NFL) as a member of the National Football Conference (NFC) East division ...
professional football team in a specific game before each
presidential election and the success of the incumbent President's political party in said election. For 16 consecutive elections between 1940 and 2000, the
Redskins Rule correctly matched whether the incumbent President's political party would retain or lose the Presidency. The rule eventually failed shortly after
Elias Sports Bureau discovered the correlation in 2000; in 2004, 2012 and 2016, the results of the Commanders' game and the election did not match.
In a similar spurious relationship involving the
National Football League
The National Football League (NFL) is a Professional gridiron football, professional American football league in the United States. Composed of 32 teams, it is divided equally between the American Football Conference (AFC) and the National ...
, in the 1970s,
Leonard Koppett noted a correlation between the direction of the stock market and the winning conference of that year's
Super Bowl
The Super Bowl is the annual History of the NFL championship, league championship game of the National Football League (NFL) of the United States. It has served as the final game of every NFL season since 1966 NFL season, 1966 (with the excep ...
, the
Super Bowl indicator; the relationship maintained itself for most of the 20th century before
reverting to more random behavior in the 21st.
Hypothesis testing
Often one tests a null hypothesis of no correlation between two variables, and chooses in advance to reject the hypothesis if the correlation computed from a data sample would have occurred in less than (say) 5% of data samples if the null hypothesis were true. While a true null hypothesis will be accepted 95% of the time, the other 5% of the times having a true null of no correlation a zero correlation will be wrongly rejected, causing acceptance of a correlation which is spurious (an event known as
Type I error
Type I error, or a false positive, is the erroneous rejection of a true null hypothesis in statistical hypothesis testing. A type II error, or a false negative, is the erroneous failure in bringing about appropriate rejection of a false null hy ...
). Here the spurious correlation in the sample resulted from random selection of a sample that did not reflect the true properties of the underlying population.
Detecting spurious relationships
The term "spurious relationship" is commonly used in
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
and in particular in
experimental research techniques, both of which attempt to understand and predict direct causal relationships (X → Y). A non-causal correlation can be spuriously created by an antecedent which causes both (W → X and W → Y).
Mediating variables, (X → M → Y), if undetected, estimate a total effect rather than direct effect without adjustment for the mediating variable M. Because of this, experimentally identified
correlation
In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistics ...
s do not represent
causal relationships unless spurious relationships can be ruled out.
Experiments
In experiments, spurious relationships can often be identified by
controlling for other factors, including those that have been theoretically identified as possible confounding factors. For example, consider a researcher trying to determine whether a new drug kills bacteria; when the researcher applies the drug to a bacterial culture, the bacteria die. But to help in ruling out the presence of a confounding variable, another culture is subjected to conditions that are as nearly identical as possible to those facing the first-mentioned culture, but the second culture is not subjected to the drug. If there is an unseen confounding factor in those conditions, this control culture will die as well, so that no conclusion of efficacy of the drug can be drawn from the results of the first culture. On the other hand, if the control culture does not die, then the researcher cannot reject the hypothesis that the drug is efficacious.
Non-experimental statistical analyses
Disciplines whose data are mostly non-experimental, such as
economics
Economics () is a behavioral science that studies the Production (economics), production, distribution (economics), distribution, and Consumption (economics), consumption of goods and services.
Economics focuses on the behaviour and interac ...
, usually employ observational data to establish causal relationships. The body of statistical techniques used in economics is called
econometrics
Econometrics is an application of statistical methods to economic data in order to give empirical content to economic relationships. M. Hashem Pesaran (1987). "Econometrics", '' The New Palgrave: A Dictionary of Economics'', v. 2, p. 8 p. 8 ...
. The main statistical method in econometrics is multivariable
regression analysis. Typically a linear relationship such as
:
is hypothesized, in which
is the dependent variable (hypothesized to be the caused variable),
for ''j'' = 1, ..., ''k'' is the ''j''
th independent variable (hypothesized to be a causative variable), and
is the error term (containing the combined effects of all other causative variables, which must be uncorrelated with the included independent variables). If there is reason to believe that none of the
s is caused by ''y'', then estimates of the coefficients
are obtained. If the null hypothesis that
is rejected, then the alternative hypothesis that
and equivalently that
causes ''y'' cannot be rejected. On the other hand, if the null hypothesis that
cannot be rejected, then equivalently the hypothesis of no causal effect of
on ''y'' cannot be rejected. Here the notion of causality is one of
contributory causality: If the true value
, then a change in
will result in a change in ''y'' ''unless'' some other causative variable(s), either included in the regression or implicit in the error term, change in such a way as to exactly offset its effect; thus a change in
is ''not sufficient'' to change ''y''. Likewise, a change in
is ''not necessary'' to change ''y'', because a change in ''y'' could be caused by something implicit in the error term (or by some other causative explanatory variable included in the model).
Regression analysis controls for other relevant variables by including them as regressors (explanatory variables). This helps to avoid mistaken inference of causality due to the presence of a third, underlying, variable that influences both the potentially causative variable and the potentially caused variable: its effect on the potentially caused variable is captured by directly including it in the regression, so that effect will not be picked up as a spurious effect of the potentially causative variable of interest. In addition, the use of multivariate regression helps to avoid wrongly inferring that an indirect effect of, say ''x''
1 (e.g., ''x''
1 → ''x''
2 → ''y'') is a direct effect (''x''
1 → ''y'').
Just as an experimenter must be careful to employ an experimental design that controls for every confounding factor, so also must the user of multiple regression be careful to control for all confounding factors by including them among the regressors. If a confounding factor is omitted from the regression, its effect is captured in the error term by default, and if the resulting error term is correlated with one (or more) of the included regressors, then the estimated regression may be biased or inconsistent (see
omitted variable bias).
In addition to regression analysis, the data can be examined to determine if
Granger causality exists. The presence of Granger causality indicates both that ''x'' precedes ''y'', and that ''x'' contains unique information about ''y''.
Other relationships
There are several other relationships defined in statistical analysis as follows.
*Direct relationship
*
Mediating relationship
*
Moderating relationship
See also
*
Causality
*
Correlation does not imply causation
The phrase "correlation does not imply causation" refers to the inability to legitimately deduce a cause-and-effect relationship between two events or variables solely on the basis of an observed association or correlation between them. The id ...
*
Illusory correlation
In psychology
Psychology is the scientific study of mind and behavior. Its subject matter includes the behavior of humans and nonhumans, both consciousness, conscious and Unconscious mind, unconscious phenomena, and mental processes such ...
*
Model specification
*
Omitted-variable bias
*
Post hoc fallacy
*
Statistical model validation
*
One in ten rule
Literature
* David A. Freedman (1983) A Note on Screening Regression Equations, The American Statistician, 37:2, 152-155, DOI: 10.1080/00031305.1983.10482729
Footnotes
References
*
*
*
External links
https://www.tylervigen.com/spurious-correlations a website listing examples of spurious correlations
{{fallacies
Causal fallacies
Logic and statistics
Independence (probability theory)