Anscombe's quartet comprises four
datasets that have nearly identical simple
descriptive statistics
A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...
, yet have very different
distributions and appear very different when
graphed. Each dataset consists of eleven
(''x'', ''y'') points. They were constructed in 1973 by the
statistician
A statistician is a person who works with Theory, theoretical or applied statistics. The profession exists in both the private sector, private and public sectors.
It is common to combine statistical knowledge with expertise in other subjects, a ...
Francis Anscombe to demonstrate both the importance of graphing data when analyzing it, and the effect of
outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s and other
influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough".
Data
For all four datasets:
* The first
scatter plot (top left) appears to be a simple
linear relationship, corresponding to two correlated
variables, where ''y'' could be modelled as
gaussian with mean linearly dependent on ''x''.
* For the second graph (top right), while a relationship between the two variables is obvious, it is not linear, and the
Pearson correlation coefficient
In statistics, the Pearson correlation coefficient (PCC) is a correlation coefficient that measures linear correlation between two sets of data. It is the ratio between the covariance of two variables and the product of their standard deviatio ...
is not relevant. A more general regression and the corresponding
coefficient of determination would be more appropriate.
* In the third graph (bottom left), the modelled relationship is linear, but should have a different
regression line (a
robust regression
In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of re ...
would have been called for). The calculated regression is offset by the one
outlier
In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
, which exerts enough influence to lower the correlation coefficient from 1 to 0.816.
* Finally, the fourth graph (bottom right) shows an example when one
high-leverage point is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables.
The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets.
The datasets are as follows. The ''x'' values are the same for the first three datasets.
It is not known how Anscombe created his datasets.
Since its publication, several methods to generate similar datasets with identical statistics and dissimilar graphics have been developed.
One of these, the ''
Datasaurus dozen'', consists of points tracing out the outline of a dinosaur, plus twelve other datasets that have the same summary statistics.
See also
*
Datasaurus dozen
*
Exploratory data analysis
In statistics, exploratory data analysis (EDA) is an approach of data analysis, analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or ...
*
Goodness of fit
The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measur ...
*
Regression validation
*
Simpson's paradox
Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science st ...
*
Statistical model validation
References
{{Reflist
External links
Department of Physics, University of TorontoDynamic Appletmade in
GeoGebra showing the data & statistics and also allowing the points to be dragged (Set 5).
Animated examples from Autodeskcalled the "Datasaurus Dozen".
for the datasets in
R.
Misuse of statistics
Statistical charts and diagrams
Statistical data sets
1973 introductions
1973 in science
Data and information visualization