HOME

TheInfoList



OR:

Anscombe's quartet comprises four data sets that have nearly identical simple
descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...
, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (''x'',''y'') points. They were constructed in 1973 by the
statistician A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors. It is common to combine statistical knowledge with expertise in other subjects, and statisticians may w ...
Francis Anscombe Francis John Anscombe (13 May 1918 – 17 October 2001) was an English statistician. Born in Hove in England, Anscombe was educated at Trinity College at Cambridge University. After serving in the Second World War, he joined Rothamsted E ...
to demonstrate both the importance of graphing data when analyzing it, and the effect of
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s and other influential observations on statistical properties. He described the article as being intended to counter the impression among statisticians that "numerical calculations are exact, but graphs are rough."


Data

For all four datasets: * The first
scatter plot A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data ...
(top left) appears to be a simple
linear relationship In statistics, correlation or dependence is any statistical relationship, whether causal or not, between two random variables or bivariate data. Although in the broadest sense, "correlation" may indicate any type of association, in statistic ...
, corresponding to two variables correlated where y could be modelled as
gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponym ...
with mean linearly dependent on ''x''. * The second graph (top right); while a relationship between the two variables is obvious, it is not linear, and the
Pearson correlation coefficient In statistics, the Pearson correlation coefficient (PCC, pronounced ) ― also known as Pearson's ''r'', the Pearson product-moment correlation coefficient (PPMCC), the bivariate correlation, or colloquially simply as the correlation coefficient ...
is not relevant. A more general regression and the corresponding
coefficient of determination In statistics, the coefficient of determination, denoted ''R''2 or ''r''2 and pronounced "R squared", is the proportion of the variation in the dependent variable that is predictable from the independent variable(s). It is a statistic used i ...
would be more appropriate. * In the third graph (bottom left), the modelled relationship is linear, but should have a different regression line (a
robust regression In robust statistics, robust regression seeks to overcome some limitations of traditional regression analysis. A regression analysis models the relationship between one or more independent variables and a dependent variable. Standard types of regr ...
would have been called for). The calculated regression is offset by the one
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
which exerts enough influence to lower the correlation coefficient from 1 to 0.816. * Finally, the fourth graph (bottom right) shows an example when one
high-leverage point In statistics and in particular in regression analysis, leverage is a measure of how far away the independent variable values of an observation are from those of the other observations. ''High-leverage points'', if any, are outliers with respect to ...
is enough to produce a high correlation coefficient, even though the other data points do not indicate any relationship between the variables. The quartet is still often used to illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic datasets. The datasets are as follows. The ''x'' values are the same for the first three datasets. It is not known how Anscombe created his datasets. Since its publication, several methods to generate similar data sets with identical statistics and dissimilar graphics have been developed. One of these, the ''Datasaurus Dozen'', consists of points tracing out the outline of a dinosaur, plus twelve other data sets that have the same summary statistics. ''Datasaurus Dozen'' was created by Justin Matejka and George Fitzmaurice. The process is described in their paper “Same stats, different graphs: generating datasets with varied appearance and identical statistics through simulated annealing“. The Datasaurus Dozen, just like Anscombe's Quartet, shows why visualizing data is important, as the summary statistics can be the same, while the data distributions can be very different.


See also

* Exploratory data analysis *
Goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measure ...
*
Regression validation In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation ...
*
Simpson's paradox Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science st ...
*
Statistical model validation In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misunderstan ...


References

{{Reflist


External links


Department of Physics, University of TorontoDynamic Applet
made in
GeoGebra GeoGebra (a portmanteau of ''geometry'' and ''algebra'') is an interactive geometry, algebra, statistics and calculus application, intended for learning and teaching mathematics and science from primary school to university level. GeoGebra is ...
showing the data & statistics and also allowing the points to be dragged (Set 5).
Animated examples from Autodesk
called the "Datasaurus Dozen".

for the datasets in R. Misuse of statistics Statistical charts and diagrams Statistical data sets 1973 introductions 1973 in science