Datasaurus Dozen
   HOME

TheInfoList



OR:

The Datasaurus dozen comprises
thirteen Thirteen or 13 may refer to: * 13 (number) * Any of the years 13 BC, AD 13, 1913, or 2013 Music Albums * ''13'' (Black Sabbath album), 2013 * ''13'' (Blur album), 1999 * ''13'' (Borgeous album), 2016 * ''13'' (Brian Setzer album), 2006 * ...
data sets that have nearly identical simple
descriptive statistics A descriptive statistic (in the count noun sense) is a summary statistic that quantitatively describes or summarizes features from a collection of information, while descriptive statistics (in the mass noun sense) is the process of using and an ...
to two decimal places, yet have very different distributions and appear very different when graphed. It was inspired by the smaller Anscombe's quartet that was created in 1973.


Data

The following table contains summary statistics for all thirteen data sets. The thirteen data sets were labeled as the following: * away * bullseye * circle * dino * dots * h_lines * high_lines * slant_down * slant_up * star * v_line * wide_lines * x_shape Similar to the Anscombe's quartet, the Datasaurus dozen was designed to further illustrate the importance of looking at a set of data graphically before starting to analyze according to a particular type of relationship, and the inadequacy of basic statistic properties for describing realistic data sets.


Creation

The first data set, in the shape of a
Tyrannosaurus ''Tyrannosaurus'' () is a genus of large theropod dinosaur. The type species ''Tyrannosaurus rex'' ( meaning 'king' in Latin), often shortened to ''T. rex'' or colloquially t-rex, is one of the best represented theropods. It lived througho ...
, that inspired the rest of the "datasaurus" data set was constructed in 2016 by
Alberto Cairo Alberto Cairo (born 1974 in A Coruña) is a Spanish information designer and professor. Cairo is the Knight Chair in Visual Journalism at the School of Communication of the University of Miami. Education Cairo holds a BA in Journalism from the ...
. It was proposed by Maarten Lambrechts that this data set also be called "Anscombosaurus". This data set was then accompanied by twelve other data sets that were created by Justin Matejka and George Fitzmaurice at
Autodesk Autodesk, Inc. is an American multinational software corporation that provides software products and services for the architecture, engineering, construction, manufacturing, media, education, and entertainment industries. Autodesk is headquarte ...
. Unlike the Anscombe's quartet, where it is not known how the data set was generated, the authors used
simulated annealing Simulated annealing (SA) is a probabilistic technique for approximating the global optimum of a given function. Specifically, it is a metaheuristic to approximate global optimization in a large search space for an optimization problem. ...
to make these data sets. They made small, random, and biased changes to each point towards the desired shape. Each shape took 200,000 iterations of perturbations to complete. The
pseudocode In computer science, pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages (like assignment operator, conditional operator, loop) with informal, usually self-explanatory, notation of actio ...
for this algorithm is as follows: current_ds ← initial_ds for x iterations, do: test_ds ← perturb(current_ds, temp) if similar_enough(test_ds, initial_ds): current_ds ← test_ds function perturb(ds, temp): loop: test ← move_random_points(ds) if fit(test) > fit(ds) or temp > random(): return test where * initial_ds is the seed data set * current_ds is the latest version of the data set * fit() is a function used to check whether moving the points gets closer to the desired shape * temp is the temperature of the simulated annealing algorithm * similar_enough() is a function that checks whether the statistics for the two given data sets are similar enough * move_random_points() is a function that randomly moves data points


See also

*
Exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of data analysis, analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or ...
*
Goodness of fit The goodness of fit of a statistical model describes how well it fits a set of observations. Measures of goodness of fit typically summarize the discrepancy between observed values and the values expected under the model in question. Such measur ...
*
Regression validation In statistics, regression validation is the process of deciding whether the numerical results quantifying hypothesized relationships between variables, obtained from regression analysis, are acceptable as descriptions of the data. The validation ...
*
Simpson's paradox Simpson's paradox is a phenomenon in probability and statistics in which a trend appears in several groups of data but disappears or reverses when the groups are combined. This result is often encountered in social-science and medical-science st ...
*
Statistical model validation In statistics, model validation is the task of evaluating whether a chosen statistical model is appropriate or not. Oftentimes in statistical inference, inferences from models that appear to fit their data may be flukes, resulting in a misundersta ...
* Anscombe's quartet


References

{{Reflist


External links


Animated examples from Autodesk
for the Datasaurus Dozen datasets
datasauRus
datasets from the Datasaurus Dozen in R * The Datasaurus Dozen in CSV and tab-delimited files https://www.openintro.org/data/index.php?data=datasaurus Misuse of statistics Statistical charts and diagrams Statistical data sets Data and information visualization