Explorative Data Analysis
   HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using
statistical graphics Statistical graphics, also known as statistical graphical techniques, are graphics used in the field of statistics for data visualization. Overview Whereas statistics and data analysis procedures generally yield their output in numeric or tabul ...
and other
data visualization Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is num ...
methods. A
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
can be used or not, but primarily EDA is for seeing what the data can tell us beyond the formal modeling and thereby contrasts traditional hypothesis testing. Exploratory data analysis has been promoted by
John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
since 1970 to encourage statisticians to explore the data, and possibly formulate hypotheses that could lead to new data collection and experiments. EDA is different from initial data analysis (IDA), which focuses more narrowly on checking assumptions required for model fitting and hypothesis testing, and handling missing values and making transformations of variables as needed. EDA encompasses IDA.


Overview

Tukey defined data analysis in 1961 as: "Procedures for analyzing data, techniques for interpreting the results of such procedures, ways of planning the gathering of data to make its analysis easier, more precise or more accurate, and all the machinery and results of (mathematical) statistics which apply to analyzing data." Tukey's championing of EDA encouraged the development of
statistical computing Computational statistics, or statistical computing, is the bond between statistics and computer science. It means statistical methods that are enabled by using computational methods. It is the area of computational science (or scientific computin ...
packages, especially S at
Bell Labs Nokia Bell Labs, originally named Bell Telephone Laboratories (1925–1984), then AT&T Bell Laboratories (1984–1996) and Bell Labs Innovations (1996–2007), is an American industrial research and scientific development company owned by mult ...
. The S programming language inspired the systems S-PLUS and R. This family of statistical-computing environments featured vastly improved dynamic visualization capabilities, which allowed statisticians to identify
outlier In statistics, an outlier is a data point that differs significantly from other observations. An outlier may be due to a variability in the measurement, an indication of novel data, or it may be the result of experimental error; the latter are ...
s, trends and patterns in data that merited further study. Tukey's EDA was related to two other developments in statistical theory:
robust statistics Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, suc ...
and
nonparametric statistics Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distr ...
, both of which tried to reduce the sensitivity of statistical inferences to errors in formulating
statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...
s. Tukey promoted the use of
five number summary The five-number summary is a set of descriptive statistics that provides information about a dataset. It consists of the five most important sample percentiles: # the sample minimum ''(smallest observation)'' # the lower quartile or ''first quart ...
of numerical data—the two
extreme Extreme may refer to: Science and mathematics Mathematics *Extreme point, a point in a convex set which does not lie in any open line segment joining two points in the set *Maxima and minima, extremes on a mathematical function Science *Extremop ...
s (
maximum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...
and
minimum In mathematical analysis, the maxima and minima (the respective plurals of maximum and minimum) of a function, known collectively as extrema (the plural of extremum), are the largest and smallest value of the function, either within a given ran ...
), the
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
, and the
quartile In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a ...
s—because these median and quartiles, being functions of the empirical distribution are defined for all distributions, unlike the
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
and
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
; moreover, the quartiles and median are more robust to
skewed In probability theory and statistics, skewness is a measure of the asymmetry of the probability distribution of a real-valued random variable about its mean. The skewness value can be positive, zero, negative, or undefined. For a unimoda ...
or
heavy-tailed distribution In probability theory, heavy-tailed distributions are probability distributions whose tails are not exponentially bounded: that is, they have heavier tails than the exponential distribution. In many applications it is the right tail of the distrib ...
s than traditional summaries (the mean and standard deviation). The packages S, S-PLUS, and R included routines using resampling statistics, such as Quenouille and Tukey's jackknife and
Efron Efron is a Jewish surname. It is taken from the Biblical place name, he, עפרון. Another version to it is the demonym Efroni (). Notable people with the surname include: * Bradley Efron (born 1938), American statistician * Edith Efron (1922 ...
bootstrap, which are nonparametric and robust (for many problems). Exploratory data analysis, robust statistics, nonparametric statistics, and the development of statistical programming languages facilitated statisticians' work on scientific and engineering problems. Such problems included the fabrication of semiconductors and the understanding of communications networks, which concerned Bell Labs. These statistical developments, all championed by Tukey, were designed to complement the analytic theory of testing statistical hypotheses, particularly the
Laplacian In mathematics, the Laplace operator or Laplacian is a differential operator given by the divergence of the gradient of a scalar function on Euclidean space. It is usually denoted by the symbols \nabla\cdot\nabla, \nabla^2 (where \nabla is the ...
tradition's emphasis on
exponential families In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...
.


Development

John W. Tukey wrote the book ''Exploratory Data Analysis'' in 1977. Tukey held that too much emphasis in statistics was placed on
statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data at hand sufficiently support a particular hypothesis. Hypothesis testing allows us to make probabilistic statements about population parameters. ...
(confirmatory data analysis); more emphasis needed to be placed on using
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
to suggest hypotheses to test. In particular, he held that confusing the two types of analyses and employing them on the same set of data can lead to systematic bias owing to the issues inherent in testing hypotheses suggested by the data. The objectives of EDA are to: *Enable unexpected discoveries in the data *Suggest hypotheses about the
causes Causes, or causality, is the relationship between one event and another. It may also refer to: * Causes (band), an indie band based in the Netherlands * Causes (company) Causes.com is a civic-technology app and website that enables users to orga ...
of observed
phenomena A phenomenon ( : phenomena) is an observable event. The term came into its modern philosophical usage through Immanuel Kant, who contrasted it with the noumenon, which ''cannot'' be directly observed. Kant was heavily influenced by Gottfried W ...
*Assess assumptions on which
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
will be based *Support the selection of appropriate statistical tools and techniques *Provide a basis for further data collection through surveys or
experiments An experiment is a procedure carried out to support or refute a hypothesis, or determine the efficacy or likelihood of something previously untried. Experiments provide insight into cause-and-effect by demonstrating what outcome occurs when ...
Many EDA techniques have been adopted into data mining. They are also being taught to young students as a way to introduce them to statistical thinking.


Techniques and tools

There are a number of tools that are useful for EDA, but EDA is characterized more by the attitude taken than by particular techniques. Typical graphical techniques used in EDA are: * Box plot *
Histogram A histogram is an approximate representation of the distribution of numerical data. The term was first introduced by Karl Pearson. To construct a histogram, the first step is to " bin" (or "bucket") the range of values—that is, divide the ent ...
*
Multi-vari chart In quality control, multi-vari charts are a visual way of presenting variability through a series of charts. The content and format of the charts has evolved over time. Original concept Multi-vari charts were first described by Leonard Seder in 1 ...
* Run chart * Pareto chart *
Scatter plot A scatter plot (also called a scatterplot, scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data. ...
(2D/3D) *
Stem-and-leaf plot A stem-and-leaf display or stem-and-leaf plot is a device for presenting quantitative data in a information graphics, graphical format, similar to a histogram, to assist in visualizing the shape of a probability distribution, distribution. They e ...
* Parallel coordinates *
Odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
*
Targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
* Heat map * Bar chart *Horizon graph *Glyph-based visualization methods such as PhenoPlot and
Chernoff face Chernoff faces, invented by applied mathematician, statistician and physicist Herman Chernoff in 1973, display multivariate data in the shape of a human face. The individual parts, such as eyes, ears, mouth and nose represent values of the variabl ...
s * Projection methods such as grand tour, guided tour and manual tour * Interactive versions of these plots Dimensionality reduction: * Multidimensional scaling *
Principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA) *
Multilinear PCA Within statistics, Multilinear principal component analysis (MPCA) is a multilinear extension of principal component analysis (PCA). MPCA is employed in the analysis of M-way arrays, i.e. a cube or hyper-cube of numbers, also informally referred ...
*
Nonlinear dimensionality reduction Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-d ...
(NLDR) * Iconography of correlations Typical
quantitative Quantitative may refer to: * Quantitative research, scientific investigation of quantitative properties * Quantitative analysis (disambiguation) * Quantitative verse, a metrical system in poetry * Statistics, also known as quantitative analysis ...
techniques are: *
Median polish The median polish is a simple and robust exploratory data analysis procedure proposed by the statistician John Tukey. The purpose of median polish is to find an additively-fit model for data in a two-way layout table (usually, results from a factor ...
*
Trimean In statistics the trimean (TM), or Tukey's trimean, is a measure of a probability distribution's location defined as a weighted average of the distribution's median and its two quartiles: : TM= \frac This is equivalent to the average of the med ...
*
Ordination Ordination is the process by which individuals are Consecration, consecrated, that is, set apart and elevated from the laity class to the clergy, who are thus then authorization, authorized (usually by the religious denomination, denominational ...


History

Many EDA ideas can be traced back to earlier authors, for example: *
Francis Galton Sir Francis Galton, FRS FRAI (; 16 February 1822 – 17 January 1911), was an English Victorian era polymath: a statistician, sociologist, psychologist, anthropologist, tropical explorer, geographer, inventor, meteorologist, proto- ...
emphasized order statistics and
quantile In statistics and probability, quantiles are cut points dividing the range of a probability distribution into continuous intervals with equal probabilities, or dividing the observations in a sample in the same way. There is one fewer quantile tha ...
s. * Arthur Lyon Bowley used precursors of the stemplot and five-number summary (Bowley actually used a " seven-figure summary", including the extremes, deciles and
quartile In statistics, a quartile is a type of quantile which divides the number of data points into four parts, or ''quarters'', of more-or-less equal size. The data must be ordered from smallest to largest to compute quartiles; as such, quartiles are a ...
s, along with the median—see his ''Elementary Manual of Statistics'' (3rd edn., 1920), p. 62– he defines "the maximum and minimum, median, quartiles and two deciles" as the "seven positions"). *
Andrew Ehrenberg Andrew Ehrenberg (1 May 1926 – 25 August 2010) was a statistician and marketing scientist. For over half a century, he made contributions to the methodology of data collection, analysis and presentation, and to understanding buyer behaviour a ...
articulated a philosophy of
data reduction Data reduction is the transformation of numerical or alphabetical digital information derived empirically or experimental data, experimentally into a corrected, ordered, and simplified form. The purpose of data reduction can be two-fold: reduce the ...
(see his book of the same name). The
Open University The Open University (OU) is a British public research university and the largest university in the United Kingdom by number of students. The majority of the OU's undergraduate students are based in the United Kingdom and principally study off- ...
course ''Statistics in Society'' (MDST 242), took the above ideas and merged them with Gottfried Noether's work, which introduced
statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying probability distribution, distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical ...
via coin-tossing and the
median test In statistics, Mood's median test is a special case of Pearson's chi-squared test. It is a nonparametric test that tests the null hypothesis that the medians of the populations from which two or more samples are drawn are identical. The data in ea ...
.


Example

Findings from EDA are orthogonal to the primary analysis task. To illustrate, consider an example from Cook et al. where the analysis task is to find the variables which best predict the tip that a dining party will give to the waiter. Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007) ″Interactive and Dynamic Graphics for Data Analysis: With R and GGobi″ Springer, 978-0387717616 The variables available in the data collected for this task are: the tip amount, total bill, payer gender, smoking/non-smoking section, time of day, day of the week, and size of the party. The primary analysis task is approached by fitting a regression model where the tip rate is the response variable. The fitted model is : (
tip rate Tip commonly refers to: * Tip (gambling) * Tip (gratuity) * Tip (law enforcement) * another term for Advice Tip or TIP may also refer to: Science and technology * Tank phone, a device allowing infantry to communicate with the occupants of an armo ...
) = 0.18 - 0.01 × (party size) which says that as the size of the dining party increases by one person (leading to a higher bill), the tip rate will decrease by 1%, on average. However, exploring the data reveals other interesting features not described by this model. Tips-hist1.png, Histogram of tip amounts where the bins cover $1 increments. The distribution of values is skewed right and unimodal, as is common in distributions of small, non-negative quantities. Tips-hist2.png, Histogram of tip amounts where the bins cover $0.10 increments. An interesting phenomenon is visible: peaks occur at the whole-dollar and half-dollar amounts, which is caused by customers picking round numbers as tips. This behavior is common to other types of purchases too, like gasoline. Tips-scat1.png, Scatterplot of tips vs. bill. Points below the line correspond to tips that are lower than expected (for that bill amount), and points above the line are higher than expected. We might expect to see a tight, positive linear association, but instead see variation that increases with tip amount. In particular, there are more points far away from the line in the lower right than in the upper left, indicating that more customers are very cheap than very generous. Tips-scat2.png, Scatterplot of tips vs. bill separated by payer gender and smoking section status. Smoking parties have a lot more variability in the tips that they give. Males tend to pay the (few) higher bills, and the female non-smokers tend to be very consistent tippers (with three conspicuous exceptions shown in the sample). What is learned from the plots is different from what is illustrated by the regression model, even though the experiment was not designed to investigate any of these other trends. The patterns found by exploring the data suggest hypotheses about tipping that may not have been anticipated in advance, and which could lead to interesting follow-up experiments where the hypotheses are formally stated and tested by collecting new data.


Software

* JMP, an EDA package from
SAS Institute SAS Institute (or SAS, pronounced "sass") is an American multinational developer of analytics software based in Cary, North Carolina. SAS develops and markets a suite of analytics software ( also called SAS), which helps access, manage, analy ...
. *
KNIME KNIME (), the Konstanz Information Miner, is a free and open-source data analytics, reporting and integration platform. KNIME integrates various components for machine learning and data mining through its modular data pipelining "Building Blocks ...
, Konstanz Information Miner – Open-Source data exploration platform based on Eclipse. * Minitab, an EDA and general statistics package widely used in industrial and corporate settings. * Orange, an
open-source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use the source code, design documents, or content of the product. The open-source model is a decentralized sof ...
data mining and
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
software suite. *
Python Python may refer to: Snakes * Pythonidae, a family of nonvenomous snakes found in Africa, Asia, and Australia ** ''Python'' (genus), a genus of Pythonidae found in Africa and Asia * Python (mythology), a mythical serpent Computing * Python (pro ...
, an open-source programming language widely used in data mining and machine learning. * R, an open-source programming language for statistical computing and graphics. Together with Python one of the most popular languages for data science. *
TinkerPlots ''TinkerPlots'' is exploratory data analysis and modeling software designed for use by students in grades 4 through university. It was designed bClifford KonoldanCraig Millerat the University of Massachusetts Amherst and is currently published by ...
an EDA software for upper elementary and middle school students. *
Weka The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...
an open source data mining package that includes visualization and EDA tools such as
targeted projection pursuit Targeted projection pursuit is a type of statistical technique used for exploratory data analysis, information visualization, and feature selection. It allows the user to interactively explore very complex data (typically having tens to hundreds o ...
.


See also

*
Anscombe's quartet Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (''x'',''y'') points. They were ...
, on importance of exploration *
Data dredging Data dredging (also known as data snooping or ''p''-hacking) is the misuse of data analysis to find patterns in data that can be presented as statistically significant, thus dramatically increasing and understating the risk of false positives. ...
*
Predictive analytics Predictive analytics encompasses a variety of statistical techniques from data mining, predictive modeling, and machine learning that analyze current and historical facts to make predictions about future or otherwise unknown events. In business ...
*
Structured data analysis (statistics) Structured data analysis is the statistical data analysis of structured data. This can arise either in the form of an ''a priori'' structure such as multiple-choice questionnaires or in situations with the need to search for structure that fits t ...
*
Configural frequency analysis Configural frequency analysis (CFA) is a method of exploratory data analysis, introduced by Gustav A. Lienert in 1969. The goal of a configural frequency analysis is to detect patterns in the data that occur Statistical significance, significantly m ...
* Descriptive statistics


References


Bibliography

*Andrienko, N & Andrienko, G (2005) ''Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach''. Springer. *Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5 Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8. Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2 * * * *Leinhardt, G., Leinhardt, S.,
Exploratory Data Analysis: New Tools for the Analysis of Empirical Data
', Review of Research in Education, Vol. 8, 1980 (1980), pp. 85–157. * *Theus, M., Urbanek, S. (2008), Interactive Graphics for Data Analysis: Principles and Examples, CRC Press, Boca Raton, FL, * * * * Young, F. W. Valero-Mora, P. and Friendly M. (2006
''Visual Statistics: Seeing your data with Dynamic Interactive Graphics''
Wiley *Jambu M. (1991
''Exploratory and Multivariate Data Analysis''
Academic Press *S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986
''Graphical Exploratory Data Analysis''
Springer Andrienko, N & Andrienko, G (2005) Exploratory Analysis of Spatial and Temporal Data. A Systematic Approach. Springer. ISBN 3-540-25994-5 Cook, D. and Swayne, D.F. (with A. Buja, D. Temple Lang, H. Hofmann, H. Wickham, M. Lawrence) (2007-12-12). Interactive and Dynamic Graphics for Data Analysis: With R and GGobi. Springer. ISBN 9780387717616. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1985). Exploring Data Tables, Trends and Shapes. ISBN 978-0-471-09776-1. Hoaglin, D C; Mosteller, F & Tukey, John Wilder (Eds) (1983). Understanding Robust and Exploratory Data Analysis. ISBN 978-0-471-09777-8. Young, F. W. Valero-Mora, P. and Friendly M. (2006) Visual Statistics: Seeing your data with Dynamic Interactive Graphics. Wiley ISBN 978-0-471-68160-1 Jambu M. (1991) Exploratory and Multivariate Data Analysis. Academic Press ISBN 0123800900 S. H. C. DuToit, A. G. W. Steyn, R. H. Stumpf (1986) Graphical Exploratory Data Analysis. Springer ISBN 978-1-4612-9371-2


External links


Carnegie Mellon University – free online course on Probability and Statistics, with a module on EDA


{{Authority control