Data Exploration
   HOME

TheInfoList



OR:

Data exploration is an approach similar to initial
data analysis Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enco ...
, whereby a
data analyst Data analysis is a process of inspecting, cleansing, transforming, and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making. Data analysis has multiple facets and approaches, enc ...
uses visual exploration to understand what is in a dataset and the characteristics of the data, rather than through traditional data management systems.FOSTER Open Science
Overview of Data Exploration Techniques: Stratos Idreos, Olga Papaemmonouil, Surajit Chaudhuri.
These characteristics can include size or amount of data, completeness of the data, correctness of the data, possible relationships amongst data elements or files/tables in the data. Data exploration is typically conducted using a combination of automated and manual activities.Stanford.edu
2011 Wrangler: Interactive Visual Specification of Data Transformation Scripts, Kandel, Paepcke, Hellerstein Heer.
Automated activities can include
data profiling Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. The purpose of these statistics may be to: # Find ou ...
or
data visualization Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is num ...
or tabular reports to give the analyst an initial view into the data and an understanding of key characteristics. This is often followed by manual drill-down or filtering of the data to identify anomalies or patterns identified through the automated actions. Data exploration can also require manual scripting and queries into the data (e.g. using languages such as SQL or R) or using
spreadsheet A spreadsheet is a computer application for computation, organization, analysis and storage of data in tabular form. Spreadsheets were developed as computerized analogs of paper accounting worksheets. The program operates on data entered in cel ...
s or similar tools to view the
raw data Raw data, also known as primary data, are ''data'' (e.g., numbers, instrument readings, figures, etc.) collected from a source. In the context of examinations, the raw data might be described as a raw score (after test scores). If a scientist ...
.Stanford.edu
IEEE Visual Analytics Science & Technology (VAST), Oct 2012 Enterprise Data Analysis and Visualization: An Interview Study., Sean Kandel, Andreas Paepcke, Joseph Hellerstein, Jeffrey Heer Proc.
All of these activities are aimed at creating a mental model and understanding of the data in the mind of the analyst, and defining basic
metadata Metadata is "data that provides information about other data", but not the content of the data, such as the text of a message or the image itself. There are many distinct types of metadata, including: * Descriptive metadata – the descriptive ...
(statistics, structure, relationships) for the data set that can be used in further analysis. Once this initial understanding of the data is had, the data can be pruned or refined by removing unusable parts of the data (
data cleansing Data cleansing or data cleaning is the process of detecting and correcting (or removing) corrupt or inaccurate records from a record set, table, or database and refers to identifying incomplete, incorrect, inaccurate or irrelevant parts of the dat ...
), correcting poorly formatted elements and defining relevant relationships across datasets. This process is also known as determining
data quality Data quality refers to the state of qualitative or quantitative pieces of information. There are many definitions of data quality, but data is generally considered high quality if it is "fit for tsintended uses in operations, decision making and ...
. Data exploration can also refer to the ad hoc querying or visualization of data to identify potential relationships or insights that may be hidden in the data and does not require to formulate assumptions beforehand. Traditionally, this had been a key area of focus for statisticians, with
John Tukey John Wilder Tukey (; June 16, 1915 – July 26, 2000) was an American mathematician and statistician, best known for the development of the fast Fourier Transform (FFT) algorithm and box plot. The Tukey range test, the Tukey lambda distributi ...
being a key evangelist in the field.Exploratory Data Analysis
Pearson.
Today, data exploration is more widespread and is the focus of data analysts and
data scientists Data science is an interdisciplinary field that uses scientific methods, processes, algorithms and systems to extract or extrapolate knowledge and insights from noisy, structured and unstructured data, and apply knowledge from data across a ...
; the latter being a relatively new role within enterprises and larger organizations.


Interactive Data Exploration

This area of data exploration has become an area of interest in the field of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
. This is a relatively new field and is still evolving. As its most basic level, a machine-learning algorithm can be fed a data set and can be used to identify whether a hypothesis is true based on the dataset. Common machine learning algorithms can focus on identifying specific patterns in the data. Many common patterns include regression and
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
or clustering, but there are many possible patterns and algorithms that can be applied to data via machine learning. By employing machine learning, it is possible to find patterns or relationships in the data that would be difficult or impossible to find via manual inspection, trial and error or traditional exploration techniques.Machine Learning for Data Exploration
/ref>


Software

*
Trifacta Trifacta is a privately owned software company headquartered in San Francisco with offices in Bengaluru, Boston, Berlin and London. The company was founded in October 2012 and primarily develops data wrangling software for data exploration and s ...
– a data preparation and analysis platform * Paxata – self-service data preparation software *
Alteryx Alteryx is an American computer software company based in Irvine, California, with a development center in Broomfield, Colorado. The company's products are used for data science and analytics. The software is designed to make advanced analytic ...
– data blending and advanced data analytics software * Microsoft
Power BI Power BI is an interactive data visualization software product developed by Microsoft with a primary focus on business intelligence. It is part of the Microsoft Power Platform. Power BI is a collection of software services, apps, and connectors t ...
- interactive visualization and data analysis tool *
OpenRefine OpenRefine is an open-source desktop application for data cleanup and transformation to other formats, an activity commonly known as data wrangling. It is similar to spreadsheet applications, and can handle spreadsheet file formats such as CSV, ...
- a standalone open source desktop application for data clean-up and data transformation *
Tableau software Tableau Software ( ) is an American interactive data visualization software company focused on business intelligence. It was founded in 2003 in Mountain View, California, and is currently headquartered in Seattle, Washington. In 2019 the comp ...
– interactive data visualization software


See also

*
Exploratory data analysis In statistics, exploratory data analysis (EDA) is an approach of analyzing data sets to summarize their main characteristics, often using statistical graphics and other data visualization methods. A statistical model can be used or not, but pr ...
*
Machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
*
Data profiling Data profiling is the process of examining the data available from an existing information source (e.g. a database or a file) and collecting statistics or informative summaries about that data. The purpose of these statistics may be to: # Find ou ...
*
Data visualization Data and information visualization (data viz or info viz) is an interdisciplinary field that deals with the graphic representation of data and information. It is a particularly efficient way of communicating when the data or information is num ...


References

{{reflist Machine learning Data analysis Data management Data quality