A data set (or dataset) is a collection of
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
. In the case of tabular data, a data set corresponds to one or more
database tables
A table is a collection of related data held in a table format within a database. It consists of columns and rows.
In relational databases, and flat file databases, a ''table'' is a set of data elements (values) using a model of vertical column ...
, where every
column
A column or pillar in architecture and structural engineering is a structural element that transmits, through compression, the weight of the structure above to other structural elements below. In other words, a column is a compression member ...
of a table represents a particular
variable, and each
row
Row or ROW may refer to:
Exercise
*Rowing, or a form of aquatic movement using oars
*Row (weight-lifting), a form of weight-lifting exercise
Math
*Row vector, a 1 × ''n'' matrix in linear algebra.
*Row (database), a single, implicitly structured ...
corresponds to a given
record
A record, recording or records may refer to:
An item or collection of data Computing
* Record (computer science), a data structure
** Record, or row (database), a set of fields in a database related to one entity
** Boot sector or boot record, ...
of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files.
In the
open data
Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license.
The goals of the open data movement are similar to those of other "open(-source)" movements ...
discipline, data set is the unit to measure the information released in a public open data repository. The European
data.europa.eu On November 16, 2015 the beta version of the European Data Portal was launched. The European Data Portal is an initiative of the European Commission, and is part of the Digital Single Market.
Purpose
The European Data Portal was created to gather ...
portal aggregates more than a million data sets.
Some other issues (
real-time data sources,
non-relational data sets, etc.) increases the difficulty to reach a consensus about it.
Properties
Several characteristics define a data set's structure and properties. These include the number and types of the attributes or variables, and various
statistical measures applicable to them, such as
standard deviation and
kurtosis
In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kur ...
.
The values may be numbers, such as
real number
In mathematics, a real number is a number that can be used to measurement, measure a ''continuous'' one-dimensional quantity such as a distance, time, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small var ...
s or
integer
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language ...
s, for example representing a person's height in centimeters, but may also be
nominal data (i.e., not consisting of
numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a
level of measurement. For each variable, the values are normally all of the same kind. However, there may also be ''
missing values
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
M ...
'', which must be indicated in some way.
In
statistics, data sets usually come from actual observations obtained by
sampling a
statistical population
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypot ...
, and each row corresponds to the observations on one element of that population. Data sets may further be generated by
algorithms
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
for the purpose of testing certain kinds of
software
Software is a set of computer programs and associated software documentation, documentation and data (computing), data. This is in contrast to Computer hardware, hardware, from which the system is built and which actually performs the work.
...
. Some modern statistical analysis software such as
SPSS still present their data in the classical data set fashion. If data is missing or suspicious an
imputation method may be used to complete a data set.
Classic data sets
Several classic data sets have been used extensively in the
statistical
Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...
literature:
*
Iris flower data set – Multivariate data set introduced by
Ronald Fisher
Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who ...
(1936).
*
MNIST database – Images of handwritten digits commonly used to test classification, clustering, and image processing algorithms
* ''
Categorical data analysis'' – Data sets used in the book, ''An Introduction to Categorical Data Analysis''.
*''
Robust statistics
Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, suc ...
'' – Data sets used in ''
Robust Regression and Outlier Detection
''Robust Regression and Outlier Detection'' is a book on robust statistics, particularly focusing on the breakdown point of methods for robust regression. It was written by Peter Rousseeuw and Annick M. Leroy, and published in 1987 by Wiley.
Backg ...
'' (
Rousseeuw and Leroy, 1968)
Provided on-line at the University of Cologne.*''
Time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. E ...
'' – Data used in Chatfield's book, ''The Analysis of Time Series'', ar
provided on-line by StatLib.*''Extreme values'' – Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' ar
a snapshot of the data as it was provided on-line by Stuart Coles the book's author.
*''Bayesian Data Analysis'' – Data used in the book ar
provided on-lineby
Andrew Gelman, one of the book's authors.
* Th
Bupa liver data– Used in several papers in the machine learning (data mining) literature.
*
Anscombe's quartet
Anscombe's quartet comprises four data sets that have nearly identical simple descriptive statistics, yet have very different distributions and appear very different when graphed. Each dataset consists of eleven (''x'',''y'') points. They were ...
– Small data set illustrating the importance of graphing the data to avoid statistical fallacies
See also
*
Data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
*
Data blending
*
Data (computing)
*
Data samples
*
Data store
*
Interoperability
Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader def ...
*
Data collection system Data collection system (DCS) is a computer application that facilitates the process of data collection, allowing specific, structured information to be gathered in a systematic fashion, subsequently enabling data analysis to be performed on the inf ...
References
External links
Data.gov– the U.S. Government's open data
GCMD– the Global Change Master Directory containing over 34,000 descriptions of Earth science and environmental science data sets and services
Humanitarian Data Exchange(HDX)– The Humanitarian Data Exchange (HDX) is an open humanitarian
data sharing
Data sharing is the practice of making data used for scholarly research available to other investigators. Many funding agencies, institutions, and publication venues have policies regarding data sharing because transparency and openness are consid ...
platform managed by the
United Nations Office for the Coordination of Humanitarian Affairs.
NYC Open Data– free public data published by New York City agencies and other partners.
Relational data set repositoryResearch Pipeline– a wiki/website with links to data sets on many different topics
StatLib–JASA Data ArchiveUCI– a machine learning repository
UK Government Public DataWorld Bank Open Data– Free and open access to global development data by
World Bank
The World Bank is an international financial institution that provides loans and grants to the governments of low- and middle-income countries for the purpose of pursuing capital projects. The World Bank is the collective name for the Inte ...
{{DEFAULTSORT:Data Set
Computer data
Statistical data sets