A data set (or dataset) is a collection of
data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
. In the case of tabular data, a data set corresponds to one or more
database tables, where every
column
A column or pillar in architecture and structural engineering is a structural element that transmits, through compression, the weight of the structure above to other structural elements below. In other words, a column is a compression member. ...
of a table represents a particular
variable
Variable may refer to:
* Variable (computer science), a symbolic name associated with a value and whose associated value may be changed
* Variable (mathematics), a symbol that represents a quantity in a mathematical expression, as used in many ...
, and each
row
Row or ROW may refer to:
Exercise
*Rowing, or a form of aquatic movement using oars
*Row (weight-lifting), a form of weight-lifting exercise
Math
*Row vector, a 1 × ''n'' matrix in linear algebra.
*Row (database), a single, implicitly structured ...
corresponds to a given
record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files.
In the
open data
Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license.
The goals of the open data movement are similar to those of other "open(-source)" movements ...
discipline, data set is the unit to measure the information released in a public open data repository. The European
data.europa.eu portal aggregates more than a million data sets.
Some other issues (
real-time data sources,
non-relational data sets, etc.) increases the difficulty to reach a consensus about it.
Properties
Several characteristics define a data set's structure and properties. These include the number and types of the attributes or variables, and various
statistical measure
In statistics, as opposed to its general use in mathematics, a parameter is any measured quantity of a statistical population that summarises or describes an aspect of the population, such as a mean or a standard deviation. If a population exa ...
s applicable to them, such as
standard deviation
In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, while ...
and
kurtosis
In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...
.
The values may be numbers, such as
real number
In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every real ...
s or
integer
An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign (−1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the language ...
s, for example representing a person's height in centimeters, but may also be
nominal data
Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scal ...
(i.e., not consisting of
numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a
level of measurement
Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scal ...
. For each variable, the values are normally all of the same kind. However, there may also be ''
missing values
In statistics, missing data, or missing values, occur when no data value is stored for the variable in an observation. Missing data are a common occurrence and can have a significant effect on the conclusions that can be drawn from the data.
Mis ...
'', which must be indicated in some way.
In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, data sets usually come from actual observations obtained by
sampling a
statistical population
In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypoth ...
, and each row corresponds to the observations on one element of that population. Data sets may further be generated by
algorithms
In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing c ...
for the purpose of testing certain kinds of
software
Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work.
At the lowest programming level, executable code consists ...
. Some modern statistical analysis software such as
SPSS
SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. C ...
still present their data in the classical data set fashion. If data is missing or suspicious an
imputation method may be used to complete a data set.
Classic data sets
Several classic data sets have been used extensively in the
statistical literature:
*
Iris flower data set
The ''Iris'' flower data set or Fisher's ''Iris'' data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper ''The use of multiple measurements in taxonomic problems'' as a ...
– Multivariate data set introduced by
Ronald Fisher
Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...
(1936).
*
MNIST database – Images of handwritten digits commonly used to test classification, clustering, and image processing algorithms
* ''
Categorical data analysis
This a list of statistical procedures which can be used for the analysis of categorical data, also known as data on the nominal scale and as categorical variables.
General tests
* Bowker's test of symmetry
* Categorical distribution, general mode ...
'' – Data sets used in the book, ''An Introduction to Categorical Data Analysis''.
*''
Robust statistics
Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, suc ...
'' – Data sets used in ''
Robust Regression and Outlier Detection'' (
Rousseeuw and Leroy, 1968)
Provided on-line at the University of Cologne.*''
Time series
In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Exa ...
'' – Data used in Chatfield's book, ''The Analysis of Time Series'', ar
provided on-line by StatLib.*''Extreme values'' – Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' ar
a snapshot of the data as it was provided on-line by Stuart Coles the book's author.
*''Bayesian Data Analysis'' – Data used in the book ar
provided on-lineby
Andrew Gelman
Andrew Eric Gelman (born February 11, 1965) is an American statistician and professor of statistics and political science at Columbia University.
Gelman received bachelor of science degrees in mathematics and in physics from MIT, where he was ...
, one of the book's authors.
* Th
Bupa liver data– Used in several papers in the machine learning (data mining) literature.
*
Anscombe's quartet – Small data set illustrating the importance of graphing the data to avoid statistical fallacies
See also
*
Data
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted ...
*
Data blending Data blending is a process whereby big data from multiple sources are merged into a single data warehouse or data set. It concerns not merely the merging of different file formats or disparate sources of data but also different varieties of data. Da ...
*
Data (computing)
*
Data samples
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. ...
*
Data store
In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted. ...
*
Interoperability
Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader defi ...
*
Data collection system Data collection system (DCS) is a computer application that facilitates the process of data collection, allowing specific, structured information to be gathered in a systematic fashion, subsequently enabling data analysis to be performed on the in ...
References
External links
Data.gov– the U.S. Government's open data
GCMD– the Global Change Master Directory containing over 34,000 descriptions of Earth science and environmental science data sets and services
Humanitarian Data Exchange(HDX)– The Humanitarian Data Exchange (HDX) is an open humanitarian
data sharing
Data sharing is the practice of making data used for scholarly research available to other investigators. Many funding agencies, institutions, and publication venues have policies regarding data sharing because transparency and openness are consid ...
platform managed by the
.
NYC Open Data– free public data published by New York City agencies and other partners.
Relational data set repositoryResearch Pipeline– a wiki/website with links to data sets on many different topics
StatLib–JASA Data ArchiveUCI– a machine learning repository
UK Government Public DataWorld Bank Open Data– Free and open access to global development data by
World Bank
The World Bank is an international financial institution that provides loans and grants to the governments of low- and middle-income countries for the purpose of pursuing capital projects. The World Bank is the collective name for the Interna ...
{{DEFAULTSORT:Data Set
Computer data
Statistical data sets