A data set (or dataset) is a collection of

data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...

. In the case of tabular data, a data set corresponds to one or more database tables, where every

column A column or pillar in architecture and structural engineering is a structural element that transmits, through compression, the weight of the structure above to other structural elements below. In other words, a column is a compression member ...

of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files. In the

open data Open data are data that are openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data are generally licensed under an open license. The goals of the open data movement are similar to those of other "open(-so ...

discipline, a dataset is a unit used to measure the amount of information released in a public open data repository. The European data.europa.eu portal aggregates more than a million data sets.

Properties

Several characteristics define a data set's structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as

standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...

and kurtosis. The values may be numbers, such as

real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...

s or

integer An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...

s, for example representing a person's height in centimeters, but may also be

nominal data Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scale ...

(i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a

level of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scale ...

. For each variable, the values are normally all of the same kind. Missing values may exist, which must be indicated somehow. In

statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, data sets usually come from actual observations obtained by sampling a

statistical population In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hyp ...

, and each row corresponds to the observations on one element of that population. Data sets may further be generated by

algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...

for the purpose of testing certain kinds of

software Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications. The history of software is closely tied to the development of digital comput ...

. Some modern statistical analysis software such as SPSS still present their data in the classical data set fashion. If data is missing or suspicious an imputation method may be used to complete a data set.

Classics

Several classic data sets have been used extensively in the

statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

literature: * Iris flower data set – Multivariate data set introduced by

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

(1936).Provided online by University of California-Irvine Machine Learning Repository
* MNIST database – Images of handwritten digits commonly used to test classification, clustering, and

image processing An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...

algorithms * '' Categorical data analysis'' – Data sets used in the book, ''An Introduction to Categorical Data Analysis''
provided online
by UCLA Advanced Research Computing. *''

Robust statistics Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust Statistics, statistical methods have been developed for many common problems, such as estimating location parame ...

'' – Data sets used in '' Robust Regression and Outlier Detection'' ( Rousseeuw and Leroy, 1968)
Provided online
at the University of Cologne. *''

Time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...

'' – Data used in Chatfield's book, ''The Analysis of Time Series'', ar
provided on-line
by StatLib. *''Extreme values'' – Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' ar
a snapshot of the data as it was provided on-line by Stuart Coles
the book's author. *''Bayesian Data Analysis'' – Data used in the book ar
provided on-linearchive link
by Andrew Gelman, one of the book's authors. * Th
Bupa liver data
– Used in several papers in the

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

(data mining) literature. * Anscombe's quartet – Small data set illustrating the importance of graphing the data to avoid statistical fallacies.

Example

Loading datasets using Python: $ pip install datasets from datasets import load_dataset dataset = load_dataset(NAME OF DATASET)

References

External links

Data.gov
– the U.S. Government's open data
GCMD
– the Global Change Master Directory containing over 34,000 descriptions of Earth science and environmental science data sets and services
Humanitarian Data Exchange(HDX)
– The Humanitarian Data Exchange (HDX) is an open humanitarian data sharing platform managed by the United Nations Office for the Coordination of Humanitarian Affairs.
NYC Open Data
– free public data published by New York City agencies and other partners.
Relational data set repository

Research Pipeline
– a wiki/website with links to data sets on many different topics
StatLib–JASA Data Archive

UCI
– a machine learning repository
UK Government Public Data

World Bank Open Data
– Free and open access to global development data by

World Bank The World Bank is an international financial institution that provides loans and Grant (money), grants to the governments of Least developed countries, low- and Developing country, middle-income countries for the purposes of economic development ...

{{DEFAULTSORT:Data Set Computer data Statistical data sets

Properties

Classics

Example

See also

References

External links