Data set
   HOME

TheInfoList



OR:

A data set (or dataset) is a collection of
data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
. In the case of tabular data, a data set corresponds to one or more
database tables A table is a collection of related data held in a table format within a database. It consists of columns and rows. In relational databases, and flat file databases, a ''table'' is a set of data elements (values) using a model of vertical colu ...
, where every
column A column or pillar in architecture and structural engineering is a structural element that transmits, through compression (physical), compression, the weight of the structure above to other structural elements below. In other words, a column i ...
of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files. In the
open data Open data is data that is openly accessible, exploitable, editable and shared by anyone for any purpose. Open data is licensed under an open license. The goals of the open data movement are similar to those of other "open(-source)" movements ...
discipline, data set is the unit to measure the information released in a public open data repository. The European
data.europa.eu On November 16, 2015 the beta version of the European Data Portal was launched. The European Data Portal is an initiative of the European Commission, and is part of the Internal market, Digital Single Market. Purpose The European Data Portal was ...
portal aggregates more than a million data sets. Some other issues ( real-time data sources, non-relational data sets, etc.) increases the difficulty to reach a consensus about it.


Properties

Several characteristics define a data set's structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as
standard deviation In statistics, the standard deviation is a measure of the amount of variation or dispersion of a set of values. A low standard deviation indicates that the values tend to be close to the mean (also called the expected value) of the set, whil ...
and kurtosis.
The values may be numbers, such as
real number In mathematics, a real number is a number that can be used to measure a ''continuous'' one-dimensional quantity such as a distance, duration or temperature. Here, ''continuous'' means that values can have arbitrarily small variations. Every ...
s or
integer An integer is the number zero (), a positive natural number (, , , etc.) or a negative integer with a minus sign ( −1, −2, −3, etc.). The negative numbers are the additive inverses of the corresponding positive numbers. In the languag ...
s, for example representing a person's height in centimeters, but may also be nominal data (i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a
level of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scal ...
. For each variable, the values are normally all of the same kind. However, there may also be '' missing values'', which must be indicated in some way. In
statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...
, data sets usually come from actual observations obtained by sampling a
statistical population In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypoth ...
, and each row corresponds to the observations on one element of that population. Data sets may further be generated by
algorithms In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
for the purpose of testing certain kinds of
software Software is a set of computer programs and associated documentation and data. This is in contrast to hardware, from which the system is built and which actually performs the work. At the lowest programming level, executable code consist ...
. Some modern statistical analysis software such as
SPSS SPSS Statistics is a statistical software suite developed by IBM for data management, advanced analytics, multivariate analysis, business intelligence, and criminal investigation. Long produced by SPSS Inc., it was acquired by IBM in 2009. C ...
still present their data in the classical data set fashion. If data is missing or suspicious an imputation method may be used to complete a data set.


Classic data sets

Several classic data sets have been used extensively in the
statistical Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, industr ...
literature: *
Iris flower data set The ''Iris'' flower data set or Fisher's ''Iris'' data set is a multivariate data set used and made famous by the British statistician and biologist Ronald Fisher in his 1936 paper ''The use of multiple measurements in taxonomic problems'' as a ...
– Multivariate data set introduced by Ronald Fisher (1936). *
MNIST database The MNIST database (''Modified National Institute of Standards and Technology database'') is a large database of handwritten digits that is commonly used for training various image processing systems. The database is also widely used for training ...
– Images of handwritten digits commonly used to test classification, clustering, and image processing algorithms * '' Categorical data analysis'' – Data sets used in the book, ''An Introduction to Categorical Data Analysis''. *''
Robust statistics Robust statistics are statistics with good performance for data drawn from a wide range of probability distributions, especially for distributions that are not normal. Robust statistical methods have been developed for many common problems, su ...
'' – Data sets used in ''
Robust Regression and Outlier Detection ''Robust Regression and Outlier Detection'' is a book on robust statistics, particularly focusing on the breakdown point of methods for robust regression. It was written by Peter Rousseeuw and Annick M. Leroy, and published in 1987 by Wiley. Back ...
'' ( Rousseeuw and Leroy, 1968)
Provided on-line at the University of Cologne.
*''
Time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. Ex ...
'' – Data used in Chatfield's book, ''The Analysis of Time Series'', ar
provided on-line by StatLib.
*''Extreme values'' – Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' ar
a snapshot of the data as it was provided on-line by Stuart Coles
the book's author. *''Bayesian Data Analysis'' – Data used in the book ar
provided on-line
by
Andrew Gelman Andrew Eric Gelman (born February 11, 1965) is an American statistician and professor of statistics and political science at Columbia University. Gelman received bachelor of science degrees in mathematics and in physics from MIT, where he w ...
, one of the book's authors. * Th
Bupa liver data
– Used in several papers in the machine learning (data mining) literature. * Anscombe's quartet – Small data set illustrating the importance of graphing the data to avoid statistical fallacies


See also

*
Data In the pursuit of knowledge, data (; ) is a collection of discrete values that convey information, describing quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpret ...
* Data blending *
Data (computing) In computer science, data (treated as singular, plural, or as a mass noun) is any sequence of one or more symbols; datum is a single symbol of data. Data requires interpretation to become information. Digital data is data that is represented u ...
* Data samples * Data store *
Interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader defi ...
*
Data collection system Data collection system (DCS) is a computer application that facilitates the process of data collection, allowing specific, structured information to be gathered in a systematic fashion, subsequently enabling data analysis to be performed on the inf ...


References


External links


Data.gov
– the U.S. Government's open data
GCMD
– the Global Change Master Directory containing over 34,000 descriptions of Earth science and environmental science data sets and services
Humanitarian Data Exchange(HDX)
– The Humanitarian Data Exchange (HDX) is an open humanitarian data sharing platform managed by the
United Nations Office for the Coordination of Humanitarian Affairs The United Nations Office for the Coordination of Humanitarian Affairs (OCHA) is a United Nations (UN) body established in December 1991 by the General Assembly to strengthen the international response to complex emergencies and natural disaster ...
.
NYC Open Data
– free public data published by New York City agencies and other partners.
Relational data set repository

Research Pipeline
– a wiki/website with links to data sets on many different topics
StatLib–JASA Data Archive

UCI
– a machine learning repository
UK Government Public Data

World Bank Open Data
– Free and open access to global development data by
World Bank The World Bank is an international financial institution that provides loans and grants to the governments of low- and middle-income countries for the purpose of pursuing capital projects. The World Bank is the collective name for the Inte ...
{{DEFAULTSORT:Data Set Computer data Statistical data sets