Data Set (IBM Mainframe), Sequential Data Set
   HOME

TheInfoList



OR:

A data set (or dataset) is a collection of
data Data ( , ) are a collection of discrete or continuous values that convey information, describing the quantity, quality, fact, statistics, other basic units of meaning, or simply sequences of symbols that may be further interpreted for ...
. In the case of tabular data, a data set corresponds to one or more database tables, where every
column A column or pillar in architecture and structural engineering is a structural element that transmits, through compression, the weight of the structure above to other structural elements below. In other words, a column is a compression member ...
of a table represents a particular variable, and each row corresponds to a given record of the data set in question. The data set lists values for each of the variables, such as for example height and weight of an object, for each member of the data set. Data sets can also consist of a collection of documents or files. In the
open data Open data are data that are openly accessible, exploitable, editable and shareable by anyone for any purpose. Open data are generally licensed under an open license. The goals of the open data movement are similar to those of other "open(-so ...
discipline, a dataset is a unit used to measure the amount of information released in a public open data repository. The European data.europa.eu portal aggregates more than a million data sets.


Properties

Several characteristics define a data set's structure and properties. These include the number and types of the attributes or variables, and various statistical measures applicable to them, such as
standard deviation In statistics, the standard deviation is a measure of the amount of variation of the values of a variable about its Expected value, mean. A low standard Deviation (statistics), deviation indicates that the values tend to be close to the mean ( ...
and
kurtosis In probability theory and statistics, kurtosis (from , ''kyrtos'' or ''kurtos'', meaning "curved, arching") refers to the degree of “tailedness” in the probability distribution of a real-valued random variable. Similar to skewness, kurtos ...
. The values may be numbers, such as
real number In mathematics, a real number is a number that can be used to measure a continuous one- dimensional quantity such as a duration or temperature. Here, ''continuous'' means that pairs of values can have arbitrarily small differences. Every re ...
s or
integer An integer is the number zero (0), a positive natural number (1, 2, 3, ...), or the negation of a positive natural number (−1, −2, −3, ...). The negations or additive inverses of the positive natural numbers are referred to as negative in ...
s, for example representing a person's height in centimeters, but may also be
nominal data Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scale ...
(i.e., not consisting of numerical values), for example representing a person's ethnicity. More generally, values may be of any of the kinds described as a
level of measurement Level of measurement or scale of measure is a classification that describes the nature of information within the values assigned to variables. Psychologist Stanley Smith Stevens developed the best-known classification with four levels, or scale ...
. For each variable, the values are normally all of the same kind. Missing values may exist, which must be indicated somehow. In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, data sets usually come from actual observations obtained by sampling a
statistical population In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hyp ...
, and each row corresponds to the observations on one element of that population. Data sets may further be generated by
algorithms In mathematics and computer science, an algorithm () is a finite sequence of mathematically rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for per ...
for the purpose of testing certain kinds of
software Software consists of computer programs that instruct the Execution (computing), execution of a computer. Software also includes design documents and specifications. The history of software is closely tied to the development of digital comput ...
. Some modern statistical analysis software such as SPSS still present their data in the classical data set fashion. If data is missing or suspicious an imputation method may be used to complete a data set.


Classics

Several classic data sets have been used extensively in the
statistical Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
literature: * Iris flower data set – Multivariate data set introduced by
Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...
(1936).Provided online by University of California-Irvine Machine Learning Repository
* MNIST database – Images of handwritten digits commonly used to test classification, clustering, and
image processing An image or picture is a visual representation. An image can be two-dimensional, such as a drawing, painting, or photograph, or three-dimensional, such as a carving or sculpture. Images may be displayed through other media, including a pr ...
algorithms * '' Categorical data analysis'' – Data sets used in the book, ''An Introduction to Categorical Data Analysis''
provided online
by UCLA Advanced Research Computing. *''
Robust statistics Robust statistics are statistics that maintain their properties even if the underlying distributional assumptions are incorrect. Robust Statistics, statistical methods have been developed for many common problems, such as estimating location parame ...
'' – Data sets used in '' Robust Regression and Outlier Detection'' ( Rousseeuw and Leroy, 1968)
Provided online
at the University of Cologne. *''
Time series In mathematics, a time series is a series of data points indexed (or listed or graphed) in time order. Most commonly, a time series is a sequence taken at successive equally spaced points in time. Thus it is a sequence of discrete-time data. ...
'' – Data used in Chatfield's book, ''The Analysis of Time Series'', ar
provided on-line
by StatLib. *''Extreme values'' – Data used in the book, ''An Introduction to the Statistical Modeling of Extreme Values'' ar
a snapshot of the data as it was provided on-line by Stuart Coles
the book's author. *''Bayesian Data Analysis'' – Data used in the book ar
provided on-linearchive link
by Andrew Gelman, one of the book's authors. * Th
Bupa liver data
– Used in several papers in the
machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
(data mining) literature. * Anscombe's quartet – Small data set illustrating the importance of graphing the data to avoid statistical fallacies.


Example

Loading datasets using Python: $ pip install datasets from datasets import load_dataset dataset = load_dataset(NAME OF DATASET)


See also

*
List of datasets for machine-learning research These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learni ...
* List of datasets in computer vision and image processing * Data blending *
Data (computer science) ''In computer science, data (treated as singular, plural, or as a mass noun) is any sequence of one or more symbols; datum is a single symbol of data. Data requires interpretation to become information. Digital data is data that is represen ...
* Sampling *
Data store A data store is a repository for persistently storing and managing collections of data which include not just repositories like databases, but also simpler store types such as simple files, emails, etc. A ''database'' is a collection of data that ...
*
Interoperability Interoperability is a characteristic of a product or system to work with other products or systems. While the term was initially defined for information technology or systems engineering services to allow for information exchange, a broader de ...
* Data collection system


References


External links


Data.gov
– the U.S. Government's open data
GCMD
– the Global Change Master Directory containing over 34,000 descriptions of Earth science and environmental science data sets and services
Humanitarian Data Exchange(HDX)
– The Humanitarian Data Exchange (HDX) is an open humanitarian data sharing platform managed by the
United Nations Office for the Coordination of Humanitarian Affairs The United Nations Office for the Coordination of Humanitarian Affairs (OCHA) is a United Nations (UN) body established in December 1991 by the General Assembly to strengthen the international response to complex emergencies and natural disaster ...
.
NYC Open Data
– free public data published by New York City agencies and other partners.
Relational data set repository

Research Pipeline
– a wiki/website with links to data sets on many different topics
StatLib–JASA Data Archive

UCI
– a machine learning repository
UK Government Public Data

World Bank Open Data
– Free and open access to global development data by
World Bank The World Bank is an international financial institution that provides loans and Grant (money), grants to the governments of Least developed countries, low- and Developing country, middle-income countries for the purposes of economic development ...
{{DEFAULTSORT:Data Set Computer data Statistical data sets