
The ''Iris'' flower data set or Fisher's ''Iris'' data set is a
multivariate data set
A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...
used and made famous by the British
statistician
A statistician is a person who works with Theory, theoretical or applied statistics. The profession exists in both the private sector, private and public sectors.
It is common to combine statistical knowledge with expertise in other subjects, a ...
and
biologist
A biologist is a scientist who conducts research in biology. Biologists are interested in studying life on Earth, whether it is an individual Cell (biology), cell, a multicellular organism, or a Community (ecology), community of Biological inter ...
Ronald Fisher
Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...
in his 1936 paper ''The use of multiple measurements in taxonomic problems'' as an example of
linear discriminant analysis
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...
.
It is sometimes called Anderson's ''Iris'' data set because
Edgar Anderson collected the data to quantify the
morphologic variation of ''
Iris'' flowers of three related species.
Two of the three species were collected in the
Gaspé Peninsula
The Gaspé Peninsula, also known as Gaspesia (, ; ), is a peninsula along the south shore of the St. Lawrence River that extends from the Matapedia Valley in Quebec, Canada, into the Gulf of St. Lawrence. It is separated from New Brunswick on it ...
"all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus".
The data set consists of 50 samples from each of three species of ''Iris'' (''
Iris setosa'', ''
Iris virginica'' and ''
Iris versicolor''). Four
features were measured from each sample: the length and the width of the
sepal
A sepal () is a part of the flower of angiosperms (flowering plants). Usually green, sepals typically function as protection for the flower in bud, and often as support for the petals when in bloom., p. 106
Etymology
The term ''sepalum'' ...
s and
petal
Petals are modified leaves that form an inner whorl surrounding the reproductive parts of flowers. They are often brightly coloured or unusually shaped to attract pollinators. All of the petals of a flower are collectively known as the ''corol ...
s, in centimeters. Based on the combination of these four features, Fisher developed a
linear discriminant model to distinguish each species. Fisher's paper was published in the ''
Annals of Eugenics
The ''Annals of Human Genetics'' is a bimonthly peer-reviewed scientific journal covering human genetics. It was established in 1925 by Karl Pearson as the ''Annals of Eugenics'', with as subtitle, Darwin's epigram "I have no Faith in anything sho ...
'' (today the ''Annals of Human Genetics'').
Use of the data set

Originally used as an example data set on which Fisher's
linear discriminant analysis
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...
was applied, it became a typical test case for many
statistical classification
When classification is performed by a computer, statistical methods are normally used to develop the algorithm.
Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or ''f ...
techniques in
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
such as
support vector machines
In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...
.
The use of this data set in
cluster analysis
Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains ''Iris setosa'', while the other cluster contains both ''Iris virginica'' and ''Iris versicolor'' and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in
data mining
Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...
: Fisher's linear discriminant model can only be obtained when the object species are known: class labels and clusters are not necessarily the same.
Nevertheless, all three species of ''Iris'' are separable in the projection on the nonlinear and branching principal component. The data set is approximated by the closest tree with some penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro map" is constructed.
The data points are projected into the closest node. For each node the
pie diagram of the projected points is prepared. The area of the pie is proportional to the number of the projected points. It is clear from the diagram (left) that the absolute majority of the samples of the different ''Iris'' species belong to the different nodes. Only a small fraction of ''Iris-virginica'' is mixed with ''Iris-versicolor'' (the mixed blue-green nodes in the diagram). Therefore, the three species of Iris (''Iris setosa'', ''Iris virginica'' and ''Iris versicolor'') are separable by the unsupervising procedures of nonlinear
principal component analysis
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
The data is linearly transformed onto a new coordinate system such that th ...
. To discriminate them, it is sufficient just to select the corresponding nodes on the principal tree.
Data set

The dataset contains a set of 150 records under five attributes: sepal length, sepal width, petal length, petal width and species.

The iris data set is widely used as a beginner's dataset for machine learning purposes. The dataset is included in
R ''base'' and Python in the machine learning library
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support ...
, so that users can access it without having to find a source for it.
Several versions of the dataset have been published.
R code illustrating usage
The example R code shown below reproduce the scatterplot displayed at the top of this article:
# Show the dataset
iris
# Show the help page, with information about the dataset
?iris
# Create scatterplots of all pairwise combination of the 4 variables in the dataset
pairs(iris :4 main="Iris Data (red=setosa,green=versicolor,blue=virginica)",
pch=21, bg=c("red","green3","blue") nclass(iris$Species)
Python code illustrating usage
from sklearn.datasets import load_iris
iris = load_iris()
iris.head()
iris.info()
This code gives:
See also
*
Classic data sets
*
List of datasets for machine-learning research
These datasets are used in machine learning (ML) research and have been cited in peer-reviewed academic journals. Datasets are an integral part of the field of machine learning. Major advances in this field can result from advances in learni ...
References
External links
*
{{Iris
Statistical data sets
Datasets in machine learning
Articles with example Python (programming language) code
Articles with example R code
Ronald Fisher