The ''Iris'' flower data set or Fisher's ''Iris'' data set is a multivariate

data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more table (database), database tables, where every column (database), column of a table represents a particular Variable (computer sci ...

used and made famous by the British

statistician A statistician is a person who works with Theory, theoretical or applied statistics. The profession exists in both the private sector, private and public sectors. It is common to combine statistical knowledge with expertise in other subjects, a ...

and

biologist A biologist is a scientist who conducts research in biology. Biologists are interested in studying life on Earth, whether it is an individual Cell (biology), cell, a multicellular organism, or a Community (ecology), community of Biological inter ...

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

in his 1936 paper ''The use of multiple measurements in taxonomic problems'' as an example of

linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

. It is sometimes called Anderson's ''Iris'' data set because Edgar Anderson collected the data to quantify the morphologic variation of '' Iris'' flowers of three related species. Two of the three species were collected in the

Gaspé Peninsula The Gaspé Peninsula, also known as Gaspesia (, ; ), is a peninsula along the south shore of the St. Lawrence River that extends from the Matapedia Valley in Quebec, Canada, into the Gulf of St. Lawrence. It is separated from New Brunswick on it ...

"all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus". The data set consists of 50 samples from each of three species of ''Iris'' ('' Iris setosa'', '' Iris virginica'' and '' Iris versicolor''). Four features were measured from each sample: the length and the width of the

sepal A sepal () is a part of the flower of angiosperms (flowering plants). Usually green, sepals typically function as protection for the flower in bud, and often as support for the petals when in bloom., p. 106 Etymology The term ''sepalum'' ...

s and

petal Petals are modified leaves that form an inner whorl surrounding the reproductive parts of flowers. They are often brightly coloured or unusually shaped to attract pollinators. All of the petals of a flower are collectively known as the ''corol ...

s, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish each species. Fisher's paper was published in the ''

Annals of Eugenics The ''Annals of Human Genetics'' is a bimonthly peer-reviewed scientific journal covering human genetics. It was established in 1925 by Karl Pearson as the ''Annals of Eugenics'', with as subtitle, Darwin's epigram "I have no Faith in anything sho ...

'' (today the ''Annals of Human Genetics'').

Use of the data set

Originally used as an example data set on which Fisher's

was applied, it became a typical test case for many

statistical classification When classification is performed by a computer, statistical methods are normally used to develop the algorithm. Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or ''f ...

techniques in

machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

such as

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

. The use of this data set in

cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...

however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains ''Iris setosa'', while the other cluster contains both ''Iris virginica'' and ''Iris versicolor'' and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in

data mining Data mining is the process of extracting and finding patterns in massive data sets involving methods at the intersection of machine learning, statistics, and database systems. Data mining is an interdisciplinary subfield of computer science and ...

: Fisher's linear discriminant model can only be obtained when the object species are known: class labels and clusters are not necessarily the same. Nevertheless, all three species of ''Iris'' are separable in the projection on the nonlinear and branching principal component. The data set is approximated by the closest tree with some penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro map" is constructed. The data points are projected into the closest node. For each node the pie diagram of the projected points is prepared. The area of the pie is proportional to the number of the projected points. It is clear from the diagram (left) that the absolute majority of the samples of the different ''Iris'' species belong to the different nodes. Only a small fraction of ''Iris-virginica'' is mixed with ''Iris-versicolor'' (the mixed blue-green nodes in the diagram). Therefore, the three species of Iris (''Iris setosa'', ''Iris virginica'' and ''Iris versicolor'') are separable by the unsupervising procedures of nonlinear

principal component analysis Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing. The data is linearly transformed onto a new coordinate system such that th ...

. To discriminate them, it is sufficient just to select the corresponding nodes on the principal tree.

Data set

The dataset contains a set of 150 records under five attributes: sepal length, sepal width, petal length, petal width and species. Iris versicolor 3

The iris data set is widely used as a beginner's dataset for machine learning purposes. The dataset is included in R ''base'' and Python in the machine learning library

scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...

, so that users can access it without having to find a source for it. Several versions of the dataset have been published.

R code illustrating usage

The example R code shown below reproduce the scatterplot displayed at the top of this article: # Show the dataset iris # Show the help page, with information about the dataset ?iris # Create scatterplots of all pairwise combination of the 4 variables in the dataset pairs(iris :4 main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue") nclass(iris$Species)

Python code illustrating usage

from sklearn.datasets import load_iris iris = load_iris() iris.head() iris.info() This code gives:

References

External links