The ''Iris'' flower data set or Fisher's ''Iris'' data set is a

multivariate Multivariate may refer to: In mathematics * Multivariable calculus * Multivariate function * Multivariate polynomial In computing * Multivariate cryptography * Multivariate division algorithm * Multivariate interpolation * Multivariate optical c ...

data set used and made famous by the British

statistician A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors. It is common to combine statistical knowledge with expertise in other subjects, and statisticians may w ...

and

biologist A biologist is a scientist who conducts research in biology. Biologists are interested in studying life on Earth, whether it is an individual cell, a multicellular organism, or a community of interacting populations. They usually specialize ...

Ronald Fisher in his 1936 paper ''The use of multiple measurements in taxonomic problems'' as an example of

linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...

. It is sometimes called Anderson's ''Iris'' data set because Edgar Anderson collected the data to quantify the morphologic variation of '' Iris'' flowers of three related species. Two of the three species were collected in the

Gaspé Peninsula The Gaspé Peninsula, also known as Gaspesia (; ), is a peninsula along the south shore of the Saint Lawrence River that extends from the Matapedia Valley in Quebec, Canada, into the Gulf of Saint Lawrence. It is separated from New Brunswick ...

"all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus". The data set consists of 50 samples from each of three species of ''Iris'' (''

Iris setosa Iris most often refers to: *Iris (anatomy), part of the eye *Iris (mythology), a Greek goddess * ''Iris'' (plant), a genus of flowering plants * Iris (color), an ambiguous color term Iris or IRIS may also refer to: Arts and media Fictional ent ...

'', ''

Iris virginica ''Iris virginica'', with the common name Virginia blueflag, Virginia iris, great blue flag, or southern blue flag, is a perennial species of flowering plant in the Iridaceae (iris) family, native to central and eastern North America. It was iden ...

'' and ''

Iris versicolor ''Iris versicolor'' is also commonly known as the blue flag, harlequin blueflag, larger blue flag, northern blue flag, and poison flag, plus other variations of these names, and in Britain and Ireland as purple iris. It is a species of ''Iris'' ...

''). Four

features Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software ite ...

were measured from each sample: the length and the width of the

sepal A sepal () is a part of the flower of angiosperms (flowering plants). Usually green, sepals typically function as protection for the flower in bud, and often as support for the petals when in bloom., p. 106 The term ''sepalum'' was coine ...

s and

petal Petals are modified leaves that surround the reproductive parts of flowers. They are often brightly colored or unusually shaped to attract pollinators. All of the petals of a flower are collectively known as the ''corolla''. Petals are usuall ...

s, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Fisher's paper was published in the Annals of Eugenics and includes discussion of the contained techniques' applications to the field of

phrenology Phrenology () is a pseudoscience which involves the measurement of bumps on the skull to predict mental traits.Wihe, J. V. (2002). "Science and Pseudoscience: A Primer in Critical Thinking." In ''Encyclopedia of Pseudoscience'', pp. 195–203. C ...

Use of the data set

Originally used as an example data set on which Fisher's

was applied, it became a typical test case for many statistical classification techniques in

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

such as

support vector machines In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratori ...

. The use of this data set in

cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains ''Iris setosa'', while the other cluster contains both ''Iris virginica'' and ''Iris versicolor'' and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in data mining: Fisher's linear discriminant model can only be obtained when the object species are known: class labels and clusters are not necessarily the same. Nevertheless, all three species of ''Iris'' are separable in the projection on the nonlinear and branching principal component. The data set is approximated by the closest tree with some penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro map" is constructed. The data points are projected into the closest node. For each node the pie diagram of the projected points is prepared. The area of the pie is proportional to the number of the projected points. It is clear from the diagram (left) that the absolute majority of the samples of the different ''Iris'' species belong to the different nodes. Only a small fraction of ''Iris-virginica'' is mixed with ''Iris-versicolor'' (the mixed blue-green nodes in the diagram). Therefore, the three species of Iris (''Iris setosa'', ''Iris virginica'' and ''Iris versicolor'') are separable by the unsupervising procedures of nonlinear

principal component analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...

. To discriminate them, it is sufficient just to select the corresponding nodes on the principal tree.

Data set

The dataset contains a set of 150 records under five attributes - sepal length, sepal width, petal length, petal width and species. Iris versicolor 3

The iris data set is widely used as a beginner's dataset for machine learning purposes. The dataset is included in R ''base'' and Python in the machine learning library scikit-learn, so that users can access it without having to find a source for it. Several versions of the dataset have been published.

R code illustrating usage

The example R code shown below reproduce the scatterplot displayed at the top of this article: # Show the dataset iris # Show the help page, with information about the dataset ?iris # Create scatterplots of all pairwise combination of the 4 variables in the dataset pairs(iris :4 main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue") nclass(iris$Species)

Python code illustrating usage

from sklearn.datasets import load_iris iris = load_iris() iris This code gives:

References

External links