HOME

TheInfoList



OR:

The ''Iris'' flower data set or Fisher's ''Iris'' data set is a multivariate data set used and made famous by the British
statistician A statistician is a person who works with theoretical or applied statistics. The profession exists in both the private and public sectors. It is common to combine statistical knowledge with expertise in other subjects, and statisticians may wor ...
and
biologist A biologist is a scientist who conducts research in biology. Biologists are interested in studying life on Earth, whether it is an individual cell, a multicellular organism, or a community of interacting populations. They usually speciali ...
Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who ...
in his 1936 paper ''The use of multiple measurements in taxonomic problems'' as an example of linear discriminant analysis. It is sometimes called Anderson's ''Iris'' data set because Edgar Anderson collected the data to quantify the morphologic variation of '' Iris'' flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus". The data set consists of 50 samples from each of three species of ''Iris'' ('' Iris setosa'', '' Iris virginica'' and ''
Iris versicolor ''Iris versicolor'' is also commonly known as the blue flag, harlequin blueflag, larger blue flag, northern blue flag, and poison flag, plus other variations of these names, and in Britain and Ireland as purple iris. It is a species of '' Iri ...
''). Four features were measured from each sample: the length and the width of the
sepal A sepal () is a part of the flower of angiosperms (flowering plants). Usually green, sepals typically function as protection for the flower in bud, and often as support for the petals when in bloom., p. 106 The term ''sepalum'' was coined ...
s and
petal Petals are modified leaves that surround the reproductive parts of flowers. They are often brightly colored or unusually shaped to attract pollinators. All of the petals of a flower are collectively known as the ''corolla''. Petals are usuall ...
s, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other. Fisher's paper was published in the Annals of Eugenics and includes discussion of the contained techniques' applications to the field of phrenology.


Use of the data set

Originally used as an example data set on which Fisher's linear discriminant analysis was applied, it became a typical test case for many statistical classification techniques in
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
such as support vector machines. The use of this data set in
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains ''Iris setosa'', while the other cluster contains both ''Iris virginica'' and ''Iris versicolor'' and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in data mining: Fisher's linear discriminant model can only be obtained when the object species are known: class labels and clusters are not necessarily the same. Nevertheless, all three species of ''Iris'' are separable in the projection on the nonlinear and branching principal component. The data set is approximated by the closest tree with some penalty for the excessive number of nodes, bending and stretching. Then the so-called "metro map" is constructed. The data points are projected into the closest node. For each node the pie diagram of the projected points is prepared. The area of the pie is proportional to the number of the projected points. It is clear from the diagram (left) that the absolute majority of the samples of the different ''Iris'' species belong to the different nodes. Only a small fraction of ''Iris-virginica'' is mixed with ''Iris-versicolor'' (the mixed blue-green nodes in the diagram). Therefore, the three species of Iris (''Iris setosa'', ''Iris virginica'' and ''Iris versicolor'') are separable by the unsupervising procedures of nonlinear principal component analysis. To discriminate them, it is sufficient just to select the corresponding nodes on the principal tree.


Data set

The dataset contains a set of 150 records under five attributes - sepal length, sepal width, petal length, petal width and species. The iris data set is widely used as a beginner's dataset for machine learning purposes. The dataset is included in R ''base'' and Python in the machine learning library scikit-learn, so that users can access it without having to find a source for it. Several versions of the dataset have been published.


R code illustrating usage

The example R code shown below reproduce the scatterplot displayed at the top of this article: # Show the dataset iris # Show the help page, with information about the dataset ?iris # Create scatterplots of all pairwise combination of the 4 variables in the dataset pairs(iris :4 main="Iris Data (red=setosa,green=versicolor,blue=virginica)", pch=21, bg=c("red","green3","blue") nclass(iris$Species)


Python code illustrating usage

from sklearn.datasets import load_iris iris = load_iris() iris This code gives:


See also

*
Classic data sets A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...


References


External links

* {{Iris Statistical data sets Datasets in machine learning Articles with example Python (programming language) code Articles with example R code