A diversity index is a quantitative measure that reflects how many different types (such as

species In biology, a species is the basic unit of classification and a taxonomic rank of an organism, as well as a unit of biodiversity. A species is often defined as the largest group of organisms in which any two individuals of the appropriate s ...

) there are in a dataset (a community), and that can simultaneously take into account the

phylogenetic In biology, phylogenetics (; from Greek φυλή/ φῦλον [] "tribe, clan, race", and wikt:γενετικός, γενετικός [] "origin, source, birth") is the study of the evolutionary history and relationships among or within groups o ...

relations among the individuals distributed among those types, such as ''richness'', ''divergence'' or ''evenness''. These indices are statistical representations of biodiversity in different aspects ( richness, evenness, and dominance).

Effective number of species or Hill numbers

When diversity indices are used in

ecology Ecology () is the study of the relationships between living organisms, including humans, and their physical environment. Ecology considers organisms at the individual, population, community, ecosystem, and biosphere level. Ecology overl ...

, the types of interest are usually species, but they can also be other categories, such as genera,

families Family (from la, familia) is a group of people related either by consanguinity (by recognized birth) or affinity (by marriage or other relationship). The purpose of the family is to maintain the well-being of its members and of society. Ideal ...

, functional types, or haplotypes. The entities of interest are usually individual plants or animals, and the measure of abundance can be, for example, number of individuals, biomass or coverage. In

demography Demography () is the statistical study of populations, especially human beings. Demographic analysis examines and measures the dimensions and dynamics of populations; it can cover whole societies or groups defined by criteria such as edu ...

, the entities of interest can be people, and the types of interest various demographic groups. In

information science Information science (also known as information studies) is an academic field which is primarily concerned with analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of informatio ...

, the entities can be characters and the types of the different letters of the alphabet. The most commonly used diversity indices are simple transformations of the effective number of types (also known as 'true diversity'), but each diversity index can also be interpreted in its own right as a measure corresponding to some real phenomenon (but a different one for each diversity index). Many indices only account for categorical diversity between subjects or entities. Such indices, however do not account for the total variation (diversity) that can be held between subjects or entities which occurs only when both categorical and qualitative diversity are calculated. True diversity, or the effective number of types, refers to the number of equally abundant types needed for the average proportional abundance of the types to equal that observed in the dataset of interest (where all types may not be equally abundant). The true diversity in a dataset is calculated by first taking the weighted

generalized mean In mathematics, generalized means (or power mean or Hölder mean from Otto Hölder) are a family of functions for aggregating sets of numbers. These include as special cases the Pythagorean means (arithmetic, geometric, and harmonic means). D ...

of the proportional abundances of the types in the dataset, and then taking the

reciprocal Reciprocal may refer to: In mathematics * Multiplicative inverse, in mathematics, the number 1/''x'', which multiplied by ''x'' gives the product 1, also known as a ''reciprocal'' * Reciprocal polynomial, a polynomial obtained from another pol ...

of this. The equation is: :

^q\!D\left (  \right )^

The

denominator A fraction (from la, fractus, "broken") represents a part of a whole or, more generally, any number of equal parts. When spoken in everyday English, a fraction describes how many parts of a certain size there are, for example, one-half, eight ...

equals the average proportional abundance of the types in the dataset as calculated with the weighted

with exponent . In the equation, is richness (the total number of types in the dataset), and the proportional abundance of the th type is . The proportional abundances themselves are used as the nominal weights. The numbers

^q D

are called Hill numbers of order q or effective number of species. When , the above equation is undefined. However, the mathematical limit as approaches 1 is well defined and the corresponding diversity is calculated with the following equation: :

^1\!D= = \exp\left(-\sum_^R p_i \ln(p_i)\right)

which is the exponential of the

Shannon entropy Shannon may refer to: People * Shannon (given name) * Shannon (surname) * Shannon (American singer), stage name of singer Shannon Brenda Greene (born 1958) * Shannon (South Korean singer), British-South Korean singer and actress Shannon Arrum W ...

calculated with natural logarithms (see above). In other domains, this statistic is also known as the ''

perplexity In information theory, perplexity is a measurement of how well a probability distribution or probability model predicts a sample. It may be used to compare probability models. A low perplexity indicates the probability distribution is good ...

''. The general equation of diversity is often written in the form :

^q\!D=\left (  \right )^

and the term inside the parentheses is called the basic sum. Some popular diversity indices correspond to the basic sum as calculated with different values of .

Sensitivity of the diversity value to rare vs. abundant species

The value of is often referred to as the order of the diversity. It defines the sensitivity of the true diversity to rare vs. abundant species by modifying how the weighted mean of the species' proportional abundances is calculated. With some values of the parameter , the value of the generalized mean assumes familiar kinds of weighted means as special cases. In particular, * corresponds to the weighted harmonic mean, * to the weighted geometric mean, and * to the weighted arithmetic mean. * As approaches infinity, the weighted generalized mean with exponent approaches the maximum value, which is the proportional abundance of the most abundant species in the dataset. Generally, increasing the value of increases the effective weight given to the most abundant species. This leads to obtaining a larger value and a smaller true diversity () value with increasing . When , the weighted geometric mean of the values is used, and each species is exactly weighted by its proportional abundance (in the weighted geometric mean, the weights are the exponents). When , the weight given to abundant species is exaggerated, and when , the weight given to rare species is. At , the species weights exactly cancel out the species proportional abundances, such that the weighted mean of the values equals even when all species are not equally abundant. At , the effective number of species, , hence equals the actual number of species . In the context of diversity, is generally limited to non-negative values. This is because negative values of would give rare species so much more weight than abundant ones that would exceed .

Richness

Richness simply quantifies how many different types the dataset of interest contains. For example, species richness (usually noted ) of a dataset is the number of species in the corresponding species list. Richness is a simple measure, so it has been a popular diversity index in ecology, where abundance data are often not available for the datasets of interest. Because richness does not take the abundance of the types into account, it is not the same thing as diversity, which does take abundance into account. However, if true diversity is calculated with , the effective number of types () equals the actual number of types, which is identical to Richness ().

Shannon index

The Shannon index has been a popular diversity index in the ecological literature, where it is also known as Shannon's diversity index, Shannon– Wiener index, and (erroneously) Shannon– Weaver index.Spellerberg, Ian F., and Peter J. Fedor. (2003) A tribute to Claude Shannon (1916–2001) and a plea for more rigorous use of species richness, species diversity and the ‘Shannon–Wiener’Index.

Global Ecology and Biogeography ''Global Ecology and Biogeography'' is a bimonthly peer-reviewed scientific journal that was established in 1991. It covers research in the field of macroecology. The current editor-in-chief is Brian McGill. According to its publisher, Wiley, the ...

12.3, 177-179. The measure was originally proposed by

Claude Shannon Claude Elwood Shannon (April 30, 1916 – February 24, 2001) was an American mathematician, electrical engineer, and cryptographer known as a "father of information theory". As a 21-year-old master's degree student at the Massachusetts Inst ...

in 1948 to quantify the

entropy Entropy is a scientific concept, as well as a measurable physical property, that is most commonly associated with a state of disorder, randomness, or uncertainty. The term and the concept are used in diverse fields, from classical thermodynam ...

(hence ''Shannon entropy'', related to

Shannon information content In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative wa ...

) in strings of text.Shannon, C. E. (1948)

A mathematical theory of communication "A Mathematical Theory of Communication" is an article by mathematician Claude E. Shannon published in ''Bell System Technical Journal'' in 1948. It was renamed ''The Mathematical Theory of Communication'' in the 1949 book of the same name, a sma ...

. The Bell System Technical Journal, 27, 379–423 and 623–656. The idea is that the more letters there are, and the closer their proportional abundances in the string of interest, the more difficult it is to correctly predict which letter will be the next one in the string. The Shannon entropy quantifies the uncertainty (entropy or degree of surprise) associated with this prediction. It is most often calculated as follows: :

H' = -\sum_^R p_i \ln p_i

where is the proportion of characters belonging to the th type of letter in the string of interest. In ecology, is often the proportion of individuals belonging to the th species in the dataset of interest. Then the Shannon entropy quantifies the uncertainty in predicting the species identity of an individual that is taken at random from the dataset. Although the equation is here written with natural logarithms, the base of the logarithm used when calculating the Shannon entropy can be chosen freely. Shannon himself discussed logarithm bases 2, 10 and , and these have since become the most popular bases in applications that use the Shannon entropy. Each log base corresponds to a different measurement unit, which has been called binary digits (bits), decimal digits (decits), and natural digits (nats) for the bases 2, 10 and , respectively. Comparing Shannon entropy values that were originally calculated with different log bases requires converting them to the same log base: change from the base to base is obtained with multiplication by . The Shannon index () is related to the

weighted geometric mean In statistics, the weighted geometric mean is a generalization of the geometric mean using the weighted arithmetic mean. Given a sample x=(x_1,x_2\dots,x_n) and weights w=(w_1, w_2,\dots,w_n), it is calculated as: : \bar = \left(\prod_^n x_i^\ri ...

of the proportional abundances of the types. Specifically, it equals the logarithm of true diversity as calculated with : :

H' = -\sum_^R p_i \ln p_i = -\sum_^R \ln p_i^

This can also be written :

H' = -(\ln p_1^ +\ln p_2^ +\ln p_3^ + \cdots + \ln p_R^)

which equals :

H' = -\ln p_1^p_2^p_3^ \cdots p_R^ = \ln \left (  \right ) = \ln \left (  \right )

Since the sum of the values equals

unity Unity may refer to: Buildings * Unity Building, Oregon, Illinois, US; a historic building * Unity Building (Chicago), Illinois, US; a skyscraper * Unity Buildings, Liverpool, UK; two buildings in England * Unity Chapel, Wyoming, Wisconsin, US; ...

by definition, the

equals the weighted geometric mean of the values, with the values themselves being used as the weights (exponents in the equation). The term within the parentheses hence equals true diversity , and equals . When all types in the dataset of interest are equally common, all values equal , and the Shannon index hence takes the value . The more unequal the abundances of the types, the larger the weighted geometric mean of the values, and the smaller the corresponding Shannon entropy. If practically all abundance is concentrated to one type, and the other types are very rare (even if there are many of them), Shannon entropy approaches zero. When there is only one type in the dataset, Shannon entropy exactly equals zero (there is no uncertainty in predicting the type of the next randomly chosen entity). In machine learning the Shannon index is also called as

Information gain Information is an abstract concept that refers to that which has the power to inform. At the most fundamental level information pertains to the interpretation of that which may be sensed. Any natural process that is not completely random, ...

Rényi entropy

The Rényi entropy is a generalization of the Shannon entropy to other values of than unity. It can be expressed: :

^qH = \frac \; \ln\left ( \sum_^R p_i^q \right )

which equals :

^qH = \ln\left (  \right ) = \ln(^q\!D)

This means that taking the logarithm of true diversity based on any value of gives the Rényi entropy corresponding to the same value of .

Simpson index

The Simpson index was introduced in 1949 by Edward H. Simpson to measure the degree of concentration when individuals are classified into types. The same index was rediscovered by Orris C. Herfindahl in 1950. The square root of the index had already been introduced in 1945 by the economist Albert O. Hirschman. As a result, the same measure is usually known as the Simpson index in ecology, and as the Herfindahl index or the Herfindahl–Hirschman index (HHI) in economics. The measure equals the probability that two entities taken at random from the dataset of interest represent the same type. It equals: :

\lambda = \sum_^R p_i^2

, where is richness (the total number of types in the dataset). This equation is also equal to the weighted arithmetic mean of the proportional abundances of the types of interest, with the proportional abundances themselves being used as the weights. Proportional abundances are by definition constrained to values between zero and unity, but it is a weighted arithmetic mean, hence , which is reached when all types are equally abundant. By comparing the equation used to calculate λ with the equations used to calculate true diversity, it can be seen that equals , i.e., true diversity as calculated with . The original Simpson's index hence equals the corresponding basic sum. The interpretation of λ as the probability that two entities taken at random from the dataset of interest represent the same type assumes that the first entity is replaced to the dataset before taking the second entity. If the dataset is very large, sampling without replacement gives approximately the same result, but in small datasets, the difference can be substantial. If the dataset is small, and sampling without replacement is assumed, the probability of obtaining the same type with both random draws is: :

\ell = \frac

where is the number of entities belonging to the th type and is the total number of entities in the dataset. This form of the Simpson index is also known as the Hunter–Gaston index in microbiology. Since the mean proportional abundance of the types increases with decreasing number of types and increasing abundance of the most abundant type, λ obtains small values in datasets of high diversity and large values in datasets of low diversity. This is counterintuitive behavior for a diversity index, so often, such transformations of λ that increase with increasing diversity have been used instead. The most popular of such indices have been the inverse Simpson index (1/λ) and the Gini–Simpson index (1 − λ). Both of these have also been called the Simpson index in the ecological literature, so care is needed to avoid accidentally comparing the different indices as if they were the same.

Inverse Simpson index

The inverse Simpson index equals: :

\frac =  = ^2D

This simply equals true diversity of order 2, i.e. the effective number of types that is obtained when the weighted arithmetic mean is used to quantify average proportional abundance of types in the dataset of interest. The index is also used as a measure of the

effective number of parties The effective number of parties is a concept introduced by Laakso and Taagepera (1979) which provides for an adjusted number of political parties in a country's party system. The idea behind this measure is to count parties and, at the same time, ...

Gini–Simpson index

The Gini-Simpson Index is also called Gini impurity, or Gini's diversity index in the field of

Machine Learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

. The original Simpson index λ equals the probability that two entities taken at random from the dataset of interest (with replacement) represent the same type. Its transformation 1 − λ, therefore, equals the probability that the two entities represent different types. This measure is also known in ecology as the probability of interspecific encounter (''PIE'') and the Gini–Simpson index. It can be expressed as a transformation of the true diversity of order 2: :

1 - \lambda = 1 - \sum_^R p_i^2 = 1 - \frac

The Gibbs–Martin index of sociology, psychology, and management studies, which is also known as the Blau index, is the same measure as the Gini–Simpson index. The quantity is also known as the

expected heterozygosity Zygosity (the noun, zygote, is from the Greek "yoked," from "yoke") () is the degree to which both copies of a chromosome or gene have the same genetic sequence. In other words, it is the degree of similarity of the alleles in an organism. Mo ...

in population genetics.

Berger–Parker index

The Berger–Parker index equals the maximum value in the dataset, i.e., the proportional abundance of the most abundant type. This corresponds to the weighted

of the values when approaches infinity, and hence equals the inverse of the true diversity of order infinity ().

References

External links

Simpson's Diversity index

gives some examples of estimates of Simpson's index for real ecosystems. Measurement of biodiversity Index numbers Summary statistics for categorical data