statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, overdispersion is the presence of greater variability (

statistical dispersion In statistics, dispersion (also called variability, scatter, or spread) is the extent to which a Probability distribution, distribution is stretched or squeezed. Common examples of measures of statistical dispersion are the variance, standard de ...

) in a data set than would be expected based on a given

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repres ...

. A common task in applied

is choosing a

parametric model In statistics, a parametric model or parametric family or finite-dimensional model is a particular class of statistical models. Specifically, a parametric model is a family of probability distributions that has a finite number of parameters. Def ...

to fit a given set of empirical observations. This necessitates an assessment of the fit of the chosen model. It is usually possible to choose the model parameters in such a way that the theoretical

population mean In statistics, a population is a set of similar items or events which is of interest for some question or experiment. A statistical population can be a group of existing objects (e.g. the set of all stars within the Milky Way galaxy) or a hypothe ...

of the model is approximately equal to the

sample mean The sample mean (or "empirical mean") and the sample covariance are statistics computed from a Sample (statistics), sample of data on one or more random variables. The sample mean is the average value (or mean, mean value) of a sample (statistic ...

. However, especially for simple models with few parameters, theoretical predictions may not match empirical observations for higher moments. When the observed

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...

is higher than the variance of a theoretical model, overdispersion has occurred. Conversely, underdispersion means that there was less variation in the data than predicted. Overdispersion is a very common feature in applied data analysis because in practice, populations are frequently

heterogeneous Homogeneity and heterogeneity are concepts often used in the sciences and statistics relating to the uniformity of a substance or organism. A material or image that is homogeneous is uniform in composition or character (i.e. color, shape, siz ...

(non-uniform) contrary to the assumptions implicit within widely used simple parametric models.

Examples

Poisson

Overdispersion is often encountered when fitting very simple parametric models, such as those based on the

Poisson distribution In probability theory and statistics, the Poisson distribution is a discrete probability distribution that expresses the probability of a given number of events occurring in a fixed interval of time or space if these events occur with a known co ...

. The Poisson distribution has one free parameter and does not allow for the variance to be adjusted independently of the mean. The choice of a distribution from the Poisson family is often dictated by the nature of the empirical data. For example,

Poisson regression In statistics, Poisson regression is a generalized linear model form of regression analysis used to model count data and contingency tables. Poisson regression assumes the response variable ''Y'' has a Poisson distribution, and assumes the logari ...

analysis is commonly used to model

count data Count (feminine: countess) is a historical title of nobility in certain European countries, varying in relative status, generally of middling rank in the hierarchy of nobility. Pine, L. G. ''Titles: How the King Became His Majesty''. New York: ...

. If overdispersion is a feature, an alternative model with additional free parameters may provide a better fit. In the case of count data, a Poisson

mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observatio ...

like the negative binomial distribution can be proposed instead, in which the mean of the Poisson distribution can itself be thought of as a random variable drawn – in this case – from the

gamma distribution In probability theory and statistics, the gamma distribution is a two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-square distribution are special cases of the gamma d ...

thereby introducing an additional free parameter (note the resulting negative binomial distribution is completely characterized by two parameters).

Binomial

As a more concrete example, it has been observed that the number of boys born to families does not conform faithfully to a binomial distribution as might be expected. Instead, the sex ratios of families seem to skew toward either boys or girls (see, for example the

Trivers–Willard hypothesis In evolutionary biology and evolutionary psychology, the Trivers–Willard hypothesis, formally proposed by Robert Trivers and Dan Willard in 1973, suggests that female mammals adjust the sex ratio of offspring in response to maternal condition, ...

for one possible explanation) i.e. there are more all-boy families, more all-girl families and not enough families close to the population 51:49 boy-to-girl mean ratio than expected from a binomial distribution, and the resulting empirical variance is larger than specified by a binomial model. In this case, the

beta-binomial model In probability theory and statistics, the beta-binomial distribution is a family of discrete probability distributions on a finite support of non-negative integers arising when the probability of success in each of a fixed or known number of Bern ...

distribution is a popular and analytically tractable alternative model to the binomial distribution since it provides a better fit to the observed data. To capture the heterogeneity of the families, one can think of the probability parameter of the binomial model (say, probability of being a boy) is itself a random variable (i.e.

random effects model In statistics, a random effects model, also called a variance components model, is a statistical model where the model parameters are random variables. It is a kind of hierarchical linear model, which assumes that the data being analysed are dra ...

) drawn for each family from a

beta distribution In probability theory and statistics, the beta distribution is a family of continuous probability distributions defined on the interval , 1in terms of two positive parameters, denoted by ''alpha'' (''α'') and ''beta'' (''β''), that appear as ...

as the mixing distribution. The resulting

compound distribution In probability and statistics, a compound probability distribution (also known as a mixture distribution or contagious distribution) is the probability distribution that results from assuming that a random variable is distributed according to som ...

(beta-binomial) has an additional free parameter. Another common model for overdispersion—when some of the observations are not

Bernoulli Bernoulli can refer to: People *Bernoulli family of 17th and 18th century Swiss mathematicians: ** Daniel Bernoulli (1700–1782), developer of Bernoulli's principle **Jacob Bernoulli (1654–1705), also known as Jacques, after whom Bernoulli numbe ...

—arises from introducing a normal random variable into a logistic model. Software is widely available for fitting this type of

multilevel model Multilevel models (also known as hierarchical linear models, linear mixed-effect model, mixed models, nested data models, random coefficient, random-effects models, random parameter models, or split-plot designs) are statistical models of parame ...

. In this case, if the variance of the normal variable is zero, the model reduces to the standard (undispersed)

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...

. This model has an additional free parameter, namely the variance of the normal variable. With respect to binomial random variables, the concept of overdispersion makes sense only if n>1 (i.e. overdispersion is nonsensical for Bernoulli random variables).

Normal distribution

As the

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

(Gaussian) has variance as a parameter, any data with finite variance (including any finite data) can be modeled with a normal distribution with the exact variance – the normal distribution is a two-parameter model, with mean and variance. Thus, in the absence of an underlying model, there is no notion of data being overdispersed relative to the normal model, though the fit may be poor in other respects (such as the higher moments of

skew Skew may refer to: In mathematics * Skew lines, neither parallel nor intersecting. * Skew normal distribution, a probability distribution * Skew field or division ring * Skew-Hermitian matrix * Skew lattice * Skew polygon, whose vertices do not ...

kurtosis In probability theory and statistics, kurtosis (from el, κυρτός, ''kyrtos'' or ''kurtos'', meaning "curved, arching") is a measure of the "tailedness" of the probability distribution of a real-valued random variable. Like skewness, kurt ...

, etc.). However, in the case that the data is modeled by a normal distribution with an expected variation, it can be over- or under-dispersed relative to that prediction. For example, in a

statistical survey Survey methodology is "the study of survey methods". As a field of applied statistics concentrating on human-research surveys, survey methodology studies the sampling of individual units from a population and associated techniques of survey da ...

, the

margin of error The margin of error is a statistic expressing the amount of random sampling error in the results of a survey. The larger the margin of error, the less confidence one should have that a poll result would reflect the result of a census of the e ...

(determined by sample size) predicts the

sampling error In statistics, sampling errors are incurred when the statistical characteristics of a population are estimated from a subset, or sample, of that population. Since the sample does not include all members of the population, statistics of the sample ( ...

and hence dispersion of results on repeated surveys. If one performs a

meta-analysis A meta-analysis is a statistical analysis that combines the results of multiple scientific studies. Meta-analyses can be performed when there are multiple scientific studies addressing the same question, with each individual study reporting me ...

of repeated surveys of a fixed population (say with a given sample size, so margin of error is the same), one expects the results to fall on normal distribution with standard deviation equal to the margin of error. However, in the presence of

study heterogeneity In statistics, (between-) study heterogeneity is a phenomenon that commonly occurs when attempting to undertake a meta-analysis. In a simplistic scenario, studies whose results are to be combined in the meta-analysis would all be undertaken in the ...

where studies have different

sampling bias In statistics, sampling bias is a bias in which a sample is collected in such a way that some members of the intended population have a lower or higher sampling probability than others. It results in a biased sample of a population (or non-human fa ...

, the distribution is instead a

and will be overdistributed relative to the predicted distribution. For example, given repeated

opinion poll An opinion poll, often simply referred to as a survey or a poll (although strictly a poll is an actual election) is a human research survey of public opinion from a particular sample. Opinion polls are usually designed to represent the opinions ...

s all with a margin of error of 3%, if they are conducted by different polling organizations, one expects the results to have standard deviation greater than 3%, due to pollster bias from different methodologies.

Differences in terminology among disciplines

Over- and underdispersion are terms which have been adopted in branches of the

biological science Biology is the scientific study of life. It is a natural science with a broad scope but has several unifying themes that tie it together as a single, coherent field. For instance, all organisms are made up of cells that process hereditary ...

s. In parasitology, the term 'overdispersion' is generally used as defined here – meaning a distribution with a higher than expected variance. In some areas of

ecology Ecology () is the study of the relationships between living organisms, including humans, and their physical environment. Ecology considers organisms at the individual, population, community, ecosystem, and biosphere level. Ecology overlaps wi ...

, however, meanings have been transposed, so that overdispersion is actually taken to mean more even (lower variance) than expected. This confusion has caused some ecologists to suggest that the terms 'aggregated', or 'contagious', would be better used in ecology for 'overdispersed'. Such preferences are creeping into parasitology too. Generally this suggestion has not been heeded, and confusion persists in the literature. Furthermore in

demography Demography () is the statistics, statistical study of populations, especially human beings. Demographic analysis examines and measures the dimensions and Population dynamics, dynamics of populations; it can cover whole societies or groups ...

, overdispersion is often evident in the analysis of death count data, but demographers prefer the term '

unobserved heterogeneity In economic theory and econometrics, the term heterogeneity refers to differences across the units being studied. For example, a macroeconomic model in which consumers are assumed to differ from one another is said to have heterogeneous agents. U ...

References