A likelihood function (often simply called the likelihood) measures how well a

statistical model A statistical model is a mathematical model that embodies a set of statistical assumptions concerning the generation of Sample (statistics), sample data (and similar data from a larger Statistical population, population). A statistical model repre ...

explains observed data by calculating the probability of seeing that data under different

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

values of the model. It is constructed from the

joint probability distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

of the

random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a Mathematics, mathematical formalization of a quantity or object which depends on randomness, random events. The term 'random variable' in its mathema ...

that (presumably) generated the observations. When evaluated on the actual data points, it becomes a function solely of the model parameters. In

maximum likelihood estimation In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

, the argument that maximizes the likelihood function serves as a point estimate for the unknown parameter, while the

Fisher information In mathematical statistics, the Fisher information is a way of measuring the amount of information that an observable random variable ''X'' carries about an unknown parameter ''θ'' of a distribution that models ''X''. Formally, it is the variance ...

(often approximated by the likelihood's

Hessian matrix In mathematics, the Hessian matrix, Hessian or (less commonly) Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued Function (mathematics), function, or scalar field. It describes the local curvature of a functio ...

at the maximum) gives an indication of the estimate's precision. In contrast, in

Bayesian statistics Bayesian statistics ( or ) is a theory in the field of statistics based on the Bayesian interpretation of probability, where probability expresses a ''degree of belief'' in an event. The degree of belief may be based on prior knowledge about ...

, the estimate of interest is the ''converse'' of the likelihood, the so-called

posterior probability The posterior probability is a type of conditional probability that results from updating the prior probability with information summarized by the likelihood via an application of Bayes' rule. From an epistemological perspective, the posteri ...

of the parameter given the observed data, which is calculated via Bayes' rule.

Definition

The likelihood function, parameterized by a (possibly multivariate) parameter

\theta

, is usually defined differently for discrete and continuous

probability distributions In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...

(a more general definition is discussed below). Given a probability density or mass function

x\mapsto f(x \mid \theta),

where

x

is a realization of the random variable

X

, the likelihood function is

\theta\mapsto f(x \mid \theta),

often written

\mathcal(\theta \mid x).

In other words, when

f(x\mid\theta)

is viewed as a function of

x

with

\theta

fixed, it is a probability density function, and when viewed as a function of

\theta

with

x

fixed, it is a likelihood function. In the frequentist paradigm, the notation

f(x\mid\theta)

is often avoided and instead

f(x;\theta)

f(x,\theta)

are used to indicate that

\theta

is regarded as a fixed unknown quantity rather than as a

being conditioned on. The likelihood function does ''not'' specify the probability that

\theta

is the truth, given the observed sample

X = x

. Such an interpretation is a common error, with potentially disastrous consequences (see prosecutor's fallacy).

Discrete probability distribution

Let

X

be a discrete

with

probability mass function In probability and statistics, a probability mass function (sometimes called ''probability function'' or ''frequency function'') is a function that gives the probability that a discrete random variable is exactly equal to some value. Sometimes i ...

p

depending on a parameter

\theta

. Then the function

\mathcal(\theta \mid x) = p_\theta (x) = P_\theta (X=x),

considered as a function of

\theta

, is the ''likelihood function'', given the outcome

x

of the random variable

X

. Sometimes the probability of "the value

x

X

for the parameter value

\theta

" is written as or . The likelihood is the probability that a particular outcome

x

is observed when the true value of the parameter is

\theta

, equivalent to the probability mass on

x

; it is ''not'' a probability density over the parameter

\theta

. The likelihood,

\mathcal(\theta \mid x)

, should not be confused with

P(\theta \mid x)

, which is the posterior probability of

\theta

given the data

x

Example

Consider a simple statistical model of a coin flip: a single parameter

p_\text

that expresses the "fairness" of the coin. The parameter is the probability that a coin lands heads up ("H") when tossed.

p_\text

can take on any value within the range 0.0 to 1.0. For a perfectly

fair coin In probability theory and statistics, a sequence of Independence (probability theory), independent Bernoulli trials with probability 1/2 of success on each trial is metaphorically called a fair coin. One for which the probability is not 1/2 is ca ...

p_\text = 0.5

. Imagine flipping a fair coin twice, and observing two heads in two tosses ("HH"). Assuming that each successive coin flip is i.i.d., then the probability of observing HH is

P(\text \mid p_\text=0.5) = 0.5^2 = 0.25.

Equivalently, the likelihood of observing "HH" assuming

p_\text = 0.5

\mathcal(p_\text=0.5 \mid \text) = 0.25.

This is not the same as saying that

P(p_\text = 0.5 \mid HH) = 0.25

, a conclusion which could only be reached via

Bayes' theorem Bayes' theorem (alternatively Bayes' law or Bayes' rule, after Thomas Bayes) gives a mathematical rule for inverting Conditional probability, conditional probabilities, allowing one to find the probability of a cause given its effect. For exampl ...

given knowledge about the marginal probabilities

P(p_\text = 0.5)

and

P(\text)

. Now suppose that the coin is not a fair coin, but instead that

p_\text = 0.3

. Then the probability of two heads on two flips is

P(\text \mid p_\text=0.3) = 0.3^2 = 0.09.

Hence

\mathcal(p_\text=0.3 \mid \text) = 0.09.

More generally, for each value of

p_\text

, we can calculate the corresponding likelihood. The result of such calculations is displayed in Figure 1. The integral of

\mathcal

over , 1is 1/3; likelihoods need not integrate or sum to one over the parameter space.

Continuous probability distribution

Let

X

be a

following an

absolutely continuous probability distribution In probability theory and statistics, a probability distribution is a function that gives the probabilities of occurrence of possible events for an experiment. It is a mathematical description of a random phenomenon in terms of its sample spac ...

with

density function In probability theory, a probability density function (PDF), density function, or density of an absolutely continuous random variable, is a Function (mathematics), function whose value at any given sample (or point) in the sample space (the s ...

f

(a function of

x

) which depends on a parameter

\theta

. Then the function

\mathcal(\theta \mid x) = f_\theta (x),

considered as a function of

\theta

, is the ''likelihood function'' (of

\theta

, given the outcome

X=x

). Again,

\mathcal

is not a probability density or mass function over

\theta

, despite being a function of

\theta

given the observation

X = x

Relationship between the likelihood and probability density functions

The use of the probability density in specifying the likelihood function above is justified as follows. Given an observation

x_j

, the likelihood for the interval

_j, x_j + h /math>, where h > 0 is a constant, is given by \mathcal(\theta\mid x \in_j, x_j + h . Observe that \mathop\operatorname_\theta \mathcal(\theta\mid x \in_j, x_j + h = \mathop\operatorname_\theta \frac \mathcal(\theta\mid x \in_j, x_j + h, since h is positive and constant. Because \mathop\operatorname_\theta \frac 1 h \mathcal(\theta\mid x \in_j, x_j + h = \mathop\operatorname_\theta \frac 1 h \Pr(x_j \leq x \leq x_j + h \mid \theta)
 = \mathop\operatorname_\theta \frac 1 h \int_^ f(x\mid \theta) \,dx, where f(x\mid \theta) is the probability density function, it follows that \mathop\operatorname_\theta \mathcal(\theta\mid x \in_j, x_j + h = \mathop\operatorname_\theta \frac \int_^ f(x\mid\theta) \,dx . The first

fundamental theorem of calculus The fundamental theorem of calculus is a theorem that links the concept of derivative, differentiating a function (mathematics), function (calculating its slopes, or rate of change at every point on its domain) with the concept of integral, inte ...

provides that

\lim_ \frac 1 h \int_^ f(x\mid\theta) \,dx = f(x_j \mid \theta).

Then

&= \mathop\operatorname_\theta f(x_j \mid \theta). \end

Therefore,

\mathop\operatorname_\theta \mathcal(\theta\mid x_j) = \mathop\operatorname_\theta f(x_j \mid \theta),

and so maximizing the probability density at

x_j

amounts to maximizing the likelihood of the specific observation

x_j

In general

In measure-theoretic probability theory, the

is defined as the Radon–Nikodym derivative of the probability distribution relative to a common dominating measure. The likelihood function is this density interpreted as a function of the parameter, rather than the random variable. Thus, we can construct a likelihood function for any distribution, whether discrete, continuous, a mixture, or otherwise. (Likelihoods are comparable, e.g. for parameter estimation, only if they are Radon–Nikodym derivatives with respect to the same dominating measure.) The above discussion of the likelihood for discrete random variables uses the

counting measure In mathematics, specifically measure theory, the counting measure is an intuitive way to put a measure on any set – the "size" of a subset is taken to be the number of elements in the subset if the subset has finitely many elements, and infinit ...

, under which the probability density at any outcome equals the probability of that outcome.

Likelihoods for mixed continuous–discrete distributions

The above can be extended in a simple way to allow consideration of distributions which contain both discrete and continuous components. Suppose that the distribution consists of a number of discrete probability masses

p_k (\theta)

and a density

f(x\mid\theta)

, where the sum of all the

p

's added to the integral of

f

is always one. Assuming that it is possible to distinguish an observation corresponding to one of the discrete probability masses from one which corresponds to the density component, the likelihood function for an observation from the continuous component can be dealt with in the manner shown above. For an observation from the discrete component, the likelihood function for an observation from the discrete component is simply

\mathcal(\theta \mid x )= p_k(\theta),

where

k

is the index of the discrete probability mass corresponding to observation

x

, because maximizing the probability mass (or probability) at

x

amounts to maximizing the likelihood of the specific observation. The fact that the likelihood function can be defined in a way that includes contributions that are not commensurate (the density and the probability mass) arises from the way in which the likelihood function is defined up to a constant of proportionality, where this "constant" can change with the observation

x

, but not with the parameter

\theta

Regularity conditions

In the context of parameter estimation, the likelihood function is usually assumed to obey certain conditions, known as regularity conditions. These conditions are in various proofs involving likelihood functions, and need to be verified in each particular application. For maximum likelihood estimation, the existence of a global maximum of the likelihood function is of the utmost importance. By the

extreme value theorem In calculus, the extreme value theorem states that if a real-valued function f is continuous on the closed and bounded interval ,b/math>, then f must attain a maximum and a minimum, each at least once. That is, there exist numbers c and ...

, it suffices that the likelihood function is continuous on a

compact Compact as used in politics may refer broadly to a pact or treaty; in more specific cases it may refer to: * Interstate compact, a type of agreement used by U.S. states * Blood compact, an ancient ritual of the Philippines * Compact government, a t ...

parameter space for the maximum likelihood estimator to exist. While the continuity assumption is usually met, the compactness assumption about the parameter space is often not, as the bounds of the true parameter values might be unknown. In that case, concavity of the likelihood function plays a key role. More specifically, if the likelihood function is twice continuously differentiable on the k-dimensional parameter space

\Theta

assumed to be an

open Open or OPEN may refer to: Music * Open (band), Australian pop/rock band * The Open (band), English indie rock band * ''Open'' (Blues Image album), 1969 * ''Open'' (Gerd Dudek, Buschi Niebergall, and Edward Vesala album), 1979 * ''Open'' (Go ...

connected subset of

\mathbb^ \,,

there exists a unique maximum

\hat \in \Theta

if the matrix of second partials

\mathbf(\theta) \equiv \left, \frac \,\right^ \;

is negative definite for every

\, \theta \in \Theta \,

at which the gradient

\; \nabla L \equiv \left, \frac \,\right^ \;

vanishes, and if the likelihood function approaches a constant on the boundary of the parameter space,

\; \partial \Theta \;,

i.e.,

\lim_ L(\theta) = 0 \;,

which may include the points at infinity if

\, \Theta \,

is unbounded. Mäkeläinen and co-authors prove this result using

Morse theory In mathematics, specifically in differential topology, Morse theory enables one to analyze the topology of a manifold by studying differentiable functions on that manifold. According to the basic insights of Marston Morse, a typical differenti ...

while informally appealing to a mountain pass property. Mascarenhas restates their proof using the

mountain pass theorem The mountain pass theorem is an existence theorem from the calculus of variations, originally due to Antonio Ambrosetti and Paul Rabinowitz. Given certain conditions on a function, the theorem demonstrates the existence of a saddle point. The theor ...

. In the proofs of

consistency In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...

and asymptotic normality of the maximum likelihood estimator, additional assumptions are made about the probability densities that form the basis of a particular likelihood function. These conditions were first established by Chanda. In particular, for

almost all In mathematics, the term "almost all" means "all but a negligible quantity". More precisely, if X is a set (mathematics), set, "almost all elements of X" means "all elements of X but those in a negligible set, negligible subset of X". The meaning o ...

x

, and for all

\, \theta \in \Theta \,,

\frac \,, \quad \frac \,, \quad \frac \,

exist for all

\, r, s, t = 1, 2, \ldots, k \,

in order to ensure the existence of a

Taylor expansion In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...

. Second, for almost all

x

and for every

\, \theta \in \Theta \,

it must be that

\left,  \frac \ < F_r(x) \,, \quad \left,  \frac \ < F_(x) \,, \quad \left,  \frac \ < H_(x)

where

H

is such that

\, \int_^ H_(z) \mathrmz \leq M < \infty \;.

This boundedness of the derivatives is needed to allow for differentiation under the integral sign. And lastly, it is assumed that the information matrix,

\mathbf(\theta) = \int_^ \frac\ \frac\ f\ \mathrmz

is positive definite and

\, \left,  \mathbf(\theta) \ \,

is finite. This ensures that the score has a finite variance. The above conditions are sufficient, but not necessary. That is, a model that does not meet these regularity conditions may or may not have a maximum likelihood estimator of the properties mentioned above. Further, in case of non-independently or non-identically distributed observations additional properties may need to be assumed. In Bayesian statistics, almost identical regularity conditions are imposed on the likelihood function in order to proof asymptotic normality of the

, and therefore to justify a Laplace approximation of the posterior in large samples.

Likelihood ratio and relative likelihood

Likelihood ratio

A ''likelihood ratio'' is the ratio of any two specified likelihoods, frequently written as:

\Lambda(\theta_1:\theta_2 \mid x) = \frac.

The likelihood ratio is central to likelihoodist statistics: the '' law of likelihood'' states that the degree to which data (considered as evidence) supports one parameter value versus another is measured by the likelihood ratio. In

frequentist inference Frequentist inference is a type of statistical inference based in frequentist probability, which treats “probability” in equivalent terms to “frequency” and draws conclusions from sample-data by means of emphasizing the frequency or pr ...

, the likelihood ratio is the basis for a

test statistic Test statistic is a quantity derived from the sample for statistical hypothesis testing.Berger, R. L.; Casella, G. (2001). ''Statistical Inference'', Duxbury Press, Second Edition (p.374) A hypothesis test is typically specified in terms of a tes ...

, the so-called

likelihood-ratio test In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing ...

. By the Neyman–Pearson lemma, this is the most powerful test for comparing two simple hypotheses at a given

significance level In statistical hypothesis testing, a result has statistical significance when a result at least as "extreme" would be very infrequent if the null hypothesis were true. More precisely, a study's defined significance level, denoted by \alpha, is the ...

. Numerous other tests can be viewed as likelihood-ratio tests or approximations thereof. The asymptotic distribution of the log-likelihood ratio, considered as a test statistic, is given by Wilks' theorem. The likelihood ratio is also of central importance in

Bayesian inference Bayesian inference ( or ) is a method of statistical inference in which Bayes' theorem is used to calculate a probability of a hypothesis, given prior evidence, and update it as more information becomes available. Fundamentally, Bayesian infer ...

, where it is known as the

Bayes factor The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis ...

, and is used in Bayes' rule. Stated in terms of

odds In probability theory, odds provide a measure of the probability of a particular outcome. Odds are commonly used in gambling and statistics. For example for an event that is 40% probable, one could say that the odds are or When gambling, o ...

, Bayes' rule states that the ''posterior'' odds of two alternatives, and , given an event , is the ''prior'' odds, times the likelihood ratio. As an equation:

O(A_1:A_2 \mid B) = O(A_1:A_2) \cdot \Lambda(A_1:A_2 \mid B).

The likelihood ratio is not directly used in AIC-based statistics. Instead, what is used is the relative likelihood of models (see below). In

evidence-based medicine Evidence-based medicine (EBM) is "the conscientious, explicit and judicious use of current best evidence in making decisions about the care of individual patients. It means integrating individual clinical expertise with the best available exte ...

, likelihood ratios are used in diagnostic testing to assess the value of performing a

diagnostic test A medical test is a medical procedure performed to detect, diagnose, or monitor diseases, disease processes, susceptibility, or to determine a course of treatment. Medical tests such as, physical and visual exams, diagnostic imaging, genetic ...

Relative likelihood function

Since the actual value of the likelihood function depends on the sample, it is often convenient to work with a standardized measure. Suppose that the

maximum likelihood estimate In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stati ...

for the parameter is

\hat

. Relative plausibilities of other values may be found by comparing the likelihoods of those other values with the likelihood of

\hat

. The ''relative likelihood'' of is defined to be (§9.3).Sprott, D. A. (2000), ''Statistical Inference in Science'', Springer (chap. 2).

R(\theta) = \frac.

Thus, the relative likelihood is the likelihood ratio (discussed above) with the fixed denominator

\mathcal(\hat)

. This corresponds to standardizing the likelihood to have a maximum of 1.

Likelihood region

A ''likelihood region'' is the set of all values of whose relative likelihood is greater than or equal to a given threshold. In terms of percentages, a ''% likelihood region'' for is defined to be.

\left\.

If is a single real parameter, a % likelihood region will usually comprise an interval of real values. If the region does comprise an interval, then it is called a ''likelihood interval''.. Likelihood intervals, and more generally likelihood regions, are used for

interval estimation In statistics, interval estimation is the use of sample data to estimate an '' interval'' of possible values of a parameter of interest. This is in contrast to point estimation, which gives a single value. The most prevalent forms of interval es ...

within likelihoodist statistics: they are similar to confidence intervals in frequentist statistics and

credible interval In Bayesian statistics, a credible interval is an interval used to characterize a probability distribution. It is defined such that an unobserved parameter value has a particular probability \gamma to fall within it. For example, in an experime ...

s in Bayesian statistics. Likelihood intervals are interpreted directly in terms of relative likelihood, not in terms of

coverage probability In statistical estimation theory, the coverage probability, or coverage for short, is the probability that a confidence interval or confidence region will include the true value (parameter) of interest. It can be defined as the proportion of i ...

(frequentism) or

(Bayesianism). Given a model, likelihood intervals can be compared to confidence intervals. If is a single real parameter, then under certain conditions, a 14.65% likelihood interval (about 1:7 likelihood) for will be the same as a 95% confidence interval (19/20 coverage probability). In a slightly different formulation suited to the use of log-likelihoods (see Wilks' theorem), the test statistic is twice the difference in log-likelihoods and the probability distribution of the test statistic is approximately a

chi-squared distribution In probability theory and statistics, the \chi^2-distribution with k Degrees of freedom (statistics), degrees of freedom is the distribution of a sum of the squares of k Independence (probability theory), independent standard normal random vari ...

with degrees-of-freedom (df) equal to the difference in df's between the two models (therefore, the ⁻² likelihood interval is the same as the 0.954 confidence interval; assuming difference in df's to be 1).

Likelihoods that eliminate nuisance parameters

In many cases, the likelihood is a function of more than one parameter but interest focuses on the estimation of only one, or at most a few of them, with the others being considered as

nuisance parameter In statistics, a nuisance parameter is any parameter which is unspecified but which must be accounted for in the hypothesis testing of the parameters which are of interest. The classic example of a nuisance parameter comes from the normal distri ...

s. Several alternative approaches have been developed to eliminate such nuisance parameters, so that a likelihood can be written as a function of only the parameter (or parameters) of interest: the main approaches are profile, conditional, and marginal likelihoods. These approaches are also useful when a high-dimensional likelihood surface needs to be reduced to one or two parameters of interest in order to allow a

graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discret ...

Profile likelihood

It is possible to reduce the dimensions by concentrating the likelihood function for a subset of parameters by expressing the nuisance parameters as functions of the parameters of interest and replacing them in the likelihood function. In general, for a likelihood function depending on the parameter vector

\mathbf

that can be partitioned into

\mathbf = \left( \mathbf_ : \mathbf_ \right)

, and where a correspondence

\mathbf_ = \mathbf_ \left( \mathbf_ \right)

can be determined explicitly, concentration reduces computational burden of the original maximization problem. For instance, in a

linear regression In statistics, linear regression is a statistical model, model that estimates the relationship between a Scalar (mathematics), scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A mode ...

with normally distributed errors,

\mathbf = \mathbf \beta + u

, the coefficient vector could be partitioned into

\beta = \left \beta_ : \beta_ \right /math> (and consequently the

design matrix In statistics and in particular in regression analysis, a design matrix, also known as model matrix or regressor matrix and often denoted by X, is a matrix of values of explanatory variables of a set of objects. Each row represents an individual o ...

\mathbf = \left \mathbf_ : \mathbf_ \right /math>). Maximizing with respect to \beta_yields an optimal value function \beta_ (\beta_) = \left( \mathbf_^ \mathbf_ \right)^ \mathbf_^ \left( \mathbf - \mathbf_ \beta_ \right) . Using this result, the maximum likelihood estimator for \beta_can then be derived as \hat_ = \left( \mathbf_^ \left( \mathbf - \mathbf_ \right) \mathbf_ \right)^ \mathbf_^ \left( \mathbf - \mathbf_ \right) \mathbf where \mathbf_ = \mathbf_ \left( \mathbf_^ \mathbf_ \right)^ \mathbf_^is the

projection matrix In statistics, the projection matrix (\mathbf), sometimes also called the influence matrix or hat matrix (\mathbf), maps the vector of response values (dependent variable values) to the vector of fitted values (or predicted values). It describes ...

\mathbf_

. This result is known as the

Frisch–Waugh–Lovell theorem In econometrics, the Frisch–Waugh–Lovell (FWL) theorem is named after the econometricians Ragnar Frisch, Frederick V. Waugh, and Michael C. Lovell. The Frisch–Waugh–Lovell theorem states that if the regression we are concerned with is ...

. Since graphically the procedure of concentration is equivalent to slicing the likelihood surface along the ridge of values of the nuisance parameter

\beta_

that maximizes the likelihood function, creating an isometric profile of the likelihood function for a given

\beta_

, the result of this procedure is also known as ''profile likelihood''. In addition to being graphed, the profile likelihood can also be used to compute confidence intervals that often have better small-sample properties than those based on asymptotic standard errors calculated from the full likelihood.

Conditional likelihood

Sometimes it is possible to find a

sufficient statistic In statistics, sufficiency is a property of a statistic computed on a sample dataset in relation to a parametric model of the dataset. A sufficient statistic contains all of the information that the dataset provides about the model parameters. It ...

for the nuisance parameters, and conditioning on this statistic results in a likelihood which does not depend on the nuisance parameters. One example occurs in 2×2 tables, where conditioning on all four marginal totals leads to a conditional likelihood based on the non-central

hypergeometric distribution In probability theory and statistics, the hypergeometric distribution is a Probability distribution#Discrete probability distribution, discrete probability distribution that describes the probability of k successes (random draws for which the ...

. This form of conditioning is also the basis for Fisher's exact test.

Marginal likelihood

Sometimes we can remove the nuisance parameters by considering a likelihood based on only part of the information in the data, for example by using the set of ranks rather than the numerical values. Another example occurs in linear

mixed model A mixed model, mixed-effects model or mixed error-component model is a statistical model containing both fixed effects and random effects. These models are useful in a wide variety of disciplines in the physical, biological and social sciences. ...

s, where considering a likelihood for the residuals only after fitting the fixed effects leads to residual maximum likelihood estimation of the variance components.

Partial likelihood

A partial likelihood is an adaption of the full likelihood such that only a part of the parameters (the parameters of interest) occur in it. It is a key component of the

proportional hazards model Proportional hazards models are a class of survival models in statistics. Survival models relate the time that passes, before some event occurs, to one or more covariates that may be associated with that quantity of time. In a proportional haz ...

: using a restriction on the hazard function, the likelihood does not contain the shape of the hazard over time.

Products of likelihoods

The likelihood, given two or more

independent Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artist ...

events, is the product of the likelihoods of each of the individual events:

\Lambda(A \mid X_1 \land X_2) = \Lambda(A \mid X_1) \cdot \Lambda(A \mid X_2).

This follows from the definition of independence in probability: the probabilities of two independent events happening, given a model, is the product of the probabilities. This is particularly important when the events are from

independent and identically distributed random variables Independent or Independents may refer to: Arts, entertainment, and media Artist groups * Independents (artist group), a group of modernist painters based in Pennsylvania, United States * Independentes (English: Independents), a Portuguese artis ...

, such as independent observations or sampling with replacement. In such a situation, the likelihood function factors into a product of individual likelihood functions. The empty product has value 1, which corresponds to the likelihood, given no event, being 1: before any data, the likelihood is always 1. This is similar to a uniform prior in Bayesian statistics, but in likelihoodist statistics this is not an improper prior because likelihoods are not integrated.

Log-likelihood

''Log-likelihood function'' is the logarithm of the likelihood function, often denoted by a lowercase or , to contrast with the uppercase or

\mathcal

for the likelihood. Because logarithms are

strictly increasing In mathematical writing, the term strict refers to the property of excluding equality and equivalence and often occurs in the context of inequality and monotonic functions. It is often attached to a technical term to indicate that the exclusiv ...

functions, maximizing the likelihood is equivalent to maximizing the log-likelihood. But for practical purposes it is more convenient to work with the log-likelihood function in

, in particular since most common

probability distribution In probability theory and statistics, a probability distribution is a Function (mathematics), function that gives the probabilities of occurrence of possible events for an Experiment (probability theory), experiment. It is a mathematical descri ...

s—notably the

exponential family In probability and statistics, an exponential family is a parametric set of probability distributions of a certain form, specified below. This special form is chosen for mathematical convenience, including the enabling of the user to calculate ...

—are only logarithmically concave, and concavity of the

objective function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

plays a key role in the maximization. Given the independence of each event, the overall log-likelihood of intersection equals the sum of the log-likelihoods of the individual events. This is analogous to the fact that the overall log-probability is the sum of the log-probability of the individual events. In addition to the mathematical convenience from this, the adding process of log-likelihood has an intuitive interpretation, as often expressed as "support" from the data. When the parameters are estimated using the log-likelihood for the

, each data point is used by being added to the total log-likelihood. As the data can be viewed as an evidence that support the estimated parameters, this process can be interpreted as "support from independent evidence ''adds",'' and the log-likelihood is the "weight of evidence". Interpreting negative log-probability as

information content In information theory, the information content, self-information, surprisal, or Shannon information is a basic quantity derived from the probability of a particular event occurring from a random variable. It can be thought of as an alternative w ...

or surprisal, the support (log-likelihood) of a model, given an event, is the negative of the surprisal of the event, given the model: a model is supported by an event to the extent that the event is unsurprising, given the model. A logarithm of a likelihood ratio is equal to the difference of the log-likelihoods:

\log \frac = \log \mathcal(A) - \log \mathcal(B) = \ell(A) - \ell(B).

Just as the likelihood, given no event, being 1, the log-likelihood, given no event, is 0, which corresponds to the value of the empty sum: without any data, there is no support for any models.

Graph

The

of the log-likelihood is called the support curve (in the

univariate In mathematics, a univariate object is an expression (mathematics), expression, equation, function (mathematics), function or polynomial involving only one Variable (mathematics), variable. Objects involving more than one variable are ''wikt:multi ...

case). In the multivariate case, the concept generalizes into a support surface over the

parameter space The parameter space is the space of all possible parameter values that define a particular mathematical model. It is also sometimes called weight space, and is often a subset of finite-dimensional Euclidean space. In statistics, parameter spaces a ...

. It has a relation to, but is distinct from, the

support of a distribution In mathematics, the support of a real-valued function f is the subset of the function domain of elements that are not mapped to zero. If the domain of f is a topological space, then the support of f is instead defined as the smallest closed set ...

. The term was coined by

A. W. F. Edwards Anthony William Fairbank Edwards, Fellow of the Royal Society, FRS One or more of the preceding sentences incorporates text from the royalsociety.org website where: (born 1935) is a British statistician, geneticist and evolutionary biologist. Ed ...

in the context of

statistical hypothesis testing A statistical hypothesis test is a method of statistical inference used to decide whether the data provide sufficient evidence to reject a particular hypothesis. A statistical hypothesis test typically involves a calculation of a test statistic. T ...

, i.e. whether or not the data "support" one hypothesis (or parameter value) being tested more than any other. The log-likelihood function being plotted is used in the computation of the score (the gradient of the log-likelihood) and

(the curvature of the log-likelihood). Thus, the graph has a direct interpretation in the context of

and

Likelihood equations

If the log-likelihood function is smooth, its

gradient In vector calculus, the gradient of a scalar-valued differentiable function f of several variables is the vector field (or vector-valued function) \nabla f whose value at a point p gives the direction and the rate of fastest increase. The g ...

with respect to the parameter, known as the score and written

s_(\theta) \equiv \nabla_ \ell_(\theta)

, exists and allows for the application of

differential calculus In mathematics, differential calculus is a subfield of calculus that studies the rates at which quantities change. It is one of the two traditional divisions of calculus, the other being integral calculus—the study of the area beneath a curve. ...

. The basic way to maximize a differentiable function is to find the

stationary point In mathematics, particularly in calculus, a stationary point of a differentiable function of one variable is a point on the graph of a function, graph of the function where the function's derivative is zero. Informally, it is a point where the ...

s (the points where the

derivative In mathematics, the derivative is a fundamental tool that quantifies the sensitivity to change of a function's output with respect to its input. The derivative of a function of a single variable at a chosen input value, when it exists, is t ...

is zero); since the derivative of a sum is just the sum of the derivatives, but the derivative of a product requires the

product rule In calculus, the product rule (or Leibniz rule or Leibniz product rule) is a formula used to find the derivatives of products of two or more functions. For two functions, it may be stated in Lagrange's notation as (u \cdot v)' = u ' \cdot v ...

, it is easier to compute the stationary points of the log-likelihood of independent events than for the likelihood of independent events. The equations defined by the stationary point of the score function serve as estimating equations for the maximum likelihood estimator.

s_(\theta) = \mathbf

In that sense, the maximum likelihood estimator is implicitly defined by the value at

\mathbf

of the

inverse function In mathematics, the inverse function of a function (also called the inverse of ) is a function that undoes the operation of . The inverse of exists if and only if is bijective, and if it exists, is denoted by f^ . For a function f\colon ...

s_^: \mathbb^ \to \Theta

, where

\mathbb^

is the d-dimensional

Euclidean space Euclidean space is the fundamental space of geometry, intended to represent physical space. Originally, in Euclid's ''Elements'', it was the three-dimensional space of Euclidean geometry, but in modern mathematics there are ''Euclidean spaces ...

, and

\Theta

is the parameter space. Using the

inverse function theorem In mathematics, the inverse function theorem is a theorem that asserts that, if a real function ''f'' has a continuous derivative near a point where its derivative is nonzero, then, near this point, ''f'' has an inverse function. The inverse fu ...

, it can be shown that

s_^

well-defined In mathematics, a well-defined expression or unambiguous expression is an expression (mathematics), expression whose definition assigns it a unique interpretation or value. Otherwise, the expression is said to be ''not well defined'', ill defined ...

in an

open neighborhood In topology and related areas of mathematics, a neighbourhood (or neighborhood) is one of the basic concepts in a topological space. It is closely related to the concepts of open set and interior. Intuitively speaking, a neighbourhood of a po ...

about

\mathbf

with probability going to one, and

\hat_ = s_^(\mathbf)

is a consistent estimate of

\theta

. As a consequence there exists a sequence

\left\

such that

s_(\hat_) = \mathbf

asymptotically

almost surely In probability theory, an event is said to happen almost surely (sometimes abbreviated as a.s.) if it happens with probability 1 (with respect to the probability measure). In other words, the set of outcomes on which the event does not occur ha ...

, and

\hat_ \xrightarrow \theta_

. A similar result can be established using Rolle's theorem. The second derivative evaluated at

\hat

, known as

, determines the curvature of the likelihood surface, and thus indicates the precision of the estimate.

Exponential families

The log-likelihood is also particularly useful for exponential families of distributions, which include many of the common parametric probability distributions. The probability distribution function (and thus likelihood function) for exponential families contain products of factors involving

exponentiation In mathematics, exponentiation, denoted , is an operation (mathematics), operation involving two numbers: the ''base'', , and the ''exponent'' or ''power'', . When is a positive integer, exponentiation corresponds to repeated multiplication ...

. The logarithm of such a function is a sum of products, again easier to differentiate than the original function. An exponential family is one whose probability density function is of the form (for some functions, writing

\langle -, - \rangle

for the

inner product In mathematics, an inner product space (or, rarely, a Hausdorff pre-Hilbert space) is a real vector space or a complex vector space with an operation called an inner product. The inner product of two vectors in the space is a scalar, ofte ...

p(x \mid \boldsymbol \theta) = h(x) \exp\Big(\langle \boldsymbol\eta(), \mathbf(x)\rangle -A() \Big).

Each of these terms has an interpretation, but simply switching from probability to likelihood and taking logarithms yields the sum:

\ell(\boldsymbol \theta \mid x) = \langle \boldsymbol\eta(), \mathbf(x)\rangle - A() + \log h(x).

The

\boldsymbol \eta(\boldsymbol \theta)

and

h(x)

each correspond to a

change of coordinates In mathematics, an ordered basis of a vector space of finite dimension (vector space), dimension allows representing uniquely any element of the vector space by a coordinate vector, which is a finite sequence, sequence of scalar (mathematics), ...

, so in these coordinates, the log-likelihood of an exponential family is given by the simple formula:

\ell(\boldsymbol \eta \mid x) = \langle \boldsymbol\eta, \mathbf(x)\rangle - A().

In words, the log-likelihood of an exponential family is inner product of the natural parameter and the

, minus the normalization factor ( log-partition function) . Thus for example the maximum likelihood estimate can be computed by taking derivatives of the sufficient statistic and the log-partition function .

Example: the gamma distribution

The

gamma distribution In probability theory and statistics, the gamma distribution is a versatile two-parameter family of continuous probability distributions. The exponential distribution, Erlang distribution, and chi-squared distribution are special cases of the g ...

is an exponential family with two parameters,

\alpha

and

\beta

. The likelihood function is

\mathcal (\alpha, \beta \mid x) = \frac x^ e^.

Finding the maximum likelihood estimate of

\beta

for a single observed value

x

looks rather daunting. Its logarithm is much simpler to work with:

\log \mathcal(\alpha,\beta \mid x) = \alpha \log \beta - \log \Gamma(\alpha) + (\alpha-1) \log x  - \beta x. \,

To maximize the log-likelihood, we first take the

partial derivative In mathematics, a partial derivative of a function of several variables is its derivative with respect to one of those variables, with the others held constant (as opposed to the total derivative, in which all variables are allowed to vary). P ...

with respect to

\beta

\frac = \frac - x.

If there are a number of independent observations

x_1, \ldots, x_n

, then the joint log-likelihood will be the sum of individual log-likelihoods, and the derivative of this sum will be a sum of derivatives of each individual log-likelihood:

\begin
& \frac \\
&= \frac + \cdots + \frac \\
&= \frac \beta - \sum_^n x_i.
\end

To complete the maximization procedure for the joint log-likelihood, the equation is set to zero and solved for

\beta

\widehat\beta = \frac.

Here

\widehat\beta

denotes the maximum-likelihood estimate, and

\textstyle \bar = \frac \sum_^n x_i

is the

sample mean The sample mean (sample average) or empirical mean (empirical average), and the sample covariance or empirical covariance are statistics computed from a sample of data on one or more random variables. The sample mean is the average value (or me ...

of the observations.

Background and interpretation

Historical remarks

The term "likelihood" has been in use in English since at least late

Middle English Middle English (abbreviated to ME) is a form of the English language that was spoken after the Norman Conquest of 1066, until the late 15th century. The English language underwent distinct variations and developments following the Old English pe ...

. Its formal use to refer to a specific function in mathematical statistics was proposed by

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...

, in two research papers published in 1921 and 1922. The 1921 paper introduced what is today called a "likelihood interval"; the 1922 paper introduced the term " method of maximum likelihood". Quoting Fisher: The concept of likelihood should not be confused with probability as mentioned by Sir Ronald Fisher Fisher's invention of statistical likelihood was in reaction against an earlier form of reasoning called

inverse probability In probability theory, inverse probability is an old term for the probability distribution of an unobserved variable. Today, the problem of determining an unobserved variable (by whatever method) is called inferential statistics. The method of i ...

. His use of the term "likelihood" fixed the meaning of the term within mathematical statistics.

(1972) established the axiomatic basis for use of the log-likelihood ratio as a measure of relative support for one hypothesis against another. The ''support function'' is then the natural logarithm of the likelihood function. Both terms are used in

phylogenetics In biology, phylogenetics () is the study of the evolutionary history of life using observable characteristics of organisms (or genes), which is known as phylogenetic inference. It infers the relationship among organisms based on empirical dat ...

, but were not adopted in a general treatment of the topic of statistical evidence.

Interpretations under different foundations

Among statisticians, there is no consensus about what the foundation of statistics should be. There are four main paradigms that have been proposed for the foundation: frequentism, Bayesianism, likelihoodism, and AIC-based. For each of the proposed foundations, the interpretation of likelihood is different. The four interpretations are described in the subsections below.

Frequentist interpretation

Bayesian interpretation

, although one can speak about the likelihood of any proposition or

given another random variable: for example the likelihood of a parameter value or of a

(see

marginal likelihood A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample for all possible values of the parameters; it can be under ...

), given specified data or other evidence,I. J. Good: ''Probability and the Weighing of Evidence'' (Griffin 1950), §6.1H. Jeffreys: ''Theory of Probability'' (3rd ed., Oxford University Press 1983), §1.22E. T. Jaynes: ''Probability Theory: The Logic of Science'' (Cambridge University Press 2003), §4.1D. V. Lindley: ''Introduction to Probability and Statistics from a Bayesian Viewpoint. Part 1: Probability'' (Cambridge University Press 1980), §1.6 the likelihood function remains the same entity, with the additional interpretations of (i) a conditional density of the data given the parameter (since the parameter is then a random variable) and (ii) a measure or amount of information brought by the data about the parameter value or even the model.A. Gelman, J. B. Carlin, H. S. Stern, D. B. Dunson, A. Vehtari, D. B. Rubin: ''Bayesian Data Analysis'' (3rd ed., Chapman & Hall/CRC 2014), §1.3 Due to the introduction of a probability structure on the parameter space or on the collection of models, it is possible that a parameter value or a statistical model have a large likelihood value for given data, and yet have a low ''probability'', or vice versa. This is often the case in medical contexts. Following Bayes' Rule, the likelihood when seen as a conditional density can be multiplied by the

prior probability A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...

density of the parameter and then normalized, to give a

density. More generally, the likelihood of an unknown quantity

X

given another unknown quantity

Y

is proportional to the ''probability of

Y

given

X

''.

Likelihoodist interpretation

In frequentist statistics, the likelihood function is itself a

statistic A statistic (singular) or sample statistic is any quantity computed from values in a sample which is considered for a statistical purpose. Statistical purposes include estimating a population parameter, describing a sample, or evaluating a hypot ...

that summarizes a single sample from a population, whose calculated value depends on a choice of several parameters ''θ''₁ ... ''θ''_p, where ''p'' is the count of parameters in some already-selected

. The value of the likelihood serves as a figure of merit for the choice used for the parameters, and the parameter set with maximum likelihood is the best choice, given the data available. The specific calculation of the likelihood is the probability that the observed sample would be assigned, assuming that the model chosen and the values of the several parameters ''θ'' give an accurate approximation of the

frequency distribution In statistics, the frequency or absolute frequency of an Event (probability theory), event i is the number n_i of times the observation has occurred/been recorded in an experiment or study. These frequencies are often depicted graphically or tabu ...

of the population that the observed sample was drawn from. Heuristically, it makes sense that a good choice of parameters is those which render the sample actually observed the maximum possible ''post-hoc'' probability of having happened. Wilks' theorem quantifies the heuristic rule by showing that the difference in the logarithm of the likelihood generated by the estimate's parameter values and the logarithm of the likelihood generated by population's "true" (but unknown) parameter values is asymptotically χ² distributed. Each independent sample's maximum likelihood estimate is a separate estimate of the "true" parameter set describing the population sampled. Successive estimates from many independent samples will cluster together with the population's "true" set of parameter values hidden somewhere in their midst. The difference in the logarithms of the maximum likelihood and adjacent parameter sets' likelihoods may be used to draw a

confidence region In statistics, a confidence region is a multi-dimensional generalization of a confidence interval. For a bivariate normal distribution, it is an ellipse, also known as the error ellipse. More generally, it is a set of points in an ''n''-dimension ...

on a plot whose co-ordinates are the parameters ''θ''₁ ... ''θ''_p. The region surrounds the maximum-likelihood estimate, and all points (parameter sets) within that region differ at most in log-likelihood by some fixed value. The χ² distribution given by Wilks' theorem converts the region's log-likelihood differences into the "confidence" that the population's "true" parameter set lies inside. The art of choosing the fixed log-likelihood difference is to make the confidence acceptably high while keeping the region acceptably small (narrow range of estimates). As more data are observed, instead of being used to make independent estimates, they can be combined with the previous samples to make a single combined sample, and that large sample may be used for a new maximum likelihood estimate. As the size of the combined sample increases, the size of the likelihood region with the same confidence shrinks. Eventually, either the size of the confidence region is very nearly a single point, or the entire population has been sampled; in both cases, the estimated parameter set is essentially the same as the population parameter set.

AIC-based interpretation

Under the AIC paradigm, likelihood is interpreted within the context of

information theory Information theory is the mathematical study of the quantification (science), quantification, Data storage, storage, and telecommunications, communication of information. The field was established and formalized by Claude Shannon in the 1940s, ...

Notes

References

External links

Likelihood function at Planetmath
* {{DEFAULTSORT:Likelihood Function Bayesian statistics

Definition

Discrete probability distribution

Example

Continuous probability distribution

Relationship between the likelihood and probability density functions

In general

Likelihoods for mixed continuous–discrete distributions

Regularity conditions

Likelihood ratio and relative likelihood

Likelihood ratio

Relative likelihood function

Likelihood region

Likelihoods that eliminate nuisance parameters

Profile likelihood

Conditional likelihood

Marginal likelihood

Partial likelihood

Products of likelihoods

Log-likelihood

Graph

Likelihood equations

Exponential families

Example: the gamma distribution

Background and interpretation

Historical remarks

Interpretations under different foundations

Frequentist interpretation

Bayesian interpretation

Likelihoodist interpretation

AIC-based interpretation

See also

Notes

References

Further reading

External links