In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for
model selection
Model selection is the task of selecting a statistical model from a set of candidate models, given data. In the simplest cases, a pre-existing set of data is considered. However, the task can also involve the design of experiments such that the ...
among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the
likelihood function
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
and it is closely related to the
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
(AIC).
When fitting models, it is possible to increase the likelihood by adding parameters, but doing so may result in
overfitting
mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC for sample sizes greater than 7.
The BIC was developed by Gideon E. Schwarz and published in a 1978 paper, where he gave a
Bayesian
Thomas Bayes (/beɪz/; c. 1701 – 1761) was an English statistician, philosopher, and Presbyterian minister.
Bayesian () refers either to a range of concepts and approaches that relate to statistical methods based on Bayes' theorem, or a followe ...
argument for adopting it.
Definition
The BIC is formally defined as
:
where
*
= the maximized value of the
likelihood function
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
of the model
, i.e.
, where
are the parameter values that maximize the likelihood function;
*
= the observed data;
*
= the number of data points in
, the number of
observation
Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. The ...
s, or equivalently, the sample size;
*
= the number of
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s estimated by the model. For example, in
multiple linear regression
In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
, the estimated parameters are the intercept, the
slope parameters, and the constant variance of the errors; thus,
.
Derivation
Konishi and Kitagawa
derive the BIC to approximate the distribution of the data, integrating out the parameters using
Laplace's method
In mathematics, Laplace's method, named after Pierre-Simon Laplace, is a technique used to approximate integrals of the form
:\int_a^b e^ \, dx,
where f(x) is a twice-differentiable function, ''M'' is a large number, and the endpoints ''a'' an ...
, starting with the following
model evidence
A marginal likelihood is a likelihood function that has been integrated over the parameter space. In Bayesian statistics, it represents the probability of generating the observed sample from a prior and is therefore often referred to as model evi ...
:
:
where
is the prior for
under model
.
The log-likelihood,
, is then expanded to a second order
Taylor series
In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor serie ...
about the
MLE,
, assuming it is twice differentiable as follows:
:
where
is the average
observed information per observation, and
denotes the residual term. To the extent that
is negligible and
is relatively linear near
, we can integrate out
to get the following:
:
As
increases, we can ignore
and
as they are
. Thus,
:
where BIC is defined as above, and
either (a) is the Bayesian posterior mode or (b) uses the MLE and the prior
has nonzero slope at the MLE. Then the posterior
:
Usage
When picking from several models, ones with lower BIC values are generally preferred. The BIC is an increasing
function
Function or functionality may refer to:
Computing
* Function key, a type of key on computer keyboards
* Function model, a structured representation of processes in a system
* Function object or functor or functionoid, a concept of object-oriente ...
of the error variance
and an increasing function of ''k''. That is, unexplained variation in the
dependent variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
and the number of explanatory variables increase the value of BIC. However, a lower BIC does not necessarily indicate one model is better than another. Because it involves approximations, the BIC is merely a heuristic. In particular, differences in BIC should never be treated like transformed Bayes factors.
It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all models being compared. The models being compared need not be
nested
''Nested'' is the seventh studio album by Bronx-born singer, songwriter and pianist Laura Nyro, released in 1978 on Columbia Records.
Following on from her extensive tour to promote 1976's ''Smile'', which resulted in the 1977 live album '' Seas ...
, unlike the case when models are being compared using an
F-test
An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model ...
or a
likelihood ratio test
In statistics, the likelihood-ratio test assesses the goodness of fit of two competing statistical models based on the ratio of their likelihoods, specifically one found by maximization over the entire parameter space and another found after ...
.
Properties
* The BIC generally penalizes free parameters more strongly than the
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
, though it depends on the size of ''n'' and relative magnitude of ''n'' and ''k''.
*It is independent of the prior.
* It can measure the efficiency of the parameterized model in terms of predicting the data.
* It penalizes the complexity of the model where complexity refers to the number of parameters in the model.
* It is approximately equal to the
minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
criterion but with negative sign.
* It can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset.
* It is closely related to other penalized likelihood criteria such as
Deviance information criterion The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been o ...
and the
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
.
Limitations
The BIC suffers from two main limitations
# the above approximation is only valid for sample size
much larger than the number
of parameters in the model.
# the BIC cannot handle complex collections of models as in the variable selection (or
feature selection
In machine learning and statistics, feature selection, also known as variable selection, attribute selection or variable subset selection, is the process of selecting a subset of relevant features (variables, predictors) for use in model construc ...
) problem in high-dimension.
[
]
Gaussian special case
Under the assumption that the model errors or disturbances are independent and identically distributed according to a normal distribution
In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
:
f(x) = \frac e^
The parameter \mu ...
and the boundary condition that the derivative of the log likelihood
The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
with respect to the true variance is zero, this becomes (''up to an additive constant'', which depends only on ''n'' and not on the model):[ (p. 375).]
:
where is the error variance. The error variance in this case is defined as
:
which is a biased estimator for the true variance.
In terms of the residual sum of squares (RSS) the BIC is
:
When testing multiple linear models against a saturated model, the BIC can be rewritten in terms of the
deviance as:[.]
:
where is the number of model parameters in the test.
See also
*Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
*Bayes factor
The Bayes factor is a ratio of two competing statistical models represented by their marginal likelihood, and is used to quantify the support for one model over the other. The models in questions can have a common set of parameters, such as a nul ...
* Bayesian model comparison
*Deviance information criterion The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been o ...
*Hannan–Quinn information criterion In statistics, the Hannan–Quinn information criterion (HQC) is a criterion for model selection. It is an alternative to Akaike information criterion (AIC) and Bayesian information criterion (BIC). It is given as
: \mathrm = -2 L_ + 2 k \ln(\ln ...
*Jensen–Shannon divergence
In probability theory and statistics, the Jensen– Shannon divergence is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or total divergence to the average. It is based o ...
*Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how one probability distribution ''P'' is different fro ...
*Minimum message length
Minimum message length (MML) is a Bayesian information-theoretic method for statistical model comparison and selection. It provides a formal information theory restatement of Occam's Razor: even when models are equal in their measure of fit-accurac ...
Notes
References
Further reading
*
*
*
*
*
External links
Information Criteria and Model Selection
Sparse Vector Autoregressive Modeling
{{DEFAULTSORT:Bayesian Information Criterion
Model selection
Information criterion
Regression variable selection
de:Informationskriterium