In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for
model selection
Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one.
In the context of machine learning and more generally statistical analysis, this may be the selection of ...
among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the
likelihood function
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
and it is closely related to the
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...
(AIC).
When fitting models, it is possible to increase the maximum likelihood by adding parameters, but doing so may result in
overfitting
In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...
. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC for sample sizes greater than 7.
The BIC was developed by Gideon E. Schwarz and published in a 1978 paper, as a large-sample approximation to the
Bayes factor
The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis ...
.
Definition
The BIC is formally defined as
:
where
*
= the maximized value of the
likelihood function
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...
of the model
, i.e.
, where
are the parameter values that maximize the likelihood function and
is the observed data;
*
= the number of data points in
, the number of
observation
Observation in the natural sciences is an act or instance of noticing or perceiving and the acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the percep ...
s, or equivalently, the sample size;
*
= the number of
parameter
A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...
s estimated by the model. For example, in
multiple linear regression
In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...
, the estimated parameters are the intercept, the
slope parameters, and the constant variance of the errors; thus,
.
Derivation
The BIC can be derived by integrating out the parameters of the model using
Laplace's method
In mathematics, Laplace's method, named after Pierre-Simon Laplace, is a technique used to approximate integrals of the form
:\int_a^b e^ \, dx,
where f is a twice-differentiable function, M is a large number, and the endpoints a and b could b ...
, starting with the following
model evidence:
:
where
is the prior for
under model
.
The log-likelihood,
, is then expanded to a second order
Taylor series
In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...
about the
MLE,
, assuming it is twice differentiable as follows:
:
where
is the average
observed information per observation, and
denotes the residual term. To the extent that
is negligible and
is relatively linear near
, we can integrate out
to get the following:
:
As
increases, we can ignore
and
as they are
. Thus,
:
where BIC is defined as above, and
either (a) is the Bayesian posterior mode or (b) uses the MLE and the prior
has nonzero slope at the MLE. Then the posterior
:
Use
When picking from several models, ones with lower BIC values are generally preferred. The BIC is an increasing
function
Function or functionality may refer to:
Computing
* Function key, a type of key on computer keyboards
* Function model, a structured representation of processes in a system
* Function object or functor or functionoid, a concept of object-orie ...
of the error variance
and an increasing function of ''k''. That is, unexplained variation in the
dependent variable
A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
and the number of explanatory variables increase the value of BIC. However, a lower BIC does not necessarily indicate one model is better than another. Because it involves approximations, the BIC is merely a heuristic. In particular, differences in BIC should never be treated like transformed Bayes factors.
It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all models being compared. The models being compared need not be
nested, unlike the case when models are being compared using an
F-test
An F-test is a statistical test that compares variances. It is used to determine if the variances of two samples, or if the ratios of variances among multiple samples, are significantly different. The test calculates a Test statistic, statistic, ...
or a
likelihood ratio test
In statistics, the likelihood-ratio test is a hypothesis test that involves comparing the goodness of fit of two competing statistical models, typically one found by maximization over the entire parameter space and another found after imposing ...
.
Properties
* The BIC generally penalizes free parameters more strongly than the
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...
, though it depends on the size of ''n'' and relative magnitude of ''n'' and ''k''.
*It is independent of the prior.
* It can measure the efficiency of the parameterized model in terms of predicting the data.
* It penalizes the complexity of the model where complexity refers to the number of parameters in the model.
* It is approximately equal to the
minimum description length
Minimum Description Length (MDL) is a model selection principle where the shortest description of the data is the best model. MDL methods learn through a data compression perspective and are sometimes described as mathematical applications of Occam ...
criterion but with negative sign.
* It can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset.
* It is closely related to other penalized likelihood criteria such as
Deviance information criterion
The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been ...
and the
Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...
.
Limitations
The BIC suffers from two main limitations
# the above approximation is only valid for sample size
much larger than the number
of parameters in the model.
# the BIC cannot handle complex collections of models as in the variable selection (or
feature selection
In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons:
* sim ...
) problem in high-dimension.
[
]
Gaussian special case
Under the assumption that the model errors or disturbances are independent and identically distributed according to a normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
and the boundary condition that the derivative of the log likelihood
A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the j ...
with respect to the true variance is zero, this becomes (''up to an additive constant'', which depends only on ''n'' and not on the model):[ (p. 375).]
:
where is the error variance. The error variance in this case is defined as
:
which is a biased estimator for the true variance.
In terms of the residual sum of squares (RSS) the BIC is
:
When testing multiple linear models against a saturated model, the BIC can be rewritten in terms of the
deviance as:[.]
:
where is the number of model parameters in the test.
See also
*Akaike information criterion
The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to ...
*Bayes factor
The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis ...
* Bayesian model comparison
*Deviance information criterion
The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been ...
*Hannan–Quinn information criterion In statistics, the Hannan–Quinn information criterion (HQC) is a criterion for model selection. It is an alternative to Akaike information criterion (AIC) and Bayesian information criterion (BIC). It is given as
: \mathrm = -2 L_ + 2 k \ln(\ln(n ...
*Jensen–Shannon divergence
In probability theory and statistics, the Jensen–Shannon divergence, named after Johan Jensen and Claude Shannon, is a method of measuring the similarity between two probability distributions. It is also known as information radius (IRad) or to ...
*Kullback–Leibler divergence
In mathematical statistics, the Kullback–Leibler (KL) divergence (also called relative entropy and I-divergence), denoted D_\text(P \parallel Q), is a type of statistical distance: a measure of how much a model probability distribution is diff ...
* Minimum message length
Notes
References
Further reading
*
*
*
*
*
External links
Sparse Vector Autoregressive Modeling
{{DEFAULTSORT:Bayesian Information Criterion
Model selection
Information criterion
Regression variable selection
de:Informationskriterium