statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...

, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for

model selection Model selection is the task of selecting a model from among various candidates on the basis of performance criterion to choose the best one. In the context of machine learning and more generally statistical analysis, this may be the selection of ...

among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on the

likelihood function A likelihood function (often simply called the likelihood) measures how well a statistical model explains observed data by calculating the probability of seeing that data under different parameter values of the model. It is constructed from the ...

and it is closely related to the Akaike information criterion (AIC). When fitting models, it is possible to increase the maximum likelihood by adding parameters, but doing so may result in

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

. Both BIC and AIC attempt to resolve this problem by introducing a penalty term for the number of parameters in the model; the penalty term is larger in BIC than in AIC for sample sizes greater than 7. The BIC was developed by Gideon E. Schwarz and published in a 1978 paper, as a large-sample approximation to the

Bayes factor The Bayes factor is a ratio of two competing statistical models represented by their evidence, and is used to quantify the support for one model over the other. The models in question can have a common set of parameters, such as a null hypothesis ...

Definition

The BIC is formally defined as :

\mathrm = k\ln(n) - 2\ln(\widehat L). \

where *

\hat L

= the maximized value of the

of the model

M

, i.e.

\hat L=p(x\mid\widehat\theta,M)

, where

\

are the parameter values that maximize the likelihood function and

x

is the observed data; *

n

= the number of data points in

x

, the number of

observation Observation in the natural sciences is an act or instance of noticing or perceiving and the acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the percep ...

s, or equivalently, the sample size; *

k

= the number of

parameter A parameter (), generally, is any characteristic that can help in defining or classifying a particular system (meaning an event, project, object, situation, etc.). That is, a parameter is an element of a system that is useful, or critical, when ...

s estimated by the model. For example, in

multiple linear regression In statistics, linear regression is a model that estimates the relationship between a scalar response (dependent variable) and one or more explanatory variables (regressor or independent variable). A model with exactly one explanatory variable ...

, the estimated parameters are the intercept, the

q

slope parameters, and the constant variance of the errors; thus,

k = q + 2

Derivation

The BIC can be derived by integrating out the parameters of the model using Laplace's method, starting with the following model evidence: :

p(x\mid M) = \int p(x\mid\theta,M) \pi(\theta\mid M) \, d\theta

where

\pi(\theta\mid M)

is the prior for

\theta

under model

M

. The log-likelihood,

\ln(p(x\mid\theta,M))

, is then expanded to a second order

Taylor series In mathematics, the Taylor series or Taylor expansion of a function is an infinite sum of terms that are expressed in terms of the function's derivatives at a single point. For most common functions, the function and the sum of its Taylor ser ...

about the MLE,

\widehat\theta

, assuming it is twice differentiable as follows: :

\ln(p(x\mid\theta,M)) = \ln(\widehat L) - \frac (\theta - \widehat\theta)^ \mathcal(\widehat\theta) (\theta - \widehat\theta) + R(x, \theta),

where

\mathcal(\theta)

is the average observed information per observation, and

R(x, \theta)

denotes the residual term. To the extent that

R(x, \theta)

is negligible and

\pi(\theta\mid M)

is relatively linear near

\widehat\theta

, we can integrate out

\theta

to get the following: :

p(x\mid M) \approx \hat L ^\frac , \mathcal(\widehat\theta), ^ \pi(\widehat\theta)

n

increases, we can ignore

, \mathcal(\widehat\theta),

and

\pi(\widehat\theta)

as they are

O(1)

. Thus, :

p(x\mid M) = \exp\left(\ln\widehat L - \frac \ln(n) + O(1)\right) = \exp\left(-\frac + O(1)\right),

where BIC is defined as above, and

\widehat L

either (a) is the Bayesian posterior mode or (b) uses the MLE and the prior

\pi(\theta\mid M)

has nonzero slope at the MLE. Then the posterior :

p(M\mid x) \propto p(x\mid M) p(M) \approx \exp\left(-\frac\right) p(M)

Use

When picking from several models, ones with lower BIC values are generally preferred. The BIC is an increasing function of the error variance

\sigma_e^2

and an increasing function of ''k''. That is, unexplained variation in the

dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...

and the number of explanatory variables increase the value of BIC. However, a lower BIC does not necessarily indicate one model is better than another. Because it involves approximations, the BIC is merely a heuristic. In particular, differences in BIC should never be treated like transformed Bayes factors. It is important to keep in mind that the BIC can be used to compare estimated models only when the numerical values of the dependent variable are identical for all models being compared. The models being compared need not be nested, unlike the case when models are being compared using an

F-test An F-test is a statistical test that compares variances. It is used to determine if the variances of two samples, or if the ratios of variances among multiple samples, are significantly different. The test calculates a Test statistic, statistic, ...

or a likelihood ratio test.

Properties

* The BIC generally penalizes free parameters more strongly than the Akaike information criterion, though it depends on the size of ''n'' and relative magnitude of ''n'' and ''k''. *It is independent of the prior. * It can measure the efficiency of the parameterized model in terms of predicting the data. * It penalizes the complexity of the model where complexity refers to the number of parameters in the model. * It is approximately equal to the minimum description length criterion but with negative sign. * It can be used to choose the number of clusters according to the intrinsic complexity present in a particular dataset. * It is closely related to other penalized likelihood criteria such as Deviance information criterion and the Akaike information criterion.

Limitations

The BIC suffers from two main limitations # the above approximation is only valid for sample size

n

much larger than the number

k

of parameters in the model. # the BIC cannot handle complex collections of models as in the variable selection (or

feature selection In machine learning, feature selection is the process of selecting a subset of relevant Feature (machine learning), features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: * sim ...

) problem in high-dimension.

Gaussian special case

Under the assumption that the model errors or disturbances are independent and identically distributed according to a

normal distribution In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is f(x) = \frac ...

and the boundary condition that the derivative of the log likelihood with respect to the true variance is zero, this becomes (''up to an additive constant'', which depends only on ''n'' and not on the model): (p. 375). :

\mathrm = n \ln(\widehat) + k \ln(n) \

where

\widehat

is the error variance. The error variance in this case is defined as :

\widehat = \frac \sum_^n (x_i-\widehat_i)^2.

which is a biased estimator for the true variance. In terms of the residual sum of squares (RSS) the BIC is :

\mathrm = n \ln(\text/n) + k \ln(n) \

When testing multiple linear models against a saturated model, the BIC can be rewritten in terms of the deviance

\chi^2

as:. :

\mathrm= \chi^2 + k \ln(n)

where

k

is the number of model parameters in the test.

Notes

References

External links

Sparse Vector Autoregressive Modeling
{{DEFAULTSORT:Bayesian Information Criterion Model selection Information criterion Regression variable selection de:Informationskriterium

Definition

Derivation

Use

Properties

Limitations

Gaussian special case

See also

Notes

References

Further reading

External links