In
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, binomial regression is a
regression analysis technique in which the
response
Response may refer to:
*Call and response (music), musical structure
*Reaction (disambiguation)
*Request–response
**Output (computing), Output or response, the result of telecommunications input
*Response (liturgy), a line answering a versicle
...
(often referred to as ''Y'') has a
binomial distribution
In probability theory and statistics, the binomial distribution with parameters ''n'' and ''p'' is the discrete probability distribution of the number of successes in a sequence of ''n'' independent experiments, each asking a yes–no quest ...
: it is the number of successes in a series of independent
Bernoulli trials, where each trial has probability of success .
In binomial regression, the probability of a success is related to
explanatory variables: the corresponding concept in ordinary regression is to relate the mean value of the unobserved response to explanatory variables.
Binomial regression is closely related to
binary regression: a binary regression can be considered a binomial regression with
, or a regression on
ungrouped binary data, while a binomial regression can be considered a regression on
grouped binary data (see
comparison). Binomial regression models are essentially the same as
binary choice model
In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such ...
s, one type of
discrete choice
In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Su ...
model: the primary difference is in the theoretical motivation (see
comparison). In
machine learning, binomial regression is considered a special case of
probabilistic classification, and thus a generalization of
binary classification.
Example application
In one published example of an application of binomial regression,
[Cox & Snell (1981), Example H]
p. 91
/ref> the details were as follows. The observed outcome variable was whether or not a fault occurred in an industrial process. There were two explanatory variables: the first was a simple two-case factor representing whether or not a modified version of the process was used and the second was an ordinary quantitative variable measuring the purity of the material being supplied for the process.
Specification of model
The response variable ''Y'' is assumed to be binomially distributed conditional on the explanatory variables ''X''. The number of trials ''n'' is known, and the probability of success for each trial ''p'' is specified as a function ''θ(X)''. This implies that the conditional expectation
In probability theory, the conditional expectation, conditional expected value, or conditional mean of a random variable is its expected value – the value it would take “on average” over an arbitrarily large number of occurrences – give ...
and conditional variance of the observed fraction of successes, ''Y/n'', are
:
:
The goal of binomial regression is to estimate the function ''θ(X)''. Typically the statistician assumes , for a known function ''m'', and estimates ''β''. Common choices for ''m'' include the logistic function
A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation
f(x) = \frac,
where
For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
.
The data are often fitted as a generalised linear model where the predicted values μ are the probabilities that any individual event will result in a success. The likelihood of the predictions is then given by
:
where ''1A'' is the indicator function
In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
which takes on the value one when the event ''A'' occurs, and zero otherwise: in this formulation, for any given observation ''yi'', only one of the two terms inside the product contributes, according to whether ''yi''=0 or 1. The likelihood function is more fully specified by defining the formal parameters ''μi'' as parameterised functions of the explanatory variables: this defines the likelihood in terms of a much reduced number of parameters. Fitting of the model is usually achieved by employing the method of maximum likelihood to determine these parameters. In practice, the use of a formulation as a generalised linear model allows advantage to be taken of certain algorithmic ideas which are applicable across the whole class of more general models but which do not apply to all maximum likelihood problems.
Models used in binomial regression can often be extended to multinomial data.
There are many methods of generating the values of ''μ'' in systematic ways that allow for interpretation of the model; they are discussed below.
Link functions
There is a requirement that the modelling linking the probabilities μ to the explanatory variables should be of a form which only produces values in the range 0 to 1. Many models can be fitted into the form
:
Here ''η'' is an intermediate variable representing a linear combination, containing the regression parameters, of the explanatory variables. The function
''g'' is the cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ev ...
(cdf) of some probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
. Usually this probability distribution has a support
Support may refer to:
Arts, entertainment, and media
* Supporting character
Business and finance
* Support (technical analysis)
* Child support
* Customer support
* Income Support
Construction
* Support (structure), or lateral support, a ...
from minus infinity to plus infinity so that any finite value of ''η'' is transformed by the function ''g'' to a value inside the range 0 to 1.
In the case of logistic regression, the link function is the log of the odds ratio or logistic function
A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation
f(x) = \frac,
where
For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
. In the case of probit, the link is the cdf of the normal distribution. The linear probability model is not a proper binomial regression specification because predictions need not be in the range of zero to one; it is sometimes used for this type of data when the probability space is where interpretation occurs or when the analyst lacks sufficient sophistication to fit or calculate approximate linearizations of probabilities for interpretation.
Comparison with binary regression
Binomial regression is closely connected with binary regression. If the response is a binary variable (two possible outcomes), then these alternatives can be coded as 0 or 1 by considering one of the outcomes as "success" and the other as "failure" and considering these as count data: "success" is 1 success out of 1 trial, while "failure" is 0 successes out of 1 trial. This can now be considered a binomial distribution with trial, so a binary regression is a special case of a binomial regression. If these data are grouped (by adding counts), they are no longer binary data, but are count data for each group, and can still be modeled by a binomial regression; the individual binary outcomes are then referred to as "ungrouped data". An advantage of working with grouped data is that one can test the goodness of fit of the model; for example, grouped data may exhibit overdispersion relative to the variance estimated from the ungrouped data.
Comparison with binary choice models
A binary choice model assumes a latent variable ''Un'', the utility (or net benefit) that person ''n'' obtains from taking an action (as opposed to not taking the action). The utility the person obtains from taking the action depends on the characteristics of the person, some of which are observed by the researcher and some are not:
:
where is a set of regression coefficients and is a set of independent variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s (also known as "features") describing person ''n'', which may be either discrete " dummy variables" or regular continuous variables. is a random variable
A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
specifying "noise" or "error" in the prediction, assumed to be distributed according to some distribution. Normally, if there is a mean or variance parameter in the distribution, it cannot be identified
''Identified'' is the second studio album by Vanessa Hudgens, released on July 1, 2008 in the U.S. June 24, 2008 in Japan, February 13, 2009 in most European countries and February 16, 2009 in the United Kingdom. The album re ...
, so the parameters are set to convenient values — by convention usually mean 0, variance 1.
The person takes the action, , if ''Un'' > 0. The unobserved term, ''εn'', is assumed to have a logistic distribution.
The specification is written succinctly as:
**
**
** logistic, standard normal, etc.
Let us write it slightly differently:
**
**
** logistic, standard normal, etc.
Here we have made the substitution ''en'' = −''εn''. This changes a random variable into a slightly different one, defined over a negated domain. As it happens, the error distributions we usually consider (e.g. logistic distribution, standard normal distribution, standard Student's t-distribution
In probability and statistics, Student's ''t''-distribution (or simply the ''t''-distribution) is any member of a family of continuous probability distributions that arise when estimating the mean of a normally distributed population in sit ...
, etc.) are symmetric about 0, and hence the distribution over ''en'' is identical to the distribution over ''εn''.
Denote the cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ev ...
(CDF) of as and the quantile function
In probability and statistics, the quantile function, associated with a probability distribution of a random variable, specifies the value of the random variable such that the probability of the variable being less than or equal to that value equ ...
(inverse CDF) of as
Note that
::
Since is a Bernoulli trial, where we have
:
or equivalently
:
Note that this is exactly equivalent to the binomial regression model expressed in the formalism of the generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
.
If i.e. distributed as a standard normal distribution, then
:
which is exactly a probit model.
If i.e. distributed as a standard logistic distribution with mean 0 and scale parameter 1, then the corresponding quantile function
In probability and statistics, the quantile function, associated with a probability distribution of a random variable, specifies the value of the random variable such that the probability of the variable being less than or equal to that value equ ...
is the logit function, and
:
which is exactly a logit model.
Note that the two different formalisms — generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and b ...
s (GLM's) and discrete choice
In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Su ...
models — are equivalent in the case of simple binary choice models, but can be extended if differing ways:
*GLM's can easily handle arbitrarily distributed response variables (dependent variable
Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s), not just categorical variable
In statistics, a categorical variable (also called qualitative variable) is a variable that can take on one of a limited, and usually fixed, number of possible values, assigning each individual or other unit of observation to a particular group or ...
s or ordinal variables, which discrete choice models are limited to by their nature. GLM's are also not limited to link functions that are quantile function
In probability and statistics, the quantile function, associated with a probability distribution of a random variable, specifies the value of the random variable such that the probability of the variable being less than or equal to that value equ ...
s of some distribution, unlike the use of an error variable, which must by assumption have a probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
.
*On the other hand, because discrete choice models are described as types of generative models, it is conceptually easier to extend them to complicated situations with multiple, possibly correlated, choices for each person, or other variations.
Latent variable interpretation / derivation
A latent variable model involving a binomial observed variable ''Y'' can be constructed such that ''Y'' is related to the latent variable ''Y*'' via
:
The latent variable ''Y*'' is then related to a set of regression variables ''X'' by the model
:
This results in a binomial regression model.
The variance of ''ϵ'' can not be identified and when it is not of interest is often assumed to be equal to one. If ''ϵ'' is normally distributed, then a probit is the appropriate model and if ''ϵ'' is log-Weibull distributed, then a logit is appropriate. If ''ϵ'' is uniformly distributed, then a linear probability model is appropriate.
See also
* Linear probability model
* Poisson regression
*Predictive modelling
Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
Notes
References
*
*
Further reading
*
{{statistics
Factorial and binomial topics
Generalized linear models