statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...

, a linear probability model (LPM) is a special case of a

binary regression In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of si ...

model. Here the

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...

for each observation takes values which are either 0 or 1. The probability of observing a 0 or 1 in any one case is treated as depending on one or more explanatory variables. For the "linear probability model", this relationship is a particularly simple one, and allows the model to be fitted by

linear regression In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is call ...

. The model assumes that, for a binary outcome (

Bernoulli trial In the theory of probability and statistics, a Bernoulli trial (or binomial trial) is a random experiment with exactly two possible outcomes, "success" and "failure", in which the probability of success is the same every time the experiment is c ...

Y

, and its associated vector of explanatory variables,

X

, :

\Pr(Y=1 ,  X=x) = x'\beta .

For this model, :

= \Pr(Y=1, X) =x'\beta,

and hence the vector of parameters β can be estimated using

least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...

. This method of fitting would be inefficient, and can be improved by adopting an iterative scheme based on

weighted least squares Weighted least squares (WLS), also known as weighted linear regression, is a generalization of ordinary least squares and linear regression in which knowledge of the variance of observations is incorporated into the regression. WLS is also a speci ...

, in which the model from the previous iteration is used to supply estimates of the conditional variances,

\operatorname(Y, X=x)

, which would vary between observations. This approach can be related to fitting the model by

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...

. A drawback of this model is that, unless restrictions are placed on

\beta

, the estimated coefficients can imply probabilities outside the

unit interval In mathematics, the unit interval is the closed interval , that is, the set of all real numbers that are greater than or equal to 0 and less than or equal to 1. It is often denoted ' (capital letter ). In addition to its role in real analysis, ...

,1

. For this reason, models such as the

logit model In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression ana ...

or the

probit model In statistics, a probit model is a type of regression where the dependent variable can take only two values, for example married or not married. The word is a portmanteau, coming from ''probability'' + ''unit''. The purpose of the model is to est ...

are more commonly used.

Latent-variable formulation

More formally, the LPM can arise from a latent-variable formulation (usually to be found in the econometrics literature, ), as follows: assume the following regression model with a latent (unobservable) dependent variable: :

y^* = b_0+ \mathbf x'\mathbf b + \varepsilon,\;\; \varepsilon\mid \mathbf x\sim U(-a,a).

The critical assumption here is that the error term of this regression is a symmetric around zero Uniform random variable, and hence, of mean zero. The cumulative distribution function of

\varepsilon

here is

F_(\varepsilon\mid \mathbf x) = \frac .

Define the indicator variable

y = 1

y^* >0

, and zero otherwise, and consider the conditional probability :

(y =1\mid \mathbf x ) = (y^* > 0\mid \mathbf x) = (b_0+ \mathbf x'\mathbf b + \varepsilon>0\mid \mathbf x)

= (\varepsilon >- b_0- \mathbf x'\mathbf b\mid \mathbf x) = 1- (\varepsilon \leq - b_0- \mathbf x'\mathbf b\mid \mathbf x)

=1- F_(- b_0- \mathbf x'\mathbf b\mid \mathbf x) =1- \frac  = \frac +\frac .

But this is the Linear Probability Model, :

P(y =1\mid \mathbf x )= \beta_0 + \mathbf x'\beta

with the mapping :

\beta_0 = \frac ,\;\; \beta=\frac.

This method is a general device to obtain a conditional probability model of a binary variable: if we assume that the distribution of the error term is Logistic, we obtain the

, while if we assume that it is the Normal, we obtain the

and, if we assume that it is the logarithm of a Weibull distribution, the complementary log-log model.

Latent-variable formulation

See also

References

Further reading