statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

, a probit model is a type of regression where the

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or dema ...

can take only two values, for example married or not married. The word is a

portmanteau A portmanteau word, or portmanteau (, ) is a blend of wordsbinary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...

model. A

probit In probability theory and statistics, the probit function is the quantile function associated with the standard normal distribution. It has applications in data analysis and machine learning, in particular exploratory statistical graphics and s ...

model is a popular specification for a binary response model. As such it treats the same set of problems as does

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression an ...

using similar techniques. When viewed in the

generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...

framework, the probit model employs a

link function. It is most often estimated using the

maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimating the parameters of an assumed probability distribution, given some observed data. This is achieved by maximizing a likelihood function so that, under the assumed stat ...

procedure, such an estimation being called a probit regression.

Conceptual framework

Suppose a response variable ''Y'' is ''binary'', that is it can have only two possible outcomes which we will denote as 1 and 0. For example, ''Y'' may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of

regressor Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

s ''X'', which are assumed to influence the outcome ''Y''. Specifically, we assume that the model takes the form :

P(Y=1 \mid X) = \Phi(X^T\beta),

where ''P'' is the

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...

and

\Phi

is the

cumulative distribution function In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x. Eve ...

of the standard

normal distribution In statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is : f(x) = \frac e^ The parameter \mu ...

. The parameters ''β'' are typically estimated by

. It is possible to motivate the probit model as a

latent variable model A latent variable model is a statistical model that relates a set of observable variables (also called ''manifest variables'' or ''indicators'') to a set of latent variables. It is assumed that the responses on the indicators or manifest variabl ...

. Suppose there exists an auxiliary random variable :

Y^\ast = X^T\beta + \varepsilon,

where ''ε'' ~ ''N''(0, 1). Then ''Y'' can be viewed as an indicator for whether this latent variable is positive: :

Y = \left.\begin 1 & Y^* > 0 \\
0 &\text \end \right\} = \left.\begin 1 &  X^T\beta + \varepsilon > 0 \\
0 &\text \end \right\}

The use of the standard normal distribution causes no

loss of generality ''Without loss of generality'' (often abbreviated to WOLOG, WLOG or w.l.o.g.; less commonly stated as ''without any loss of generality'' or ''with no loss of generality'') is a frequently used expression in mathematics. The term is used to indicat ...

compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount. To see that the two models are equivalent, note that :

\begin
P(Y = 1 \mid X)
&= P(Y^\ast > 0) \\
&= P(X^T\beta + \varepsilon > 0) \\
&= P(\varepsilon > -X^T\beta) \\
&= P(\varepsilon < X^T\beta) & \text\\
&= \Phi(X^T\beta)
\end

Model estimation

Maximum likelihood estimation

Suppose data set

\_^n

contains ''n'' independent

statistical unit In statistics, a unit is one member of a set of entities being studied. It is the main source for the mathematical abstraction of a "random variable". Common examples of a unit would be a single person, animal, plant, manufactured item, or country ...

s corresponding to the model above. For the single observation, conditional on the vector of inputs of that observation, we have: :

P(y_i=1, x_i)= \Phi(x_i'\beta)

P(y_i=0, x_i)= 1-\Phi(x_i'\beta)

where

x_i

is a vector of

K \times 1

inputs, and

\beta

is a

K \times 1

vector of coefficients. The likelihood of a single observation

(y_i, x_i)

is then :

\mathcal(\beta; y_i, x_i) = \Phi(x_i'\beta)^-\Phi(x_i'\beta)

In fact, if

y_i=1

, then

\mathcal(\beta; y_i, x_i) = \Phi(x_i'\beta)

, and if

y_i=0

, then

\mathcal(\beta; y_i, x_i) = 1-\Phi(x_i'\beta)

. Since the observations are independent and identically distributed, then the likelihood of the entire sample, or the joint likelihood, will be equal to the product of the likelihoods of the single observations: :

\right)

The joint log-likelihood function is thus :

\ln\mathcal(\beta; Y, X) = \sum_^n \bigg( y_i\ln\Phi(x_i'\beta) + (1-y_i)\ln\!\big(1-\Phi(x_i'\beta)\big) \bigg)

The estimator

\hat\beta

which maximizes this function will be

consistent In classical deductive logic, a consistent theory is one that does not lead to a logical contradiction. The lack of contradiction can be defined in either semantic or syntactic terms. The semantic definition states that a theory is consistent ...

, asymptotically normal and efficient provided that E 'XXexists and is not singular. It can be shown that this log-likelihood function is globally

concave Concave or concavity may refer to: Science and technology * Concave lens * Concave mirror Mathematics * Concave function, the negative of a convex function * Concave polygon, a polygon which is not convex * Concave set In geometry, a subset o ...

in ''β'', and therefore standard numerical algorithms for optimization will converge rapidly to the unique maximum. Asymptotic distribution for

\hat\beta

is given by :

\sqrt(\hat\beta - \beta)\ \xrightarrow\ \mathcal(0,\,\Omega^),

where :

\qquad \hat\Omega = \frac\sum_^n \fracx_ix'_i,

and

\varphi=\Phi'

is the Probability Density Function (

PDF Portable Document Format (PDF), standardized as ISO 32000, is a file format developed by Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, hardware, and operating systems. ...

) of standard normal distribution. Semi-parametric and non-parametric maximum likelihood methods for probit-type and other related models are also available.

Berkson's minimum chi-square method

This method can be applied only when there are many observations of response variable

y_i

having the same value of the vector of regressors

x_i

(such situation may be referred to as "many observations per cell"). More specifically, the model can be formulated as follows. Suppose among ''n'' observations

\_^n

there are only ''T'' distinct values of the regressors, which can be denoted as

\

. Let

n_t

be the number of observations with

x_i=x_,

and

r_t

the number of such observations with

y_i=1

. We assume that there are indeed "many" observations per each "cell": for each

t, \lim_ n_t/n = c_t > 0

. Denote :

\hat_t = r_t/n_t

\hat\sigma_t^2 = \frac \frac

Then Berkson's minimum chi-square estimator is a

generalized least squares In statistics, generalized least squares (GLS) is a technique for estimating the unknown parameters in a linear regression model when there is a certain degree of correlation between the residuals in a regression model. In these cases, ordinar ...

estimator in a regression of

\Phi^(\hat_t)

x_

with weights

\hat\sigma_t^

: :

\hat\beta = \Bigg( \sum_^T \hat\sigma_t^x_x'_ \Bigg)^ \sum_^T \hat\sigma_t^x_\Phi^(\hat_t)

It can be shown that this estimator is consistent (as ''n''→∞ and ''T'' fixed), asymptotically normal and efficient. Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts

r_t

n_t

, and

x_

(for example in the analysis of voting behavior).

Gibbs sampling

Gibbs sampling In statistics, Gibbs sampling or a Gibbs sampler is a Markov chain Monte Carlo (MCMC) algorithm for obtaining a sequence of observations which are approximated from a specified multivariate probability distribution, when direct sampling is diff ...

of a probit model is possible because regression models typically use normal

prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken into ...

s over the weights, and this distribution is conjugate with the normal distribution of the errors (and hence of the latent variables ''Y''^*). The model can be described as :

y_i & = \begin 1 & \text y_i^\ast > 0 \\ 0 & \text \end \end

From this, we can determine the full conditional densities needed: :

\end

The result for β is given in the article on Bayesian linear regression, although specified with different notation. The only trickiness is in the last two equations. The notation

_i^\ast < 0 /math> is the

Iverson bracket In mathematics, the Iverson bracket, named after Kenneth E. Iverson, is a notation that generalises the Kronecker delta, which is the Iverson bracket of the statement . It maps any statement to a function of the free variables in that statement ...

, sometimes written

\mathcal(y_i^\ast < 0)

or similar. It indicates that the distribution must be truncated within the given range, and rescaled appropriately. In this particular case, a

truncated normal distribution In probability and statistics, the truncated normal distribution is the probability distribution derived from that of a normally distributed random variable by bounding the random variable from either below or above (or both). The truncated no ...

arises. Sampling from this distribution depends on how much is truncated. If a large fraction of the original mass remains, sampling can be easily done with

rejection sampling In numerical analysis and computational statistics, rejection sampling is a basic technique used to generate observations from a distribution. It is also commonly called the acceptance-rejection method or "accept-reject algorithm" and is a type of ...

—simply sample a number from the non-truncated distribution, and reject it if it falls outside the restriction imposed by the truncation. If sampling from only a small fraction of the original mass, however (e.g. if sampling from one of the tails of the normal distribution—for example if

\mathbf'_i\boldsymbol\beta

is around 3 or more, and a negative sample is desired), then this will be inefficient and it becomes necessary to fall back on other sampling algorithms. General sampling from the truncated normal can be achieved using approximations to the normal CDF and the probit function, and R has a function rtnorm() for generating truncated-normal samples.

Model evaluation

The suitability of an estimated binary model can be evaluated by counting the number of true observations equaling 1, and the number equaling zero, for which the model assigns a correct predicted classification by treating any estimated probability above 1/2 (or, below 1/2), as an assignment of a prediction of 1 (or, of 0). See for details.

Performance under misspecification

Consider the latent variable model formulation of the probit model. When the

variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbe ...

\varepsilon

conditional on

x

is not constant but dependent on

x

, then the

heteroscedasticity In statistics, a sequence (or a vector) of random variables is homoscedastic () if all its random variables have the same finite variance. This is also known as homogeneity of variance. The complementary notion is called heteroscedasticity. The ...

issue arises. For example, suppose

y^*= \beta_0+B_1 x_1+\varepsilon

and

\varepsilon\mid x \sim N (0,x^2_1)

where

x_1

is a continuous positive explanatory variable. Under heteroskedasticity, the probit estimator for

\beta

is usually inconsistent, and most of the tests about the coefficients are invalid. More importantly, the estimator for

P (y=1\mid x)

becomes inconsistent, too. To deal with this problem, the original model needs to be transformed to be homoskedastic. For instance, in the same example,

1 beta_0+\beta_1 x_1+\varepsilon>0 /math> can be rewritten as 1 beta_0/x_1+\beta_1+\varepsilon/x_1>0 /math>, where \varepsilon/x_1\mid x\sim N(0,1) . Therefore, P(y=1\mid x) = \Phi (\beta_1 + \beta_0/x_1) and running probit on (1, 1/x_1) generates a consistent estimator for the

conditional probability In probability theory, conditional probability is a measure of the probability of an event occurring, given that another event (by assumption, presumption, assertion or evidence) has already occurred. This particular method relies on event B occu ...

P(y=1\mid x).

When the assumption that

\varepsilon

is normally distributed fails to hold, then a functional form misspecification issue arises: if the model is still estimated as a probit model, the estimators of the coefficients

\beta

are inconsistent. For instance, if

\varepsilon

follows a

logistic distribution Logistic may refer to: Mathematics * Logistic function, a sigmoid function used in many fields ** Logistic map, a recurrence relation that sometimes exhibits chaos ** Logistic regression, a statistical model using the logistic function ** Logit, ...

in the true model, but the model is estimated by probit, the estimates will be generally smaller than the true value. However, the inconsistency of the coefficient estimates is practically irrelevant because the estimates for the

partial effect Partial may refer to: Mathematics *Partial derivative, derivative with respect to one of several variables of a function, with the other variables held constant ** ∂, a symbol that can denote a partial derivative, sometimes pronounced "partial ...

\partial P(y=1\mid x)/\partial x_

, will be close to the estimates given by the true logit model. To avoid the issue of distribution misspecification, one may adopt a general distribution assumption for the error term, such that many different types of distribution can be included in the model. The cost is heavier computation and lower accuracy for the increase of the number of parameter. In most of the cases in practice where the distribution form is misspecified, the estimators for the coefficients are inconsistent, but estimators for the conditional probability and the partial effects are still very good. One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions on a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).

History

The probit model is usually credited to Chester Bliss, who coined the term "probit" in 1934, and to

John Gaddum Sir John Henry Gaddum (31 March 1900 – 30 June 1965) was an English pharmacologist who, with Ulf von Euler, co-discovered the neuropeptide Substance P in 1931. He was a founder member of the British Pharmacological Society and first editor ...

(1933), who systematized earlier work. However, the basic model dates to the

Weber–Fechner law The Weber–Fechner laws are two related hypotheses in the field of psychophysics, known as Weber's law and Fechner's law. Both laws relate to human perception, more specifically the relation between the actual change in a physical stimulus a ...

Gustav Fechner Gustav Theodor Fechner (; ; 19 April 1801 – 18 November 1887) was a German physicist, philosopher, and experimental psychologist. A pioneer in experimental psychology and founder of psychophysics (techniques for measuring the mind), he ins ...

, published in , and was repeatedly rediscovered until the 1930s; see and . A fast method for computing

estimates for the probit model was proposed by

Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who ...

as an appendix to Bliss' work in 1935.

References

* ** Published in: *

External links

* * by

Mark Thoma Mark Allen Thoma (born December 15, 1956) is a macroeconomist and econometrician and a professor of economics at the Department of Economics of the University of Oregon. Thoma is best known as a regular columnist for ''The Fiscal Times'' through ...

{{DEFAULTSORT:Probit Model Categorical regression models Classification algorithms