HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, a probit model is a type of regression where the
dependent variable A variable is considered dependent if it depends on (or is hypothesized to depend on) an independent variable. Dependent variables are studied under the supposition or demand that they depend, by some law or rule (e.g., by a mathematical functio ...
can take only two values, for example married or not married. The word is a
portmanteau In linguistics, a blend—also known as a blend word, lexical blend, or portmanteau—is a word formed by combining the meanings, and parts of the sounds, of two or more words together.
, coming from ''probability'' + ''unit''. The purpose of the model is to estimate the probability that an observation with particular characteristics will fall into a specific one of the categories; moreover, classifying observations based on their predicted probabilities is a type of binary classification model. A probit model is a popular specification for a binary response model. As such it treats the same set of problems as does
logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...
using similar techniques. When viewed in the
generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
framework, the probit model employs a probit link function. It is most often estimated using the maximum likelihood procedure, such an estimation being called a probit regression.


Conceptual framework

Suppose a response variable ''Y'' is ''binary'', that is it can have only two possible outcomes which we will denote as 1 and 0. For example, ''Y'' may represent presence/absence of a certain condition, success/failure of some device, answer yes/no on a survey, etc. We also have a vector of regressors ''X'', which are assumed to influence the outcome ''Y''. Specifically, we assume that the model takes the form : P(Y=1 \mid X) = \Phi(X^\operatorname\beta), where ''P'' is the
probability Probability is a branch of mathematics and statistics concerning events and numerical descriptions of how likely they are to occur. The probability of an event is a number between 0 and 1; the larger the probability, the more likely an e ...
and \Phi is the cumulative distribution function ( CDF) of the standard normal distribution. The parameters ''β'' are typically estimated by maximum likelihood. It is possible to motivate the probit model as a latent variable model. Suppose there exists an auxiliary random variable : Y^\ast = X^T\beta + \varepsilon, where ''ε'' ~ ''N''(0, 1). Then ''Y'' can be viewed as an indicator for whether this latent variable is positive: : Y = \left.\begin 1 & Y^* > 0 \\ 0 &\text \end \right\} = \left.\begin 1 & X^\operatorname\beta + \varepsilon > 0 \\ 0 &\text \end \right\} The use of the standard normal distribution causes no loss of generality compared with the use of a normal distribution with an arbitrary mean and standard deviation, because adding a fixed amount to the mean can be compensated by subtracting the same amount from the intercept, and multiplying the standard deviation by a fixed amount can be compensated by multiplying the weights by the same amount. To see that the two models are equivalent, note that : \begin P(Y = 1 \mid X) &= P(Y^\ast > 0) \\ &= P(X^\operatorname\beta + \varepsilon > 0) \\ &= P(\varepsilon > -X^\operatorname\beta) \\ &= P(\varepsilon < X^\operatorname\beta) & \text\\ &= \Phi(X^\operatorname\beta) \end


Model estimation


Maximum likelihood estimation

Suppose data set \_^n contains ''n'' independent statistical units corresponding to the model above. For the single observation, conditional on the vector of inputs of that observation, we have: :P(y_i=1, x_i)= \Phi(x_i^\operatorname\beta) :P(y_i=0, x_i)= 1-\Phi(x_i^\operatorname\beta) where x_i is a vector of K \times 1 inputs, and \beta is a K \times 1 vector of coefficients. The likelihood of a single observation (y_i, x_i) is then :\mathcal(\beta; y_i, x_i) = \Phi(x_i^\operatorname\beta)^ -\Phi(x_i^\operatorname\beta) In fact, if y_i=1, then \mathcal(\beta; y_i, x_i) = \Phi(x_i^\operatorname\beta), and if y_i=0, then \mathcal(\beta; y_i, x_i) = 1-\Phi(x_i^\operatorname\beta). Since the observations are independent and identically distributed, then the likelihood of the entire sample, or the joint likelihood, will be equal to the product of the likelihoods of the single observations: :\mathcal(\beta; Y, X) = \prod_^n \left( \Phi(x_i^\operatorname\beta)^ -\Phi(x_i^\operatorname\beta) \right) The joint log-likelihood function is thus : \ln\mathcal(\beta; Y, X) = \sum_^n \bigg( y_i\ln\Phi(x_i^\operatorname\beta) + (1-y_i)\ln\!\big(1-\Phi(x_i^\operatorname\beta)\big) \bigg) The estimator \hat\beta which maximizes this function will be
consistent In deductive logic, a consistent theory is one that does not lead to a logical contradiction. A theory T is consistent if there is no formula \varphi such that both \varphi and its negation \lnot\varphi are elements of the set of consequences ...
, asymptotically normal and efficient provided that \operatorname X^\operatorname exists and is not singular. It can be shown that this log-likelihood function is globally concave in \beta , and therefore standard numerical algorithms for optimization will converge rapidly to the unique maximum. Asymptotic distribution for \hat\beta is given by : \sqrt(\hat\beta - \beta)\ \xrightarrow\ \mathcal(0,\,\Omega^), where : \Omega = \operatorname\bigg \fracXX^\operatorname \bigg \qquad \hat\Omega = \frac\sum_^n \fracx_ix^\operatorname_i, and \varphi=\Phi' is the Probability Density Function (
PDF Portable document format (PDF), standardized as ISO 32000, is a file format developed by Adobe Inc., Adobe in 1992 to present documents, including text formatting and images, in a manner independent of application software, computer hardware, ...
) of standard normal distribution. Semi-parametric and non-parametric maximum likelihood methods for probit-type and other related models are also available.


Berkson's minimum chi-square method

This method can be applied only when there are many observations of response variable y_i having the same value of the vector of regressors x_i (such situation may be referred to as "many observations per cell"). More specifically, the model can be formulated as follows. Suppose among ''n'' observations \_^n there are only ''T'' distinct values of the regressors, which can be denoted as \. Let n_t be the number of observations with x_i=x_, and r_t the number of such observations with y_i=1. We assume that there are indeed "many" observations per each "cell": for each t, \lim_ n_t/n = c_t > 0 . Denote : \hat_t = r_t/n_t : \hat\sigma_t^2 = \frac \frac Then Berkson's minimum chi-square estimator is a
generalized least squares In statistics, generalized least squares (GLS) is a method used to estimate the unknown parameters in a Linear regression, linear regression model. It is used when there is a non-zero amount of correlation between the Residual (statistics), resi ...
estimator in a regression of \Phi^(\hat_t) on x_ with weights \hat\sigma_t^: : \hat\beta = \Bigg( \sum_^T \hat\sigma_t^x_x^\operatorname_ \Bigg)^ \sum_^T \hat\sigma_t^x_\Phi^(\hat_t) It can be shown that this estimator is consistent (as ''n''→∞ and ''T'' fixed), asymptotically normal and efficient. Its advantage is the presence of a closed-form formula for the estimator. However, it is only meaningful to carry out this analysis when individual observations are not available, only their aggregated counts r_t, n_t, and x_ (for example in the analysis of voting behavior).


Albert and Chib Gibbs sampling method

Gibbs sampling of a probit model is possible with the introduction of normally distributed latent variables ''z'', which are observed as 1 if positive and 0 otherwise. This approach was introduced in Albert and Chib (1993),Albert, J., & Chib, S. (1993). "Bayesian Analysis of Binary and Polychotomous Response Data." Journal of the American Statistical Association, 88(422), 669-679. which demonstrated how Gibbs sampling could be applied to binary and polychotomous response models within a Bayesian framework. Under a multivariate normal
prior distribution A prior probability distribution of an uncertain quantity, simply called the prior, is its assumed probability distribution before some evidence is taken into account. For example, the prior could be the probability distribution representing the ...
over the weights, the model can be described as : \begin \boldsymbol\beta & \sim \mathcal(\mathbf_0, \mathbf_0) \\ ptz_i\mid\mathbf_i,\boldsymbol\beta & \sim \mathcal(\mathbf^\operatorname_i\boldsymbol\beta, 1) \\ pt y_i & = \begin 1 & \text z_i > 0 \\ 0 & \text \end \end From this, Albert and Chib (1993)Albert, J., & Chib, S. (1993). "Bayesian Analysis of Binary and Polychotomous Response Data." Journal of the American Statistical Association, 88(422), 669-679. derive the following full conditional distributions in the Gibbs sampling algorithm: : \begin \mathbf &= (\mathbf_0^ + \mathbf^\operatorname\mathbf)^ \\ pt\boldsymbol\beta\mid\mathbf &\sim \mathcal(\mathbf(\mathbf_0^\mathbf_0 + \mathbf^\operatorname\mathbf), \mathbf) \\ ptz_i \mid y_i=0,\mathbf_i,\boldsymbol\beta &\sim \mathcal(\mathbf^\operatorname_i\boldsymbol\beta, 1) _i \le 0\\ ptz_i \mid y_i=1,\mathbf_i,\boldsymbol\beta &\sim \mathcal(\mathbf^\operatorname_i\boldsymbol\beta, 1) _i > 0\end The result for \boldsymbol\beta is given in the article on Bayesian linear regression, although specified with different notation, while the conditional posterior distributions of the latent variables follow a truncated normal distribution within the given ranges. The notation _i < 0/math> is the Iverson bracket, sometimes written \mathcal(z_i < 0) or similar. Thus, knowledge of the observed outcomes serves to restrict the support of the latent variables. Sampling of the weights \boldsymbol given the latent vector \mathbf from the multinormal distribution is standard. For sampling the latent variables from the truncated normal posterior distributions, one can take advantage of the inverse-cdf method, implemented in the following R vectorized function, making it straightforward to implement the method. zbinprobit <- function(y, X, beta, n)


Model evaluation

The suitability of an estimated binary model can be evaluated by counting the number of true observations equaling 1, and the number equaling zero, for which the model assigns a correct predicted classification by treating any estimated probability above 1/2 (or, below 1/2), as an assignment of a prediction of 1 (or, of 0). See for details.


Performance under misspecification

Consider the latent variable model formulation of the probit model. When the
variance In probability theory and statistics, variance is the expected value of the squared deviation from the mean of a random variable. The standard deviation (SD) is obtained as the square root of the variance. Variance is a measure of dispersion ...
of \varepsilon conditional on x is not constant but dependent on x, then the
heteroscedasticity In statistics, a sequence of random variables is homoscedastic () if all its random variables have the same finite variance; this is also known as homogeneity of variance. The complementary notion is called heteroscedasticity, also known as hete ...
issue arises. For example, suppose y^*= \beta_0+B_1 x_1+\varepsilon and \varepsilon\mid x \sim N (0,x^2_1) where x_1 is a continuous positive explanatory variable. Under heteroskedasticity, the probit estimator for \beta is usually inconsistent, and most of the tests about the coefficients are invalid. More importantly, the estimator for P (y=1\mid x) becomes inconsistent, too. To deal with this problem, the original model needs to be transformed to be homoskedastic. For instance, in the same example, 1 beta_0+\beta_1 x_1+\varepsilon>0/math> can be rewritten as 1 beta_0/x_1+\beta_1+\varepsilon/x_1>0/math>, where \varepsilon/x_1\mid x\sim N(0,1). Therefore, P(y=1\mid x) = \Phi (\beta_1 + \beta_0/x_1) and running probit on (1, 1/x_1) generates a consistent estimator for the
conditional probability In probability theory, conditional probability is a measure of the probability of an Event (probability theory), event occurring, given that another event (by assumption, presumption, assertion or evidence) is already known to have occurred. This ...
P(y=1\mid x). When the assumption that \varepsilon is normally distributed fails to hold, then a functional form misspecification issue arises: if the model is still estimated as a probit model, the estimators of the coefficients \beta are inconsistent. For instance, if \varepsilon follows a logistic distribution in the true model, but the model is estimated by probit, the estimates will be generally smaller than the true value. However, the inconsistency of the coefficient estimates is practically irrelevant because the estimates for the partial effects, \partial P(y=1\mid x)/\partial x_, will be close to the estimates given by the true logit model. To avoid the issue of distribution misspecification, one may adopt a general distribution assumption for the error term, such that many different types of distribution can be included in the model. The cost is heavier computation and lower accuracy for the increase of the number of parameter. In most of the cases in practice where the distribution form is misspecified, the estimators for the coefficients are inconsistent, but estimators for the conditional probability and the partial effects are still very good. One can also take semi-parametric or non-parametric approaches, e.g., via local-likelihood or nonparametric quasi-likelihood methods, which avoid assumptions on a parametric form for the index function and is robust to the choice of the link function (e.g., probit or logit).


History

The probit model is usually credited to Chester Bliss, who coined the term "probit" in 1934, and to John Gaddum (1933), who systematized earlier work. However, the basic model dates to the
Weber–Fechner law The Weber–Fechner laws are two related scientific law, scientific laws in the field of psychophysics, known as Weber's law and Fechner's law. Both relate to human perception, more specifically the relation between the actual change in a physica ...
by
Gustav Fechner Gustav Theodor Fechner (; ; 19 April 1801 – 18 November 1887) was a German physicist, philosopher, and experimental psychologist. A pioneer in experimental psychology and founder of psychophysics (techniques for measuring the mind), he inspi ...
, published in , and was repeatedly rediscovered until the 1930s; see and . A fast method for computing maximum likelihood estimates for the probit model was proposed by
Ronald Fisher Sir Ronald Aylmer Fisher (17 February 1890 – 29 July 1962) was a British polymath who was active as a mathematician, statistician, biologist, geneticist, and academic. For his work in statistics, he has been described as "a genius who a ...
as an appendix to Bliss' work in 1935.


See also

*
Generalized linear model In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
* Limited dependent variable * Logit model * Multinomial probit * Multivariate probit models * Ordered probit and ordered logit model * Separation (statistics) * Tobit model


References

* * ** Published in: * *


Further reading

* * * * *


External links

* * by Mark Thoma {{DEFAULTSORT:Probit Model Categorical regression models Classification algorithms