Applications
Logistic regression is used in various fields, including machine learning, most medical fields, and social sciences. For example, the Trauma and Injury Severity Score ( TRISS), which is widely used to predict mortality in injured patients, was originally developed by Boyd ' using logistic regression. Many other medical scales used to assess severity of a patient have been developed using logistic regression. Logistic regression may be used to predict the risk of developing a given disease (e.g.Example
Problem
As a simple example, we can use a logistic regression with one explanatory variable and two categories to answer the following question:A group of 20 students spends between 0 and 6 hours studying for an exam. How does the number of hours spent studying affect the probability of the student passing the exam?The reason for using logistic regression for this problem is that the values of the dependent variable, pass and fail, while represented by "1" and "0", are not
Model
TheFit
The usual measure ofParameter estimation
Since ''ℓ'' is nonlinear in and , determining their optimum values will require numerical methods. Note that one method of maximizing ''ℓ'' is to require the derivatives of ''ℓ'' with respect to and to be zero: : : and the maximization procedure can be accomplished by solving the above two equations for and , which, again, will generally require the use of numerical methods. The values of and which maximize ''ℓ'' and ''L'' using the above data are found to be: : : which yields a value for ''μ'' and ''s'' of: : :Predictions
The and coefficients may be entered into the logistic regression equation to estimate the probability of passing the exam. For example, for a student who studies 2 hours, entering the value into the equation gives the estimated probability of passing the exam of 0.25: : : Similarly, for a student who studies 4 hours, the estimated probability of passing the exam is 0.87: : : This table shows the estimated probability of passing the exam for several values of hours studying.Model evaluation
The logistic regression analysis gives the following output. By theGeneralizations
This simple model is an example of binary logistic regression, and has one explanatory variable and a binary categorical variable which can assume one of two categorical values. Multinomial logistic regression is the generalization of binary logistic regression to include any number of explanatory variables and any number of categories.Background
Definition of the logistic function
An explanation of logistic regression can begin with an explanation of the standardDefinition of the inverse of the logistic function
We can now define theInterpretation of these terms
In the above equations, the terms are as follows: * is the logit function. The equation for illustrates that theDefinition of the odds
The odds of the dependent variable equaling a case (given some linear combination of the predictors) is equivalent to the exponential function of the linear regression expression. This illustrates how theThe odds ratio
For a continuous independent variable the odds ratio can be defined as: : This exponential relationship provides an interpretation for : The odds multiply by for every 1-unit increase in x. For a binary independent variable the odds ratio is defined as where ''a'', ''b'', ''c'' and ''d'' are cells in a 2×2Multiple explanatory variables
If there are multiple explanatory variables, the above expression can be revised to . Then when this is used in the equation relating the log odds of a success to the values of the predictors, the linear regression will be aDefinition
The basic setup of logistic regression is as follows. We are given a dataset containing ''N'' points. Each point ''i'' consists of a set of ''m'' input variables ''x''1,''i'' ... ''x''''m,i'' (also calledMany explanatory variables, two categories
The above example of binary logistic regression on one explanatory variable can be generalized to binary logistic regression on any number of explanatory variables ''x1, x2,...'' and any number of categorical values . To begin with, we may consider a logistic model with ''M'' explanatory variables, ''x1'', ''x2'' ... ''xM'' and, as in the example above, two categorical values (''y'' = 0 and 1). For the simple binary logistic regression model, we assumed aMultinomial logistic regression: Many explanatory variables and many categories
In the above cases of two categories (binomial logistic regression), the categories were indexed by "0" and "1", and we had two probability distributions: The probability that the outcome was in category 1 was given by and the probability that the outcome was in category 0 was given by . The sum of both probabilities is equal to unity, as they must be. In general, if we have explanatory variables (including ''x0'') and categories, we will need separate probability distributions, one for each category, indexed by ''n'', which describe the probability that the categorical outcome ''y'' for explanatory vector x will be in category ''y=n''. It will also be required that the sum of these probabilities over all categories be equal to unity. Using the mathematically convenient base ''e'', these probabilities are: : for : Each of the probabilities except will have their own set of regression coefficients . It can be seen that, as required, the sum of the over all categories is unity. Note that the selection of to be defined in terms of the other probabilities is artificial. Any of the probabilities could have been selected to be so defined. This special value of ''n'' is termed the "pivot index", and the log-odds (''tn'') are expressed in terms of the pivot probability and are again expressed as a linear combination of the explanatory variables: : Note also that for the simple case of , the two-category case is recovered, with and . The log-likelihood that a particular set of ''K'' measurements or data points will be generated by the above probabilities can now be calculated. Indexing each measurement by ''k'', let the ''k''-th set of measured explanatory variables be denoted by and their categorical outcomes be denoted by which can be equal to any integer in ,N The log-likelihood is then: : where is anInterpretations
There are various equivalent specifications and interpretations of logistic regression, which fit into different types of more general models, and allow different generalizations.As a generalized linear model
The particular model used by logistic regression, which distinguishes it from standardAs a latent-variable model
The logistic model has an equivalent formulation as aTwo-way latent-variable model
Yet another formulation uses two separate latent variables: : where : where ''EV''1(0,1) is a standard type-1Example
: As an example, consider a province-level election where the choice is between a right-of-center party, a left-of-center party, and a secessionist party (e.g. theAs a "log-linear" model
Yet another formulation combines the two-way latent variable formulation above with the original formulation higher up without latent variables, and in the process provides a link to one of the standard formulations of theAs a single-layer perceptron
The model has an equivalent formulation : This functional form is commonly called a single-layer perceptron or single-layer artificial neural network. A single-layer neural network computes a continuous output instead of a step function. The derivative of ''pi'' with respect to ''X'' = (''x''1, ..., ''x''''k'') is computed from the general form: : where ''f''(''X'') is an analytic function in ''X''. With this choice, the single-layer neural network is identical to the logistic regression model. This function has a continuous derivative, which allows it to be used in backpropagation. This function is also preferred because its derivative is easily calculated: :In terms of binomial data
A closely related model assumes that each ''i'' is associated not with a single Bernoulli trial but with ''n''''i'' independent identically distributed trials, where the observation ''Y''''i'' is the number of successes observed (the sum of the individual Bernoulli-distributed random variables), and hence follows a binomial distribution: : An example of this distribution is the fraction of seeds (''p''''i'') that germinate after ''n''''i'' are planted. In terms ofModel fitting
Maximum likelihood estimation (MLE)
The regression coefficients are usually estimated usingIteratively reweighted least squares (IRLS)
Binary logistic regression ( or ) can, for example, be calculated using ''iteratively reweighted least squares'' (IRLS), which is equivalent to maximizing theBayesian
In a Bayesian statistics context,"Rule of ten"
A widely used rule of thumb, the "one in ten rule", states that logistic regression models give stable values for the explanatory variables if based on a minimum of about 10 events per explanatory variable (EPV); where ''event'' denotes the cases belonging to the less frequent category in the dependent variable. Thus a study designed to use explanatory variables for an event (e.g. myocardial infarction) expected to occur in a proportion of participants in the study will require a total of participants. However, there is considerable debate about the reliability of this rule, which is based on simulation studies and lacks a secure theoretical underpinning. According to some authors the rule is overly conservative in some circumstances, with the authors stating, "If we (somewhat subjectively) regard confidence interval coverage less than 93 percent, type I error greater than 7 percent, or relative bias greater than 15 percent as problematic, our results indicate that problems are fairly frequent with 2–4 EPV, uncommon with 5–9 EPV, and still observed with 10–16 EPV. The worst instances of each problem were not severe with 5–9 EPV and usually comparable to those with 10–16 EPV". Others have found results that are not consistent with the above, using different criteria. A useful criterion is whether the fitted model will be expected to achieve the same predictive discrimination in a new sample as it appeared to achieve in the model development sample. For that criterion, 20 events per candidate variable may be required. Also, one can argue that 96 observations are needed only to estimate the model's intercept precisely enough that the margin of error in predicted probabilities is ±0.1 with a 0.95 confidence level.Error and significance of fit
Deviance and likelihood ratio test ─ a simple case
In any fitting procedure, the addition of another fitting parameter to a model (e.g. the beta parameters in a logistic regression model) will almost always improve the ability of the model to predict the measured outcomes. This will be true even if the additional term has no predictive value, since the model will simply be "overfitting" to the noise in the data. The question arises as to whether the improvement gained by the addition of another fitting parameter is significant enough to recommend the inclusion of the additional term, or whether the improvement is simply that which may be expected from overfitting. In short, for logistic regression, a statistic known as the deviance (statistics), deviance is defined which is a measure of the error between the logistic model fit and the outcome data. In the limit of a large number of data points, the deviance is Chi-squared distribution, chi-squared distributed, which allows a chi-squared test to be implemented in order to determine the significance of the explanatory variables. Linear regression and logistic regression have many similarities. For example, in simple linear regression, a set of ''K'' data points (''xk'', ''yk'') are fitted to a proposed model function of the form . The fit is obtained by choosing the ''b'' parameters which minimize the sum of the squares of the residuals (the squared error term) for each data point: : The minimum value which constitutes the fit will be denoted by The idea of a null model may be introduced, in which it is assumed that the ''x'' variable is of no use in predicting the yk outcomes: The data points are fitted to a null model function of the form ''y=b0'' with a squared error term: : The fitting process consists of choosing a value of ''b0'' which minimizes of the fit to the null model, denoted by where the subscript denotes the null model. It is seen that the null model is optimized by where is the mean of the ''yk'' values, and the optimized is: : which is proportional to the square of the (uncorrected) sample standard deviation of the ''yk'' data points. We can imagine a case where the ''yk'' data points are randomly assigned to the various ''xk'', and then fitted using the proposed model. Specifically, we can consider the fits of the proposed model to every permutation of the ''yk'' outcomes. It can be shown that the optimized error of any of these fits will never be less than the optimum error of the null model, and that the difference between these minimum error will follow a chi-squared distribution distribution, with degrees of freedom equal those of the proposed model minus those of the null model which, in this case, will be 2-1=1. Using the chi-squared test, we may then estimate how many of these permuted sets of ''yk'' will yield an minimum error less than or equal to the minimum error using the original ''yk'', and so we can estimate how significant an improvement is given by the inclusion of the ''x'' variable in the proposed model. For logistic regression, the measure of goodness-of-fit is the likelihood function ''L'', or its logarithm, the log-likelihood ''ℓ''. The likelihood function ''L'' is analogous to the in the linear regression case, except that the likelihood is maximized rather than minimized. Denote the maximized log-likelihood of the proposed model by . In the case of simple binary logistic regression, the set of ''K'' data points are fitted in a probabilistic sense to a function of the form: : where is the probability that . The log-odds are given by: : and the log-likelihood is: : For the null model, the probability that is given by: : The log-odds for the null model are given by: : and the log-likelihood is: : Since we have at the maximum of ''L'', the maximum log-likelihood for the null model is : The optimum is: : where is again the mean of the ''yk'' values. Again, we can conceptually consider the fit of the proposed model to every permutation of the ''yk'' and it can be shown that the maximum log-likelihood of these permutation fits will never be smaller than that of the null model: : Also, as an analog to the error of the linear regression case, we may define the deviance (statistics), deviance of a logistic regression fit as: : which will always be positive or zero. The reason for this choice is that not only is the deviance a good measure of the goodness of fit, it is also approximately chi-squared distributed, with the approximation improving as the number of data points (''K'') increases, becoming exactly chi-square distributed in the limit of an infinite number of data points. As in the case of linear regression, we may use this fact to estimate the probability that a random set of data points will give a better fit than the fit obtained by the proposed model, and so have an estimate how significantly the model is improved by including the ''xk'' data points in the proposed model. For the simple model of student test scores described above, the maximum value of the log-likelihood of the null model is The maximum value of the log-likelihood for the simple model is so that the deviance is Using the chi-squared test of significance, the integral of the chi-squared distribution with one degree of freedom from 11.6661... to infinity is equal to 0.00063649... This effectively means that about 6 out of a 10,000 fits to random ''yk'' can be expected to have a better fit (smaller deviance) than the given ''yk'' and so we can conclude that the inclusion of the ''x'' variable and data in the proposed model is a very significant improvement over the null model. In other words, we reject the null hypothesis with confidence.Goodness of fit summary
Goodness of fit in linear regression models is generally measured using R square, R2. Since this has no direct analog in logistic regression, various methods including the following can be used instead.Deviance and likelihood ratio tests
In linear regression analysis, one is concerned with partitioning variance via the Partition of sums of squares, sum of squares calculations – variance in the criterion is essentially divided into variance accounted for by the predictors and residual variance. In logistic regression analysis, Deviance (statistics), deviance is used in lieu of a sum of squares calculations. Deviance is analogous to the sum of squares calculations in linear regression and is a measure of the lack of fit to the data in a logistic regression model. When a "saturated" model is available (a model with a theoretically perfect fit), deviance is calculated by comparing a given model with the saturated model. This computation gives thePseudo-R-squared
In linear regression the squared multiple correlation, 2 is used to assess goodness of fit as it represents the proportion of variance in the criterion that is explained by the predictors. In logistic regression analysis, there is no agreed upon analogous measure, but there are several competing measures each with limitations. Four of the most commonly used indices and one less commonly used one are examined on this page: * Likelihood ratio 2 * Cox and Snell 2 * Nagelkerke 2 * McFadden 2 * Tjur 2Hosmer–Lemeshow test
The Hosmer–Lemeshow test uses a test statistic that asymptotically follows a chi-squared distribution, distribution to assess whether or not the observed event rates match expected event rates in subgroups of the model population. This test is considered to be obsolete by some statisticians because of its dependence on arbitrary binning of predicted probabilities and relative low power.Coefficient significance
After fitting the model, it is likely that researchers will want to examine the contribution of individual predictors. To do so, they will want to examine the regression coefficients. In linear regression, the regression coefficients represent the change in the criterion for each unit change in the predictor. In logistic regression, however, the regression coefficients represent the change in the logit for each unit change in the predictor. Given that the logit is not intuitive, researchers are likely to focus on a predictor's effect on the exponential function of the regression coefficient – the odds ratio (see #Logistic function, odds, odds ratio, and logit, definition). In linear regression, the significance of a regression coefficient is assessed by computing a ''t'' test. In logistic regression, there are several different tests designed to assess the significance of an individual predictor, most notably the likelihood ratio test and the Wald statistic.Likelihood ratio test
TheWald statistic
Alternatively, when assessing the contribution of individual predictors in a given model, one may examine the significance of the Wald test, Wald statistic. The Wald statistic, analogous to the ''t''-test in linear regression, is used to assess the significance of coefficients. The Wald statistic is the ratio of the square of the regression coefficient to the square of the standard error of the coefficient and is asymptotically distributed as a chi-square distribution. : Although several statistical packages (e.g., SPSS, SAS) report the Wald statistic to assess the contribution of individual predictors, the Wald statistic has limitations. When the regression coefficient is large, the standard error of the regression coefficient also tends to be larger increasing the probability of Type I and Type II errors, Type-II error. The Wald statistic also tends to be biased when data are sparse.Case-control sampling
Suppose cases are rare. Then we might wish to sample them more frequently than their prevalence in the population. For example, suppose there is a disease that affects 1 person in 10,000 and to collect our data we need to do a complete physical. It may be too expensive to do thousands of physicals of healthy people in order to obtain data for only a few diseased individuals. Thus, we may evaluate more diseased individuals, perhaps all of the rare outcomes. This is also retrospective sampling, or equivalently it is called unbalanced data. As a rule of thumb, sampling controls at a rate of five times the number of cases will produce sufficient control data.https://class.stanford.edu/c4x/HumanitiesScience/StatLearning/asset/classification.pdf slide 16 Logistic regression is unique in that it may be estimated on unbalanced data, rather than randomly sampled data, and still yield correct coefficient estimates of the effects of each independent variable on the outcome. That is to say, if we form a logistic model from such data, if the model is correct in the general population, the parameters are all correct except for . We can correct if we know the true prevalence as follows: : where is the true prevalence and is the prevalence in the sample.Discussion
Like other forms of