HOME

TheInfoList



OR:

In
statistics Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
, multinomial logistic regression is a
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...
method that generalizes
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
to multiclass problems, i.e. with more than two possible discrete outcomes. That is, it is a model that is used to predict the probabilities of the different possible outcomes of a categorically distributed
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
, given a set of
independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s (which may be real-valued, binary-valued, categorical-valued, etc.). Multinomial logistic regression is known by a variety of other names, including polytomous LR, multiclass LR, softmax regression, multinomial logit (mlogit), the maximum entropy (MaxEnt) classifier, and the conditional maximum entropy model.


Background

Multinomial logistic regression is used when the
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
in question is
nominal Nominal may refer to: Linguistics and grammar * Nominal (linguistics), one of the parts of speech * Nominal, the adjectival form of "noun", as in "nominal agreement" (= "noun agreement") * Nominal sentence, a sentence without a finite verb * Nou ...
(equivalently ''categorical'', meaning that it falls into any one of a set of categories that cannot be ordered in any meaningful way) and for which there are more than two categories. Some examples would be: *Which major will a college student choose, given their grades, stated likes and dislikes, etc.? *Which blood type does a person have, given the results of various diagnostic tests? *In a hands-free mobile phone dialing application, which person's name was spoken, given various properties of the speech signal? *Which candidate will a person vote for, given particular demographic characteristics? *Which country will a firm locate an office in, given the characteristics of the firm and of the various candidate countries? These are all
statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagno ...
problems. They all have in common a
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
to be predicted that comes from one of a limited set of items that cannot be meaningfully ordered, as well as a set of
independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s (also known as features, explanators, etc.), which are used to predict the dependent variable. Multinomial logistic regression is a particular solution to classification problems that use a linear combination of the observed features and some problem-specific parameters to estimate the probability of each particular value of the dependent variable. The best values of the parameters for a given problem are usually determined from some training data (e.g. some people for whom both the diagnostic test results and blood types are known, or some examples of known words being spoken).


Assumptions

The multinomial logistic model assumes that data are case-specific; that is, each independent variable has a single value for each case. The multinomial logistic model also assumes that the dependent variable cannot be perfectly predicted from the independent variables for any case. As with other types of regression, there is no need for the independent variables to be
statistically independent Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of o ...
from each other (unlike, for example, in a
naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
); however,
collinearity In geometry, collinearity of a set of Point (geometry), points is the property of their lying on a single Line (geometry), line. A set of points with this property is said to be collinear (sometimes spelled as colinear). In greater generality, t ...
is assumed to be relatively low, as it becomes difficult to differentiate between the impact of several variables if this is not the case. If the multinomial logit is used to model choices, it relies on the assumption of
independence of irrelevant alternatives The independence of irrelevant alternatives (IIA), also known as binary independence or the independence axiom, is an axiom of decision theory and various social sciences. The term is used in different connotation in several contexts. Although it a ...
(IIA), which is not always desirable. This assumption states that the odds of preferring one class over another do not depend on the presence or absence of other "irrelevant" alternatives. For example, the relative probabilities of taking a car or bus to work do not change if a bicycle is added as an additional possibility. This allows the choice of ''K'' alternatives to be modeled as a set of ''K''-1 independent binary choices, in which one alternative is chosen as a "pivot" and the other ''K''-1 compared against it, one at a time. The IIA hypothesis is a core hypothesis in rational choice theory; however numerous studies in psychology show that individuals often violate this assumption when making choices. An example of a problem case arises if choices include a car and a blue bus. Suppose the odds ratio between the two is 1 : 1. Now if the option of a red bus is introduced, a person may be indifferent between a red and a blue bus, and hence may exhibit a car : blue bus : red bus odds ratio of 1 : 0.5 : 0.5, thus maintaining a 1 : 1 ratio of car : any bus while adopting a changed car : blue bus ratio of 1 : 0.5. Here the red bus option was not in fact irrelevant, because a red bus was a perfect substitute for a blue bus. If the multinomial logit is used to model choices, it may in some situations impose too much constraint on the relative preferences between the different alternatives. This point is especially important to take into account if the analysis aims to predict how choices would change if one alternative were to disappear (for instance if one political candidate withdraws from a three candidate race). Other models like the
nested logit In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such ...
or the
multinomial probit In statistics and econometrics, the multinomial probit model is a generalization of the probit model used when there are several possible categories that the dependent variable can fall into. As such, it is an alternative to the multinomial log ...
may be used in such cases as they allow for violation of the IIA.


Model


Introduction

There are multiple equivalent ways to describe the mathematical model underlying multinomial logistic regression. This can make it difficult to compare different treatments of the subject in different texts. The article on
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
presents a number of equivalent formulations of simple logistic regression, and many of these have analogues in the multinomial logit model. The idea behind all of them, as in many other
statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagno ...
techniques, is to construct a
linear predictor function In statistics and in machine learning, a linear predictor function is a linear function ( linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent vari ...
that constructs a score from a set of weights that are linearly combined with the explanatory variables (features) of a given observation using a
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
: :\operatorname(\mathbf_i,k) = \boldsymbol\beta_k \cdot \mathbf_i, where X''i'' is the vector of explanatory variables describing observation ''i'', β''k'' is a vector of weights (or
regression coefficient In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
s) corresponding to outcome ''k'', and score(X''i'', ''k'') is the score associated with assigning observation ''i'' to category ''k''. In
discrete choice In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such ...
theory, where observations represent people and outcomes represent choices, the score is considered the
utility As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosopher ...
associated with person ''i'' choosing outcome ''k''. The predicted outcome is the one with the highest score. The difference between the multinomial logit model and numerous other methods, models, algorithms, etc. with the same basic setup (the
perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
algorithm,
support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
s, linear discriminant analysis, etc.) is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. In particular, in the multinomial logit model, the score can directly be converted to a probability value, indicating the
probability Probability is the branch of mathematics concerning numerical descriptions of how likely an Event (probability theory), event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and ...
of observation ''i'' choosing outcome ''k'' given the measured characteristics of the observation. This provides a principled way of incorporating the prediction of a particular multinomial logit model into a larger procedure that may involve multiple such predictions, each with a possibility of error. Without such means of combining predictions, errors tend to multiply. For example, imagine a large
predictive model Predictive modelling uses statistics to predict outcomes. Most often the event one wants to predict is in the future, but predictive modelling can be applied to any type of unknown event, regardless of when it occurred. For example, predictive mod ...
that is broken down into a series of submodels where the prediction of a given submodel is used as the input of another submodel, and that prediction is in turn used as the input into a third submodel, etc. If each submodel has 90% accuracy in its predictions, and there are five submodels in series, then the overall model has only 0.95 = 59% accuracy. If each submodel has 80% accuracy, then overall accuracy drops to 0.85 = 33% accuracy. This issue is known as
error propagation In statistics, propagation of uncertainty (or propagation of error) is the effect of variables' uncertainties (or errors, more specifically random errors) on the uncertainty of a function based on them. When the variables are the values of expe ...
and is a serious problem in real-world predictive models, which are usually composed of numerous parts. Predicting probabilities of each possible outcome, rather than simply making a single optimal prediction, is one means of alleviating this issue.


Setup

The basic setup is the same as in
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
, the only difference being that the
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s are categorical rather than
binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...
, i.e. there are ''K'' possible outcomes rather than just two. The following description is somewhat shortened; for more details, consult the
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
article.


Data points

Specifically, it is assumed that we have a series of ''N'' observed data points. Each data point ''i'' (ranging from ''1'' to ''N'') consists of a set of ''M'' explanatory variables ''x''''1,i'' ... ''x''''M,i'' (aka
independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
s, predictor variables, features, etc.), and an associated categorical outcome ''Y''''i'' (aka
dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or demand ...
, response variable), which can take on one of ''K'' possible values. These possible values represent logically separate categories (e.g. different political parties, blood types, etc.), and are often described mathematically by arbitrarily assigning each a number from 1 to ''K''. The explanatory variables and outcome represent observed properties of the data points, and are often thought of as originating in the observations of ''N'' "experiments" — although an "experiment" may consist in nothing more than gathering data. The goal of multinomial logistic regression is to construct a model that explains the relationship between the explanatory variables and the outcome, so that the outcome of a new "experiment" can be correctly predicted for a new data point for which the explanatory variables, but not the outcome, are available. In the process, the model attempts to explain the relative effect of differing explanatory variables on the outcome. Some examples: *The observed outcomes are different variants of a disease such as
hepatitis Hepatitis is inflammation of the liver tissue. Some people or animals with hepatitis have no symptoms, whereas others develop yellow discoloration of the skin and whites of the eyes (jaundice), poor appetite, vomiting, tiredness, abdominal pa ...
(possibly including "no disease" and/or other related diseases) in a set of patients, and the explanatory variables might be characteristics of the patients thought to be pertinent (sex, race, age,
blood pressure Blood pressure (BP) is the pressure of circulating blood against the walls of blood vessels. Most of this pressure results from the heart pumping blood through the circulatory system. When used without qualification, the term "blood pressure" r ...
, outcomes of various liver-function tests, etc.). The goal is then to predict which disease is causing the observed liver-related symptoms in a new patient. *The observed outcomes are the party chosen by a set of people in an election, and the explanatory variables are the demographic characteristics of each person (e.g. sex, race, age, income, etc.). The goal is then to predict the likely vote of a new voter with given characteristics.


Linear predictor

As in other forms of linear regression, multinomial logistic regression uses a
linear predictor function In statistics and in machine learning, a linear predictor function is a linear function ( linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent vari ...
f(k,i) to predict the probability that observation ''i'' has outcome ''k'', of the following form: :f(k,i) = \beta_ + \beta_ x_ + \beta_ x_ + \cdots + \beta_ x_, where \beta_ is a
regression coefficient In statistics, linear regression is a linear approach for modelling the relationship between a scalar response and one or more explanatory variables (also known as dependent and independent variables). The case of one explanatory variable is cal ...
associated with the ''m''th explanatory variable and the ''k''th outcome. As explained in the
logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
article, the regression coefficients and explanatory variables are normally grouped into vectors of size ''M+1'', so that the predictor function can be written more compactly: :f(k,i) = \boldsymbol\beta_k \cdot \mathbf_i, where \boldsymbol\beta_k is the set of regression coefficients associated with outcome ''k'', and \mathbf_i (a row vector) is the set of explanatory variables associated with observation ''i''.


As a set of independent binary regressions

To arrive at the multinomial logit model, one can imagine, for ''K'' possible outcomes, running ''K''-1 independent binary logistic regression models, in which one outcome is chosen as a "pivot" and then the other ''K''-1 outcomes are separately regressed against the pivot outcome. This would proceed as follows, if outcome ''K'' (the last outcome) is chosen as the pivot: : \begin \ln \frac &= \boldsymbol\beta_1 \cdot \mathbf_i \\ \ln \frac &= \boldsymbol\beta_2 \cdot \mathbf_i \\ \cdots & \cdots \\ \ln \frac &= \boldsymbol\beta_ \cdot \mathbf_i \\ \end This formulation is also known as the alr transform commonly used in compositional data analysis. Note that we have introduced separate sets of regression coefficients, one for each possible outcome. If we exponentiate both sides, and solve for the probabilities, we get: : \begin \Pr(Y_i=1) &= e^ \\ \Pr(Y_i=2) &= e^ \\ \cdots & \cdots \\ \Pr(Y_i=K-1) &= e^ \\ \end Using the fact that all ''K'' of the probabilities must sum to one, we find: :\Pr(Y_i=K) = 1- \sum_^ \Pr (Y_i = k) = 1 - \sum_^e^ \Rightarrow \Pr(Y_i=K) = \frac We can use this to find the other probabilities: : \begin \Pr(Y_i=1) &= \frac \\ \\ \Pr(Y_i=2) &= \frac \\ \cdots & \cdots \\ \Pr(Y_i=K-1) &= \frac \\ \end where the summation runs from 1 to K-1 or generally: \begin \Pr(Y_i=k) = \frac \end where \beta_K is defined to be zero. The fact that we run multiple regressions reveals why the model relies on the assumption of
independence of irrelevant alternatives The independence of irrelevant alternatives (IIA), also known as binary independence or the independence axiom, is an axiom of decision theory and various social sciences. The term is used in different connotation in several contexts. Although it a ...
described above.


Estimating the coefficients

The unknown parameters in each vector ''βk'' are typically jointly estimated by
maximum a posteriori In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the ...
(MAP) estimation, which is an extension of
maximum likelihood In statistics, maximum likelihood estimation (MLE) is a method of estimation theory, estimating the Statistical parameter, parameters of an assumed probability distribution, given some observed data. This is achieved by Mathematical optimization, ...
using
regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) In physics, especially quantum field theory, regularization is a method of modifying observables which have singularities in ...
of the weights to prevent pathological solutions (usually a squared regularizing function, which is equivalent to placing a zero-mean
Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...
prior distribution In Bayesian statistical inference, a prior probability distribution, often simply called the prior, of an uncertain quantity is the probability distribution that would express one's beliefs about this quantity before some evidence is taken int ...
on the weights, but other distributions are also possible). The solution is typically found using an iterative procedure such as
generalized iterative scaling In statistics, generalized iterative scaling (GIS) and improved iterative scaling (IIS) are two early algorithms used to fit log-linear models, notably multinomial logistic regression (MaxEnt) classifiers and extensions of it such as MaxEnt Markov ...
,
iteratively reweighted least squares The method of iteratively reweighted least squares (IRLS) is used to solve certain optimization problems with objective functions of the form of a ''p''-norm: :\underset \sum_^n \big, y_i - f_i (\boldsymbol\beta) \big, ^p, by an iterative met ...
(IRLS), by means of gradient-based optimization algorithms such as
L-BFGS Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algo ...
, or by specialized
coordinate descent Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, ...
algorithms.


As a log-linear model

The formulation of binary logistic regression as a
log-linear model A log-linear model is a mathematical model that takes the form of a function whose logarithm equals a linear combination of the parameters of the model, which makes it possible to apply (possibly multivariate) linear regression. That is, it has ...
can be directly extended to multi-way regression. That is, we model the
logarithm In mathematics, the logarithm is the inverse function to exponentiation. That means the logarithm of a number  to the base  is the exponent to which must be raised, to produce . For example, since , the ''logarithm base'' 10 o ...
of the probability of seeing a given output using the linear predictor as well as an additional
normalization factor The concept of a normalizing constant arises in probability theory and a variety of other areas of mathematics. The normalizing constant is used to reduce any probability function to a probability density function with total probability of one. ...
, the logarithm of the partition function: : \begin \ln \Pr(Y_i=1) &= \boldsymbol\beta_1 \cdot \mathbf_i - \ln Z \, \\ \ln \Pr(Y_i=2) &= \boldsymbol\beta_2 \cdot \mathbf_i - \ln Z \, \\ \cdots & \cdots \\ \ln \Pr(Y_i=K) &= \boldsymbol\beta_K \cdot \mathbf_i - \ln Z \, \\ \end As in the binary case, we need an extra term - \ln Z to ensure that the whole set of probabilities forms a
probability distribution In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
, i.e. so that they all sum to one: :\sum_^ \Pr(Y_i=k) = 1 The reason why we need to add a term to ensure normalization, rather than multiply as is usual, is because we have taken the logarithm of the probabilities. Exponentiating both sides turns the additive term into a multiplicative factor, so that the probability is just the
Gibbs measure In mathematics, the Gibbs measure, named after Josiah Willard Gibbs, is a probability measure frequently seen in many problems of probability theory and statistical mechanics. It is a generalization of the canonical ensemble to infinite systems. Th ...
: : \begin \Pr(Y_i=1) &= \frac e^ \, \\ \Pr(Y_i=2) &= \frac e^ \, \\ \cdots & \cdots \\ \Pr(Y_i=K) &= \frac e^ \, \\ \end The quantity ''Z'' is called the partition function for the distribution. We can compute the value of the partition function by applying the above constraint that requires all probabilities to sum to 1: : \begin 1 = \sum_^ \Pr(Y_i=k) &= \sum_^ \frac e^ \\ &= \frac \sum_^ e^ \\ \end Therefore: :Z = \sum_^ e^ Note that this factor is "constant" in the sense that it is not a function of ''Y''''i'', which is the variable over which the probability distribution is defined. However, it is definitely not constant with respect to the explanatory variables, or crucially, with respect to the unknown regression coefficients ''β''''k'', which we will need to determine through some sort of
optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
procedure. The resulting equations for the probabilities are : \begin \Pr(Y_i=1) &= \frac \, \\ \Pr(Y_i=2) &= \frac \, \\ \cdots & \cdots \\ \Pr(Y_i=K) &= \frac \, \\ \end Or generally: :\Pr(Y_i=c) = \frac The following function: :\operatorname(k,x_1,\ldots,x_n) = \frac is referred to as the
softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...
. The reason is that the effect of exponentiating the values x_1,\ldots,x_n is to exaggerate the differences between them. As a result, \operatorname(k,x_1,\ldots,x_n) will return a value close to 0 whenever ''x_k'' is significantly less than the maximum of all the values, and will return a value close to 1 when applied to the maximum value, unless it is extremely close to the next-largest value. Thus, the softmax function can be used to construct a
weighted average The weighted arithmetic mean is similar to an ordinary arithmetic mean (the most common type of average), except that instead of each of the data points contributing equally to the final average, some data points contribute more than others. The ...
that behaves as a
smooth function In mathematical analysis, the smoothness of a function (mathematics), function is a property measured by the number of Continuous function, continuous Derivative (mathematics), derivatives it has over some domain, called ''differentiability cl ...
(which can be conveniently differentiated, etc.) and which approximates the
indicator function In mathematics, an indicator function or a characteristic function of a subset of a set is a function that maps elements of the subset to one, and all other elements to zero. That is, if is a subset of some set , one has \mathbf_(x)=1 if x\i ...
:f(k) = \begin 1 \; \textrm \; k = \operatorname(x_1, \ldots, x_n), \\ 0 \; \textrm. \end Thus, we can write the probability equations as :\Pr(Y_i=c) = \operatorname(c, \boldsymbol\beta_1 \cdot \mathbf_i, \ldots, \boldsymbol\beta_K \cdot \mathbf_i) The softmax function thus serves as the equivalent of the
logistic function A logistic function or logistic curve is a common S-shaped curve (sigmoid curve) with equation f(x) = \frac, where For values of x in the domain of real numbers from -\infty to +\infty, the S-curve shown on the right is obtained, with the ...
in binary logistic regression. Note that not all of the \beta_k vectors of coefficients are uniquely
identifiable In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an ...
. This is due to the fact that all probabilities must sum to 1, making one of them completely determined once all the rest are known. As a result, there are only k-1 separately specifiable probabilities, and hence k-1 separately identifiable vectors of coefficients. One way to see this is to note that if we add a constant vector to all of the coefficient vectors, the equations are identical: : \begin \frac &= \frac \\ &= \frac \\ &= \frac \end As a result, it is conventional to set C = -\boldsymbol\beta_K (or alternatively, one of the other coefficient vectors). Essentially, we set the constant so that one of the vectors becomes 0, and all of the other vectors get transformed into the difference between those vectors and the vector we chose. This is equivalent to "pivoting" around one of the ''K'' choices, and examining how much better or worse all of the other ''K''-1 choices are, relative to the choice we are pivoting around. Mathematically, we transform the coefficients as follows: : \begin \boldsymbol\beta'_1 &= \boldsymbol\beta_1 - \boldsymbol\beta_K \\ \cdots & \cdots \\ \boldsymbol\beta'_ &= \boldsymbol\beta_ - \boldsymbol\beta_K \\ \boldsymbol\beta'_K &= 0 \end This leads to the following equations: : \begin \Pr(Y_i=1) &= \frac \, \\ \cdots & \cdots \\ \Pr(Y_i=K-1) &= \frac \, \\ \Pr(Y_i=K) &= \frac \, \\ \end Other than the prime symbols on the regression coefficients, this is exactly the same as the form of the model described above, in terms of ''K''-1 independent two-way regressions.


As a latent-variable model

It is also possible to formulate multinomial logistic regression as a latent variable model, following the two-way latent variable model described for binary logistic regression. This formulation is common in the theory of
discrete choice In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such ...
models, and makes it easier to compare multinomial logistic regression to the related
multinomial probit In statistics and econometrics, the multinomial probit model is a generalization of the probit model used when there are several possible categories that the dependent variable can fall into. As such, it is an alternative to the multinomial log ...
model, as well as to extend it to more complex models. Imagine that, for each data point ''i'' and possible outcome ''k=1,2,...,K'', there is a continuous
latent variable In statistics, latent variables (from Latin: present participle of ''lateo'', “lie hidden”) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or me ...
''Y''''i,k''''*'' (i.e. an unobserved
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
) that is distributed as follows: : \begin Y_^ &= \boldsymbol\beta_1 \cdot \mathbf_i + \varepsilon_1 \, \\ Y_^ &= \boldsymbol\beta_2 \cdot \mathbf_i + \varepsilon_2 \, \\ \cdots & \\ Y_^ &= \boldsymbol\beta_K \cdot \mathbf_i + \varepsilon_K \, \\ \end where \varepsilon_k \sim \operatorname_1(0,1), i.e. a standard type-1
extreme value distribution In probability theory and statistics, the generalized extreme value (GEV) distribution is a family of continuous probability distributions developed within extreme value theory to combine the Gumbel, Fréchet and Weibull families also known as ...
. This latent variable can be thought of as the
utility As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosopher ...
associated with data point ''i'' choosing outcome ''k'', where there is some randomness in the actual amount of utility obtained, which accounts for other unmodeled factors that go into the choice. The value of the actual variable Y_i is then determined in a non-random fashion from these latent variables (i.e. the randomness has been moved from the observed outcomes into the latent variables), where outcome ''k'' is chosen if and only if the associated utility (the value of Y_^) is greater than the utilities of all the other choices, i.e. if the utility associated with outcome ''k'' is the maximum of all the utilities. Since the latent variables are
continuous Continuity or continuous may refer to: Mathematics * Continuity (mathematics), the opposing concept to discreteness; common examples include ** Continuous probability distribution or random variable in probability and statistics ** Continuous ...
, the probability of two having exactly the same value is 0, so we ignore the scenario. That is: : \begin \Pr(Y_i = 1) &= \Pr(Y_^ > Y_^ \text Y_^ > Y_^\text \cdots \text Y_^ > Y_^) \\ \Pr(Y_i = 2) &= \Pr(Y_^ > Y_^ \text Y_^ > Y_^\text \cdots \text Y_^ > Y_^) \\ \cdots & \\ \Pr(Y_i = K) &= \Pr(Y_^ > Y_^ \text Y_^ > Y_^\text \cdots \text Y_^ > Y_^) \\ \end Or equivalently: : \begin \Pr(Y_i = 1) &= \Pr(\max(Y_^,Y_^,\ldots,Y_^)=Y_^) \\ \Pr(Y_i = 2) &= \Pr(\max(Y_^,Y_^,\ldots,Y_^)=Y_^) \\ \cdots & \\ \Pr(Y_i = K) &= \Pr(\max(Y_^,Y_^,\ldots,Y_^)=Y_^) \\ \end Let's look more closely at the first equation, which we can write as follows: : \begin \Pr(Y_i = 1) &= \Pr(Y_^ > Y_^\ \forall\ k=2,\ldots,K) \\ &= \Pr(Y_^ - Y_^ > 0\ \forall\ k=2,\ldots,K) \\ &= \Pr(\boldsymbol\beta_1 \cdot \mathbf_i + \varepsilon_1 - (\boldsymbol\beta_k \cdot \mathbf_i + \varepsilon_k) > 0\ \forall\ k=2,\ldots,K) \\ &= \Pr((\boldsymbol\beta_1 - \boldsymbol\beta_k) \cdot \mathbf_i > \varepsilon_k - \varepsilon_1\ \forall\ k=2,\ldots,K) \end There are a few things to realize here: #In general, if X \sim \operatorname_1(a,b) and Y \sim \operatorname_1(a,b) then X - Y \sim \operatorname(0,b). That is, the difference of two
independent identically distributed In probability theory and statistics, a collection of random variables is independent and identically distributed if each random variable has the same probability distribution as the others and all are mutually independent. This property is us ...
extreme-value-distributed variables follows the
logistic distribution Logistic may refer to: Mathematics * Logistic function, a sigmoid function used in many fields ** Logistic map, a recurrence relation that sometimes exhibits chaos ** Logistic regression, a statistical model using the logistic function ** Logit, ...
, where the first parameter is unimportant. This is understandable since the first parameter is a
location parameter In geography, location or place are used to denote a region (point, line, or area) on Earth's surface or elsewhere. The term ''location'' generally implies a higher degree of certainty than ''place'', the latter often indicating an entity with an ...
, i.e. it shifts the mean by a fixed amount, and if two values are both shifted by the same amount, their difference remains the same. This means that all of the relational statements underlying the probability of a given choice involve the logistic distribution, which makes the initial choice of the extreme-value distribution, which seemed rather arbitrary, somewhat more understandable. #The second parameter in an extreme-value or logistic distribution is a
scale parameter In probability theory and statistics, a scale parameter is a special kind of numerical parameter of a parametric family of probability distributions. The larger the scale parameter, the more spread out the distribution. Definition If a family o ...
, such that if X \sim \operatorname(0,1) then bX \sim \operatorname(0,b). This means that the effect of using an error variable with an arbitrary scale parameter in place of scale 1 can be compensated simply by multiplying all regression vectors by the same scale. Together with the previous point, this shows that the use of a standard extreme-value distribution (location 0, scale 1) for the error variables entails no loss of generality over using an arbitrary extreme-value distribution. In fact, the model is
nonidentifiable In statistics, identifiability is a property which a model must satisfy for precise inference to be possible. A model is identifiable if it is theoretically possible to learn the true values of this model's underlying parameters after obtaining an ...
(no single set of optimal coefficients) if the more general distribution is used. #Because only differences of vectors of regression coefficients are used, adding an arbitrary constant to all coefficient vectors has no effect on the model. This means that, just as in the log-linear model, only ''K''-1 of the coefficient vectors are identifiable, and the last one can be set to an arbitrary value (e.g. 0). Actually finding the values of the above probabilities is somewhat difficult, and is a problem of computing a particular
order statistic In statistics, the ''k''th order statistic of a statistical sample is equal to its ''k''th-smallest value. Together with rank statistics, order statistics are among the most fundamental tools in non-parametric statistics and inference. Importan ...
(the first, i.e. maximum) of a set of values. However, it can be shown that the resulting expressions are the same as in above formulations, i.e. the two are equivalent.


Estimation of intercept

When using multinomial logistic regression, one category of the dependent variable is chosen as the reference category. Separate
odds ratio An odds ratio (OR) is a statistic that quantifies the strength of the association between two events, A and B. The odds ratio is defined as the ratio of the odds of A in the presence of B and the odds of A in the absence of B, or equivalently (due ...
s are determined for all independent variables for each category of the dependent variable with the exception of the reference category, which is omitted from the analysis. The exponential beta coefficient represents the change in the odds of the dependent variable being in a particular category vis-a-vis the reference category, associated with a one unit change of the corresponding independent variable.


Application in natural language processing

In
natural language processing Natural language processing (NLP) is an interdisciplinary subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to pro ...
, multinomial LR classifiers are commonly used as an alternative to
naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...
s because they do not assume
statistical independence Independence is a fundamental notion in probability theory, as in statistics and the theory of stochastic processes. Two events are independent, statistically independent, or stochastically independent if, informally speaking, the occurrence of ...
of the random variables (commonly known as ''features'') that serve as predictors. However, learning in such a model is slower than for a naive Bayes classifier, and thus may not be appropriate given a very large number of classes to learn. In particular, learning in a Naive Bayes classifier is a simple matter of counting up the number of co-occurrences of features and classes, while in a maximum entropy classifier the weights, which are typically maximized using
maximum a posteriori In Bayesian statistics, a maximum a posteriori probability (MAP) estimate is an estimate of an unknown quantity, that equals the mode of the posterior distribution. The MAP can be used to obtain a point estimate of an unobserved quantity on the ...
(MAP) estimation, must be learned using an iterative procedure; see #Estimating the coefficients.


See also

*
Logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
*
Multinomial probit In statistics and econometrics, the multinomial probit model is a generalization of the probit model used when there are several possible categories that the dependent variable can fall into. As such, it is an alternative to the multinomial log ...


References

{{reflist, 30em Logistic regression Classification algorithms Regression models