In
statistics
Statistics (from German language, German: ', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a s ...
, ordinal regression, also called ordinal classification, is a type of
regression analysis used for predicting an
ordinal variable, i.e. a variable whose value exists on an arbitrary scale where only the relative ordering between different values is significant. It can be considered an intermediate problem between regression and
classification
Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...
. Examples of ordinal regression are
ordered logit and
ordered probit. Ordinal regression turns up often in the
social sciences
Social science (often rendered in the plural as the social sciences) is one of the branches of science, devoted to the study of society, societies and the Social relation, relationships among members within those societies. The term was former ...
, for example in the modeling of human levels of preference (on a scale from, say, 1–5 for "very poor" through "excellent"), as well as in
information retrieval
Information retrieval (IR) in computing and information science is the task of identifying and retrieving information system resources that are relevant to an Information needs, information need. The information need can be specified in the form ...
. In
machine learning
Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...
, ordinal regression may also be called ranking learning.
Linear models for ordinal regression
Ordinal regression can be performed using a
generalized linear model
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
(GLM) that fits both a coefficient vector and a set of ''thresholds'' to a dataset. Suppose one has a set of observations, represented by length- vectors through , with associated
responses through , where each is an
ordinal variable on a scale . For simplicity, and without loss of generality, we assume is a non-decreasing vector, that is, . To this data, one fits a length- coefficient vector and a set of thresholds with the property that . This set of thresholds divides the real number line into disjoint segments, corresponding to the response levels.
The model can now be formulated as
:
or, the cumulative probability of the response being at most is given by a function (the inverse
link function
In statistics, a generalized linear model (GLM) is a flexible generalization of ordinary linear regression. The GLM generalizes linear regression by allowing the linear model to be related to the response variable via a ''link function'' and by ...
) applied to a linear function of . Several choices exist for ; the
logistic function
A logistic function or logistic curve is a common S-shaped curve ( sigmoid curve) with the equation
f(x) = \frac
where
The logistic function has domain the real numbers, the limit as x \to -\infty is 0, and the limit as x \to +\infty is L.
...
:
gives the
ordered logit model, while using the
CDF of the standard
normal distribution
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real-valued random variable. The general form of its probability density function is
f(x) = \frac ...
gives the
ordered probit model. A third option is to use an exponential function
:
which gives the
proportional hazards model.
Latent variable model
The probit version of the above model can be justified by assuming the existence of a real-valued
latent variable
In statistics, latent variables (from Latin: present participle of ) are variables that can only be inferred indirectly through a mathematical model from other observable variables that can be directly observed or measured. Such '' latent va ...
(unobserved quantity) , determined by
:
where is
normally distributed
In probability theory and statistics, a normal distribution or Gaussian distribution is a type of continuous probability distribution for a real number, real-valued random variable. The general form of its probability density function is
f(x ...
with zero mean and unit variance,
conditioned on . The response variable results from an "incomplete measurement" of , where one only determines the interval into which falls:
:
Defining and , the above can be summarized as
if and only if
In logic and related fields such as mathematics and philosophy, "if and only if" (often shortened as "iff") is paraphrased by the biconditional, a logical connective between statements. The biconditional is true in two cases, where either bo ...
.
From these assumptions, one can derive the conditional distribution of as
:
where is the
cumulative distribution function
In probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable X, or just distribution function of X, evaluated at x, is the probability that X will take a value less than or equal to x.
Ever ...
of the standard normal distribution, and takes on the role of the inverse link function . The
log-likelihood of the model for a single training example , can now be stated as
: