In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, a probabilistic classifier is a
classifier that is able to predict, given an observation of an input, a
probability distribution
In probability theory and statistics, a probability distribution is the mathematical function that gives the probabilities of occurrence of different possible outcomes for an experiment. It is a mathematical description of a random phenomenon i ...
over a
set
Set, The Set, SET or SETS may refer to:
Science, technology, and mathematics Mathematics
*Set (mathematics), a collection of elements
*Category of sets, the category whose objects and morphisms are sets and total functions, respectively
Electro ...
of classes, rather than only outputting the most likely class that the observation should belong to. Probabilistic classifiers provide classification that can be useful in its own right or when combining classifiers into
ensembles.
Types of classification
Formally, an "ordinary" classifier is some rule, or
function
Function or functionality may refer to:
Computing
* Function key, a type of key on computer keyboards
* Function model, a structured representation of processes in a system
* Function object or functor or functionoid, a concept of object-oriente ...
, that assigns to a sample a class label :
:
The samples come from some set (e.g., the set of all
documents
A document is a writing, written, drawing, drawn, presented, or memorialized representation of thought, often the manifestation of nonfiction, non-fictional, as well as fictional, content. The word originates from the Latin ''Documentum'', w ...
, or the set of all
images
An image is a visual representation of something. It can be two-dimensional, three-dimensional, or somehow otherwise feed into the visual system to convey information. An image can be an artifact, such as a photograph or other two-dimensiona ...
), while the class labels form a finite set defined prior to training.
Probabilistic classifiers generalize this notion of classifiers: instead of functions, they are
conditional distributions
, meaning that for a given
, they assign probabilities to all
(and these probabilities sum to one). "Hard" classification can then be done using the
optimal decision rule
:
or, in English, the predicted class is that which has the highest probability.
Binary probabilistic classifiers are also called
binary regression
In statistics, specifically regression analysis, a binary regression estimates a relationship between one or more explanatory variables and a single output binary variable. Generally the probability of the two alternatives is modeled, instead of si ...
models in
statistics
Statistics (from German language, German: ''wikt:Statistik#German, Statistik'', "description of a State (polity), state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of ...
. In
econometrics
Econometrics is the application of Statistics, statistical methods to economic data in order to give Empirical evidence, empirical content to economic relationships.M. Hashem Pesaran (1987). "Econometrics," ''The New Palgrave: A Dictionary of ...
, probabilistic classification in general is called
discrete choice
In economics, discrete choice models, or qualitative choice models, describe, explain, and predict choices between two or more discrete alternatives, such as entering or not entering the labor market, or choosing between modes of transport. Such ...
.
Some classification models, such as
naive Bayes
In statistics, naive Bayes classifiers are a family of simple "Probabilistic classification, probabilistic classifiers" based on applying Bayes' theorem with strong (naive) statistical independence, independence assumptions between the features (s ...
,
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
and
multilayer perceptron
A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean ''any'' feedforward ANN, sometimes strictly to refer to networks composed of mu ...
s (when trained under an appropriate
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
) are naturally probabilistic. Other models such as
support vector machine
In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...
s are not, but
methods exist to turn them into probabilistic classifiers.
Generative and conditional training
Some models, such as
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
, are conditionally trained: they optimize the conditional probability
directly on a training set (see
empirical risk minimization
Empirical risk minimization (ERM) is a principle in statistical learning theory which defines a family of learning algorithms and is used to give theoretical bounds on their performance. The core idea is that we cannot know exactly how well an alg ...
). Other classifiers, such as
naive Bayes
In statistics, naive Bayes classifiers are a family of simple "Probabilistic classification, probabilistic classifiers" based on applying Bayes' theorem with strong (naive) statistical independence, independence assumptions between the features (s ...
, are trained
generatively: at training time, the class-conditional distribution
and the class
prior
Prior (or prioress) is an ecclesiastical title for a superior in some religious orders. The word is derived from the Latin for "earlier" or "first". Its earlier generic usage referred to any monastic superior. In abbeys, a prior would be l ...
are found, and the conditional distribution
is derived using
Bayes' rule
In probability theory and statistics, Bayes' theorem (alternatively Bayes' law or Bayes' rule), named after Thomas Bayes, describes the probability of an event, based on prior knowledge of conditions that might be related to the event. For examp ...
.
Probability calibration
Not all classification models are naturally probabilistic, and some that are, notably naive Bayes classifiers,
decision trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
and
boosting methods, produce distorted class probability distributions.
In the case of decision trees, where is the proportion of training samples with label in the leaf where ends up, these distortions come about because learning algorithms such as
C4.5 or
CART
A cart or dray (Australia and New Zealand) is a vehicle designed for transport, using two wheels and normally pulled by one or a pair of draught animals. A handcart is pulled or pushed by one or more people.
It is different from the flatbed tr ...
explicitly aim to produce homogeneous leaves (giving probabilities close to zero or one, and thus high
bias
Bias is a disproportionate weight ''in favor of'' or ''against'' an idea or thing, usually in a way that is closed-minded, prejudicial, or unfair. Biases can be innate or learned. People may develop biases for or against an individual, a group, ...
) while using few samples to estimate the relevant proportion (high
variance
In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
).
Calibration can be assessed using a calibration plot (also called a reliability diagram).
A calibration plot shows the proportion of items in each class for bands of predicted probability or score (such as a distorted probability distribution or the "signed distance to the hyperplane" in a support vector machine). Deviations from the identity function indicate a poorly-calibrated classifier for which the predicted probabilities or scores can not be used as probabilities. In this case one can use a method to turn these scores into properly
calibrated
In measurement technology and metrology, calibration is the comparison of measurement values delivered by a device under test with those of a calibration standard of known accuracy. Such a standard could be another measurement device of known a ...
class membership probabilities.
For the
binary
Binary may refer to:
Science and technology Mathematics
* Binary number, a representation of numbers using only two digits (0 and 1)
* Binary function, a function that takes two arguments
* Binary operation, a mathematical operation that t ...
case, a common approach is to apply
Platt scaling
In machine learning, Platt scaling or Platt calibration is a way of transforming the outputs of a classification model into a probability distribution over classes. The method was invented by John Platt in the context of support vector machine ...
, which learns a
logistic regression
In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear function (calculus), linear combination of one or more independent var ...
model on the scores.
An alternative method using
isotonic regression
In statistics and numerical analysis, isotonic regression or monotonic regression is the technique of fitting a free-form line to a sequence of observations such that the fitted line is non-decreasing (or non-increasing) everywhere, and lies as ...
is generally superior to Platt's method when sufficient training data is available.
In the
multiclass case, one can use a reduction to binary tasks, followed by univariate calibration with an algorithm as described above and further application of the pairwise coupling algorithm by Hastie and Tibshirani.
Evaluating probabilistic classification
Commonly used loss functions for probabilistic classification include
log loss
In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...
and the
Brier score
The Brier Score is a Scoring rule#StrictlyProperScoringRules, ''strictly proper score function'' or ''strictly proper scoring rule'' that measures the accuracy of probabilistic classification, probabilistic predictions. For unidimensional predicti ...
between the predicted and the true probability distributions. The former of these is commonly used to train logistic models.
A method used to assign scores to pairs of predicted probabilities and actual discrete outcomes, so that different predictive methods can be compared, is called a
scoring rule
In decision theory, a scoring rule
provides a summary measure for the evaluation of probabilistic predictions or forecasts. It is applicable to tasks in which predictions assign probabilities to events, i.e. one issues a probability distribution ...
.
References
{{reflist, 30em
Probabilistic models
Statistical classification