machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

and

statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagn ...

, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called

binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...

). While many classification algorithms (notably multinomial logistic regression) naturally permit the use of more than two classes, some are by nature

binary Binary may refer to: Science and technology Mathematics * Binary number, a representation of numbers using only two digits (0 and 1) * Binary function, a function that takes two arguments * Binary operation, a mathematical operation that ta ...

algorithms; these can, however, be turned into multinomial classifiers by a variety of strategies. Multiclass classification should not be confused with

multi-label classification In machine learning, multi-label classification or multi-output classification is a variant of the classification problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of mult ...

, where multiple labels are to be predicted for each instance.

General strategies

The existing multi-class classification techniques can be categorized into (i) transformation to binary (ii) extension from binary and (iii) hierarchical classification.

Transformation to binary

This section discusses strategies for reducing the problem of multiclass classification to multiple binary classification problems. It can be categorized into ''one vs rest'' and ''one vs one''. The techniques developed based on reducing the multi-class problem into multiple binary problems can also be called problem transformation techniques.

One-vs.-rest

One-vs.-rest (OvR or ''one-vs.-all'', OvA or ''one-against-all'', OAA) strategy involves training a single classifier per class, with the samples of that class as positive samples and all other samples as negatives. This strategy requires the base classifiers to produce a real-valued confidence score for its decision, rather than just a class label; discrete class labels alone can lead to ambiguities, where multiple classes are predicted for a single sample.In

, OvR is known as ''binary relevance'' and the prediction of multiple classes is considered a feature, not a problem. In pseudocode, the training algorithm for an OvR learner constructed from a binary classification learner is as follows: :Inputs: :* , a learner (training algorithm for binary classifiers) :* samples :* labels where ∈ is the label for the sample :Output: :*a list of classifiers for ∈ :Procedure: :*For each in :** Construct a new label vector where if and otherwise :** Apply to , to obtain Making decisions means applying all classifiers to an unseen sample and predicting the label for which the corresponding classifier reports the highest confidence score: :

\hat = \underset\; f_k(x)

Although this strategy is popular, it is a

heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate ...

that suffers from several problems. Firstly, the scale of the confidence values may differ between the binary classifiers. Second, even if the class distribution is balanced in the training set, the binary classification learners see unbalanced distributions because typically the set of negatives they see is much larger than the set of positives.

One-vs.-one

In the ''one-vs.-one'' (OvO) reduction, one trains binary classifiers for a -way multiclass problem; each receives the samples of a pair of classes from the original training set, and must learn to distinguish these two classes. At prediction time, a voting scheme is applied: all classifiers are applied to an unseen sample and the class that got the highest number of "+1" predictions gets predicted by the combined classifier. Like OvR, OvO suffers from ambiguities in that some regions of its input space may receive the same number of votes.

Extension from binary

This section discusses strategies of extending the existing binary classifiers to solve multi-class classification problems. Several algorithms have been developed based on neural networks,

decision trees A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...

k-nearest neighbors In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and reg ...

naive Bayes In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

, support vector machines and extreme learning machines to address multi-class classification problems. These types of techniques can also be called algorithm adaptation techniques.

Neural networks

Multiclass perceptrons provide a natural extension to the multi-class problem. Instead of just having one neuron in the output layer, with binary output, one could have N binary neurons leading to multi-class classification. In practice, the last layer of a neural network is usually a

softmax function The softmax function, also known as softargmax or normalized exponential function, converts a vector of real numbers into a probability distribution of possible outcomes. It is a generalization of the logistic function to multiple dimensions, a ...

layer, which is the algebraic simplification of N logistic classifiers, normalized per class by the sum of the N-1 other logistic classifiers.

=Extreme learning machines

= Extreme learning machines (ELM) is a special case of single hidden layer feed-forward neural networks (SLFNs) wherein the input weights and the hidden node biases can be chosen at random. Many variants and developments are made to the ELM for multiclass classification.

k-nearest neighbours

kNN is considered among the oldest non-parametric classification algorithms. To classify an unknown example, the distance from that example to every other training example is measured. The k smallest distances are identified, and the most represented class by these k nearest neighbours is considered the output class label.

Naive Bayes

Naive Bayes In statistics, naive Bayes classifiers are a family of simple " probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

is a successful classifier based upon the principle of maximum a posteriori (MAP). This approach is naturally extensible to the case of having more than two classes, and was shown to perform well in spite of the underlying simplifying assumption of

conditional independence In probability theory, conditional independence describes situations wherein an observation is irrelevant or redundant when evaluating the certainty of a hypothesis. Conditional independence is usually formulated in terms of conditional probabil ...

Decision trees

Decision tree learning Decision tree learning is a supervised learning approach used in statistics, data mining and machine learning. In this formalism, a classification or regression decision tree is used as a predictive model to draw conclusions about a set of ob ...

is a powerful classification technique. The tree tries to infer a split of the training data based on the values of the available features to produce a good generalization. The algorithm can naturally handle binary or multiclass classification problems. The leaf nodes can refer to any of the K classes concerned.

Support vector machines

Support vector machines are based upon the idea of maximizing the margin i.e. maximizing the minimum distance from the separating hyperplane to the nearest example. The basic SVM supports only binary classification, but extensions have been proposed to handle the multiclass classification case as well. In these extensions, additional parameters and constraints are added to the optimization problem to handle the separation of the different classes.

Multi expression programming

Multi expression programming Multi Expression Programming (MEP) is an evolutionary algorithm for generating mathematical functions describing a given set of data. MEP is a Genetic Programming variant encoding multiple solutions in the same chromosome. MEP representation is no ...

(MEP) is an evolutionary algorithm for generating computer programs (that can be used for classification tasks too). MEP has a unique feature: it encodes multiple programs into a single chromosome. Each of these programs can be used to generate the output for a class, thus making MEP naturally suitable for solving multi-class classification problems.

Hierarchical classification

Hierarchical classification Hierarchical classification is a system of grouping things according to a hierarchy. In the field of machine learning, hierarchical classification is sometimes referred to as instance space decomposition, which splits a complete multi-class pro ...

tackles the multi-class classification problem by dividing the output space i.e. into a

tree In botany, a tree is a perennial plant with an elongated stem, or trunk, usually supporting branches and leaves. In some usages, the definition of a tree may be narrower, including only woody plants with secondary growth, plants that are ...

. Each parent node is divided into multiple child nodes and the process is continued until each child node represents only one class. Several methods have been proposed based on hierarchical classification.

Learning paradigms

Based on learning paradigms, the existing multi-class classification techniques can be classified into batch learning and

online learning Educational technology (commonly abbreviated as edutech, or edtech) is the combined use of computer hardware, software, and educational theory and practice to facilitate learning. When referred to with its abbreviation, edtech, it often refer ...

. Batch learning algorithms require all the data samples to be available beforehand. It trains the model using the entire training data and then predicts the test sample using the found relationship. The online learning algorithms, on the other hand, incrementally build their models in sequential iterations. In iteration t, an online algorithm receives a sample, x_t and predicts its label ŷ_t using the current model; the algorithm then receives y_t, the true label of x_t and updates its model based on the sample-label pair: (x_t, y_t). Recently, a new learning paradigm called progressive learning technique has been developed. The progressive learning technique is capable of not only learning from new samples but also capable of learning new classes of data and yet retain the knowledge learnt thus far.

Notes

References

{{reflist Classification algorithms Statistical classification