In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, multi-label classification or multi-output classification is a variant of the
classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood.
Classification is the grouping of related facts into classes.
It may also refer to:
Business, organizat ...
problem where multiple nonexclusive labels may be assigned to each instance. Multi-label classification is a generalization of
multiclass classification
In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...
, which is the single-label problem of categorizing instances into precisely one of several (more than two) classes. In the multi-label problem the labels are nonexclusive and there is no constraint on how many of the classes the instance can be assigned to.
Formally, multi-label classification is the problem of finding a model that maps inputs x to binary vectors y; that is, it assigns a value of 0 or 1 for each element (label) in y.
Problem transformation methods
Several problem transformation methods exist for multi-label classification, and can be roughly broken down into:
* Transformation into
binary classification
Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include:
* Medical testing to determine if a patient has c ...
problems: the baseline approach, called the ''binary relevance'' method,
[Jesse Read, Bernhard Pfahringer, Geoff Holmes, Eibe Frank]
Classifier Chains for Multi-label Classification
Machine Learning Journal. Springer. Vol. 85(3), (2011). amounts to independently training one binary classifier for each label. Given an unseen sample, the combined model then predicts all labels for this sample for which the respective classifiers predict a positive result. Although this method of dividing the task into multiple binary tasks may resemble superficially the one-vs.-all (OvA) and one-vs.-rest (OvR) methods for
multiclass classification
In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...
, it is essentially different from both, because a single classifier under binary relevance deals with a single label, without any regard to other labels whatsoever. A
classifier chain is an alternative method for transforming a multi-label classification problem into several binary classification problems. It differs from binary relevance in that labels are predicted sequentially, and the output of all previous classifiers (i.e. positive or negative for a particular label) are input as features to subsequent classifiers.
Classifier chains have been applied, for instance, in
HIV
The human immunodeficiency viruses (HIV) are two species of ''Lentivirus'' (a subgroup of retrovirus) that infect humans. Over time, they cause acquired immunodeficiency syndrome (AIDS), a condition in which progressive failure of the immune ...
drug resistance prediction.
Bayesian network
A Bayesian network (also known as a Bayes network, Bayes net, belief network, or decision network) is a probabilistic graphical model that represents a set of variables and their conditional dependencies via a directed acyclic graph (DAG). Bay ...
has also been applied to optimally order classifiers in
Classifier chains.
* Transformation into
multi-class classification
In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...
problem: The label powerset (LP) transformation creates one binary classifier for every label combination present in the training set. For example, if possible labels for an example were A, B, and C, the label powerset representation of this problem is a multi-class classification problem with the classes
0 0 0 0 1 0 0 1 1 0 0 1 1 1 1 1where for example
0 1denotes an example where labels A and C are present and label B is absent.
*
Ensemble methods: A set of multi-class classifiers can be used to create a multi-label ensemble classifier. For a given example, each classifier outputs a single class (corresponding to a single label in the multi-label problem). These predictions are then combined by an ensemble method, usually a voting scheme where every class that receives a requisite percentage of votes from individual classifiers (often referred to as the discrimination threshold) is predicted as a present label in the multi-label output. However, more complex ensemble methods exist, such as
committee machine A committee machine is a type of artificial neural network using a divide and conquer strategy in which the responses of multiple neural networks (experts) are combined into a single response.HAYKIN, S. Neural Networks - A Comprehensive Foundation. ...
s. Another variation is the random -labelsets (RAKEL) algorithm, which uses multiple LP classifiers, each trained on a random subset of the actual labels; label prediction is then carried out by a voting scheme. A set of multi-label classifiers can be used in a similar way to create a multi-label ensemble classifier. In this case, each classifier votes once for each label it predicts rather than for a single label.
Adapted algorithms
Some classification algorithms/models have been adapted to the multi-label task, without requiring problem transformations. Examples of these including for multi-label data.
*
k-nearest neighbors
In statistics, the ''k''-nearest neighbors algorithm (''k''-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and reg ...
: the ML-kNN algorithm extends the k-NN classifier to multi-label data.
*
decision trees
A decision tree is a decision support tool that uses a tree-like model of decisions and their possible consequences, including chance event outcomes, resource costs, and utility. It is one way to display an algorithm that only contains condit ...
: "Clare" is an adapted C4.5 algorithm for multi-label classification; the modification involves the entropy calculations.
MMC, MMDT, and SSC refined MMDT, can classify multi-labeled data based on multi-valued attributes without transforming the attributes into single-values. They are also named multi-valued and multi-labeled decision tree classification methods.
*
kernel methods for vector output Kernel methods are a well-established tool to analyze the relationship between input data and the corresponding output of a function. Kernels encapsulate the properties of functions in a Kernel trick, computationally efficient way and allow algorith ...
*
neural networks
A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
: BP-MLL is an adaptation of the popular back-propagation algorithm for multi-label learning.
Learning paradigms
Based on learning paradigms, the existing multi-label classification techniques can be classified into batch learning and
online machine learning
In computer science, online machine learning is a method of machine learning in which data becomes available in a sequential order and is used to update the best predictor for future data at each step, as opposed to batch learning techniques whic ...
. Batch learning algorithms require all the data samples to be available beforehand. It trains the model using the entire training data and then predicts the test sample using the found relationship. The online learning algorithms, on the other hand, incrementally build their models in sequential iterations. In iteration t, an online algorithm receives a sample, x
t and predicts its label(s) ŷ
t using the current model; the algorithm then receives y
t, the true label(s) of x
t and updates its model based on the sample-label pair: (x
t, y
t).
Multi-label stream classification
Data stream
In connection-oriented communication, a data stream is the transmission of a sequence of digitally encoded coherent signals to convey information. Typically, the transmitted symbols are grouped into a series of packets.
Data streaming has bec ...
s are possibly infinite sequences of data that continuously and rapidly grow over time. Multi-label stream classification (MLSC) is the version of multi-label classification task that takes place in data streams. It is sometimes also called online multi-label classification. The difficulties of multi-label classification (exponential number of possible label sets, capturing dependencies between labels) are combined with difficulties of data streams (time and memory constraints, addressing infinite stream with finite means,
concept drift
In predictive analytics and machine learning, concept drift means that the statistical properties of the target variable, which the model is trying to predict, change over time in unforeseen ways. This causes problems because the predictions become ...
s).
Many MLSC methods resort to
ensemble methods in order to increase their predictive performance and deal with concept drifts. Below are the most widely used ensemble methods in the literature:
* Online Bagging (OzaBagging)-based methods: Observing the probability of having K many of a certain data point in a bootstrap sample is approximately ''Poisson(1)'' for big datasets, each incoming data instance in a data stream can be weighted proportional to Poisson(1) distribution to mimic bootstrapping in an online setting. This is called Online Bagging (OzaBagging). Many multi-label methods that use Online Bagging are proposed in the literature, each of which utilizes different problem transformation methods. EBR,
ECC,
EPS, E
BRT,
E
BMT,
ML-Random Rules are examples of such methods.
* ADWIN Bagging
-based methods: Online Bagging methods for MLSC are sometimes combined with explicit concept drift detection mechanisms such as ADWIN (Adaptive Window). ADWIN keeps a variable-sized window to detect changes in the distribution of the data, and improves the ensemble by resetting the components that perform poorly when there is a drift in the incoming data. Generally, the letter 'a' is used as a subscript in the name of such ensembles to indicate the usage of ADWIN change detector. E
aBR,
E
aCC,
E
aHT
PS are examples of such multi-label ensembles.
* GOOWE-ML
-based methods: Interpreting the relevance scores of each component of the ensemble as vectors in the label space and solving a least squares problem at the end of each batch, Geometrically-Optimum Online-Weighted Ensemble for Multi-label Classification (GOOWE-ML) is proposed. The ensemble tries to minimize the distance between the weighted prediction of its components and the ground truth vector for each instance over a batch. Unlike Online Bagging and ADWIN Bagging, GOOWE-ML utilizes a ''weighted'' voting scheme where better performing components of the ensemble are given more weight. The GOOWE-ML ensemble grows over time, and the lowest weight component is replaced by a new component when it is full at the end of a batch. GOBR,
GOCC,
GOPS,
GORT
are the proposed GOOWE-ML-based multi-label ensembles.
* Multiple Windows : Here, BR models that use a sliding window are replaced with two windows for each label, one for relevant and one for non-relevant examples. Instances are oversampled or undersampled according to a load factor that is kept between these two windows. This allows concept drifts that are independent for each label to be detected, and class-imbalance (skewness in the relevant and non-relevant examples) to be handled.
Statistics and evaluation metrics
Considering
to be a set of labels for
data sample (do not confuse it with a one-hot vector; it is simply a collection of all of the labels that belong to this sample), the extent to which a dataset is multi-label can be captured in two statistics:
* Label cardinality is the average number of labels per example in the set:
where
is the total number of data samples;
* Label density is the number of labels per sample divided by the total number of labels, averaged over the samples:
where
, the total number of available classes (which is the maximum number of elements that can make up
).
Evaluation metrics for multi-label classification performance are inherently different from those used in multi-class (or binary) classification, due to the inherent differences of the classification problem. If denotes the true set of labels for a given sample, and the predicted set of labels, then the following metrics can be defined on that sample:
*
Hamming Hamming may refer to:
* Richard Hamming (1915–1998), American mathematician
* Hamming(7,4), in coding theory, a linear error-correcting code
* Overacting, or acting in an exaggerated way
See also
* Hamming code, error correction in telecom ...
loss: the fraction of the wrong labels to the total number of labels, i.e.
, where
is the target,
is the prediction, and
is the
"Exclusive, or" operator that returns zero when the target and prediction are identical and one otherwise. This is a
loss function
In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...
, so the optimal value is zero and its upper bound is one.
* The closely related
Jaccard index
The Jaccard index, also known as the Jaccard similarity coefficient, is a statistic used for gauging the similarity and diversity of sample sets. It was developed by Grove Karl Gilbert in 1884 as his ratio of verification (v) and now is freque ...
, also called Intersection over Union in the multi-label setting, is defined as the number of correctly predicted labels divided by the union of predicted and true labels,
, where
and
are sets of predicted labels and true labels respectively.
*
Precision, recall and
score: precision is
, recall is
, and
is their
harmonic mean
In mathematics, the harmonic mean is one of several kinds of average, and in particular, one of the Pythagorean means. It is sometimes appropriate for situations when the average rate is desired.
The harmonic mean can be expressed as the recipro ...
.
* Exact match (also called Subset accuracy): is the most strict metric, indicating the percentage of samples that have all their labels classified correctly.
Cross-validation in multi-label settings is complicated by the fact that the ordinary (binary/multiclass) way of
stratified sampling
In statistics, stratified sampling is a method of sampling from a population which can be partitioned into subpopulations.
In statistical surveys, when subpopulations within an overall population vary, it could be advantageous to sample each s ...
will not work; alternative ways of approximate stratified sampling have been suggested.
Implementations and datasets
Java implementations of multi-label algorithms are available in th
Mulanan
Mekasoftware packages, both based on
Weka
The weka, also known as the Māori hen or woodhen (''Gallirallus australis'') is a flightless bird species of the rail family. It is endemic to New Zealand. It is the only extant member of the genus ''Gallirallus''. Four subspecies are recognize ...
.
The
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-vector ...
Python package implements som
multi-labels algorithms and metrics
Th
scikit-multilearnPython package specifically caters to the multi-label classification. It provides multi-label implementation of several well-known techniques including SVM, kNN an
The package is built on top of
scikit-learn
scikit-learn (formerly scikits.learn and also known as sklearn) is a free software machine learning library for the Python programming language.
It features various classification, regression and clustering algorithms including support-vector ...
ecosystem.
The binary relevance method, classifier chains and other multilabel algorithms with a lot of different base learners are implemented in the R-packag
mlrref name="multiR">Philipp Probst, Quay Au, Giuseppe Casalicchio, Clemens Stachl, Bernd Bischl
The R Journal (2017) 9:1, pages 352-369.
A list of commonly used multi-label data-sets is available at th
See also
*
Multiclass classification
In machine learning and statistical classification, multiclass classification or multinomial classification is the problem of classifying instances into one of three or more classes (classifying instances into one of two classes is called binary c ...
*
Multiple-instance learning
*
Structured prediction
Structured prediction or structured (output) learning is an umbrella term for supervised machine learning techniques that involves predicting structured objects, rather than scalar discrete or real values.
Similar to commonly used supervised l ...
*
Life-time of correlation
References
Further reading
* {{cite journal , first1=Gjorgji , last1=Madjarov , first2=Dragi , last2=Kocev , first3=Dejan , last3=Gjorgjevikj , first4=Sašo , last4=Džeroski , title=An extensive experimental comparison of methods for multi-label learning , journal=Pattern Recognition , volume=45 , issue=9 , year=2012 , doi=10.1016/j.patcog.2012.03.004 , pages=3084–3104
Classification algorithms