HOME

TheInfoList



OR:

In the field of
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, the goal of
statistical classification In statistics, classification is the problem of identifying which of a set of categories (sub-populations) an observation (or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagn ...
is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a
feature vector In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
. Such classifiers work well for practical problems such as
document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
, and more generally for problems with many variables ( features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.


Definition

If the input feature vector to the classifier is a
real Real may refer to: Currencies * Brazilian real (R$) * Central American Republic real * Mexican real * Portuguese real * Spanish real * Spanish colonial real Music Albums * ''Real'' (L'Arc-en-Ciel album) (2000) * ''Real'' (Bright album) (2010) ...
vector \vec x, then the output score is :y = f(\vec\cdot\vec) = f\left(\sum_j w_j x_j\right), where \vec w is a real vector of weights and ''f'' is a function that converts the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alge ...
of the two vectors into the desired output. (In other words, \vec is a
one-form In differential geometry, a one-form on a differentiable manifold is a smooth section of the cotangent bundle. Equivalently, a one-form on a manifold M is a smooth mapping of the total space of the tangent bundle of M to \R whose restriction to e ...
or
linear functional In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear map from a vector space to its field of scalars (often, the real numbers or the complex numbers). If is a vector space over a field , the ...
mapping \vec x onto R.) The weight vector \vec w is learned from a set of labeled training samples. Often ''f'' is a threshold function, which maps all values of \vec\cdot\vec above a certain threshold to the first class and all other values to the second class; e.g., : f(\mathbf) = \begin1 & \text\ \mathbf^T \cdot \mathbf > \theta,\\0 & \text\end The superscript T indicates the transpose and \theta is a scalar threshold. A more complex ''f'' might give the probability that an item belongs to a certain class. For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space with a hyperplane: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no". A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when \vec x is sparse. Also, linear classifiers often work very well when the number of dimensions in \vec x is large, as in
document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
, where each element in \vec x is typically the number of occurrences of a word in a document (see
document-term matrix A document-term matrix is a mathematical matrix that describes the frequency of terms that occur in a collection of documents. In a document-term matrix, rows correspond to documents in the collection and columns correspond to terms. This matrix i ...
). In such cases, the classifier should be well- regularized.


Generative models vs. discriminative models

There are two broad classes of methods for determining the parameters of a linear classifier \vec w. They can be generative and discriminative models. Methods of the former model joint probability distribution, whereas methods of the latter model conditional density functions P(, \vec x). Examples of such algorithms include: * Linear Discriminant Analysis (LDA)—assumes Gaussian conditional density models * Naive Bayes classifier with multinomial or multivariate Bernoulli event models. The second set of methods includes discriminative models, which attempt to maximize the quality of the output on a training set. Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include: *
Logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression a ...
—maximum likelihood estimation of \vec w assuming that the observed training set was generated by a binomial model that depends on the output of the classifier. *
Perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
—an algorithm that attempts to fix all errors encountered in the training set * Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification. * Support vector machine—an algorithm that maximizes the margin between the decision hyperplane and the examples in the training set. Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear
dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...
algorithm:
principal components analysis Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and ...
(PCA). LDA is a
supervised learning Supervised learning (SL) is a machine learning paradigm for problems where the available data consists of labelled examples, meaning that each data point contains features (covariates) and an associated label. The goal of supervised learning alg ...
algorithm that utilizes the labels of the data, while PCA is an
unsupervised learning Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a concise representation of its world and t ...
algorithm that ignores the labels. To summarize, the name is a historical artifact.R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001). Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models. All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space \varphi(\vec x), using the
kernel trick In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations (for example ...
.


Discriminative training

Discriminative training of linear classifiers usually proceeds in a supervised way, by means of an
optimization algorithm Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
that is given a training set with desired outputs and a loss function that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form :\underset \;R(\mathbf) + C \sum_^N L(y_i, \mathbf^\mathsf \mathbf_i) where * is a vector of classifier parameters, * is a loss function that measures the discrepancy between the classifier's prediction and the true output for the 'th training example, * is a regularization function that prevents the parameters from getting too large (causing overfitting), and * is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function. Popular loss functions include the
hinge loss In machine learning, the hinge loss is a loss function used for training classifiers. The hinge loss is used for "maximum-margin" classification, most notably for support vector machines (SVMs). For an intended output and a classifier score , th ...
(for linear SVMs) and the
log loss In information theory, the cross-entropy between two probability distributions p and q over the same underlying set of events measures the average number of bits needed to identify an event drawn from the set if a coding scheme used for the set is ...
(for linear logistic regression). If the regularization function is
convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...
, then the above is a convex problem. Many algorithms exist for solving such problems; popular ones for linear classification include ( stochastic)
gradient descent In mathematics, gradient descent (also often called steepest descent) is a first-order iterative optimization algorithm for finding a local minimum of a differentiable function. The idea is to take repeated steps in the opposite direction of the ...
,
L-BFGS Limited-memory BFGS (L-BFGS or LM-BFGS) is an optimization algorithm in the family of quasi-Newton methods that approximates the Broyden–Fletcher–Goldfarb–Shanno algorithm (BFGS) using a limited amount of computer memory. It is a popular algo ...
, coordinate descent and
Newton method In numerical analysis, Newton's method, also known as the Newton–Raphson method, named after Isaac Newton and Joseph Raphson, is a root-finding algorithm which produces successively better approximations to the roots (or zeroes) of a real-va ...
s.


See also

*
Backpropagation In machine learning, backpropagation (backprop, BP) is a widely used algorithm for training feedforward artificial neural networks. Generalizations of backpropagation exist for other artificial neural networks (ANNs), and for functions gener ...
* Linear regression *
Perceptron In machine learning, the perceptron (or McCulloch-Pitts neuron) is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belon ...
* Quadratic classifier * Support vector machines *
Winnow (algorithm) The winnow algorithm Nick Littlestone (1988). "Learning Quickly When Irrelevant Attributes Abound: A New Linear-threshold Algorithm" ''Machine Learning'' 285–318(2) is a technique from machine learning for learning a linear classifier from labeled ...


Notes


Further reading

# Y. Yang, X. Liu, "A re-examination of text categorization", Proc. ACM SIGIR Conference, pp. 42–49, (1999)
paper @ citeseer
# R. Herbrich, "Learning Kernel Classifiers: Theory and Algorithms," MIT Press, (2001). {{DEFAULTSORT:Linear Classifier Classification algorithms Statistical classification