machine learning Machine learning (ML) is a field of study in artificial intelligence concerned with the development and study of Computational statistics, statistical algorithms that can learn from data and generalise to unseen data, and thus perform Task ( ...

, a linear classifier makes a

classification Classification is the activity of assigning objects to some pre-existing classes or categories. This is distinct from the task of establishing the classes themselves (for example through cluster analysis). Examples include diagnostic tests, identif ...

decision for each object based on a

linear combination In mathematics, a linear combination or superposition is an Expression (mathematics), expression constructed from a Set (mathematics), set of terms by multiplying each term by a constant and adding the results (e.g. a linear combination of ''x'' a ...

of its features. Such classifiers work well for practical problems such as

document classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more Class (philosophy), classes or Categorization, categories. This may be do ...

, and more generally for problems with many variables ( features), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

Definition

If the input feature vector to the classifier is a real vector

\vec x

, then the output score is :

y = f(\vec\cdot\vec) = f\left(\sum_j w_j x_j\right),

where

\vec w

is a real vector of weights and ''f'' is a function that converts the

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a Scalar (mathematics), scalar as a result". It is also used for other symmetric bilinear forms, for example in a pseudo-Euclidean space. N ...

of the two vectors into the desired output. (In other words,

\vec

is a one-form or

linear functional In mathematics, a linear form (also known as a linear functional, a one-form, or a covector) is a linear mapIn some texts the roles are reversed and vectors are defined as linear maps from covectors to scalars from a vector space to its field of ...

mapping

\vec x

onto R.) The weight vector

\vec w

is learned from a set of labeled training samples. Often ''f'' is a threshold function, which maps all values of

\vec\cdot\vec

above a certain threshold to the first class and all other values to the second class; e.g., :

f(\mathbf) = \begin1 & \text\ \mathbf^T \cdot \mathbf > \theta,\\0 & \text\end

The superscript T indicates the transpose and

\theta

is a scalar threshold. A more complex ''f'' might give the probability that an item belongs to a certain class. For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a high-dimensional input space with a

hyperplane In geometry, a hyperplane is a generalization of a two-dimensional plane in three-dimensional space to mathematical spaces of arbitrary dimension. Like a plane in space, a hyperplane is a flat hypersurface, a subspace whose dimension is ...

: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no". A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when

\vec x

is sparse. Also, linear classifiers often work very well when the number of dimensions in

\vec x

is large, as in

, where each element in

\vec x

is typically the number of occurrences of a word in a document (see

document-term matrix A document-term matrix is a mathematical Matrix (mathematics), matrix that describes the frequency of terms that occur in each document in a collection. In a document-term matrix, rows correspond to documents in the collection and columns correspo ...

). In such cases, the classifier should be well- regularized.

Generative models vs. discriminative models

There are two broad classes of methods for determining the parameters of a linear classifier

\vec w

. They can be generative and discriminative models. Methods of the former model

joint probability distribution A joint or articulation (or articular surface) is the connection made between bones, ossicles, or other hard structures in the body which link an animal's skeletal system into a functional whole.Saladin, Ken. Anatomy & Physiology. 7th ed. McGraw- ...

, whereas methods of the latter model conditional density functions

P(, \vec x)

. Examples of such algorithms include: *

Linear Discriminant Analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), canonical variates analysis (CVA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to fi ...

(LDA)—assumes Gaussian conditional density models * Naive Bayes classifier with multinomial or multivariate Bernoulli event models. The second set of methods includes discriminative models, which attempt to maximize the quality of the output on a training set. Additional terms in the training cost function can easily perform regularization of the final model. Examples of discriminative training of linear classifiers include: *

Logistic regression In statistics, a logistic model (or logit model) is a statistical model that models the logit, log-odds of an event as a linear function (calculus), linear combination of one or more independent variables. In regression analysis, logistic regres ...

—maximum likelihood estimation of

\vec w

assuming that the observed training set was generated by a binomial model that depends on the output of the classifier. *

Perceptron In machine learning, the perceptron is an algorithm for supervised classification, supervised learning of binary classification, binary classifiers. A binary classifier is a function that can decide whether or not an input, represented by a vect ...

—an algorithm that attempts to fix all errors encountered in the training set * Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification. *

Support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laborato ...

—an algorithm that maximizes the

margin Margin may refer to: Physical or graphical edges *Margin (typography), the white space that surrounds the content of a page * Continental margin, the zone of the ocean floor that separates the thin oceanic crust from thick continental crust *Leaf ...

between the decision hyperplane and the examples in the training set. Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear

dimensionality reduction Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally ...

algorithm:

principal components analysis Principal component analysis (PCA) is a Linear map, linear dimensionality reduction technique with applications in exploratory data analysis, visualization and Data Preprocessing, data preprocessing. The data is linear map, linearly transformed ...

(PCA). LDA is a

supervised learning In machine learning, supervised learning (SL) is a paradigm where a Statistical model, model is trained using input objects (e.g. a vector of predictor variables) and desired output values (also known as a ''supervisory signal''), which are often ...

algorithm that utilizes the labels of the data, while PCA is an

unsupervised learning Unsupervised learning is a framework in machine learning where, in contrast to supervised learning, algorithms learn patterns exclusively from unlabeled data. Other frameworks in the spectrum of supervisions include weak- or semi-supervision, wh ...

algorithm that ignores the labels. To summarize, the name is a historical artifact. Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models. All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space

\varphi(\vec x)

, using the kernel trick.

Discriminative training

Discriminative training of linear classifiers usually proceeds in a supervised way, by means of an

optimization algorithm Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criteria, from some set of available alternatives. It is generally divided into two subfiel ...

that is given a training set with desired outputs and a loss function that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form :

\underset \;R(\mathbf) + C \sum_^N L(y_i, \mathbf^\mathsf \mathbf_i)

where * is a vector of classifier parameters, * is a loss function that measures the discrepancy between the classifier's prediction and the true output for the 'th training example, * is a regularization function that prevents the parameters from getting too large (causing

overfitting In mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfi ...

), and * is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function. Popular loss functions include the hinge loss (for linear SVMs) and the log loss (for linear logistic regression). If the regularization function is

convex Convex or convexity may refer to: Science and technology * Convex lens, in optics Mathematics * Convex set, containing the whole line segment that joins points ** Convex polygon, a polygon which encloses a convex set of points ** Convex polytop ...

, then the above is a convex problem. Many algorithms exist for solving such problems; popular ones for linear classification include (

stochastic Stochastic (; ) is the property of being well-described by a random probability distribution. ''Stochasticity'' and ''randomness'' are technically distinct concepts: the former refers to a modeling approach, while the latter describes phenomena; i ...

) gradient descent, L-BFGS, coordinate descent and Newton methods.

Definition

Generative models vs. discriminative models

Discriminative training

See also

Notes

Further reading