In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the value of a linear combination of the characteristics. An object's characteristics are also known as feature values and are typically presented to the machine in a vector called a feature vector. Such classifiers work well for practical problems such as document classification, and more generally for problems with many variables (

features Feature may refer to: Computing * Feature (CAD), could be a hole, pocket, or notch * Feature (computer vision), could be an edge, corner or blob * Feature (software design) is an intentional distinguishing characteristic of a software item ...

), reaching accuracy levels comparable to non-linear classifiers while taking less time to train and use.

Definition

If the input feature vector to the classifier is a real vector

\vec x

, then the output score is :

y = f(\vec\cdot\vec) = f\left(\sum_j w_j x_j\right),

where

\vec w

is a real vector of weights and ''f'' is a function that converts the dot product of the two vectors into the desired output. (In other words,

\vec

is a one-form or linear functional mapping

\vec x

onto R.) The weight vector

\vec w

is learned from a set of labeled training samples. Often ''f'' is a threshold function, which maps all values of

\vec\cdot\vec

above a certain threshold to the first class and all other values to the second class; e.g., :

f(\mathbf) = \begin1 & \text\ \mathbf^T \cdot \mathbf > \theta,\\0 & \text\end

The superscript T indicates the transpose and

\theta

is a scalar threshold. A more complex ''f'' might give the probability that an item belongs to a certain class. For a two-class classification problem, one can visualize the operation of a linear classifier as splitting a

high-dimensional In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coord ...

input space with a

hyperplane In geometry, a hyperplane is a subspace whose dimension is one less than that of its ''ambient space''. For example, if a space is 3-dimensional then its hyperplanes are the 2-dimensional planes, while if the space is 2-dimensional, its hyper ...

: all points on one side of the hyperplane are classified as "yes", while the others are classified as "no". A linear classifier is often used in situations where the speed of classification is an issue, since it is often the fastest classifier, especially when

\vec x

is sparse. Also, linear classifiers often work very well when the number of dimensions in

\vec x

is large, as in document classification, where each element in

\vec x

is typically the number of occurrences of a word in a document (see document-term matrix). In such cases, the classifier should be well- regularized.

Generative models vs. discriminative models

There are two broad classes of methods for determining the parameters of a linear classifier

\vec w

. They can be

generative Generative may refer to: * Generative actor, a person who instigates social change * Generative art, art that has been created using an autonomous system that is frequently, but not necessarily, implemented using a computer * Generative music, mus ...

and discriminative models. Methods of the former model

joint probability distribution Given two random variables that are defined on the same probability space, the joint probability distribution is the corresponding probability distribution on all possible pairs of outputs. The joint distribution can just as well be considered ...

, whereas methods of the latter model conditional density functions

P(, \vec x)

. Examples of such algorithms include: *

Linear Discriminant Analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...

(LDA)—assumes

Gaussian Carl Friedrich Gauss (1777–1855) is the eponym of all of the topics listed below. There are over 100 topics all named after this German mathematician and scientist, all in the fields of mathematics, physics, and astronomy. The English eponymo ...

conditional density models *

Naive Bayes classifier In statistics, naive Bayes classifiers are a family of simple "probabilistic classifiers" based on applying Bayes' theorem with strong (naive) independence assumptions between the features (see Bayes classifier). They are among the simplest Baye ...

with multinomial or multivariate Bernoulli event models. The second set of methods includes

discriminative model Discriminative models, also referred to as conditional models, are a class of logistical models used for classification or regression. They distinguish decision boundaries through observed data, such as pass/fail, win/lose, alive/dead or healthy/si ...

s, which attempt to maximize the quality of the output on a

training set In machine learning, a common task is the study and construction of algorithms that can learn from and make predictions on data. Such algorithms function by making data-driven predictions or decisions, through building a mathematical model from ...

. Additional terms in the training cost function can easily perform

regularization Regularization may refer to: * Regularization (linguistics) * Regularization (mathematics) * Regularization (physics) * Regularization (solid modeling) * Regularization Law, an Israeli law intended to retroactively legalize settlements See also ...

of the final model. Examples of discriminative training of linear classifiers include: * Logistic regression—maximum likelihood estimation of

\vec w

assuming that the observed training set was generated by a binomial model that depends on the output of the classifier. * Perceptron—an algorithm that attempts to fix all errors encountered in the training set * Fisher's Linear Discriminant Analysis—an algorithm (different than "LDA") that maximizes the ratio of between-class scatter to within-class scatter, without any other assumptions. It is in essence a method of dimensionality reduction for binary classification. *

Support vector machine In machine learning, support vector machines (SVMs, also support vector networks) are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratorie ...

—an algorithm that maximizes the

margin Margin may refer to: Physical or graphical edges *Margin (typography), the white space that surrounds the content of a page *Continental margin, the zone of the ocean floor that separates the thin oceanic crust from thick continental crust *Leaf ...

between the decision hyperplane and the examples in the training set. Note: Despite its name, LDA does not belong to the class of discriminative models in this taxonomy. However, its name makes sense when we compare LDA to the other main linear dimensionality reduction algorithm: principal components analysis (PCA). LDA is a supervised learning algorithm that utilizes the labels of the data, while PCA is an unsupervised learning algorithm that ignores the labels. To summarize, the name is a historical artifact.R.O. Duda, P.E. Hart, D.G. Stork, "Pattern Classification", Wiley, (2001). Discriminative training often yields higher accuracy than modeling the conditional density functions. However, handling missing data is often easier with conditional density models. All of the linear classifier algorithms listed above can be converted into non-linear algorithms operating on a different input space

\varphi(\vec x)

, using the kernel trick.

Discriminative training

Discriminative training of linear classifiers usually proceeds in a supervised way, by means of an optimization algorithm that is given a training set with desired outputs and a

loss function In mathematical optimization and decision theory, a loss function or cost function (sometimes also called an error function) is a function that maps an event or values of one or more variables onto a real number intuitively representing some "cost ...

that measures the discrepancy between the classifier's outputs and the desired outputs. Thus, the learning algorithm solves an optimization problem of the form :

\underset \;R(\mathbf) + C \sum_^N L(y_i, \mathbf^\mathsf \mathbf_i)

where * is a vector of classifier parameters, * is a loss function that measures the discrepancy between the classifier's prediction and the true output for the 'th training example, * is a

function that prevents the parameters from getting too large (causing

overfitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...

), and * is a scalar constant (set by the user of the learning algorithm) that controls the balance between the regularization and the loss function. Popular loss functions include the hinge loss (for linear SVMs) and the log loss (for linear logistic regression). If the regularization function is convex, then the above is a convex problem. Many algorithms exist for solving such problems; popular ones for linear classification include (

stochastic Stochastic (, ) refers to the property of being well described by a random probability distribution. Although stochasticity and randomness are distinct in that the former refers to a modeling approach and the latter refers to phenomena themselv ...

) gradient descent, L-BFGS,

coordinate descent Coordinate descent is an optimization algorithm that successively minimizes along coordinate directions to find the minimum of a function. At each iteration, the algorithm determines a coordinate or coordinate block via a coordinate selection rule, ...

and Newton methods.

Definition

Generative models vs. discriminative models

Discriminative training

See also

Notes

Further reading