statistics Statistics (from German: '' Statistik'', "description of a state, a country") is the discipline that concerns the collection, organization, analysis, interpretation, and presentation of data. In applying statistics to a scientific, indust ...

, classification is the problem of identifying which of a set of categories (sub-populations) an

observation Observation is the active acquisition of information from a primary source. In living beings, observation employs the senses. In science, observation can also involve the perception and recording of data via the use of scientific instruments. The ...

(or observations) belongs to. Examples are assigning a given email to the "spam" or "non-spam" class, and assigning a diagnosis to a given patient based on observed characteristics of the patient (sex, blood pressure, presence or absence of certain symptoms, etc.). Often, the individual observations are analyzed into a set of quantifiable properties, known variously as explanatory variables or ''features''. These properties may variously be categorical (e.g. "A", "B", "AB" or "O", for

blood type A blood type (also known as a blood group) is a classification of blood, based on the presence and absence of antibodies and inherited antigenic substances on the surface of red blood cells (RBCs). These antigens may be proteins, carbohydrates ...

), ordinal (e.g. "large", "medium" or "small"), integer-valued (e.g. the number of occurrences of a particular word in an

email Electronic mail (email or e-mail) is a method of exchanging messages ("mail") between people using electronic devices. Email was thus conceived as the electronic ( digital) version of, or counterpart to, mail, at a time when "mail" mean ...

) or real-valued (e.g. a measurement of

blood pressure Blood pressure (BP) is the pressure of circulating blood against the walls of blood vessels. Most of this pressure results from the heart pumping blood through the circulatory system. When used without qualification, the term "blood pressure ...

). Other classifiers work by comparing observations to previous observations by means of a similarity or

distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...

function. An

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

that implements classification, especially in a concrete implementation, is known as a classifier. The term "classifier" sometimes also refers to the mathematical function, implemented by a classification algorithm, that maps input data to a category. Terminology across fields is quite varied. In

, where classification is often done with

logistic regression In statistics, the logistic model (or logit model) is a statistical model that models the probability of an event taking place by having the log-odds for the event be a linear combination of one or more independent variables. In regression an ...

or a similar procedure, the properties of observations are termed

explanatory variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or deman ...

s (or

independent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or dema ...

s, regressors, etc.), and the categories to be predicted are known as outcomes, which are considered to be possible values of the

dependent variable Dependent and independent variables are variables in mathematical modeling, statistical modeling and experimental sciences. Dependent variables receive this name because, in an experiment, their values are studied under the supposition or dema ...

. In

machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...

, the observations are often known as ''instances'', the explanatory variables are termed ''features'' (grouped into a feature vector), and the possible categories to be predicted are ''classes''. Other fields may use different terminology: e.g. in community ecology, the term "classification" normally refers to

cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

Relation to other problems

Classification Classification is a process related to categorization, the process in which ideas and objects are recognized, differentiated and understood. Classification is the grouping of related facts into classes. It may also refer to: Business, organizat ...

and clustering are examples of the more general problem of

pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics ...

, which is the assignment of some sort of output value to a given input value. Other examples are regression, which assigns a real-valued output to each input;

sequence labeling In machine learning, sequence labeling is a type of pattern recognition task that involves the algorithmic assignment of a categorical label to each member of a sequence of observed values. A common example of a sequence labeling task is part of ...

, which assigns a class to each member of a sequence of values (for example,

part of speech tagging In corpus linguistics, part-of-speech tagging (POS tagging or PoS tagging or POST), also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definiti ...

, which assigns a

part of speech In grammar, a part of speech or part-of-speech (abbreviated as POS or PoS, also known as word class or grammatical category) is a category of words (or, more generally, of lexical items) that have similar grammatical properties. Words that are as ...

to each word in an input sentence);

parsing Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term ''parsing'' comes from ...

, which assigns a

parse tree A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term ''parse tree'' itself is used primarily in co ...

to an input sentence, describing the

syntactic structure In linguistics, syntax () is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure ( constituenc ...

of the sentence; etc. A common subclass of classification is probabilistic classification. Algorithms of this nature use

statistical inference Statistical inference is the process of using data analysis to infer properties of an underlying distribution of probability.Upton, G., Cook, I. (2008) ''Oxford Dictionary of Statistics'', OUP. . Inferential statistical analysis infers properti ...

to find the best class for a given instance. Unlike other algorithms, which simply output a "best" class, probabilistic algorithms output a

probability Probability is the branch of mathematics concerning numerical descriptions of how likely an event is to occur, or how likely it is that a proposition is true. The probability of an event is a number between 0 and 1, where, roughly speaking, ...

of the instance being a member of each of the possible classes. The best class is normally then selected as the one with the highest probability. However, such an algorithm has numerous advantages over non-probabilistic classifiers: *It can output a confidence value associated with its choice (in general, a classifier that can do this is known as a ''confidence-weighted classifier''). *Correspondingly, it can ''abstain'' when its confidence of choosing any particular output is too low. *Because of the probabilities which are generated, probabilistic classifiers can be more effectively incorporated into larger machine-learning tasks, in a way that partially or completely avoids the problem of ''error propagation''.

Frequentist procedures

Early work on statistical classification was undertaken by

Fisher Fisher is an archaic term for a fisherman, revived as gender-neutral. Fisher, Fishers or The Fisher may also refer to: Places Australia *Division of Fisher, an electoral district in the Australian House of Representatives, in Queensland *Elect ...

, in the context of two-group problems, leading to Fisher's linear discriminant function as the rule for assigning a group to a new observation.Gnanadesikan, R. (1977) ''Methods for Statistical Data Analysis of Multivariate Observations'', Wiley. (p. 83–86) This early work assumed that data-values within each of the two groups had a

multivariate normal distribution In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional ( univariate) normal distribution to higher dimensions. One ...

. The extension of this same context to more than two-groups has also been considered with a restriction imposed that the classification rule should be

linear Linearity is the property of a mathematical relationship ('' function'') that can be graphically represented as a straight line. Linearity is closely related to '' proportionality''. Examples in physics include rectilinear motion, the linear ...

. Later work for the multivariate normal distribution allowed the classifier to be

nonlinear In mathematics and science, a nonlinear system is a system in which the change of the output is not proportional to the change of the input. Nonlinear problems are of interest to engineers, biologists, physicists, mathematicians, and many oth ...

: several classification rules can be derived based on different adjustments of the

Mahalanobis distance The Mahalanobis distance is a measure of the distance between a point ''P'' and a distribution ''D'', introduced by P. C. Mahalanobis in 1936. Mahalanobis's definition was prompted by the problem of identifying the similarities of skulls based ...

, with a new observation being assigned to the group whose centre has the lowest adjusted distance from the observation.

Bayesian procedures

Unlike frequentist procedures, Bayesian classification procedures provide a natural way of taking into account any available information about the relative sizes of the different groups within the overall population. Bayesian procedures tend to be computationally expensive and, in the days before

Markov chain Monte Carlo In statistics, Markov chain Monte Carlo (MCMC) methods comprise a class of algorithms for sampling from a probability distribution. By constructing a Markov chain that has the desired distribution as its equilibrium distribution, one can obtain ...

computations were developed, approximations for Bayesian clustering rules were devised. Some Bayesian procedures involve the calculation of group membership probabilities: these provide a more informative outcome than a simple attribution of a single group-label to each new observation.

Binary and multiclass classification

Classification can be thought of as two separate problems –

binary classification Binary classification is the task of classifying the elements of a set into two groups (each called ''class'') on the basis of a classification rule. Typical binary classification problems include: * Medical testing to determine if a patient has c ...

and multiclass classification. In binary classification, a better understood task, only two classes are involved, whereas multiclass classification involves assigning an object to one of several classes. Since many classification methods have been developed specifically for binary classification, multiclass classification often requires the combined use of multiple binary classifiers.

Feature vectors

Most algorithms describe an individual instance whose category is to be predicted using a feature vector of individual, measurable properties of the instance. Each property is termed a feature, also known in statistics as an

(or

, although features may or may not be statistically independent). Features may variously be binary (e.g. "on" or "off"); categorical (e.g. "A", "B", "AB" or "O", for

); ordinal (e.g. "large", "medium" or "small"); integer-valued (e.g. the number of occurrences of a particular word in an email); or real-valued (e.g. a measurement of blood pressure). If the instance is an image, the feature values might correspond to the pixels of an image; if the instance is a piece of text, the feature values might be occurrence frequencies of different words. Some algorithms work only in terms of discrete data and require that real-valued or integer-valued data be ''discretized'' into groups (e.g. less than 5, between 5 and 10, or greater than 10).

Linear classifiers

A large number of

s for classification can be phrased in terms of a linear function that assigns a score to each possible category ''k'' by combining the feature vector of an instance with a vector of weights, using a

dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an alg ...

. The predicted category is the one with the highest score. This type of score function is known as a

linear predictor function In statistics and in machine learning, a linear predictor function is a linear function ( linear combination) of a set of coefficients and explanatory variables (independent variables), whose value is used to predict the outcome of a dependent vari ...

and has the following general form:

\operatorname(\mathbf_i, k) = \boldsymbol\beta_k \cdot \mathbf_i,

where X_''i'' is the feature vector for instance ''i'', β_''k'' is the vector of weights corresponding to category ''k'', and score(X_''i'', ''k'') is the score associated with assigning instance ''i'' to category ''k''. In discrete choice theory, where instances represent people and categories represent choices, the score is considered the

utility As a topic of economics, utility is used to model worth or value. Its usage has evolved significantly over time. The term was introduced initially as a measure of pleasure or happiness as part of the theory of utilitarianism by moral philosophe ...

associated with person ''i'' choosing category ''k''. Algorithms with this basic setup are known as

linear classifier In the field of machine learning, the goal of statistical classification is to use an object's characteristics to identify which class (or group) it belongs to. A linear classifier achieves this by making a classification decision based on the v ...

s. What distinguishes them is the procedure for determining (training) the optimal weights/coefficients and the way that the score is interpreted. Examples of such algorithms include * ** * * The perceptron algorithm * *

Algorithms

Since no single form of classification is appropriate for all data sets, a large toolkit of classification algorithms have been developed. The most commonly used include: * * * ** * ** ** ** * ** * * ** ** ** ** * * **

Evaluation

Classifier performance depends greatly on the characteristics of the data to be classified. There is no single classifier that works best on all given problems (a phenomenon that may be explained by the no-free-lunch theorem). Various empirical tests have been performed to compare classifier performance and to find the characteristics of data that determine classifier performance. Determining a suitable classifier for a given problem is however still more an art than a science. The measures

precision and recall In pattern recognition, information retrieval, object detection and classification (machine learning), precision and recall are performance metrics that apply to data retrieved from a collection, corpus or sample space. Precision (also call ...

are popular metrics used to evaluate the quality of a classification system. More recently,

receiver operating characteristic A receiver operating characteristic curve, or ROC curve, is a graph of a function, graphical plot that illustrates the diagnostic ability of a binary classifier system as its discrimination threshold is varied. The method was originally develope ...

(ROC) curves have been used to evaluate the tradeoff between true- and false-positive rates of classification algorithms. As a performance metric, the

uncertainty coefficient In statistics, the uncertainty coefficient, also called proficiency, entropy coefficient or Theil's U, is a measure of nominal association. It was first introduced by Henri Theil and is based on the concept of information entropy. Definition S ...

has the advantage over simple

accuracy Accuracy and precision are two measures of '' observational error''. ''Accuracy'' is how close a given set of measurements ( observations or readings) are to their '' true value'', while ''precision'' is how close the measurements are to each o ...

in that it is not affected by the relative sizes of the different classes. Further, it will not penalize an algorithm for simply ''rearranging'' the classes.

Application domains

Classification has many applications. In some of these it is employed as a data mining procedure, while in others more detailed statistical modeling is undertaken. * * identification * ** Medical image analysis and ** ** * * *

Drug discovery In the fields of medicine, biotechnology and pharmacology, drug discovery is the process by which new candidate medications are discovered. Historically, drugs were discovered by identifying the active ingredient from traditional remedies or b ...

and ** ** * * * Internet * Micro-array classification * * * *

References

{{DEFAULTSORT:Statistical Classification *