Nearest centroid classifier
   HOME

TheInfoList



OR:

In
machine learning Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence. Machine ...
, a nearest centroid classifier or nearest prototype classifier is a classification model that assigns to observations the label of the class of training samples whose
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ar ...
(
centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ...
) is closest to the observation. When applied to
text classification Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
using word vectors containing tf*idf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the
Rocchio algorithm The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algori ...
for
relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not th ...
. An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of
tumor A neoplasm () is a type of abnormal and excessive growth of tissue. The process that occurs to form or produce a neoplasm is called neoplasia. The growth of a neoplasm is uncoordinated with that of the normal surrounding tissue, and persists ...
s.


Algorithm


Training

Given labeled training samples \textstyle\ with class labels y_i \in \mathbf, compute the per-class centroids \textstyle\vec_\ell = \frac\underset \vec_i where C_\ell is the set of indices of samples belonging to class \ell \in \mathbf.


Prediction

The class assigned to an observation \vec is \hat = _ \, \vec_\ell - \vec{x}\, .


See also

*
Cluster hypothesis In machine learning and information retrieval, the cluster hypothesis is an assumption about the nature of the data handled in those fields, which takes various forms. In information retrieval, it states that documents that are clustered together ...
* ''k''-means clustering * ''k''-nearest neighbor algorithm *
Linear discriminant analysis Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...


References

Classification algorithms