In
machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial intelligence.
Machine ...
, a nearest centroid classifier or nearest prototype classifier is a
classification model that assigns to observations the label of the class of training samples whose
mean
There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set.
For a data set, the ''arithme ...
(
centroid
In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ob ...
) is closest to the observation. When applied to
text classification
Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories. This may be done "manually" (or "intellectually") ...
using
word vectors containing
tf*idf weights to represent documents, the nearest centroid classifier is known as the Rocchio classifier because of its similarity to the
Rocchio algorithm The Rocchio algorithm is based on a method of relevance feedback found in information retrieval systems which stemmed from the SMART Information Retrieval System developed between 1960 and 1964. Like many other retrieval systems, the Rocchio algori ...
for
relevance feedback Relevance feedback is a feature of some information retrieval systems. The idea behind relevance feedback is to take the results that are initially returned from a given query, to gather user feedback, and to use information about whether or not th ...
.
An extended version of the nearest centroid classifier has found applications in the medical domain, specifically classification of
tumor
A neoplasm () is a type of abnormal and excessive growth of tissue. The process that occurs to form or produce a neoplasm is called neoplasia. The growth of a neoplasm is uncoordinated with that of the normal surrounding tissue, and persists ...
s.
Algorithm
Training
Given labeled training samples
with class labels
, compute the per-class centroids
where
is the set of indices of samples belonging to class
.
Prediction
The class assigned to an observation
is
.
See also
*
Cluster hypothesis
In machine learning
Machine learning (ML) is a field of inquiry devoted to understanding and building methods that 'learn', that is, methods that leverage data to improve performance on some set of tasks. It is seen as a part of artificial i ...
*
''k''-means clustering
*
''k''-nearest neighbor algorithm
*
Linear discriminant analysis
Linear discriminant analysis (LDA), normal discriminant analysis (NDA), or discriminant function analysis is a generalization of Fisher's linear discriminant, a method used in statistics and other fields, to find a linear combination of features ...
References
Classification algorithms