Medoids are representative objects of a
data set or a
cluster
may refer to:
Science and technology Astronomy
* Cluster (spacecraft), constellation of four European Space Agency spacecraft
* Asteroid cluster, a small asteroid family
* Cluster II (spacecraft), a European Space Agency mission to study th ...
within a data set whose sum of dissimilarities to all the objects in the cluster is minimal.
Medoids are similar in concept to
means or
centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images and 3-D trajectories and
gene expression
Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...
(where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than
squared euclidean distance (for instance in movie-ratings).
For some data sets there may be more than one medoid, as with medians.
A common application of the medoid is the
k-medoids clustering algorithm, which is similar to the
k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process.
Note that a medoid is not equivalent to a
median
In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
, a
geometric median, or
centroid. A
median
In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a
norm (such as the
Manhattan distance or
Euclidean distance). A
geometric median is defined in any dimension, but is not necessarily a point from within the original dataset.
Definition
Let
be a set of
points in a space with a
distance function d. Medoid is defined as
:
Clustering with Medoids
Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a
metric (as the medoid does not require the triangle inequality). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster.
Clustering algorithms based on the idea of medoids include:
* Partitioning Around Medoids (PAM), the standard
k-medoids algorithm
* Hierarchical Clustering Around Medoids (HACAM), which uses medoids in
hierarchical clustering
In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...
Algorithms to compute the medoid of a set
From the definition above, it is clear that the medoid of a set
can be computed after computing all pairwise distances between points in the ensemble. This would take
distance evaluations (with
). In the worst case, one can not compute the medoid with fewer distance evaluations.
[Newling, James; & Fleuret, François (2016); "A sub-quadratic exact medoid algorithm", in ''Proceedings of the 20th International Conference on Artificial Intelligence and Statistics'', PMLR 54:185-193, 201]
Available online
'.[Bagaria, Vivek; Kamath, Govinda M.; Ntranos, Vasilis; Zhang, Martin J.; & Tse, David N. (2017); "Medoids in almost linear time via multi-armed bandits", ''arXiv preprin]
Available online
'. However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models.
If the points lie on the real line, computing the medoid reduces to computing the median which can be done in
by
Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known.
RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of
distance computations to approximate the medoid within a factor of
with high probability,
where
is the maximum
distance between two points
in the ensemble. Note that
RAND is an approximation algorithm, and moreover
may ''not'' be known apriori.
RAND was leveraged by
TOPRANK [Okamoto, Kazuya; Chen, Wei; & Li, Xiang-Yang (2008)]
"Ranking of closeness centrality for large-scale social networks"
in Preparata, Franco P.; Wu, Xiaodong; Yin, Jianping (eds.); ''Frontiers in Algorithmics Workshop 2008'', ''Lecture Notes in Computer Science'', ''5059'', 186-195 which
uses the estimates obtained by
RAND to focus on a small subset of candidate points, evaluates the average distance of these points ''exactly'', and picks the minimum of those.
TOPRANK needs
distance computations
to find the ''exact'' medoid with high probability
under a distributional assumption
on the average distances.
trimed
presents an algorithm
to find the medoid with
distance evaluations under a distributional
assumption on the points. The algorithm uses the triangle inequality to cut down the search space.
Meddit leverages
a connection of the medoid computation with
multi-armed bandits and uses an upper-Confidence-bound type of algorithm to get
an algorithm which takes
distance evaluations under statistical
assumptions on the points.
Correlated Sequential Halving[Baharav, Tavor Z.; & Tse, David N. (2019); "Ultra Fast Medoid Identification via Correlated Sequential Halving", in ''Advances in Neural Information Processing Systems'']
available online
/ref> also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time.
Implementations
An implementation of RAND, TOPRANK, and trimed can be foun
here
An implementation of Meddit
can be foun
here
an
here
An implementation of Correlated Sequential Halving
can be foun
here
References
Cluster analysis
Means