Medoids are representative objects of a data set or a

cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Asteroid cluster, a small asteroid family * Cluster II (spacecraft), a European Space Agency mission to study th ...

within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images and 3-D trajectories and

gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...

(where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than squared euclidean distance (for instance in movie-ratings). For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the k-medoids clustering algorithm, which is similar to the k-means algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process. Note that a medoid is not equivalent to a

median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...

, a geometric median, or centroid. A

is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a norm (such as the Manhattan distance or Euclidean distance). A geometric median is defined in any dimension, but is not necessarily a point from within the original dataset.

Definition

Let

\mathcal X := \

be a set of

n

points in a space with a distance function d. Medoid is defined as :

x_ = \arg\min_ \sum_^n d(y, x_i).

Clustering with Medoids

Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a metric (as the medoid does not require the triangle inequality). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster. Clustering algorithms based on the idea of medoids include: * Partitioning Around Medoids (PAM), the standard k-medoids algorithm * Hierarchical Clustering Around Medoids (HACAM), which uses medoids in

hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...

Algorithms to compute the medoid of a set

From the definition above, it is clear that the medoid of a set

\mathcal X

can be computed after computing all pairwise distances between points in the ensemble. This would take

O(n^2)

distance evaluations (with

n=, \mathcal X,

). In the worst case, one can not compute the medoid with fewer distance evaluations.Newling, James; & Fleuret, François (2016); "A sub-quadratic exact medoid algorithm", in ''Proceedings of the 20th International Conference on Artificial Intelligence and Statistics'', PMLR 54:185-193, 201
Available online
'.Bagaria, Vivek; Kamath, Govinda M.; Ntranos, Vasilis; Zhang, Martin J.; & Tse, David N. (2017); "Medoids in almost linear time via multi-armed bandits", ''arXiv preprin
Available online
'. However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models. If the points lie on the real line, computing the medoid reduces to computing the median which can be done in

O(n)

by Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known. RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of

O\left(\frac\right)

distance computations to approximate the medoid within a factor of

(1+\epsilon\Delta)

with high probability, where

\Delta

is the maximum distance between two points in the ensemble. Note that RAND is an approximation algorithm, and moreover

\Delta

may ''not'' be known apriori. RAND was leveraged by TOPRANK Okamoto, Kazuya; Chen, Wei; & Li, Xiang-Yang (2008)
"Ranking of closeness centrality for large-scale social networks"
in Preparata, Franco P.; Wu, Xiaodong; Yin, Jianping (eds.); ''Frontiers in Algorithmics Workshop 2008'', ''Lecture Notes in Computer Science'', ''5059'', 186-195 which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points ''exactly'', and picks the minimum of those. TOPRANK needs

O(n^ \log^ n)

distance computations to find the ''exact'' medoid with high probability under a distributional assumption on the average distances. trimed presents an algorithm to find the medoid with

O(n^ 2^{\Theta(d)})

distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space. Meddit leverages a connection of the medoid computation with multi-armed bandits and uses an upper-Confidence-bound type of algorithm to get an algorithm which takes

O(n \log n)

distance evaluations under statistical assumptions on the points. Correlated Sequential HalvingBaharav, Tavor Z.; & Tse, David N. (2019); "Ultra Fast Medoid Identification via Correlated Sequential Halving", in ''Advances in Neural Information Processing Systems''
available online
/ref> also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time.

Implementations

An implementation of RAND, TOPRANK, and trimed can be foun
here
An implementation of Meddit can be foun
here
an
here
An implementation of Correlated Sequential Halving can be foun
here

References

Cluster analysis Means