Medoid
   HOME

TheInfoList



OR:

Medoids are representative objects of a
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
or a
cluster may refer to: Science and technology Astronomy * Cluster (spacecraft), constellation of four European Space Agency spacecraft * Asteroid cluster, a small asteroid family * Cluster II (spacecraft), a European Space Agency mission to study t ...
within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to
mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value (magnitude and sign) of a given data set. For a data set, the ''arithme ...
s or
centroids In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any o ...
, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images and 3-D trajectories and
gene expression Gene expression is the process by which information from a gene is used in the synthesis of a functional gene product that enables it to produce end products, protein or non-coding RNA, and ultimately affect a phenotype, as the final effect. The ...
(where while the data is sparse the medoid need not be). These are also of interest while wanting to find a representative using some distance other than
squared euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two Point (geometry), points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theo ...
(for instance in movie-ratings). For some data sets there may be more than one medoid, as with medians. A common application of the medoid is the
k-medoids The -medoids problem is a clustering problem similar to -means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the -means and -medoids algorithms are partitional (breaking the dataset up into group ...
clustering algorithm, which is similar to the
k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
algorithm but works when a mean or centroid is not definable. This algorithm basically works as follows. First, a set of medoids is chosen at random. Second, the distances to the other points are computed. Third, data are clustered according to the medoid they are most similar to. Fourth, the medoid set is optimized via an iterative process. Note that a medoid is not equivalent to a
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
, a
geometric median In geometry, the geometric median of a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances ...
, or
centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the surface of the figure. The same definition extends to any ob ...
. A
median In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic fe ...
is only defined on 1-dimensional data, and it only minimizes dissimilarity to other points for metrics induced by a
norm Naturally occurring radioactive materials (NORM) and technologically enhanced naturally occurring radioactive materials (TENORM) consist of materials, usually industrial wastes or by-products enriched with radioactive elements found in the envir ...
(such as the
Manhattan distance A taxicab geometry or a Manhattan geometry is a geometry whose usual distance function or Metric (mathematics), metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences ...
or
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
). A
geometric median In geometry, the geometric median of a discrete set of sample points in a Euclidean space is the point minimizing the sum of distances to the sample points. This generalizes the median, which has the property of minimizing the sum of distances ...
is defined in any dimension, but is not necessarily a point from within the original dataset.


Definition

Let \mathcal X := \ be a set of n points in a space with a
distance function In mathematics, a metric space is a set together with a notion of ''distance'' between its elements, usually called points. The distance is measured by a function called a metric or distance function. Metric spaces are the most general setting ...
d. Medoid is defined as : x_ = \arg\min_ \sum_^n d(y, x_i).


Clustering with Medoids

Medoids are a popular replacement for the cluster mean when the distance function is not (squared) Euclidean distance, or not even a
metric Metric or metrical may refer to: * Metric system, an internationally adopted decimal system of measurement * An adjective indicating relation to measurement in general, or a noun describing a specific type of measurement Mathematics In mathem ...
(as the medoid does not require the triangle inequality). When partitioning the data set into clusters, the medoid of each cluster can be used as a representative of each cluster. Clustering algorithms based on the idea of medoids include: * Partitioning Around Medoids (PAM), the standard
k-medoids The -medoids problem is a clustering problem similar to -means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the -means and -medoids algorithms are partitional (breaking the dataset up into group ...
algorithm * Hierarchical Clustering Around Medoids (HACAM), which uses medoids in
hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...


Algorithms to compute the medoid of a set

From the definition above, it is clear that the medoid of a set \mathcal X can be computed after computing all pairwise distances between points in the ensemble. This would take O(n^2) distance evaluations (with n=, \mathcal X, ). In the worst case, one can not compute the medoid with fewer distance evaluations.Newling, James; & Fleuret, François (2016); "A sub-quadratic exact medoid algorithm", in ''Proceedings of the 20th International Conference on Artificial Intelligence and Statistics'', PMLR 54:185-193, 201
Available online
'.
Bagaria, Vivek; Kamath, Govinda M.; Ntranos, Vasilis; Zhang, Martin J.; & Tse, David N. (2017); "Medoids in almost linear time via multi-armed bandits", ''arXiv preprin
Available online
'.
However, there are many approaches that allow us to compute medoids either exactly or approximately in sub-quadratic time under different statistical models. If the points lie on the real line, computing the medoid reduces to computing the median which can be done in O(n) by Quick-select algorithm of Hoare. However, in higher dimensional real spaces, no linear-time algorithm is known. RAND is an algorithm that estimates the average distance of each point to all the other points by sampling a random subset of other points. It takes a total of O\left(\frac\right) distance computations to approximate the medoid within a factor of (1+\epsilon\Delta) with high probability, where \Delta is the maximum distance between two points in the ensemble. Note that RAND is an approximation algorithm, and moreover \Delta may ''not'' be known apriori. RAND was leveraged by TOPRANK Okamoto, Kazuya; Chen, Wei; & Li, Xiang-Yang (2008)
"Ranking of closeness centrality for large-scale social networks"
in Preparata, Franco P.; Wu, Xiaodong; Yin, Jianping (eds.); ''Frontiers in Algorithmics Workshop 2008'', ''Lecture Notes in Computer Science'', ''5059'', 186-195
which uses the estimates obtained by RAND to focus on a small subset of candidate points, evaluates the average distance of these points ''exactly'', and picks the minimum of those. TOPRANK needs O(n^ \log^ n) distance computations to find the ''exact'' medoid with high probability under a distributional assumption on the average distances. trimed presents an algorithm to find the medoid with O(n^ 2^{\Theta(d)}) distance evaluations under a distributional assumption on the points. The algorithm uses the triangle inequality to cut down the search space. Meddit leverages a connection of the medoid computation with
multi-armed bandit In probability theory and machine learning, the multi-armed bandit problem (sometimes called the ''K''- or ''N''-armed bandit problem) is a problem in which a fixed limited set of resources must be allocated between competing (alternative) choices ...
s and uses an upper-Confidence-bound type of algorithm to get an algorithm which takes O(n \log n) distance evaluations under statistical assumptions on the points. Correlated Sequential HalvingBaharav, Tavor Z.; & Tse, David N. (2019); "Ultra Fast Medoid Identification via Correlated Sequential Halving", in ''Advances in Neural Information Processing Systems''
available online
/ref> also leverages multi-armed bandit techniques, improving upon Meddit. By exploiting the correlation structure in the problem, the algorithm is able to provably yield drastic improvement (usually around 1-2 orders of magnitude) in both number of distance computations needed and wall clock time.


Implementations

An implementation of RAND, TOPRANK, and trimed can be foun
here
An implementation of Meddit can be foun
here
an
here
An implementation of Correlated Sequential Halving can be foun
here


References

Cluster analysis Means