In statistics, ''k''-medians clusteringP. S. Bradley, O. L. Mangasarian, and W. N. Street, "Clustering via Concave Minimization," in Advances in Neural Information Processing Systems, vol. 9, M. C. Mozer, M. I. Jordan, and T. Petsche, Eds. Cambridge, Massachusetts: MIT Press, 1997, pp. 368–374. is a

cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...

algorithm. It is a variation of ''k''-means clustering where instead of calculating the

mean There are several kinds of mean in mathematics, especially in statistics. Each mean serves to summarize a given group of data, often to better understand the overall value ( magnitude and sign) of a given data set. For a data set, the '' ari ...

for each cluster to determine its centroid, one instead calculates the median. This has the effect of minimizing error over all clusters with respect to the 1-

norm Naturally occurring radioactive materials (NORM) and technologically enhanced naturally occurring radioactive materials (TENORM) consist of materials, usually industrial wastes or by-products enriched with radioactive elements found in the envir ...

distance metric, as opposed to the squared 2-norm distance metric (which ''k''-means does.) This relates directly to the ''k''-median problem with respect to the 1-norm, which is the problem of finding ''k'' centers such that the clusters formed by them are the most compact. Formally, given a set of data points ''x'', the ''k'' centers ''c''_''i'' are to be chosen so as to minimize the sum of the distances from each ''x'' to the nearest ''c''_''i''. The criterion function formulated in this way is sometimes a better criterion than that used in the ''k''-means clustering algorithm, in which the sum of the squared distances is used. The sum of distances is widely used in applications such as the

facility location problem The study of facility location problems (FLP), also known as location analysis, is a branch of operations research and computational geometry concerned with the optimal placement of facilities to minimize transportation costs while considering fac ...

. The proposed algorithm uses Lloyd-style iteration which alternates between an expectation (E) and maximization (M) step, making this an

expectation–maximization algorithm In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variab ...

. In the E step, all objects are assigned to their nearest median. In the M step, the medians are recomputed by using the median in each single dimension.

Medians and medoids

The median is computed in each single dimension in the Manhattan-distance formulation of the ''k''-medians problem, so the individual attributes will come from the dataset (or be an average of two values from the dataset). This makes the algorithm more reliable for discrete or even

binary data Binary data is data whose unit can take on only two possible states. These are often labelled as 0 and 1 in accordance with the binary numeral system and Boolean algebra. Binary data occurs in many different technical and scientific fields, wher ...

sets. In contrast, the use of means or Euclidean-distance medians will not necessarily yield individual attributes from the dataset. Even with the Manhattan-distance formulation, the individual attributes may come from different instances in the dataset; thus, the resulting median may not be a member of the input dataset. This algorithm is often confused with the ''k''-medoids algorithm. However, a medoid has to be an actual instance from the dataset, while for the multivariate Manhattan-distance median this only holds for single attribute values. The actual median can thus be a combination of multiple instances. For example, given the vectors (0,1), (1,0) and (2,2), the Manhattan-distance median is (1,1), which does not exist in the original data, and thus cannot be a medoid.

Software

ELKI ELKI (for ''Environment for DeveLoping KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally at the database s ...

includes various k-means variants, including k-medians. * FORTRANbr>kmedians
*

GNU R R is a programming language for statistical computing and graphics supported by the R Core Team and the R Foundation for Statistical Computing. Created by statisticians Ross Ihaka and Robert Gentleman, R is used among data miners, bioinforma ...

includes k-medians in the "flexclust" package. *

Stata Stata (, , alternatively , occasionally stylized as STATA) is a general-purpose statistical software package developed by StataCorp for data manipulation, visualization, statistics, and automated reporting. It is used by researchers in many fiel ...

br>kmedians

References

{{reflist Cluster analysis algorithms

Medians and medoids

Software

See also

References