Determining The Number Of Clusters In A Data Set
   HOME

TheInfoList



OR:

Determining the number of clusters in a
data set A data set (or dataset) is a collection of data. In the case of tabular data, a data set corresponds to one or more database tables, where every column of a table represents a particular variable, and each row corresponds to a given record of the ...
, a quantity often labelled ''k'' as in the ''k''-means algorithm, is a frequent problem in
data clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
, and is a distinct issue from the process of actually solving the clustering problem. For a certain class of
clustering algorithm Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
s (in particular ''k''-means, ''k''-medoids and
expectation–maximization algorithm In statistics, an expectation–maximization (EM) algorithm is an iterative method to find (local) maximum likelihood or maximum a posteriori (MAP) estimates of parameters in statistical models, where the model depends on unobserved latent variabl ...
), there is a parameter commonly referred to as ''k'' that specifies the number of clusters to detect. Other algorithms such as
DBSCAN Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: gi ...
and
OPTICS algorithm Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic idea is simil ...
do not require the specification of this parameter;
hierarchical clustering In data mining and statistics, hierarchical clustering (also called hierarchical cluster analysis or HCA) is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into ...
avoids the problem altogether. The correct choice of ''k'' is often ambiguous, with interpretations depending on the shape and scale of the distribution of points in a data set and the desired clustering resolution of the user. In addition, increasing ''k'' without penalty will always reduce the amount of error in the resulting clustering, to the extreme case of zero error if each data point is considered its own cluster (i.e., when ''k'' equals the number of data points, ''n''). Intuitively then, ''the optimal choice of ''k'' will strike a balance between maximum compression of the data using a single cluster, and maximum accuracy by assigning each data point to its own cluster''. If an appropriate value of ''k'' is not apparent from prior knowledge of the properties of the data set, it must be chosen somehow. There are several categories of methods for making this decision.


Elbow method

The elbow method looks at the percentage of
explained variance In statistics, explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. Often, variation is quantified as variance; then, the more specific term explained variance can be ...
as a function of the number of clusters: One should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. More precisely, if one plots the percentage of variance explained by the clusters against the number of clusters, the first clusters will add much information (explain a lot of variance), but at some point the marginal gain will drop, giving an angle in the graph. The number of clusters is chosen at this point, hence the "elbow criterion". This "elbow" cannot always be unambiguously identified, making this method very subjective and unreliable. Percentage of variance explained is the ratio of the between-group variance to the total variance, also known as an
F-test An ''F''-test is any statistical test in which the test statistic has an ''F''-distribution under the null hypothesis. It is most often used when comparing statistical models that have been fitted to a data set, in order to identify the model th ...
. A slight variation of this method plots the curvature of the within group variance. The method can be traced to speculation by
Robert L. Thorndike Robert Ladd Thorndike (September 22, 1910 – September 21, 1990) was an American psychometrician and educational psychologist who made significant contributions to the analysis of reliability, the interpretation of error, cognitive ability, and ...
in 1953.


X-means clustering

In statistics and data mining, X-means clustering is a variation of k-means clustering that refines cluster assignments by repeatedly attempting subdivision, and keeping the best resulting splits, until a criterion such as the
Akaike information criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
(AIC) or
Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
(BIC) is reached.


Information criterion approach

Another set of methods for determining the number of clusters are information criteria, such as the
Akaike information criterion The Akaike information criterion (AIC) is an estimator of prediction error and thereby relative quality of statistical models for a given set of data. Given a collection of models for the data, AIC estimates the quality of each model, relative to e ...
(AIC),
Bayesian information criterion In statistics, the Bayesian information criterion (BIC) or Schwarz information criterion (also SIC, SBC, SBIC) is a criterion for model selection among a finite set of models; models with lower BIC are generally preferred. It is based, in part, on ...
(BIC), or the
deviance information criterion The deviance information criterion (DIC) is a hierarchical modeling generalization of the Akaike information criterion (AIC). It is particularly useful in Bayesian model selection problems where the posterior distributions of the models have been o ...
(DIC) — if it is possible to make a
likelihood function The likelihood function (often simply called the likelihood) represents the probability of random variable realizations conditional on particular values of the statistical parameters. Thus, when evaluated on a given sample, the likelihood funct ...
for the clustering model. For example: The ''k''-means model is "almost" a
Gaussian mixture model In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observatio ...
and one can construct a likelihood for the Gaussian mixture model and thus also determine information criterion values.


Information–theoretic approach

Rate distortion theory Rate or rates may refer to: Finance * Rates (tax), a type of taxation system in the United Kingdom used to fund local government * Exchange rate, rate at which one currency will be exchanged for another Mathematics and science * Rate (mathema ...
has been applied to choosing ''k'' called the "jump" method, which determines the number of clusters that maximizes efficiency while minimizing error by
information-theoretic Information theory is the scientific study of the quantification, storage, and communication of information. The field was originally established by the works of Harry Nyquist and Ralph Hartley, in the 1920s, and Claude Shannon in the 1940s. T ...
standards. The strategy of the algorithm is to generate a distortion curve for the input data by running a standard clustering algorithm such as
k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
for all values of ''k'' between 1 and ''n'', and computing the distortion (described below) of the resulting clustering. The distortion curve is then transformed by a negative power chosen based on the
dimensionality In physics and mathematics, the dimension of a mathematical space (or object) is informally defined as the minimum number of coordinates needed to specify any point within it. Thus, a line has a dimension of one (1D) because only one coordin ...
of the data. Jumps in the resulting values then signify reasonable choices for ''k'', with the largest jump representing the best choice. The distortion of a clustering of some input data is formally defined as follows: Let the data set be modeled as a ''p''-dimensional
random variable A random variable (also called random quantity, aleatory variable, or stochastic variable) is a mathematical formalization of a quantity or object which depends on random events. It is a mapping or a function from possible outcomes (e.g., the po ...
, ''X'', consisting of a
mixture distribution In probability and statistics, a mixture distribution is the probability distribution of a random variable that is derived from a collection of other random variables as follows: first, a random variable is selected by chance from the collection a ...
of ''G'' components with common
covariance In probability theory and statistics, covariance is a measure of the joint variability of two random variables. If the greater values of one variable mainly correspond with the greater values of the other variable, and the same holds for the les ...
, . If we let c_1 \ldots c_K be a set of ''K'' cluster centers, with c_X the closest center to a given sample of ''X'', then the minimum average distortion per dimension when fitting the ''K'' centers to the data is: : d_K = \frac \min_ This is also the average Mahalanobis distance per dimension between ''X'' and the closest cluster center c_X. Because the minimization over all possible sets of cluster centers is prohibitively complex, the distortion is computed in practice by generating a set of cluster centers using a standard clustering algorithm and computing the distortion using the result. The pseudo-code for the jump method with an input set of ''p''-dimensional data points ''X'' is: JumpMethod(X): Let Y = (p/2) Init a list D, of size n+1 Let D = 0 For k = 1 ... n: Cluster X with k clusters (e.g., with k-means) Let d = Distortion of the resulting clustering D = d^(-Y) Define J(i) = D - D -1 Return the k between 1 and n that maximizes J(k) The choice of the transform power Y = (p/2) is motivated by asymptotic reasoning using results from rate distortion theory. Let the data ''X'' have a single, arbitrarily ''p''-dimensional Gaussian distribution, and let fixed K = \lfloor\alpha^p\rfloor, for some greater than zero. Then the distortion of a clustering of ''K'' clusters in the
limit Limit or Limits may refer to: Arts and media * ''Limit'' (manga), a manga by Keiko Suenobu * ''Limit'' (film), a South Korean film * Limit (music), a way to characterize harmony * "Limit" (song), a 2016 single by Luna Sea * "Limits", a 2019 ...
as ''p'' goes to infinity is \alpha^ . It can be seen that asymptotically, the distortion of a clustering to the power (-p/2) is proportional to \alpha^p, which by definition is approximately the number of clusters ''K''. In other words, for a single Gaussian distribution, increasing ''K'' beyond the true number of clusters, which should be one, causes a linear growth in distortion. This behavior is important in the general case of a mixture of multiple distribution components. Let ''X'' be a mixture of ''G'' ''p''-dimensional Gaussian distributions with common covariance. Then for any fixed ''K'' less than ''G'', the distortion of a clustering as ''p'' goes to infinity is infinite. Intuitively, this means that a clustering of less than the correct number of clusters is unable to describe asymptotically high-dimensional data, causing the distortion to increase without limit. If, as described above, ''K'' is made an increasing function of ''p'', namely, K = \lfloor\alpha^p\rfloor, the same result as above is achieved, with the value of the distortion in the limit as ''p'' goes to infinity being equal to \alpha^. Correspondingly, there is the same proportional relationship between the transformed distortion and the number of clusters, ''K''. Putting the results above together, it can be seen that for sufficiently high values of ''p'', the transformed distortion d_K^ is approximately zero for ''K'' < ''G'', then jumps suddenly and begins increasing linearly for ''K'' ≥ ''G''. The jump algorithm for choosing ''K'' makes use of these behaviors to identify the most likely value for the true number of clusters. Although the mathematical support for the method is given in terms of asymptotic results, the algorithm has been
empirical Empirical evidence for a proposition is evidence, i.e. what supports or counters this proposition, that is constituted by or accessible to sense experience or experimental procedure. Empirical evidence is of central importance to the sciences and ...
ly verified to work well in a variety of data sets with reasonable dimensionality. In addition to the localized jump method described above, there exists a second algorithm for choosing ''K'' using the same transformed distortion values known as the broken line method. The broken line method identifies the jump point in the graph of the transformed distortion by doing a simple
least squares The method of least squares is a standard approach in regression analysis to approximate the solution of overdetermined systems (sets of equations in which there are more equations than unknowns) by minimizing the sum of the squares of the res ...
error line fit of two line segments, which in theory will fall along the ''x''-axis for ''K'' < ''G'', and along the linearly increasing phase of the transformed distortion plot for ''K'' ≥ ''G''. The broken line method is more robust than the jump method in that its decision is global rather than local, but it also relies on the assumption of Gaussian mixture components, whereas the jump method is fully
non-parametric Nonparametric statistics is the branch of statistics that is not based solely on parametrized families of probability distributions (common examples of parameters are the mean and variance). Nonparametric statistics is based on either being distri ...
and has been shown to be viable for general mixture distributions.


Silhouette method

The average
silhouette A silhouette ( , ) is the image of a person, animal, object or scene represented as a solid shape of a single colour, usually black, with its edges matching the outline of the subject. The interior of a silhouette is featureless, and the silhou ...
of the data is another useful criterion for assessing the natural number of clusters. The silhouette of a data instance is a measure of how closely it is matched to data within its cluster and how loosely it is matched to data of the neighboring cluster, i.e., the cluster whose average distance from the datum is lowest. A silhouette close to 1 implies the datum is in an appropriate cluster, while a silhouette close to −1 implies the datum is in the wrong cluster. Optimization techniques such as
genetic algorithms In computer science and operations research, a genetic algorithm (GA) is a metaheuristic inspired by the process of natural selection that belongs to the larger class of evolutionary algorithms (EA). Genetic algorithms are commonly used to gene ...
are useful in determining the number of clusters that gives rise to the largest silhouette. It is also possible to re-scale the data in such a way that the silhouette is more likely to be maximized at the correct number of clusters.


Cross-validation

One can also use the process of cross-validation to analyze the number of clusters. In this process, the data is partitioned into ''v'' parts. Each of the parts is then set aside at turn as a test set, a clustering model computed on the other ''v'' − 1 training sets, and the value of the objective function (for example, the sum of the squared distances to the centroids for ''k''-means) calculated for the test set. These ''v'' values are calculated and averaged for each alternative number of clusters, and the cluster number selected such that further increase in number of clusters leads to only a small reduction in the objective function.


Finding number of clusters in text databases

In text databases, a document collection defined by a document by term D
matrix Matrix most commonly refers to: * ''The Matrix'' (franchise), an American media franchise ** ''The Matrix'', a 1999 science-fiction action film ** "The Matrix", a fictional setting, a virtual reality environment, within ''The Matrix'' (franchis ...
(of size m×n, where m is the number of documents and n is the number of terms), the number of clusters can roughly be estimated by the formula \tfrac t where t is the number of non-zero entries in D. Note that in D each row and each column must contain at least one non-zero element.


Analyzing the kernel matrix

Kernel matrix defines the proximity of the input information. For example, in Gaussian
radial basis function A radial basis function (RBF) is a real-valued function \varphi whose value depends only on the distance between the input and some fixed point, either the origin, so that \varphi(\mathbf) = \hat\varphi(\left\, \mathbf\right\, ), or some other fixed ...
, it determines the
dot product In mathematics, the dot product or scalar productThe term ''scalar product'' means literally "product with a scalar as a result". It is also used sometimes for other symmetric bilinear forms, for example in a pseudo-Euclidean space. is an algebra ...
of the inputs in a higher-dimensional space, called
feature space In machine learning and pattern recognition, a feature is an individual measurable property or characteristic of a phenomenon. Choosing informative, discriminating and independent features is a crucial element of effective algorithms in pattern r ...
. It is believed that the data become more linearly separable in the feature space, and hence, linear algorithms can be applied on the data with a higher success. The kernel matrix can thus be analyzed in order to find the optimal number of clusters. The method proceeds by the eigenvalue decomposition of the kernel matrix. It will then analyze the eigenvalues and eigenvectors to obtain a measure of the compactness of the input distribution. Finally, a plot will be drawn, where the elbow of that plot indicates the optimal number of clusters in the data set. Unlike previous methods, this technique does not need to perform any clustering a-priori. It directly finds the number of clusters from the data.


The gap statistics

Robert Tibshirani Robert Tibshirani (born July 10, 1956) is a professor in the Departments of Statistics and Biomedical Data Science at Stanford University. He was a professor at the University of Toronto from 1985 to 1998. In his work, he develops statistical to ...
, Guenther Walther, and
Trevor Hastie Trevor John Hastie (born 27 June 1953) is an American statistician and computer scientist. He is currently serving as the John A. Overdeck Professor of Mathematical Sciences and Professor of Statistics at Stanford University. Hastie is known for ...
proposed estimating the number of clusters in a data set via the gap statistic. The gap statistics, based on theoretical grounds, measures how far is the pooled within-cluster sum of squares around the cluster centers from the sum of squares expected under the null reference distribution of data. The expected value is estimated by simulating null reference data of characteristics of the original data, but lacking any clusters in it. The optimal number of clusters is then estimated as the value of ''k'' for which the observed sum of squares falls farthest below the null reference. Unlike many previous methods, the gap statistics can tell us that there is no value of ''k'' for which there is a good clustering. The gap statistics is implemented as the ''clusGap'' function in the ''cluster'' package in R.


References


Further reading

* Ralf Wagner, Sören W. Scholz, Reinhold Decker (2005): The Number of Clusters in Market Segmentation, in: Daniel Baier, Reinhold Decker; Lars Schmidt-Thieme (Eds.): Data Analysis and Decision Support, Berlin, Springer, 157–176.


External links


Clustergram – cluster diagnostic plot
– for visual diagnostics of choosing the number of (''k'') clusters ( R code)
Eight methods for determining an optimal k value for ''k''-means analysis
– Answer on
stackoverflow In software, a stack overflow occurs if the call stack pointer exceeds the stack bound. The call stack may consist of a limited amount of address space, often determined at the start of the program. The size of the call stack depends on many facto ...
containing R code for several methods of computing an optimal value of k for ''k''-means cluster analysis
Partitioning and Clustering: How Many Classes?
Free seminar paper available on HAL server, id=hal-02124947: two non-parametric methods are presented (bibliographic references are supplied), one for numerical variables (works with an array of distances, not necessarily Euclidean), one for categorical variables (optimal partitioning; works also with signed dissimilarities). {{DEFAULTSORT:Determining The Number Of Clusters In A Data Set Cluster analysis Clustering criteria Articles with example pseudocode