HOME

TheInfoList



OR:

In
cluster analysis Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). It is a main task of ...
, the elbow method is a
heuristic A heuristic (; ), or heuristic technique, is any approach to problem solving or self-discovery that employs a practical method that is not guaranteed to be optimal, perfect, or rational, but is nevertheless sufficient for reaching an immediate, ...
used in
determining the number of clusters in a data set Determining the number of clusters in a data set, a quantity often labelled ''k'' as in the ''k''-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a ...
. The method consists of plotting the
explained variation In statistics, explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. Often, variation is quantified as variance; then, the more specific term explained variance can be ...
as a function of the number of clusters and picking the elbow of the curve as the number of clusters to use. The same method can be used to choose the number of parameters in other data-driven models, such as the number of
principal component Principal may refer to: Title or rank * Principal (academia), the chief executive of a university ** Principal (education), the office holder/ or boss in any school * Principal (civil service) or principal officer, the senior management level in ...
s to describe a data set. The method can be traced to speculation by
Robert L. Thorndike Robert Ladd Thorndike (September 22, 1910 – September 21, 1990) was an American psychometrician and educational psychologist who made significant contributions to the analysis of reliability, the interpretation of error, cognitive ability, and ...
in 1953.


Intuition

Using the "elbow" or " knee of a curve" as a cutoff point is a common heuristic in
mathematical optimization Mathematical optimization (alternatively spelled ''optimisation'') or mathematical programming is the selection of a best element, with regard to some criterion, from some set of available alternatives. It is generally divided into two subfi ...
to choose a point where
diminishing returns In economics, diminishing returns are the decrease in marginal (incremental) output of a production process as the amount of a single factor of production is incrementally increased, holding all other factors of production equal ( ceteris paribu ...
are no longer worth the additional cost. In clustering, this means one should choose a number of clusters so that adding another cluster doesn't give much better modeling of the data. The intuition is that increasing the number of clusters will naturally improve the fit (explain more of the variation), since there are more parameters (more clusters) to use, but that at some point this is
over-fitting mathematical modeling, overfitting is "the production of an analysis that corresponds too closely or exactly to a particular set of data, and may therefore fail to fit to additional data or predict future observations reliably". An overfitt ...
, and the elbow reflects this. For example, given data that actually consist of ''k'' labeled groups – for example, ''k'' points sampled with noise – clustering with more than ''k'' clusters will "explain" more of the variation (since it can use smaller, tighter clusters), but this is over-fitting, since it is subdividing the labeled groups into multiple clusters. The idea is that the first clusters will add much information (explain a lot of variation), since the data actually consist of that many groups (so these clusters are necessary), but once the number of clusters exceeds the actual number of groups in the data, the added information will drop sharply, because it is just subdividing the actual groups. Assuming this happens, there will be a sharp elbow in the graph of explained variation versus clusters: increasing rapidly up to ''k'' ( under-fitting region), and then increasing slowly after ''k'' (over-fitting region). In practice there may not be a sharp elbow, and as a heuristic method, such an "elbow" cannot always be unambiguously identified.


Measures of variation

There are various measures of "
explained variation In statistics, explained variation measures the proportion to which a mathematical model accounts for the variation (dispersion) of a given data set. Often, variation is quantified as variance; then, the more specific term explained variance can be ...
" used in the elbow method. Most commonly, varia''tion'' is quantified by ''
variance In probability theory and statistics, variance is the expectation of the squared deviation of a random variable from its population mean or sample mean. Variance is a measure of dispersion, meaning it is a measure of how far a set of numbers ...
'', and the ratio used is the ratio of between-group variance to the total variance. Alternatively, one uses the ratio of between-group variance to within-group variance, which is the one-way
ANOVA Analysis of variance (ANOVA) is a collection of statistical models and their associated estimation procedures (such as the "variation" among and between groups) used to analyze the differences among means. ANOVA was developed by the statistician ...
''F''-test statistic.See, e.g., Figure 6 in *


See also

*
Determining the number of clusters in a data set Determining the number of clusters in a data set, a quantity often labelled ''k'' as in the ''k''-means algorithm, is a frequent problem in data clustering, and is a distinct issue from the process of actually solving the clustering problem. For a ...
*
Scree plot In multivariate statistics, a scree plot is a line plot of the eigenvalues of factors or principal components in an analysis. The scree plot is used to determine the number of factors to retain in an exploratory factor analysis (FA) or principal c ...


References

Clustering criteria {{comp-sci-stub