Davies–Bouldin Index
   HOME

TheInfoList



OR:

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating
clustering algorithm Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some specific sense defined by the analyst) to each o ...
s. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.


Preliminaries

Given ''n'' dimensional points, let ''C''''i'' be a cluster of data points. Let ''X''''j'' be an ''n''-dimensional feature vector assigned to cluster ''C''''i''. : S_i = \left(\frac \sum_^ \right)^ Here A_i is the
centroid In mathematics and physics, the centroid, also known as geometric center or center of figure, of a plane figure or solid figure is the arithmetic mean position of all the points in the figure. The same definition extends to any object in n-d ...
of ''C''''i'' and ''T''''i'' is the size of the cluster ''i''. S_i is the ''q''th root of the ''q''th moment of the points in cluster ''i'' about the mean. If q=1 then S_i is the average distance between the feature vectors in cluster ''i'' and the centroid of the cluster. Usually the value of ''p'' is 2, which makes the distance a
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
function. Many other distance metrics can be used, in the case of
manifolds In mathematics, a manifold is a topological space that locally resembles Euclidean space near each point. More precisely, an n-dimensional manifold, or ''n-manifold'' for short, is a topological space with the property that each point has a n ...
and higher dimensional data, where the euclidean distance may not be the best measure for determining the clusters. It is important to note that this distance metric has to match with the metric used in the clustering scheme itself for meaningful results. : M_ = \left, \left, A_i-A_j\\_p = \Bigl(\displaystyle\sum_^\left, a_-a_\^p\Bigr)^ : M_ is a measure of separation between cluster C_i and cluster C_j. : a_ is the ''k''th element of A_i, and there are n such elements in ''A'' for it is an n dimensional centroid. Here ''k'' indexes the features of the data, and this is essentially the
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of the line segment between them. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and therefore is o ...
between the centers of clusters ''i'' and ''j'' when ''p'' equals 2.


Definition

Let ''R''''i,j'' be a measure of how good the clustering scheme is. This measure, by definition has to account for ''M''''i,j'' the separation between the ''i''''th'' and the ''j''''th'' cluster, which ideally has to be as large as possible, and ''S''''i'', the within cluster scatter for cluster i, which has to be as low as possible. Hence the Davies–Bouldin index is defined as the ratio of ''S''''i'' and ''M''''i,j'' such that these properties are conserved: # R_ \geqslant 0 . # R_ = R_ . # When S_j \geqslant S_k and M_ = M_ then R_ > R_ . # When S_j = S_k and M_ \leqslant M_ then R_ > R_ . With this formulation, the lower the value, the better the separation of the clusters and the 'tightness' inside the clusters. A solution that satisfies these properties is: : R_ = \frac This is used to define ''D''''i'': : D_i \equiv \max_ R_ If N is the number of clusters: : \mathit \equiv \frac\displaystyle\sum_^N D_i ''DB'' is called the Davies–Bouldin index. This is dependent both on the data as well as the algorithm. ''D''''i'' chooses the worst-case scenario, and this value is equal to ''R''''i,j'' for the most similar cluster to cluster ''i''. There could be many variations to this formulation, like choosing the average of the cluster similarity, weighted average and so on.


Explanation

Lower index values indicate a better clustering result. The index is improved (lowered) by increased separation between clusters and decreased variation within clusters. These conditions constrain the index so defined to be symmetric and non-negative. Due to the way it is defined, as a function of the ratio of the within cluster scatter, to the between cluster separation, a lower value will mean that the clustering is better. It happens to be the average similarity between each cluster and its most similar one, averaged over all the clusters, where the similarity is defined as ''S''''i'' above. This affirms the idea that no cluster has to be similar to another, and hence the best clustering scheme essentially minimizes the Davies–Bouldin index. This index thus defined is an average over all the ''i'' clusters, and hence a good measure of deciding how many clusters actually exists in the data is to plot it against the number of clusters it is calculated over. The number ''i'' for which this value is the lowest is a good measure of the number of clusters the data could be ideally classified into. This has applications in deciding the value of ''k'' in the
kmeans ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers ...
algorithm, where the value of k is not known apriori.


Soft version of Davies-Bouldin index

Recently, the Davies–Bouldin index has been extended to the domain of soft clustering categories. This extension is based on the category clustering approach according to the framework of
fuzzy logic Fuzzy logic is a form of many-valued logic in which the truth value of variables may be any real number between 0 and 1. It is employed to handle the concept of partial truth, where the truth value may range between completely true and completely ...
. The starting point for this new version of the validation index is the result of a given soft clustering algorithm (e.g. fuzzy c-means), shaped with the computed clustering partitions and membership values associating the elements with the clusters. In the soft domain, each element of the system belongs to every classes, given the membership values. Therefore, this soft version of the Davies–Bouldin index is able to take into account, in addition to standard validation measures such as
compactness In mathematics, specifically general topology, compactness is a property that seeks to generalize the notion of a closed and bounded subset of Euclidean space. The idea is that a compact space has no "punctures" or "missing endpoints", i.e., it ...
and separation of clusters, the degree to which elements belong to classes.


Implementations

The
scikit-learn scikit-learn (formerly scikits.learn and also known as sklearn) is a free and open-source machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support ...
Python
open source Open source is source code that is made freely available for possible modification and redistribution. Products include permission to use and view the source code, design documents, or content of the product. The open source model is a decentrali ...
library provides an implementation of this metric in the sklearn.metrics module. R provides a similar implementation in its ''clusterSim'' package. A
Java Java is one of the Greater Sunda Islands in Indonesia. It is bordered by the Indian Ocean to the south and the Java Sea (a part of Pacific Ocean) to the north. With a population of 156.9 million people (including Madura) in mid 2024, proje ...
implementation is found in
ELKI ELKI (''Environment for Developing KDD-Applications Supported by Index-Structures'') is a data mining (KDD, knowledge discovery in databases) software framework developed for use in research and teaching. It was originally created by the databa ...
, and can be compared to many other clustering quality indexes.


See also

* Silhouette score *
Dunn index The Dunn index, introduced by Joseph C. Dunn in 1974, is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index or Silhouette index, in that it is an internal evaluation sch ...
*
Cluster analysis Cluster analysis or clustering is the data analyzing technique in which task of grouping a set of objects in such a way that objects in the same group (called a cluster) are more Similarity measure, similar (in some specific sense defined by the ...
* Calinski-Harabasz index * Determining the number of clusters in a data set * DBCV index


Notes and references

{{DEFAULTSORT:Davies-Bouldin index Clustering criteria