HOME

TheInfoList



OR:

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987. The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from −1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. The silhouette can be calculated with any
distance Distance is a numerical or occasionally qualitative measurement of how far apart objects or points are. In physics or everyday usage, distance may refer to a physical length or an estimation based on other criteria (e.g. "two counties over"). ...
metric, such as the
Euclidean distance In mathematics, the Euclidean distance between two points in Euclidean space is the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, therefor ...
or the
Manhattan distance A taxicab geometry or a Manhattan geometry is a geometry whose usual distance function or metric of Euclidean geometry is replaced by a new metric in which the distance between two points is the sum of the absolute differences of their Cartesian co ...
.


Definition

Assume the data have been clustered via any technique, such as
k-medoids The -medoids problem is a clustering problem similar to -means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the -means and -medoids algorithms are partitional (breaking the dataset up into group ...
or
k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
, into clusters. For data point i \in C_I (data point in the cluster C_I), let : a(i) = \frac \sum_ d(i, j) be the mean distance between and all other data points in the same cluster, where , C_I, is the number of points belonging to cluster , and is the distance between data points and in the cluster C_I (we divide by , C_I, - 1 because we do not include the distance in the sum). We can interpret as a measure of how well is assigned to its cluster (the smaller the value, the better the assignment). We then define the mean dissimilarity of point to some cluster C_J as the mean of the distance from to all points in C_J (where C_J \neq C_I). For each data point i \in C_I, we now define :b(i) = \min_ \frac \sum_ d(i, j) to be the ''smallest'' (hence the \min operator in the formula) mean distance of to all points in any other cluster (i.e., in any cluster of which is not a member). The cluster with this smallest mean dissimilarity is said to be the "neighboring cluster" of because it is the next best fit cluster for point . We now define a ''silhouette'' (value) of one data point :s(i) = \frac , if , C_I, > 1 and :s(i) = 0, if , C_I, = 1 Which can be also written as: :s(i) = \begin 1-a(i)/b(i), & \mbox a(i) < b(i) \\ 0, & \mbox a(i) = b(i) \\ b(i)/a(i)-1, & \mbox a(i) > b(i) \\ \end From the above definition it is clear that : -1 \le s(i) \le 1 Note that is not clearly defined for clusters with size = 1, in which case we set s(i)=0. This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1. For to be close to 1 we require a(i) \ll b(i). As is a measure of how dissimilar is to its own cluster, a small value means it is well matched. Furthermore, a large implies that is badly matched to its neighbouring cluster. Thus an close to 1 means that the data is appropriately clustered. If is close to -1, then by the same logic we see that would be more appropriate if it was clustered in its neighbouring cluster. An near zero means that the datum is on the border of two natural clusters. The mean over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the mean over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of is used in the clustering algorithm (e.g.,
k-means ''k''-means clustering is a method of vector quantization, originally from signal processing, that aims to partition ''n'' observations into ''k'' clusters in which each observation belongs to the cluster with the nearest mean (cluster centers o ...
), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific. Kaufman et al. introduced the term ''silhouette coefficient'' for the maximum value of the mean over all data of the entire dataset, i.e., : SC = \max_k \tilde\left(k\right), where \tilde\left(k\right) represents the mean over all data of the entire dataset for a specific number of clusters .


Simplified Silhouette and Medoid Silhouette

Computing the silhouette coefficient needs all pairwise distances, making this evaluation much more costly than clustering with k-means. For a clustering with centers \mu_ for each cluster C_I, we can use the following simplified Silhouette for each point i\in C_Iinstead, which can be computed using only distances: :a'(i)=d(i,\mu_) and b'(i)=\min_ d(i,C_J), which has the additional benefit that is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient :s'(i) = \frac :SC' = \max_k \frac1N \sum_i s'\left(i\right). If the cluster centers are
medoids Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be ...
(as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette or medoid silhouette. If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that a'(i)\leq b'(i), and hence s'(i) = \frac = 1 - \frac .


Silhouette Clustering

Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al. proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL: # Choose initial medoids by using PAM # Compute the average silhouette of this initial solution # For each pair of a medoid and a non-medoid ## swap and ## compute the average silhouette of the resulting solution ## remember the best swap ## un-swap and for the next iteration # Perform the best swap and return to 3, otherwise stop if no improvement was found. The loop in step 3 is executed for pairs, and involves computing the silhouette in , hence this algorithm needs time, where is the number of iterations. Because this is a fairly expensive operation, the authors propose to also use the medoid-based silhouette, and call the resulting algorithm PAMMEDSIL. It needs time. Batool et al. propose a similar algorithm under the name OSil, and propose a CLARA-like sampling strategy for larger data sets, that solves the problem only for a subsample. By adopting recent improvements to the PAM algorithm, FastMSC reduces the runtime using the medoid silhouette to just .


See also

* Davies–Bouldin index *
Determining the number of clusters in a data set Determining the number of clusters in a data set, a quantity often labelled ''k'' as in the ''k''-means algorithm, is a frequent problem in data clustering Cluster analysis or clustering is the task of grouping a set of objects in such a way th ...


References

{{Machine learning evaluation metrics Clustering criteria