Silhouette (clustering)

Last updated

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. [1] It was proposed by Belgian statistician Peter Rousseeuw in 1987.

Contents

The silhouette value is a measure of how similar an object is to its own cluster (cohesion) compared to other clusters (separation). The silhouette ranges from 1 to +1, where a high value indicates that the object is well matched to its own cluster and poorly matched to neighboring clusters. If most objects have a high value, then the clustering configuration is appropriate. If many points have a low or negative value, then the clustering configuration may have too many or too few clusters. A clustering with an average silhouette width of over 0.7 is considered to be "strong", a value over 0.5 "reasonable" and over 0.25 "weak", but with increasing dimensionality of the data, it becomes difficult to achieve such high values because of the curse of dimensionality, as the distances become more similar. The silhouette score is specialized for measuring cluster quality when the clusters are convex-shaped, and may not perform well if the data clusters have irregular shapes or are of varying sizes. [2] The silhouette can be calculated with any distance metric, such as the Euclidean distance or the Manhattan distance.

Definition

A plot showing silhouette scores from three types of animals from the Zoo dataset as rendered by Orange data mining suite. At the bottom of the plot, silhouette identifies dolphin and porpoise as outliers in the group of mammals. Silhouette-plot-orange.png
A plot showing silhouette scores from three types of animals from the Zoo dataset as rendered by Orange data mining suite. At the bottom of the plot, silhouette identifies dolphin and porpoise as outliers in the group of mammals.

Assume the data have been clustered via any technique, such as k-medoids or k-means, into k clusters.

For data point (data point i in the cluster ), let

be the mean distance between i and all other data points in the same cluster, where is the number of points belonging to cluster , and d(i,j) is the distance between data points i and j in the cluster (we divide by because we do not include the distance d(i,i) in the sum). We can interpret a(i) as a measure of how well i is assigned to its cluster (the smaller the value, the better the assignment).

We then define the mean dissimilarity of point i to some cluster as the mean of the distance from i to all points in (where ).

For each data point , we now define

to be the smallest (hence the operator in the formula) mean distance of i to all points in any other cluster (i.e., in any cluster of which i is not a member). The cluster with this smallest mean dissimilarity is said to be the "neighboring cluster" of i because it is the next best fit cluster for point i.

We now define a silhouette (value) of one data point i

, if

and

, if

Which can be also written as:

From the above definition it is clear that

Note that a(i) is not clearly defined for clusters with size = 1, in which case we set . This choice is arbitrary, but neutral in the sense that it is at the midpoint of the bounds, -1 and 1. [1]

For s(i) to be close to 1 we require . As a(i) is a measure of how dissimilar i is to its own cluster, a small value means it is well matched. Furthermore, a large b(i) implies that i is badly matched to its neighbouring cluster. Thus an s(i) close to 1 means that the data is appropriately clustered. If s(i) is close to -1, then by the same logic we see that i would be more appropriate if it was clustered in its neighbouring cluster. An s(i) near zero means that the datum is on the border of two natural clusters.

The mean s(i) over all points of a cluster is a measure of how tightly grouped all the points in the cluster are. Thus the mean s(i) over all data of the entire dataset is a measure of how appropriately the data have been clustered. If there are too many or too few clusters, as may occur when a poor choice of k is used in the clustering algorithm (e.g., k-means), some of the clusters will typically display much narrower silhouettes than the rest. Thus silhouette plots and means may be used to determine the natural number of clusters within a dataset. One can also increase the likelihood of the silhouette being maximized at the correct number of clusters by re-scaling the data using feature weights that are cluster specific. [3]

Kaufman et al. introduced the term silhouette coefficient for the maximum value of the mean s(i) over all data of the entire dataset, [4] i.e.,

where represents the mean s(i) over all data of the entire dataset for a specific number of clusters k.

Simplified Silhouette and Medoid Silhouette

Computing the silhouette coefficient needs all pairwise distances, making this evaluation much more costly than clustering with k-means. For a clustering with centers for each cluster , we can use the following simplified Silhouette for each point instead, which can be computed using only distances:

and ,

which has the additional benefit that is always defined, then define accordingly the simplified silhouette and simplified silhouette coefficient [5]

.

If the cluster centers are medoids (as in k-medoids clustering) instead of arithmetic means (as in k-means clustering), this is also called the medoid-based silhouette [6] or medoid silhouette. [7]

If every object is assigned to the nearest medoid (as in k-medoids clustering), we know that , and hence . [7]

Silhouette Clustering

Instead of using the average silhouette to evaluate a clustering obtained from, e.g., k-medoids or k-means, we can try to directly find a solution that maximizes the Silhouette. We do not have a closed form solution to maximize this, but it will usually be best to assign points to the nearest cluster as done by these methods. Van der Laan et al. [6] proposed to adapt the standard algorithm for k-medoids, PAM, for this purpose and call this algorithm PAMSIL:

  1. Choose initial medoids by using PAM
  2. Compute the average silhouette of this initial solution
  3. For each pair of a medoid m and a non-medoid x
    1. swap m and x
    2. compute the average silhouette of the resulting solution
    3. remember the best swap
    4. un-swap m and x for the next iteration
  4. Perform the best swap and return to 3, otherwise stop if no improvement was found.

The loop in step 3 is executed for pairs, and involves computing the silhouette in , hence this algorithm needs time, where i is the number of iterations.

Because this is a fairly expensive operation, the authors propose to also use the medoid-based silhouette, and call the resulting algorithm PAMMEDSIL. [6] It needs time.

Batool et al. propose a similar algorithm under the name OSil, and propose a CLARA-like sampling strategy for larger data sets, that solves the problem only for a sub-sample. [8]

By adopting recent improvements to the PAM algorithm, FastMSC reduces the runtime using the medoid silhouette to just . [7]

By starting with a maximum number of clusters kmax and iteratively removing the worst center (in terms of a change in silhouette) and re-optimizing, the best (highest medoid silhouette) clustering can be automatically determined. As data structures can be reused, this reduces the computation cost substantially over repeatedly running the algorithm for different numbers of clusters. [9] This algorithm needs pairwise distances and is typically implemented with a pairwise distance matrix. The memory requirement is the main limiting factor for applying this to very large data sets.

See also

Related Research Articles

<span class="mw-page-title-main">Median</span> Middle quantile of a data set or probability distribution

In statistics and probability theory, the median is the value separating the higher half from the lower half of a data sample, a population, or a probability distribution. For a data set, it may be thought of as "the middle" value. The basic feature of the median in describing data compared to the mean is that it is not skewed by a small proportion of extremely large or small values, and therefore provides a better representation of the center. Median income, for example, may be a better way to describe the center of the income distribution because increases in the largest incomes alone have no effect on the median. For this reason, the median is of central importance in robust statistics.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974).

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

Fuzzy clustering is a form of clustering in which each data point can belong to more than one cluster.

The k-medoids problem is a clustering problem similar to k-means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the k-means and k-medoids algorithms are partitional and attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the k-means algorithm, k-medoids chooses actual data points as centers, and thereby allows for greater interpretability of the cluster centers than in k-means, where the center of a cluster is not necessarily one of the input data points. Furthermore, k-medoids can be used with arbitrary dissimilarity measures, whereas k-means generally requires Euclidean distance for efficient solutions. Because k-medoids minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, it is more robust to noise and outliers than k-means.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

In statistics, k-medians clustering is a cluster analysis algorithm. It is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. This has the effect of minimizing error over all clusters with respect to the 2-norm distance metric, as opposed to the squared 2-norm distance metric.

Shape context is a feature descriptor used in object recognition. Serge Belongie and Jitendra Malik proposed the term in their paper "Matching with Shape Contexts" in 2000.

Neighbourhood components analysis is a supervised learning method for classifying multivariate data into distinct classes according to a given distance metric over the data. Functionally, it serves the same purposes as the K-nearest neighbors algorithm and makes direct use of a related concept termed stochastic nearest neighbours.

BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.

In statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special case of the objective function approach originally presented by Joe H. Ward, Jr. Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function. This objective function could be "any function that reflects the investigator's purpose." Many of the standard clustering procedures are contained in this very general class. To illustrate the procedure, Ward used the example where the objective function is the error sum of squares, and this example is known as Ward's method or more precisely Ward's minimum variance method.

The Dunn index (DI) (introduced by J. C. Dunn in 1974) is a metric for evaluating clustering algorithms. This is part of a group of validity indices including the Davies–Bouldin index or Silhouette index, in that it is an internal evaluation scheme, where the result is based on the clustered data itself. As do all other such indices, the aim is to identify sets of clusters that are compact, with a small variance between members of the cluster, and well separated, where the means of different clusters are sufficiently far apart, as compared to the within cluster variance. For a given assignment of clusters, a higher Dunn index indicates better clustering. One of the drawbacks of using this is the computational cost as the number of clusters and dimensionality of the data increase.

The Calinski–Harabasz index (CHI), also known as the Variance Ratio Criterion (VRC), is a metric for evaluating clustering algorithms, introduced by Tadeusz Caliński and Jerzy Harabasz in 1974. It is an internal evaluation metric, where the assessment of the clustering quality is based solely on the dataset and the clustering results, and not on external, ground-truth labels.

References

  1. 1 2 Peter J. Rousseeuw (1987). "Silhouettes: a Graphical Aid to the Interpretation and Validation of Cluster Analysis". Computational and Applied Mathematics. 20: 53–65. doi: 10.1016/0377-0427(87)90125-7 .
  2. Monshizadeh, Mehrnoosh; Khatri, Vikramajeet; Kantola, Raimo; Yan, Zheng (2022-11-01). "A deep density based and self-determining clustering approach to label unknown traffic". Journal of Network and Computer Applications. 207: 103513. doi: 10.1016/j.jnca.2022.103513 . ISSN   1084-8045. However, both measures [silhouette coefficient and edge correlation] prefer convex-shaped clusters and cannot adapt to all cluster shapes produced by DBSCAN.
  3. R.C. de Amorim, C. Hennig (2015). "Recovering the number of clusters in data sets with noise features using feature rescaling factors". Information Sciences. 324: 126–145. arXiv: 1602.06989 . doi:10.1016/j.ins.2015.06.039. S2CID   315803.
  4. Leonard Kaufman; Peter J. Rousseeuw (1990). Finding groups in data : An introduction to cluster analysis. Hoboken, NJ: Wiley-Interscience. p.  87. doi:10.1002/9780470316801. ISBN   9780471878766.
  5. Hruschka, E.R.; de Castro, L.N.; Campello, R.J.G.B. (2004). Evolutionary Algorithms for Clustering Gene-Expression Data. Fourth IEEE International Conference on Data Mining (ICDM'04). IEEE. pp. 403–406. doi:10.1109/ICDM.2004.10073.
  6. 1 2 3 Van der Laan, Mark; Pollard, Katherine; Bryan, Jennifer (2003). "A new partitioning around medoids algorithm". Journal of Statistical Computation and Simulation. 73 (8): 575–584. doi:10.1080/0094965031000136012. ISSN   0094-9655. S2CID   17437463.
  7. 1 2 3 Lenssen, Lars; Schubert, Erich (2022). Clustering by Direct Optimization of the Medoid Silhouette. International Conference on Similarity Search and Applications. pp. 190–204. arXiv: 2209.12553 . doi:10.1007/978-3-031-17849-8_15 . Retrieved 2022-10-20.
  8. Batool, Fatima; Hennig, Christian (2021). "Clustering with the Average Silhouette Width". Computational Statistics & Data Analysis. 158: 107190. arXiv: 1910.11339 . doi:10.1016/j.csda.2021.107190. S2CID   219260336.
  9. Lenssen, Lars; Schubert, Erich (2024-02-01). "Medoid Silhouette clustering with automatic cluster number selection". Information Systems. 120: 102290. arXiv: 2309.03751 . doi:10.1016/j.is.2023.102290. ISSN   0306-4379.