K-medoids

Last updated December 02, 2023

The $k$ -medoids problem is a clustering problem similar to $k$ -means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM (Partitioning Around Medoids) algorithm.^[1] Both the $k$ -means and $k$ -medoids algorithms are partitional (breaking the dataset up into groups) and attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the $k$ -means algorithm, $k$ -medoids chooses actual data points as centers (medoids or exemplars), and thereby allows for greater interpretability of the cluster centers than in $k$ -means, where the center of a cluster is not necessarily one of the input data points (it is the average between the points in the cluster). Furthermore, $k$ -medoids can be used with arbitrary dissimilarity measures, whereas $k$ -means generally requires Euclidean distance for efficient solutions. Because $k$ -medoids minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, it is more robust to noise and outliers than $k$ -means.

$k$ -medoids is a classical partitioning technique of clustering that splits the data set of $n$ objects into $k$ clusters, where the number $k$ of clusters assumed known a priori (which implies that the programmer must specify k before the execution of a $k$ -medoids algorithm). The "goodness" of the given value of $k$ can be assessed with methods such as the silhouette method.

The medoid of a cluster is defined as the object in the cluster whose sum (and, equivalently, the average) of dissimilarities to all the objects in the cluster is minimal, that is, it is a most centrally located point in the cluster.

Algorithms

In general, the $k$ -medoids problem is NP-hard to solve exactly. As such, many heuristic solutions exist.

Partitioning Around Medoids (PAM)

PAM^[1] uses a greedy search which may not find the optimum solution, but it is faster than exhaustive search. It works as follows:

(BUILD) Initialize: greedily select $k$ of the $n$ data points as the medoids to minimize the cost
Associate each data point to the closest medoid.
(SWAP) While the cost of the configuration decreases:
1. For each medoid m, and for each non-medoid data point o:
  1. Consider the swap of $m$ and $o$ , and compute the cost change
  2. If the cost change is the current best, remember this m and o combination
2. Perform the best swap of $m_{\text{best}}$ and $o_{\text{best}}$ , if it decreases the cost function. Otherwise, the algorithm terminates.

The runtime complexity of the original PAM algorithm per iteration of (3) is $O(k(n-k)^{2})$ , by only computing the change in cost. A naive implementation recomputing the entire cost function every time will be in $O(n^{2}k^{2})$ . This runtime can be further reduced to $O(n^{2})$ , by splitting the cost change into three parts such that computations can be shared or avoided (FastPAM). The runtime can further be reduced by eagerly performing swaps (FasterPAM),^[2] at which point a random initialization becomes a viable alternative to BUILD.

Alternating Optimization

Algorithms other than PAM have also been suggested in the literature, including the following Voronoi iteration method known as the "Alternating" heuristic in literature, as it alternates between two optimization steps:^[3]^[4]^[5]

Select initial medoids randomly
Iterate while the cost decreases:
1. In each cluster, make the point that minimizes the sum of distances within the cluster the medoid
2. Reassign each point to the cluster defined by the closest medoid determined in the previous step

k-means-style Voronoi iteration tends to produce worse results, and exhibit "erratic behavior".^[6]^: 957 Because it does not allow re-assigning points to other clusters while updating means it only explores a smaller search space. It can be shown that even in simple cases this heuristic finds inferior solutions the swap based methods can solve.^[2]

Hierarchical Clustering

Multiple variants of hierarchical clustering with a "medoid linkage" have been proposed. The Minimum Sum linkage criterion^[7] directly uses the objective of medoids, but the Minimum Sum Increase linkage was shown to produce better results (similar to how Ward linkage uses the increase in squared error). Earlier approaches simply used the distance of the cluster medoids of the previous medoids as linkage measure,^[8]^[9] but which tends to result in worse solutions, as the distance of two medoids does not ensure there exists a good medoid for the combination. These approaches have a run time complexity of $O(n^{3})$ , and when the dendrogram is cut at a particular number of clusters k, the results will typically be worse than the results found by PAM.^[7] Hence these methods are primarily of interest when a hierarchical tree structure is desired.

Other Algorithms

Other approximate algorithms such as CLARA and CLARANS trade quality for runtime. CLARA applies PAM on multiple subsamples, keeping the best result. CLARANS works on the entire data set, but only explores a subset of the possible swaps of medoids and non-medoids using sampling. BanditPAM uses the concept of multi-armed bandits to choose candidate swaps instead of uniform sampling as in CLARANS.^[10]

Software

ELKI includes several k-medoid variants, including a Voronoi-iteration k-medoids, the original PAM algorithm, Reynolds' improvements, and the O(n²) FastPAM and FasterPAM algorithms, CLARA, CLARANS, FastCLARA and FastCLARANS.
Julia contains a k-medoid implementation of the k-means style algorithm (fast, but much worse result quality) in the JuliaStats/Clustering.jl package.
KNIME includes a k-medoid implementation supporting a variety of efficient matrix distance measures, as well as a number of native (and integrated third-party) k-means implementations
Python contains FasterPAM and other variants in the "kmedoids" package, additional implementations can be found in many other packages
R contains PAM in the "cluster" package, including the FasterPAM improvements via the options variant = "faster" and medoids = "random". There also exists a "fastkmedoids" package.
RapidMiner has an operator named KMedoids, but it does not implement any of above KMedoids algorithms. Instead, it is a k-means variant, that substitutes the mean with the closest data point (which is not the medoid), which combines the drawbacks of k-means (limited to coordinate data) with the additional cost of finding the nearest point to the mean.
Rust has a "kmedoids" crate that also includes the FasterPAM variant.
MATLAB implements PAM, CLARA, and two other algorithms to solve the k-medoid clustering problem.

Related Research Articles

A* is a graph traversal and path search algorithm, which is used in many fields of computer science due to its completeness, optimality, and optimal efficiency. One major practical drawback is its $space complexity, as it stores all generated nodes in memory. Thus, in practical travel-routing systems, it is generally outperformed by algorithms that can pre-process the graph to attain better performance, as well as memory-bounded approaches; however, A* is still the best solution in many cases.$

In data mining and statistics, hierarchical clustering is a method of cluster analysis that seeks to build a hierarchy of clusters. Strategies for hierarchical clustering generally fall into two categories:

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are $N$ elements, this matrix will have size $N \times N$ . In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

In statistics, k-medians clustering is a cluster analysis algorithm. It is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. This has the effect of minimizing error over all clusters with respect to the 1-norm distance metric, as opposed to the squared 2-norm distance metric

Density-based spatial clustering of applications with noise (DBSCAN) is a data clustering algorithm proposed by Martin Ester, Hans-Peter Kriegel, Jörg Sander and Xiaowei Xu in 1996. It is a density-based clustering non-parametric algorithm: given a set of points in some space, it groups together points that are closely packed together, marking as outliers points that lie alone in low-density regions . DBSCAN is one of the most common, and most commonly cited, clustering algorithms.

BIRCH is an unsupervised data mining algorithm used to perform hierarchical clustering over particularly large data-sets. With modifications it can also be used to accelerate k-means clustering and Gaussian mixture modeling with the expectation–maximization algorithm. An advantage of BIRCH is its ability to incrementally and dynamically cluster incoming, multi-dimensional metric data points in an attempt to produce the best quality clustering for a given set of resources. In most cases, BIRCH only requires a single scan of the database.

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987.

Ordering points to identify the clustering structure (OPTICS) is an algorithm for finding density-based clusters in spatial data. It was presented by Mihael Ankerst, Markus M. Breunig, Hans-Peter Kriegel and Jörg Sander. Its basic idea is similar to DBSCAN, but it addresses one of DBSCAN's major weaknesses: the problem of detecting meaningful clusters in data of varying density. To do so, the points of the database are (linearly) ordered such that spatially closest points become neighbors in the ordering. Additionally, a special distance is stored for each point that represents the density that must be accepted for a cluster so that both points belong to the same cluster. This is represented as a dendrogram.

CURE is an efficient data clustering algorithm for large databases. Compared with K-means clustering it is more robust to outliers and able to identify clusters having non-spherical shapes and size variances.

In computer science, data stream clustering is defined as the clustering of data that arrive continuously such as telephone records, multimedia data, financial transactions etc. Data stream clustering is usually studied as a streaming algorithm and the objective is, given a sequence of points, to construct a good clustering of the stream, using a small amount of memory and time.

ELKI is a data mining software framework developed for use in research and teaching. It was originally at the database systems research unit of Professor Hans-Peter Kriegel at the Ludwig Maximilian University of Munich, Germany, and now continued at the Technical University of Dortmund, Germany. It aims at allowing the development and evaluation of advanced data mining algorithms and their interaction with database index structures.

Peter J. Rousseeuw is a statistician known for his work on robust statistics and cluster analysis. He obtained his PhD in 1981 at the Vrije Universiteit Brussel, following research carried out at the ETH in Zurich, which led to a book on influence functions. Later he was professor at the Delft University of Technology, The Netherlands, at the University of Fribourg, Switzerland, and at the University of Antwerp, Belgium. Next he was a senior researcher at Renaissance Technologies. He then returned to Belgium as professor at KU Leuven, until becoming emeritus in 2022. His former PhD students include Annick Leroy, Hendrik Lopuhaä, Geert Molenberghs, Christophe Croux, Mia Hubert, Stefan Van Aelst, Tim Verdonck and Jakob Raymaekers.

In statistics, Ward's method is a criterion applied in hierarchical cluster analysis. Ward's minimum variance method is a special case of the objective function approach originally presented by Joe H. Ward, Jr. Ward suggested a general agglomerative hierarchical clustering procedure, where the criterion for choosing the pair of clusters to merge at each step is based on the optimal value of an objective function. This objective function could be "any function that reflects the investigator's purpose." Many of the standard clustering procedures are contained in this very general class. To illustrate the procedure, Ward used the example where the objective function is the error sum of squares, and this example is known as Ward's method or more precisely Ward's minimum variance method.

In the theory of cluster analysis, the nearest-neighbor chain algorithm is an algorithm that can speed up several methods for agglomerative hierarchical clustering. These are methods that take a collection of points as input, and create a hierarchy of clusters of points by repeatedly merging pairs of smaller clusters to form larger clusters. The clustering methods that the nearest-neighbor chain algorithm can be used for include Ward's method, complete-linkage clustering, and single-linkage clustering; these all work by repeatedly merging the closest two clusters but use different definitions of the distance between clusters. The cluster distances for which the nearest-neighbor chain algorithm works are called reducible and are characterized by a simple inequality among certain cluster distances.

In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points. Unlike clustering algorithms such as $k$ -means or $k$ -medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm. Similar to $k$ -medoids, affinity propagation finds "exemplars," members of the input set that are representative of clusters.

In computational geometry, the farthest-first traversal of a compact metric space is a sequence of points in the space, where the first point is selected arbitrarily and each successive point is as far as possible from the set of previously-selected points. The same concept can also be applied to a finite set of geometric points, by restricting the selected points to belong to the set or equivalently by considering the finite metric space generated by these points. For a finite metric space or finite set of geometric points, the resulting sequence forms a permutation of the points, also known as the greedy permutation.

References

1 2 Kaufman, Leonard; Rousseeuw, Peter J. (1990-03-08), "Partitioning Around Medoids (Program PAM)", Wiley Series in Probability and Statistics, Hoboken, NJ, USA: John Wiley & Sons, Inc., pp. 68–125, doi:10.1002/9780470316801.ch2, ISBN 978-0-470-31680-1 , retrieved 2021-06-13
1 2 Schubert, Erich; Rousseeuw, Peter J. (2021). "Fast and eager k -medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms". Information Systems. 101: 101804. arXiv: 2008.05171 . doi:10.1016/j.is.2021.101804. S2CID 221103804.
↑ Maranzana, F. E. (1963). "On the location of supply points to minimize transportation costs". IBM Systems Journal. 2 (2): 129–135. doi:10.1147/sj.22.0129.
↑ T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Springer (2001), 468–469.
↑ Park, Hae-Sang; Jun, Chi-Hyuck (2009). "A simple and fast algorithm for K-medoids clustering". Expert Systems with Applications. 36 (2): 3336–3341. doi:10.1016/j.eswa.2008.01.039.
↑ Teitz, Michael B.; Bart, Polly (1968-10-01). "Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph". Operations Research. 16 (5): 955–961. doi:10.1287/opre.16.5.955. ISSN 0030-364X.
1 2 Schubert, Erich (2021). HACAM: Hierarchical Agglomerative Clustering Around Medoids – and its Limitations (PDF). LWDA’21: Lernen, Wissen, Daten, Analysen September 01–03, 2021, Munich, Germany. pp. 191–204 – via CEUR-WS.
↑ Miyamoto, Sadaaki; Kaizu, Yousuke; Endo, Yasunori (2016). Hierarchical and Non-Hierarchical Medoid Clustering Using Asymmetric Similarity Measures. 2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS). pp. 400–403. doi:10.1109/SCIS-ISIS.2016.0091.
↑ Herr, Dominik; Han, Qi; Lohmann, Steffen; Ertl, Thomas (2016). Visual Clutter Reduction through Hierarchy-based Projection of High-dimensional Labeled Data (PDF). Graphics Interface. Graphics Interface. doi:10.20380/gi2016.14 . Retrieved 2022-11-04.
↑ Tiwari, Mo; Zhang, Martin J.; Mayclin, James; Thrun, Sebastian; Piech, Chris; Shomorony, Ilan (2020). "BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits". Advances in Neural Information Processing Systems. 33.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[:2-1] 1 2 Kaufman, Leonard; Rousseeuw, Peter J. (1990-03-08), "Partitioning Around Medoids (Program PAM)", Wiley Series in Probability and Statistics, Hoboken, NJ, USA: John Wiley & Sons, Inc., pp. 68–125, doi:10.1002/9780470316801.ch2, ISBN 978-0-470-31680-1 , retrieved 2021-06-13

[:1-2] 1 2 Schubert, Erich; Rousseeuw, Peter J. (2021). "Fast and eager k -medoids clustering: O(k) runtime improvement of the PAM, CLARA, and CLARANS algorithms". Information Systems. 101: 101804. arXiv: 2008.05171 . doi:10.1016/j.is.2021.101804. S2CID 221103804.

[3] Maranzana, F. E. (1963). "On the location of supply points to minimize transportation costs". IBM Systems Journal. 2 (2): 129–135. doi:10.1147/sj.22.0129.

[EoSL-4] T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning, Springer (2001), 468–469.

[5] Park, Hae-Sang; Jun, Chi-Hyuck (2009). "A simple and fast algorithm for K-medoids clustering". Expert Systems with Applications. 36 (2): 3336–3341. doi:10.1016/j.eswa.2008.01.039.

[6] Teitz, Michael B.; Bart, Polly (1968-10-01). "Heuristic Methods for Estimating the Generalized Vertex Median of a Weighted Graph". Operations Research. 16 (5): 955–961. doi:10.1287/opre.16.5.955. ISSN 0030-364X.

[:0-7] 1 2 Schubert, Erich (2021). HACAM: Hierarchical Agglomerative Clustering Around Medoids – and its Limitations (PDF). LWDA’21: Lernen, Wissen, Daten, Analysen September 01–03, 2021, Munich, Germany. pp. 191–204 – via CEUR-WS.

[8] Miyamoto, Sadaaki; Kaizu, Yousuke; Endo, Yasunori (2016). Hierarchical and Non-Hierarchical Medoid Clustering Using Asymmetric Similarity Measures. 2016 Joint 8th International Conference on Soft Computing and Intelligent Systems (SCIS) and 17th International Symposium on Advanced Intelligent Systems (ISIS). pp. 400–403. doi:10.1109/SCIS-ISIS.2016.0091.

[9] Herr, Dominik; Han, Qi; Lohmann, Steffen; Ertl, Thomas (2016). Visual Clutter Reduction through Hierarchy-based Projection of High-dimensional Labeled Data (PDF). Graphics Interface. Graphics Interface. doi:10.20380/gi2016.14 . Retrieved 2022-11-04.

[10] Tiwari, Mo; Zhang, Martin J.; Mayclin, James; Thrun, Sebastian; Piech, Chris; Shomorony, Ilan (2020). "BanditPAM: Almost Linear Time k-Medoids Clustering via Multi-Armed Bandits". Advances in Neural Information Processing Systems. 33.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]