Affinity propagation

Last updated April 28, 2024

In statistics and data mining, affinity propagation (AP) is a clustering algorithm based on the concept of "message passing" between data points.^[1] Unlike clustering algorithms such as $k$ -means or $k$ -medoids, affinity propagation does not require the number of clusters to be determined or estimated before running the algorithm. Similar to $k$ -medoids, affinity propagation finds "exemplars," members of the input set that are representative of clusters.^[1]

Algorithm

Let $x 1$ through $x n$ be a set of data points, with no assumptions made about their internal structure, and let $s$ be a function that quantifies the similarity between any two points, such that $s (i, j) > s (i, k)$ if and only if $x i$ is more similar to $x j$ than to $x k$ . For this example, the negative squared distance of two data points was used i.e. for points $x i$ and $x k$ , $s(i,k)=-\left\|x_{i}-x_{k}\right\|^{2}$ .^[1]

The diagonal of $s$ (i.e. $s(i,i)$ ) is particularly important, as it represents the instance preference, meaning how likely a particular instance is to become an exemplar. When it is set to the same value for all inputs, it controls how many classes the algorithm produces. A value close to the minimum possible similarity produces fewer classes, while a value close to or larger than the maximum possible similarity produces many classes. It is typically initialized to the median similarity of all pairs of inputs.

The algorithm proceeds by alternating between two message-passing steps, which update two matrices:^[1]

The "responsibility" matrix $R$ has values $r (i, k)$ that quantify how well-suited $x k$ is to serve as the exemplar for $x i$ , relative to other candidate exemplars for $x i$ .
The "availability" matrix $A$ contains values $a (i, k)$ that represent how "appropriate" it would be for $x i$ to pick $x k$ as its exemplar, taking into account other points' preference for $x k$ as an exemplar.

Both matrices are initialized to all zeroes, and can be viewed as log-probability tables. The algorithm then performs the following updates iteratively:

First, responsibility updates are sent around: $r(i,k)\gets s(i,k)-\max _{k'\neq k}{\left\{a(i,k')+s(i,k')\right\}}$
Then, availability is updated per $a(i,k)\gets \min {\left(0,r(k,k)+\sum _{i'\not \in \{i,k\}}\max(0,r(i',k))\right)}\quad {\text{ for }}i\neq k$ and $a(k,k)\leftarrow \sum _{i'\neq k}\max(0,r(i',k)).$

Iterations are performed until either the cluster boundaries remain unchanged over a number of iterations, or some predetermined number (of iterations) is reached. The exemplars are extracted from the final matrices as those whose 'responsibility + availability' for themselves is positive (i.e. $(r(i,i)+a(i,i))>0$ ).

Applications

The inventors of affinity propagation showed it is better for certain computer vision and computational biology tasks, e.g. clustering of pictures of human faces and identifying regulated transcripts, than $k$ -means,^[1] even when $k$ -means was allowed many random restarts and initialized using PCA.^[2] A study comparing affinity propagation and Markov clustering on protein interaction graph partitioning found Markov clustering to work better for that problem.^[3] A semi-supervised variant has been proposed for text mining applications.^[4] Another recent application was in economics, when the affinity propagation was used to find some temporal patterns in the output multipliers of the US economy between 1997 and 2017.^[5]

Software

A Java implementation is included in the ELKI data mining framework.
A Julia implementation of affinity propagation is contained in Julia Statistics's Clustering.jl package.
A Python version is part of the scikit-learn library.
An R implementation is available in the "apcluster" package.

Related Research Articles

In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the n-th approximation is derived from the previous ones.

Nonlinear dimensionality reduction, also known as manifold learning, is any of various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

The information bottleneck method is a technique in information theory introduced by Naftali Tishby, Fernando C. Pereira, and William Bialek. It is designed for finding the best tradeoff between accuracy and complexity (compression) when summarizing a random variable X, given a joint probability distribution p(X,Y) between X and an observed relevant variable Y - and self-described as providing "a surprisingly rich framework for discussing a variety of problems in signal processing and learning".

<span class="mw-page-title-main">Cluster analysis</span> Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Belief propagation, also known as sum–product message passing, is a message-passing algorithm for performing inference on graphical models, such as Bayesian networks and Markov random fields. It calculates the marginal distribution for each unobserved node, conditional on any observed nodes. Belief propagation is commonly used in artificial intelligence and information theory, and has demonstrated empirical success in numerous applications, including low-density parity-check codes, turbo codes, free energy approximation, and satisfiability.

In mathematics, computer science and especially graph theory, a distance matrix is a square matrix containing the distances, taken pairwise, between the elements of a set. Depending upon the application involved, the distance being used to define this matrix may or may not be a metric. If there are $N$ elements, this matrix will have size $N \times N$ . In graph-theoretic applications, the elements are more often referred to as points, nodes or vertices.

In statistics and related fields, a similarity measure or similarity function or similarity metric is a real-valued function that quantifies the similarity between two objects. Although no single definition of a similarity exists, usually such measures are in some sense the inverse of distance metrics: they take on large values for similar objects and either zero or a negative value for very dissimilar objects. Though, in more broad terms, a similarity function may also satisfy metric axioms.

Muller's method is a root-finding algorithm, a numerical method for solving equations of the form f(x) = 0. It was first presented by David E. Muller in 1956.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

Medoids are representative objects of a data set or a cluster within a data set whose sum of dissimilarities to all the objects in the cluster is minimal. Medoids are similar in concept to means or centroids, but medoids are always restricted to be members of the data set. Medoids are most commonly used on data when a mean or centroid cannot be defined, such as graphs. They are also used in contexts where the centroid is not representative of the dataset like in images, 3-D trajectories and gene expression. These are also of interest while wanting to find a representative using some distance other than squared euclidean distance.

The $k$ -medoids problem is a clustering problem similar to $k$ -means. The name was coined by Leonard Kaufman and Peter J. Rousseeuw with their PAM algorithm. Both the $k$ -means and $k$ -medoids algorithms are partitional and attempt to minimize the distance between points labeled to be in a cluster and a point designated as the center of that cluster. In contrast to the $k$ -means algorithm, $k$ -medoids chooses actual data points as centers, and thereby allows for greater interpretability of the cluster centers than in $k$ -means, where the center of a cluster is not necessarily one of the input data points. Furthermore, $k$ -medoids can be used with arbitrary dissimilarity measures, whereas $k$ -means generally requires Euclidean distance for efficient solutions. Because $k$ -medoids minimizes a sum of pairwise dissimilarities instead of a sum of squared Euclidean distances, it is more robust to noise and outliers than $k$ -means.

Stress majorization is an optimization strategy used in multidimensional scaling (MDS) where, for a set of -dimensional data items, a configuration of $points in -dimensional space is sought that minimizes the so-called stress function . Usually is or, i.e. the matrix lists points in or dimensional Euclidean space so that the result may be visualised. The function is a cost or loss function that measures the squared differences between ideal distances and actual distances in r -dimensional space. It is defined as:$

In statistics, k-medians clustering is a cluster analysis algorithm. It is a variation of k-means clustering where instead of calculating the mean for each cluster to determine its centroid, one instead calculates the median. This has the effect of minimizing error over all clusters with respect to the 2-norm distance metric, as opposed to the squared 2-norm distance metric.

Silhouette refers to a method of interpretation and validation of consistency within clusters of data. The technique provides a succinct graphical representation of how well each object has been classified. It was proposed by Belgian statistician Peter Rousseeuw in 1987.

Clustering high-dimensional data is the cluster analysis of data with anywhere from a few dozen to many thousands of dimensions. Such high-dimensional spaces of data are often encountered in areas such as medicine, where DNA microarray technology can produce many measurements at once, and the clustering of text documents, where, if a word-frequency vector is used, the number of dimensions equals the size of the vocabulary.

The iterative proportional fitting procedure is the operation of finding the fitted matrix $which is the closest to an initial matrix but with the row and column totals of a target matrix . The fitted matrix being of the form, where and are diagonal matrices such that has the margins of . Some algorithms can be chosen to perform biproportion. We have also the entropy maximization, information loss minimization or RAS which consists of factoring the matrix rows to match the specified row totals, then factoring its columns to match the specified column totals; each step usually disturbs the previous step's match, so these steps are repeated in cycles, re-adjusting the rows and columns in turn, until all specified marginal totals are satisfactorily approximated. However, all algorithms give the same solution. In three- or more-dimensional cases, adjustment steps are applied for the marginals of each dimension in turn, the steps likewise repeated in cycles.$

The Davies–Bouldin index (DBI), introduced by David L. Davies and Donald W. Bouldin in 1979, is a metric for evaluating clustering algorithms. This is an internal evaluation scheme, where the validation of how well the clustering has been done is made using quantities and features inherent to the dataset. This has a drawback that a good value reported by this method does not imply the best information retrieval.

In applied mathematics, k-SVD is a dictionary learning algorithm for creating a dictionary for sparse representations, via a singular value decomposition approach. k-SVD is a generalization of the k-means clustering method, and it works by iteratively alternating between sparse coding the input data based on the current dictionary, and updating the atoms in the dictionary to better fit the data. It is structurally related to the expectation maximization (EM) algorithm. k-SVD can be found widely in use in applications such as image processing, audio processing, biology, and document analysis.

In mathematics, the spiral optimization (SPO) algorithm is a metaheuristic inspired by spiral phenomena in nature.

Graph cut optimization is a combinatorial optimization method applicable to a family of functions of discrete variables, named after the concept of cut in the theory of flow networks. Thanks to the max-flow min-cut theorem, determining the minimum cut over a graph representing a flow network is equivalent to computing the maximum flow over the network. Given a pseudo-Boolean function $, if it is possible to construct a flow network with positive weights such that$

References

1 2 3 4 5 Brendan J. Frey; Delbert Dueck (2007). "Clustering by passing messages between data points". Science . 315 (5814): 972–976. Bibcode:2007Sci...315..972F. CiteSeerX 10.1.1.121.3145 . doi:10.1126/science.1136800. PMID 17218491. S2CID 6502291.
↑ Delbert Dueck; Brendan J. Frey (2007). Non-metric affinity propagation for unsupervised image categorization. Int'l Conf. on Computer Vision. doi:10.1109/ICCV.2007.4408853.
↑ James Vlasblom; Shoshana Wodak (2009). "Markov clustering versus affinity propagation for the partitioning of protein interaction graphs". BMC Bioinformatics. 10 (1): 99. doi: 10.1186/1471-2105-10-99 . PMC 2682798 . PMID 19331680.
↑ Renchu Guan; Xiaohu Shi; Maurizio Marchese; Chen Yang; Yanchun Liang (2011). "Text Clustering with Seeds Affinity Propagation". IEEE Transactions on Knowledge and Data Engineering. 23 (4): 627–637. doi:10.1109/tkde.2010.144. hdl: 11572/89884 . S2CID 14053903.
↑ Almeida, Lucas Milanez de Lima; Balanco, Paulo Antonio de Freitas (2020-06-01). "Application of multivariate analysis as complementary instrument in studies about structural changes: An example of the multipliers in the US economy". Structural Change and Economic Dynamics. 53: 189–207. doi:10.1016/j.strueco.2020.02.006. ISSN 0954-349X. S2CID 216406772.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[science-1] 1 2 3 4 5 Brendan J. Frey; Delbert Dueck (2007). "Clustering by passing messages between data points". Science . 315 (5814): 972–976. Bibcode:2007Sci...315..972F. CiteSeerX 10.1.1.121.3145 . doi:10.1126/science.1136800. PMID 17218491. S2CID 6502291.

[2] Delbert Dueck; Brendan J. Frey (2007). Non-metric affinity propagation for unsupervised image categorization. Int'l Conf. on Computer Vision. doi:10.1109/ICCV.2007.4408853.

[3] James Vlasblom; Shoshana Wodak (2009). "Markov clustering versus affinity propagation for the partitioning of protein interaction graphs". BMC Bioinformatics. 10 (1): 99. doi: 10.1186/1471-2105-10-99 . PMC 2682798 . PMID 19331680.

[4] Renchu Guan; Xiaohu Shi; Maurizio Marchese; Chen Yang; Yanchun Liang (2011). "Text Clustering with Seeds Affinity Propagation". IEEE Transactions on Knowledge and Data Engineering. 23 (4): 627–637. doi:10.1109/tkde.2010.144. hdl: 11572/89884 . S2CID 14053903.

[5] Almeida, Lucas Milanez de Lima; Balanco, Paulo Antonio de Freitas (2020-06-01). "Application of multivariate analysis as complementary instrument in studies about structural changes: An example of the multipliers in the US economy". Structural Change and Economic Dynamics. 53: 189–207. doi:10.1016/j.strueco.2020.02.006. ISSN 0954-349X. S2CID 216406772.

[1]

[2]

[3]

[4]

[5]