Spectral clustering

Last updated
An example connected graph, with 6 vertices. 6n-graf.svg
An example connected graph, with 6 vertices.
Partitioning into two connected graphs 6n-graf2.svg
Partitioning into two connected graphs

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

Contents

In application to image segmentation, spectral clustering is known as segmentation-based object categorization.

Definitions

Given an enumerated set of data points, the similarity matrix may be defined as a symmetric matrix , where represents a measure of the similarity between data points with indices and . The general approach to spectral clustering is to use a standard clustering method (there are many such methods, k-means is discussed below) on relevant eigenvectors of a Laplacian matrix of . There are many different ways to define a Laplacian which have different mathematical interpretations, and so the clustering will also have different interpretations. The eigenvectors that are relevant are the ones that correspond to several smallest eigenvalues of the Laplacian except for the smallest eigenvalue which will have a value of 0. For computational efficiency, these eigenvectors are often computed as the eigenvectors corresponding to the largest several eigenvalues of a function of the Laplacian.

Laplacian matrix

A 2-dimensional spring system. Elastic network model.png
A 2-dimensional spring system.

Spectral clustering is well known to relate to partitioning of a mass-spring system, where each mass is associated with a data point and each spring stiffness corresponds to a weight of an edge describing a similarity of the two related data points, as in the spring system. Specifically, the classical reference [1] explains that the eigenvalue problem describing transversal vibration modes of a mass-spring system is exactly the same as the eigenvalue problem for the graph Laplacian matrix defined as

,

where is the diagonal matrix

and A is the adjacency matrix.

The masses that are tightly connected by the springs in the mass-spring system evidently move together from the equilibrium position in low-frequency vibration modes, so that the components of the eigenvectors corresponding to the smallest eigenvalues of the graph Laplacian can be used for meaningful clustering of the masses. For example, assuming that all the springs and the masses are identical in the 2-dimensional spring system pictured, one would intuitively expect that the loosest connected masses on the right-hand side of the system would move with the largest amplitude and in the opposite direction to the rest of the masses when the system is shaken — and this expectation will be confirmed by analyzing components of the eigenvectors of the graph Laplacian corresponding to the smallest eigenvalues, i.e., the smallest vibration frequencies.

Laplacian matrix normalization

The goal of normalization is making the diagonal entries of the Laplacian matrix to be all unit, also scaling off-diagonal entries correspondingly. In a weighted graph, a vertex may have a large degree because of a small number of connected edges but with large weights just as well as due to a large number of connected edges with unit weights.

A popular normalized spectral clustering technique is the normalized cuts algorithm or Shi–Malik algorithm introduced by Jianbo Shi and Jitendra Malik, [2] commonly used for image segmentation. It partitions points into two sets based on the eigenvector corresponding to the second-smallest eigenvalue of the symmetric normalized Laplacian defined as

The vector is also the eigenvector corresponding to the second-largest eigenvalue of the symmetrically normalized adjacency matrix

The random walk (or left) normalized Laplacian is defined as

and can also be used for spectral clustering. A mathematically equivalent algorithm [3] takes the eigenvector corresponding to the largest eigenvalue of the random walk normalized adjacency matrix .

The eigenvector of the symmetrically normalized Laplacian and the eigenvector of the left normalized Laplacian are related by the identity

Cluster analysis via Spectral Embedding

Knowing the -by- matrix of selected eigenvectors, mapping — called spectral embedding — of the original data points is performed to a -dimensional vector space using the rows of . Now the analysis is reduced to clustering vectors with components, which may be done in various ways.

In the simplest case , the selected single eigenvector , called the Fiedler vector, corresponds to the second smallest eigenvalue. Using the components of one can place all points whose component in is positive in the set and the rest in , thus bi-partitioning the graph and labeling the data points with two labels. This sign-based approach follows the intuitive explanation of spectral clustering via the mass-spring model — in the low frequency vibration mode that the Fiedler vector represents, one cluster data points identified with mutually strongly connected masses would move together in one direction, while in the complement cluster data points identified with remaining masses would move together in the opposite direction. The algorithm can be used for hierarchical clustering by repeatedly partitioning the subsets in the same fashion.

In the general case , any vector clustering technique can be used, e.g., DBSCAN.

Algorithms

Basic Algorithm
  1. Calculate the Laplacian (or the normalized Laplacian)
  2. Calculate the first eigenvectors (the eigenvectors corresponding to the smallest eigenvalues of )
  3. Consider the matrix formed by the first eigenvectors; the -th row defines the features of graph node
  4. Cluster the graph nodes based on these features (e.g., using k-means clustering)

If the similarity matrix has not already been explicitly constructed, the efficiency of spectral clustering may be improved if the solution to the corresponding eigenvalue problem is performed in a matrix-free fashion (without explicitly manipulating or even computing the similarity matrix), as in the Lanczos algorithm.

For large-sized graphs, the second eigenvalue of the (normalized) graph Laplacian matrix is often ill-conditioned, leading to slow convergence of iterative eigenvalue solvers. Preconditioning is a key technology accelerating the convergence, e.g., in the matrix-free LOBPCG method. Spectral clustering has been successfully applied on large graphs by first identifying their community structure, and then clustering communities. [4]

Spectral clustering is closely related to nonlinear dimensionality reduction, and dimension reduction techniques such as locally-linear embedding can be used to reduce errors from noise or outliers. [5]

Costs

Denoting the number of the data points by , it is important to estimate the memory footprint and compute time, or number of arithmetic operations (AO) performed, as a function of . No matter the algorithm of the spectral clustering, the two main costly items are the construction of the graph Laplacian and determining its eigenvectors for the spectral embedding. The last step — determining the labels from the -by- matrix of eigenvectors — is typically the least expensive requiring only AO and creating just a -by- vector of the labels in memory.

The need to construct the graph Laplacian is common for all distance- or correlation-based clustering methods. Computing the eigenvectors is specific to spectral clustering only.

Constructing graph Laplacian

The graph Laplacian can be and commonly is constructed from the adjacency matrix. The construction can be performed matrix-free, i.e., without explicitly forming the matrix of the graph Laplacian and no AO. It can also be performed in-place of the adjacency matrix without increasing the memory footprint. Either way, the costs of constructing the graph Laplacian is essentially determined by the costs of constructing the -by- graph adjacency matrix.

Moreover, a normalized Laplacian has exactly the same eigenvectors as the normalized adjacency matrix, but with the order of the eigenvalues reversed. Thus, instead of computing the eigenvectors corresponding to the smallest eigenvalues of the normalized Laplacian, one can equivalently compute the eigenvectors corresponding to the largest eigenvalues of the normalized adjacency matrix, without even talking about the Laplacian matrix.

Naive constructions of the graph adjacency matrix, e.g., using the RBF kernel, make it dense, thus requiring memory and AO to determine each of the entries of the matrix. Nystrom method [6] can be used to approximate the similarity matrix, but the approximate matrix is not elementwise positive, [7] i.e. cannot be interpreted as a distance-based similarity.

Algorithms to construct the graph adjacency matrix as a sparse matrix are typically based on a nearest neighbor search, which estimate or sample a neighborhood of a given data point for nearest neighbors, and compute non-zero entries of the adjacency matrix by comparing only pairs of the neighbors. The number of the selected nearest neighbors thus determines the number of non-zero entries, and is often fixed so that the memory footprint of the -by- graph adjacency matrix is only , only sequential arithmetic operations are needed to compute the non-zero entries, and the calculations can be trivially run in parallel.

Computing eigenvectors

The cost of computing the -by- (with ) matrix of selected eigenvectors of the graph Laplacian is normally proportional to the cost of multiplication of the -by- graph Laplacian matrix by a vector, which varies greatly whether the graph Laplacian matrix is dense or sparse. For the dense case the cost thus is . The very commonly cited in the literature cost comes from choosing and is clearly misleading, since, e.g., in a hierarchical spectral clustering as determined by the Fiedler vector.

In the sparse case of the -by- graph Laplacian matrix with non-zero entries, the cost of the matrix-vector product and thus of computing the -by- with matrix of selected eigenvectors is , with the memory footprint also only — both are the optimal low bounds of complexity of clustering data points. Moreover, matrix-free eigenvalue solvers such as LOBPCG can efficiently run in parallel, e.g., on multiple GPUs with distributed memory, resulting not only in high quality clusters, which spectral clustering is famous for, but also top performance. [8]

Software

Free software implementing spectral clustering is available in large open source projects like scikit-learn [9] using LOBPCG [10] with multigrid preconditioning [11] [12] or ARPACK, MLlib for pseudo-eigenvector clustering using the power iteration method, [13] and R. [14]

Relationship with other clustering methods

The ideas behind spectral clustering may not be immediately obvious. It may be useful to highlight relationships with other methods. In particular, it can be described in the context of kernel clustering methods, which reveals several similarities with other approaches. [15]

Relationship with k-means

The weighted kernel k-means problem [16] shares the objective function with the spectral clustering problem, which can be optimized directly by multi-level methods. [17]

Relationship to DBSCAN

In the trivial case of determining connected graph components — the optimal clusters with no edges cut — spectral clustering is also related to a spectral version of DBSCAN clustering that finds density-connected components. [18]

Measures to compare clusterings

Ravi Kannan, Santosh Vempala and Adrian Vetta [19] proposed a bicriteria measure to define the quality of a given clustering. They said that a clustering was an (α, ε)-clustering if the conductance of each cluster (in the clustering) was at least α and the weight of the inter-cluster edges was at most ε fraction of the total weight of all the edges in the graph. They also look at two approximation algorithms in the same paper.

Spectral clustering has a long history. [20] [21] [22] [23] [24] [2] [25] Spectral clustering as a machine learning method was popularized by Shi & Malik [2] and Ng, Jordan, & Weiss. [25]

Ideas and network measures related to spectral clustering also play an important role in a number of applications apparently different from clustering problems. For instance, networks with stronger spectral partitions take longer to converge in opinion-updating models used in sociology and economics. [26] [27]

See also

Related Research Articles

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where the variation in the data can be described with fewer dimensions than the initial data. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science.

In graph theory and computer science, an adjacency matrix is a square matrix used to represent a finite graph. The elements of the matrix indicate whether pairs of vertices are adjacent or not in the graph.

<span class="mw-page-title-main">Nonlinear dimensionality reduction</span> Summary of algorithms for nonlinear dimensionality reduction

Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

In numerical analysis, one of the most important problems is designing efficient and stable algorithms for finding the eigenvalues of a matrix. These eigenvalue algorithms may also find eigenvectors.

In numerical analysis, inverse iteration is an iterative eigenvalue algorithm. It allows one to find an approximate eigenvector when an approximation to a corresponding eigenvalue is already known. The method is conceptually similar to the power method. It appears to have originally been developed to compute resonance frequencies in the field of structural mechanics.

In mathematics, spectral graph theory is the study of the properties of a graph in relationship to the characteristic polynomial, eigenvalues, and eigenvectors of matrices associated with the graph, such as its adjacency matrix or Laplacian matrix.

In numerical linear algebra, the QR algorithm or QR iteration is an eigenvalue algorithm: that is, a procedure to calculate the eigenvalues and eigenvectors of a matrix. The QR algorithm was developed in the late 1950s by John G. F. Francis and by Vera N. Kublanovskaya, working independently. The basic idea is to perform a QR decomposition, writing the matrix as a product of an orthogonal matrix and an upper triangular matrix, multiply the factors in the reverse order, and iterate.

In mathematics, the discrete Laplace operator is an analog of the continuous Laplace operator, defined so that it has meaning on a graph or a discrete grid. For the case of a finite-dimensional graph, the discrete Laplace operator is more commonly called the Laplacian matrix.

In the mathematical field of graph theory, the Laplacian matrix, also called the graph Laplacian, admittance matrix, Kirchhoff matrix or discrete Laplacian, is a matrix representation of a graph. Named after Pierre-Simon Laplace, the graph Laplacian matrix can be viewed as a matrix form of the negative discrete Laplace operator on a graph approximating the negative continuous Laplacian obtained by the finite difference method.

In matrix theory, the Perron–Frobenius theorem, proved by Oskar Perron (1907) and Georg Frobenius (1912), asserts that a real square matrix with positive entries has a unique eigenvalue of largest magnitude and that eigenvalue is real. The corresponding eigenvector can be chosen to have strictly positive components, and also asserts a similar statement for certain classes of nonnegative matrices. This theorem has important applications to probability theory ; to the theory of dynamical systems ; to economics ; to demography ; to social networks ; to Internet search engines (PageRank); and even to ranking of American football teams. The first to discuss the ordering of players within tournaments using Perron–Frobenius eigenvectors is Edmund Landau.

In linear algebra, it is often important to know which vectors have their directions unchanged by a linear transformation. An eigenvector or characteristic vector is such a vector. Thus an eigenvector of a linear transformation is scaled by a constant factor when the linear transformation is applied to it: . The corresponding eigenvalue, characteristic value, or characteristic root is the multiplying factor .

<span class="mw-page-title-main">Algebraic connectivity</span> Second-smallest eigenvalue of a graph Laplacian

The algebraic connectivity of a graph G is the second-smallest eigenvalue of the Laplacian matrix of G. This eigenvalue is greater than 0 if and only if G is a connected graph. This is a corollary to the fact that the number of times 0 appears as an eigenvalue in the Laplacian is the number of connected components in the graph. The magnitude of this value reflects how well connected the overall graph is. It has been used in analyzing the robustness and synchronizability of networks.

In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space.

In mathematics, a graph partition is the reduction of a graph to a smaller graph by partitioning its set of nodes into mutually exclusive groups. Edges of the original graph that cross between the groups will produce edges in the partitioned graph. If the number of resulting edges is small compared to the original graph, then the partitioned graph may be better suited for analysis and problem-solving than the original. Finding a partition that simplifies graph analysis is a hard problem, but one that has applications to scientific computing, VLSI circuit design, and task scheduling in multiprocessor computers, among others. Recently, the graph partition problem has gained importance due to its application for clustering and detection of cliques in social, pathological and biological networks. For a survey on recent trends in computational methods and applications see Buluc et al. (2013). Two common examples of graph partitioning are minimum cut and maximum cut problems.

In linear algebra, eigendecomposition is the factorization of a matrix into a canonical form, whereby the matrix is represented in terms of its eigenvalues and eigenvectors. Only diagonalizable matrices can be factorized in this way. When the matrix being factorized is a normal or real symmetric matrix, the decomposition is called "spectral decomposition", derived from the spectral theorem.

The image segmentation problem is concerned with partitioning an image into multiple regions according to some homogeneity criterion. This article is primarily concerned with graph theoretic approaches to image segmentation applying graph partitioning via minimum cut or maximum cut. Segmentation-based object categorization can be viewed as a specific case of spectral clustering applied to image segmentation.

Spectral layout is a class of algorithm for drawing graphs. The layout uses the eigenvectors of a matrix, such as the Laplace matrix of the graph, as Cartesian coordinates of the graph's vertices.

Locally Optimal Block Preconditioned Conjugate Gradient (LOBPCG) is a matrix-free method for finding the largest eigenvalues and the corresponding eigenvectors of a symmetric generalized eigenvalue problem

In applied mathematics, proto-value functions (PVFs) are automatically learned basis functions that are useful in approximating task-specific value functions, providing a compact representation of the powers of transition matrices. They provide a novel framework for solving the credit assignment problem. The framework introduces a novel approach to solving Markov decision processes (MDP) and reinforcement learning problems, using multiscale spectral and manifold learning methods. Proto-value functions are generated by spectral analysis of a graph, using the graph Laplacian.

<span class="mw-page-title-main">Diffusion map</span>

Diffusion maps is a dimensionality reduction or feature extraction algorithm introduced by Coifman and Lafon which computes a family of embeddings of a data set into Euclidean space whose coordinates can be computed from the eigenvectors and eigenvalues of a diffusion operator on the data. The Euclidean distance between points in the embedded space is equal to the "diffusion distance" between probability distributions centered at those points. Different from linear dimensionality reduction methods such as principal component analysis (PCA), diffusion maps are part of the family of nonlinear dimensionality reduction methods which focus on discovering the underlying manifold that the data has been sampled from. By integrating local similarities at different scales, diffusion maps give a global description of the data-set. Compared with other methods, the diffusion map algorithm is robust to noise perturbation and computationally inexpensive.

References

  1. Demmel, J. "CS267: Notes for Lecture 23, April 9, 1999, Graph Partitioning, Part 2".
  2. 1 2 3 Jianbo Shi and Jitendra Malik, "Normalized Cuts and Image Segmentation", IEEE Transactions on PAMI, Vol. 22, No. 8, Aug 2000.
  3. Marina Meilă & Jianbo Shi, "Learning Segmentation by Random Walks", Neural Information Processing Systems 13 (NIPS 2000), 2001, pp. 873–879.
  4. Zare, Habil; Shooshtari, P.; Gupta, A.; Brinkman, R. (2010). "Data reduction for spectral clustering to analyze high throughput flow cytometry data". BMC Bioinformatics. 11: 403. doi:10.1186/1471-2105-11-403. PMC   2923634 . PMID   20667133.{{cite journal}}: CS1 maint: unflagged free DOI (link)
  5. Arias-Castro, E.; Chen, G.; Lerman, G. (2011), "Spectral clustering based on local linear approximations.", Electronic Journal of Statistics, 5: 1537–87, arXiv: 1001.1323 , doi:10.1214/11-ejs651, S2CID   88518155
  6. Fowlkes, C (2004). "Spectral grouping using the Nystrom method". IEEE Transactions on Pattern Analysis and Machine Intelligence. 26 (2): 214–25. doi:10.1109/TPAMI.2004.1262185. PMID   15376896. S2CID   2384316.
  7. Wang, S.; Gittens, A.; Mahoney, M.W. (2019). "Scalable Kernel K-Means Clustering with Nystrom Approximation: Relative-Error Bounds". Journal of Machine Learning Research. 20: 1–49. arXiv: 1706.02803 .
  8. Acer, Seher; Boman, Erik G.; Glusa, Christian A.; Rajamanickam, Sivasankaran (2021). "Sphynx: A parallel multi-GPU graph partitioner for distributed-memory systems". Parallel Computing. 106: 102769. doi:10.1016/j.parco.2021.102769. S2CID   233481603.
  9. "2.3. Clustering".
  10. Knyazev, Andrew V. (2003). Boley; Dhillon; Ghosh; Kogan (eds.). Modern preconditioned eigensolvers for spectral image segmentation and graph bisection. Clustering Large Data Sets; Third IEEE International Conference on Data Mining (ICDM 2003) Melbourne, Florida: IEEE Computer Society. pp. 59–62.
  11. Knyazev, Andrew V. (2006). Multiscale Spectral Image Segmentation Multiscale preconditioning for computing eigenvalues of graph Laplacians in image segmentation. Fast Manifold Learning Workshop, WM Williamburg, VA. doi:10.13140/RG.2.2.35280.02565.
  12. Knyazev, Andrew V. (2006). Multiscale Spectral Graph Partitioning and Image Segmentation. Workshop on Algorithms for Modern Massive Datasets Stanford University and Yahoo! Research.
  13. "Clustering - RDD-based API - Spark 3.2.0 Documentation".
  14. "Kernlab: Kernel-Based Machine Learning Lab". 12 November 2019.
  15. Filippone, M.; Camastra, F.; Masulli, F.; Rovetta, S. (January 2008). "A survey of kernel and spectral methods for clustering" (PDF). Pattern Recognition. 41 (1): 176–190. Bibcode:2008PatRe..41..176F. doi:10.1016/j.patcog.2007.05.018.
  16. Dhillon, I.S.; Guan, Y.; Kulis, B. (2004). "Kernel k-means: spectral clustering and normalized cuts" (PDF). Proceedings of the tenth ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 551–6.
  17. Dhillon, Inderjit; Guan, Yuqiang; Kulis, Brian (November 2007). "Weighted Graph Cuts without Eigenvectors: A Multilevel Approach". IEEE Transactions on Pattern Analysis and Machine Intelligence. 29 (11): 1944–1957. CiteSeerX   10.1.1.131.2635 . doi:10.1109/tpami.2007.1115. PMID   17848776. S2CID   9402790.
  18. Schubert, Erich; Hess, Sibylle; Morik, Katharina (2018). The Relationship of DBSCAN to Matrix Factorization and Spectral Clustering (PDF). LWDA. pp. 330–334.
  19. Kannan, Ravi; Vempala, Santosh; Vetta, Adrian (2004). "On Clusterings : Good, Bad and Spectral". Journal of the ACM. 51 (3): 497–515. doi:10.1145/990308.990313. S2CID   207558562.
  20. Cheeger, Jeff (1969). "A lower bound for the smallest eigenvalue of the Laplacian". Proceedings of the Princeton Conference in Honor of Professor S. Bochner.
  21. Donath, William; Hoffman, Alan (1972). "Algorithms for partitioning of graphs and computer logic based on eigenvectors of connections matrices". IBM Technical Disclosure Bulletin.
  22. Fiedler, Miroslav (1973). "Algebraic connectivity of graphs". Czechoslovak Mathematical Journal. 23 (2): 298–305. doi: 10.21136/CMJ.1973.101168 .
  23. Guattery, Stephen; Miller, Gary L. (1995). "On the performance of spectral graph partitioning methods". Annual ACM-SIAM Symposium on Discrete Algorithms.
  24. Daniel A. Spielman and Shang-Hua Teng (1996). "Spectral Partitioning Works: Planar graphs and finite element meshes". Annual IEEE Symposium on Foundations of Computer Science.
  25. 1 2 Ng, Andrew Y.; Jordan, Michael I.; Weiss, Yair (2002). "On spectral clustering: analysis and an algorithm" (PDF). Advances in Neural Information Processing Systems.
  26. DeMarzo, P. M.; Vayanos, D.; Zwiebel, J. (2003-08-01). "Persuasion Bias, Social Influence, and Unidimensional Opinions". The Quarterly Journal of Economics. Oxford University Press. 118 (3): 909–968. doi:10.1162/00335530360698469. ISSN   0033-5533.
  27. Golub, Benjamin; Jackson, Matthew O. (2012-07-26). "How Homophily Affects the Speed of Learning and Best-Response Dynamics". The Quarterly Journal of Economics. Oxford University Press (OUP). 127 (3): 1287–1338. doi:10.1093/qje/qjs021. ISSN   0033-5533.