Quantum clustering

Last updated

Quantum Clustering (QC) is a class of data-clustering algorithms that use conceptual and mathematical tools from quantum mechanics. QC belongs to the family of density-based clustering algorithms, where clusters are defined by regions of higher density of data points.

Contents

QC was first developed by David Horn and Assaf Gottlieb in 2001. [1]

Original Quantum Clustering Algorithm

Given a set of points in an n-dimensional data space, QC represents each point with a multidimensional Gaussian distribution, with width (standard deviation) sigma, centered at each point’s location in the space. These Gaussians are then added together to create a single distribution for the entire data set. (This step is a particular example of kernel density estimation, often referred to as a Parzen-Rosenblatt window estimator.) This distribution is considered to be the quantum-mechanical wave function for the data set. Loosely speaking, the wave function is a generalized description of where there are likely to be data points in the space.

QC next introduces the idea of a quantum potential; using the time-independent Schrödinger equation, a potential surface is constructed which has the data set’s wave function as a stable solution. Details in the potential surface are more robust to changes in sigma (the width of the Gaussians) than the corresponding details in the wave function; this advantage is one of the initial motivations for the development of QC.

The potential surface is considered to be the ‘landscape’ of the data set, where 'low' points in the landscape correspond to regions of high data density. QC then uses gradient descent to move each data point ‘downhill’ in the landscape, causing points to gather together in nearby minima, thus revealing clusters within the data set.

QC has a single main hyperparameter, which is the width sigma of the Gaussian distribution around each data point. For sufficiently small sigma, every data point will define its own depression in the landscape, and no points will move, thus creating no clusters. For sufficiently large sigma, the landscape becomes a single smooth bowl, and every data point will cluster together at the single global minimum in the landscape. Exploring the range of sigma values between these extremes yields information about the inherent structure of the data set, including hierarchy of structure; smaller sigma values reveal more fine-grained local structure, and larger sigma values reveal overall global structure. The QC algorithm does not specify a preferred or ‘correct’ value of sigma.

Dynamic Quantum Clustering

Developed by Marvin Weinstein and David Horn in 2009, [2] Dynamic Quantum Clustering (DQC) extends the basic QC algorithm in several ways.

Quantum Evolution, Non-Local Gradient Descent, and Tunneling

DQC uses the same potential landscape as QC, but it replaces classical gradient descent with quantum evolution. To do this, each data point is again represented by its individual wave function (a multidimensional Gaussian distribution with width sigma). The time-dependent Schrödinger equation is then used to compute each wave function's evolution over time in the given quantum potential. More precisely, a small time-step value is introduced, and the evolution of the wave function is calculated repeatedly at each time step, with a new expected location for the data point calculated after each step. This process builds a trajectory through the data space for each point; the evolution continues until all points have stopped moving.

Importantly, the Ehrenfest Theorem from quantum mechanics states that this quantum evolution does, in fact, equate to the point moving downhill in the potential landscape, in expectation. The “in expectation” part is important because, unlike in classical physics, the point’s motion is not influenced only by the gradient of the potential at the point’s location; instead, the point’s wave function extends over the entire landscape (with the Gaussian centered at the point’s location), and a complex interaction between the wave function and the potential determines the point’s motion. As a loose analogy: regions of the landscape that are below the point's current location 'attract' the point—the more so the lower the region is, but the less so the farther away from the point it is. In the same way, higher regions of the landscape 'repel' the point.

Thus, quantum evolution for each point acts as a form of non-local gradient descent in the potential. This non-locality creates the possibility of tunneling, where a point will seem to ignore or pass through a potential barrier on its way toward some lower minimum. The biggest problem in non-convex gradient descent is often the existence of many small and uninteresting local minima where points can get stuck as they descend. (This problem tends to get worse as the number of dimensions increases, which is part of the curse of dimensionality.) DQC’s use of non-local gradient descent and tunneling presents a solution to this problem.

DQC introduces 2 new hyperparameters: the time step, and the mass of each data point (which controls the degree of tunneling behavior). Whereas tuning of sigma is integral to understanding any new data set, both time step and mass can usually be left at reasonable default values and still produce useful results.

An important downside of the quantum-evolution approach is that the time complexity of the evolution is now in the number of data points, since interacting with the entire potential landscape is for each point. For large data sets, the computation time would quickly become intractable. When needed, DQC addresses this problem by selecting a limited number of points from the data set to act as a basis (see next section).

Use of a Limited Basis

For a data set of n points, DQC creates a set of n quantum eigenstates for use in its underlying calculations; the eigenstates are orthonormal, and each one is a linear combination of the Gaussian distributions representing each data point.

For large n, the use of n eigenstates becomes computationally intractable, since the creation of the potential and the evolution of individual points are both . To solve this problem, DQC allows for the selection of a more limited basis, as follows. To act as the basis, b data points (b < n) are chosen such that the basis points span the space occupied by the data set. (This can be done in multiple ways; the essential goal is to choose the basis points to be as far away from each other as possible.) DQC then constructs b eigenstates using these b basis points. These eigenstates represent the basis points perfectly; they are also used to represent all non-basis points, but those representations will be imperfect. The loss of information is relative to sigma; for a given basis, sigma must be chosen large enough that the basis can be used to represent the non-basis points to some reasonable degree of accuracy. In a sense, the size of the chosen basis can be thought of as the ‘resolution’ that is used to ‘view’ the structure of the data.

The largest reasonable basis size depends on available computing resources, and on how long one is willing to wait for results. As of 2020, without access to enterprise-level computing resources, the largest tractable basis size is typically in the range of 1,500-2,000 points.

Use of Dynamic Visualization

DQC calculates a trajectory for each data point, which allows for the creation of an animated ('dynamic') visualization in which all data points move along their trajectories simultaneously. These animations present information not just in the final destination of each point, but in each entire trajectory along the way. Notably, animations can reveal the existence of channels leading to a given cluster. (In the landscape metaphor, these structures can be thought of as riverbeds and lakes.) Although a given visualization is limited to at most 3 spatial dimensions, the appearance of channels and other structures creates the ability to ‘see’ what is happening in more than 3 dimensions.

Use of a PCA coordinate system is helpful for these visualizations; viewing the trajectories in the first 3 PCA dimensions packs as much information as possible into a single visualization.

A given 3D visualization of these trajectories is not an embedding of the trajectories in 3 dimensions. The trajectories have the same dimensionality as the data space, which is often much larger than 3; the visualization is simply a 3D view into a higher-dimensional motion.

The channels ('riverbeds') can have meaning in 2 different ways. First, they can be treated as subclusters, where different subclusters join the main cluster from different directions. Second, they can be treated as regressions: position along the channel at a given time (or, equivalently, order of arrival at the cluster center) may be correlated with some metadata of interest.

Applications

Variants of QC have been applied to real-world data in many fields, including biology, [1] [2] [3] [4] [5] [6] geology, [3] [7] physics, [3] [4] [8] finance, [3] engineering, [4] and economics. [9] With these applications, a comprehensive mathematical analysis to find all the roots of the quantum potential has also been worked out. [10]

Related Research Articles

Computational chemistry is a branch of chemistry that uses computer simulation to assist in solving chemical problems. It uses methods of theoretical chemistry, incorporated into computer programs, to calculate the structures and properties of molecules, groups of molecules, and solids. It is necessary because, apart from relatively recent results concerning the hydrogen molecular ion, the quantum many-body problem cannot be solved analytically, much less in closed form. While computational results normally complement the information obtained by chemical experiments, it can in some cases predict hitherto unobserved chemical phenomena. It is widely used in the design of new drugs and materials.

Schrödinger equation Description of a quantum-mechanical system

The Schrödinger equation is a linear partial differential equation that governs the wave function of a quantum-mechanical system. It is a key result in quantum mechanics, and its discovery was a significant landmark in the development of the subject. The equation is named after Erwin Schrödinger, who postulated the equation in 1925, and published it in 1926, forming the basis for the work that resulted in his Nobel Prize in Physics in 1933.

In mathematics, a Gaussian function, often simply referred to as a Gaussian, is a function of the form

Nonlinear dimensionality reduction Summary of algorithms for nonlinear dimensionality reduction

High-dimensional data, meaning data that requires more than two or three dimensions to represent, can be difficult to interpret. One approach to simplification is to assume that the data of interest lies within lower-dimensional space. If the data of interest is of low enough dimension, the data can be visualised in the low-dimensional space.

Fitness landscape Model used to visualise relationship between genotypes and reproductive success

In evolutionary biology, fitness landscapes or adaptive landscapes are used to visualize the relationship between genotypes and reproductive success. It is assumed that every genotype has a well-defined replication rate. This fitness is the "height" of the landscape. Genotypes which are similar are said to be "close" to each other, while those that are very different are "far" from each other. The set of all possible genotypes, their degree of similarity, and their related fitness values is then called a fitness landscape. The idea of a fitness landscape is a metaphor to help explain flawed forms in evolution by natural selection, including exploits and glitches in animals like their reactions to supernormal stimuli.

In computational chemistry and molecular physics, Gaussian orbitals are functions used as atomic orbitals in the LCAO method for the representation of electron orbitals in molecules and numerous properties that depend on these.

Cluster analysis Grouping a set of objects by similarity

Cluster analysis or clustering is the task of grouping a set of objects in such a way that objects in the same group are more similar to each other than to those in other groups (clusters). It is a main task of exploratory data analysis, and a common technique for statistical data analysis, used in many fields, including pattern recognition, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning.

Q-Chem is a general-purpose electronic structure package featuring a variety of established and new methods implemented using innovative algorithms that enable fast calculations of large systems on various computer architectures, from laptops and regular lab workstations to midsize clusters and HPCC, using density functional and wave-function based approaches. It offers an integrated graphical interface and input generator; a large selection of functionals and correlation methods, including methods for electronically excited states and open-shell systems; solvation models; and wave-function analysis tools. In addition to serving the computational chemistry community, Q-Chem also provides a versatile code development platform.

In statistics, a mixture model is a probabilistic model for representing the presence of subpopulations within an overall population, without requiring that an observed data set should identify the sub-population to which an individual observation belongs. Formally a mixture model corresponds to the mixture distribution that represents the probability distribution of observations in the overall population. However, while problems associated with "mixture distributions" relate to deriving the properties of the overall population from those of the sub-populations, "mixture models" are used to make statistical inferences about the properties of the sub-populations given only observations on the pooled population, without sub-population identity information.

The scale-invariant feature transform (SIFT) is a computer vision algorithm to detect, describe, and match local features in images, invented by David Lowe in 1999. Applications include object recognition, robotic mapping and navigation, image stitching, 3D modeling, gesture recognition, video tracking, individual identification of wildlife and match moving.

k-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean, serving as a prototype of the cluster. This results in a partitioning of the data space into Voronoi cells. k-means clustering minimizes within-cluster variances, but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances. For instance, better Euclidean solutions can be found using k-medians and k-medoids.

A basis set in theoretical and computational chemistry is a set of functions that is used to represent the electronic wave function in the Hartree–Fock method or density-functional theory in order to turn the partial differential equations of the model into algebraic equations suitable for efficient implementation on a computer.

PQS (software)

PQS is a general purpose quantum chemistry program. Its roots go back to the first ab initio gradient program developed in Professor Peter Pulay's group but now it is developed and distributed commercially by Parallel Quantum Solutions. There is a reduction in cost for academic users and a site license. Its strong points are geometry optimization, NMR chemical shift calculations, and large MP2 calculations, and high parallel efficiency on computing clusters. It includes many other capabilities including Density functional theory, the semiempirical methods, MINDO/3, MNDO, AM1 and PM3, Molecular mechanics using the SYBYL 5.0 Force Field, the quantum mechanics/molecular mechanics mixed method using the ONIOM method, natural bond orbital (NBO) analysis and COSMO solvation models. Recently, a highly efficient parallel CCSD(T) code for closed shell systems has been developed. This code includes many other post Hartree–Fock methods: MP2, MP3, MP4, CISD, CEPA, QCISD and so on.

Mean shift

Mean shift is a non-parametric feature-space mathematical analysis technique for locating the maxima of a density function, a so-called mode-seeking algorithm. Application domains include cluster analysis in computer vision and image processing.

In computational chemistry, distributed multipole analysis (DMA) is a compact and accurate way of describing the spatial distribution of electric charge within a molecule.

In the field of computational chemistry, energy minimization is the process of finding an arrangement in space of a collection of atoms where, according to some computational model of chemical bonding, the net inter-atomic force on each atom is acceptably close to zero and the position on the potential energy surface (PES) is a stationary point. The collection of atoms might be a single molecule, an ion, a condensed phase, a transition state or even a collection of any of these. The computational model of chemical bonding might, for example, be quantum mechanics.

Feature learning

In machine learning, feature learning or representation learning is a set of techniques that allows a system to automatically discover the representations needed for feature detection or classification from raw data. This replaces manual feature engineering and allows a machine to both learn the features and use them to perform a specific task.

t-distributed stochastic neighbor embedding (t-SNE) is a statistical method for visualizing high-dimensional data by giving each datapoint a location in a two or three-dimensional map. It is based on Stochastic Neighbor Embedding originally developed by Sam Roweis and Geoffrey Hinton, where Laurens van der Maaten proposed the t-distributed variant. It is a nonlinear dimensionality reduction technique well-suited for embedding high-dimensional data for visualization in a low-dimensional space of two or three dimensions. Specifically, it models each high-dimensional object by a two- or three-dimensional point in such a way that similar objects are modeled by nearby points and dissimilar objects are modeled by distant points with high probability.

References

  1. 1 2 Horn, D.; Gottlieb, A. (2001). "Algorithm for Data Clustering in Pattern Recognition Problems Based on Quantum Mechanics". Physical Review Letters. 88 (1): 018702. Bibcode:2001PhRvL..88a8702H. doi:10.1103/PhysRevLett.88.018702. PMID   11800996.
  2. 1 2 Weinstein, M.; Horn, D. (2009). "Dynamic quantum clustering: a method for visual exploration of structures in data". Physical Review E. 80 (6): 066117. arXiv: 0908.2644 . Bibcode:2009PhRvE..80f6117W. doi:10.1103/PhysRevE.80.066117. PMID   20365241. S2CID   10550999.
  3. 1 2 3 4 Weinstein, M.; Meirer, F.; Hume, A.; Sciau, Ph.; Shaked, G.; Hofstetter, R.; Persi, E.; Mehta, A.; Horn, D. (2013). "Analyzing Big Data with Dynamic Quantum Clustering". arXiv: 1310.2700 [physics.data-an].
  4. 1 2 3 Scott, T. C.; Therani, M.; Wang, X. M. (2017). "Data Clustering with Quantum Mechanics". Mathematics. 5 (1): 5. doi: 10.3390/math5010005 .
  5. Roche, K.; Weinstein, M.; Dunwoodie, L. J.; Poehlman, W. L.; Feltus, F. A. (2018). "Sorting Five Human Tumor Types Reveals Specific Biomarkers and Background Classification Genes". Scientific Reports. 8 (1): 8180. Bibcode:2018NatSR...8.8180R. doi: 10.1038/s41598-018-26310-x . PMC   5970138 . PMID   29802335.
  6. Casaña-Eslava, R. V.; Lisboa, P. J. G.; Ortega-Martorell, S.; Jarman, I. H.; Martín-Guerrero, J. D. (2020). "A Probabilistic framework for Quantum Clustering". Knowledge-Based Systems. 194. arXiv: 1902.05578 . doi:10.1016/j.knosys.2020.105567. S2CID   213468799.
  7. Shaked, G. (2013). Quantum Clustering of Large Data Sets (PDF) (M.Sc.).
  8. Weinstein, M.; Heifetz, A.; Klann, R. (2014). "Detection of nuclear sources in search survey using dynamic quantum clustering of gamma-ray spectral data". The European Physical Journal Plus. 129 (11): 239. arXiv: 1406.0746 . Bibcode:2014EPJP..129..239W. doi:10.1140/epjp/i2014-14239-3. S2CID   119217077.
  9. Decheng, F.; Jon, S.; Pang, C.; Dong, W.; Won, C. (2018). "Improved quantum clustering analysis based on the weighted distance and its application". Heliyon. 4 (11): e00984. doi: 10.1016/j.heliyon.2018.e00984 . PMC   6275214 . PMID   30761372.
  10. Maignan, A.; Scott, T. C. (2021). "A Comprehensive Analysis of Quantum Clustering : Finding All the Potential Minima" (PDF). International Journal of Data Mining & Knowledge Management Process. 11 (1): 33-54. doi: 10.5121/ijdkp.2021.11103 .