Sparse PCA

Last updated

Sparse principal component analysis (SPCA or sparse PCA) is a technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.

Contents

A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. SPCA overcomes this disadvantage by finding components that are linear combinations of just a few input variables (SPCs). This means that some of the coefficients of the linear combinations defining the SPCs, called loadings, [note 1] are equal to zero. The number of nonzero loadings is called the cardinality of the SPC.

Mathematical formulation

Consider a data matrix, , where each of the columns represent an input variable, and each of the rows represents an independent sample from data population. One assumes each column of has mean zero, otherwise one can subtract column-wise mean from each element of . Let be the empirical covariance matrix of , which has dimension .

Given an integer with , the sparse PCA problem can be formulated as maximizing the variance along a direction represented by vector while constraining its cardinality:

Eq. 1

The first constraint specifies that v is a unit vector. In the second constraint, represents the pseudo-norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension p. The optimal value of Eq. 1 is known as the k-sparse largest eigenvalue.

If one takes k=p, the problem reduces to the ordinary PCA, and the optimal value becomes the largest eigenvalue of covariance matrix Σ.

After finding the optimal solution v, one deflates Σ to obtain a new matrix

and iterate this process to obtain further principal components. However, unlike PCA, sparse PCA cannot guarantee that different principal components are orthogonal. In order to achieve orthogonality, additional constraints must be enforced.

The following equivalent definition is in matrix form. Let be a p×p symmetric matrix, one can rewrite the sparse PCA problem as

Eq. 2

Tr is the matrix trace, and represents the non-zero elements in matrix V. The last line specifies that V has matrix rank one and is positive semidefinite. The last line means that one has , so Eq. 2 is equivalent to Eq. 1 .

Moreover, the rank constraint in this formulation is actually redundant, and therefore sparse PCA can be cast as the following mixed-integer semidefinite program [1]

Eq. 3

Because of the cardinality constraint, the maximization problem is hard to solve exactly, especially when dimension p is high. In fact, the sparse PCA problem in Eq. 1 is NP-hard in the strong sense. [2]

Computational considerations

As most sparse problems, variable selection in SPCA is a computationally intractable nonconvex NP-hard problem, [3] therefore greedy sub-optimal algorithms are required to find solutions.

Algorithms for SPCA

Several alternative approaches (of Eq. 1 ) have been proposed, including

The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies are recently reviewed in a survey paper. [12]

Notes on Semidefinite Programming Relaxation

It has been proposed that sparse PCA can be approximated by semidefinite programming (SDP). [6] If one drops the rank constraint and relaxes the cardinality constraint by a 1-norm convex constraint, one gets a semidefinite programming relaxation, which can be solved efficiently in polynomial time:

Eq. 3

In the second constraint, is a p×1 vector of ones, and |V| is the matrix whose elements are the absolute values of the elements of V.

The optimal solution to the relaxed problem Eq. 3 is not guaranteed to have rank one. In that case, can be truncated to retain only the dominant eigenvector.

While the semidefinite program does not scale beyond n=300 covariates, it has been shown that a second-order cone relaxation of the semidefinite relaxation is almost as tight and successfully solves problems with n=1000s of covariates [13]

Applications

Financial Data Analysis

Suppose ordinary PCA is applied to a dataset where each input variable represents a different asset, it may generate principal components that are weighted combination of all the assets. In contrast, sparse PCA would produce principal components that are weighted combination of only a few input assets, so one can easily interpret its meaning. Furthermore, if one uses a trading strategy based on these principal components, fewer assets imply less transaction costs.

Biology

Consider a dataset where each input variable corresponds to a specific gene. Sparse PCA can produce a principal component that involves only a few genes, so researchers can focus on these specific genes for further analysis.

High-dimensional Hypothesis Testing

Contemporary datasets often have the number of input variables () comparable with or even much larger than the number of samples (). It has been shown that if does not converge to zero, the classical PCA is not consistent. In other words, if we let in Eq. 1 , then the optimal value does not converge to the largest eigenvalue of data population when the sample size , and the optimal solution does not converge to the direction of maximum variance. But sparse PCA can retain consistency even if

The k-sparse largest eigenvalue (the optimal value of Eq. 1 ) can be used to discriminate an isometric model, where every direction has the same variance, from a spiked covariance model in high-dimensional setting. [14] Consider a hypothesis test where the null hypothesis specifies that data are generated from a multivariate normal distribution with mean 0 and covariance equal to an identity matrix, and the alternative hypothesis specifies that data is generated from a spiked model with signal strength :

where has only k non-zero coordinates. The largest k-sparse eigenvalue can discriminate the two hypotheses if and only if .

Since computing k-sparse eigenvalue is NP-hard, one can approximate it by the optimal value of semidefinite programming relaxation ( Eq. 3 ). If that case, we can discriminate the two hypotheses if . The additional term cannot be improved by any other polynomial time algorithm if the planted clique conjecture holds.

Software/source code


Notes

  1. The term loadings is inappropriately used for these vectors, which should be called coefficients, instead. The naming comes from the term loadings used in factor analysis to design the values generating the common covariance matrix. Since in standard PCA the loadings are equal to the coefficients, the term loadings has been used for the coefficients. This is quite unfortunate, because in SPCA the coefficients are not equal to the loadings.


Related Research Articles

In mathematics, a symmetric matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the transpose of . More generally, a Hermitian matrix is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a popular technique for analyzing large datasets containing a high number of dimensions/features per observation, increasing the interpretability of data while preserving the maximum amount of information, and enabling the visualization of multidimensional data. Formally, PCA is a statistical technique for reducing the dimensionality of a dataset. This is accomplished by linearly transforming the data into a new coordinate system where the variation in the data can be described with fewer dimensions than the initial data. Many studies use the first two principal components in order to plot the data in two dimensions and to visually identify clusters of closely related data points. Principal component analysis has applications in many fields such as population genetics, microbiome studies, and atmospheric science.

<span class="mw-page-title-main">Canonical correlation</span> Way of inferring information from cross-covariance matrices

In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y which have maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Jordan in 1875.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

<span class="mw-page-title-main">Singular value</span> Square roots of the eigenvalues of the self-adjoint operator

In mathematics, in particular functional analysis, the singular values of a compact operator acting between Hilbert spaces and , are the square roots of the eigenvalues of the self-adjoint operator .

In mathematics, we can define norms for the elements of a vector space. When the vector space in question consists of matrices, these are called matrix norms.

Semidefinite programming (SDP) is a subfield of convex optimization concerned with the optimization of a linear objective function over the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron.

In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space.

A second-order cone program (SOCP) is a convex optimization problem of the form

In probability theory, Bernstein inequalities give bounds on the probability that the sum of random variables deviates from its mean. In the simplest case, let X1, ..., Xn be independent Bernoulli random variables taking values +1 and −1 with probability 1/2, then for every positive ,

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.

A sum-of-squares optimization program is an optimization problem with a linear cost function and a particular type of constraint on the decision variables. These constraints are of the form that when the decision variables are used as coefficients in certain polynomials, those polynomials should have the polynomial SOS property. When fixing the maximum degree of the polynomials involved, sum-of-squares optimization is also known as the Lasserre hierarchy of relaxations in semidefinite programming.

For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:

In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix and an approximating matrix, subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.

<span class="mw-page-title-main">Matrix completion</span>

Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry represents the rating of movie by customer , if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

Quantum optimization algorithms are quantum algorithms that are used to solve optimization problems. Mathematical optimization deals with finding the best solution to a problem from a set of possible solutions. Mostly, the optimization problem is formulated as a minimization problem, where one tries to minimize an error which depends on the solution: the optimal solution has the minimal error. Different optimization techniques are applied in various fields such as mechanics, economics and engineering, and as the complexity and amount of data involved rise, more efficient ways of solving optimization problems are needed. Quantum computing may allow problems which are not practically feasible on classical computers to be solved, or suggest a considerable speed up with respect to the best known classical algorithm.

<span class="mw-page-title-main">L1-norm principal component analysis</span>

L1-norm principal component analysis (L1-PCA) is a general method for multivariate data analysis. L1-PCA is often preferred over standard L2-norm principal component analysis (PCA) when the analyzed data may contain outliers.

References

  1. 1 2 Dimitris Bertsimas; Ryan Cory-Wright; Jean Pauphilet (2020). "Solving Large-Scale Sparse PCA to Certifiable (Near) Optimality". arXiv: 2005.05195 [math.OC].
  2. Andreas M. Tillmann; Marc E. Pfetsch (2013). "The Computational Complexity of the Restricted Isometry Property, the Nullspace Property, and Related Concepts in Compressed Sensing". IEEE Transactions on Information Theory . 60 (2): 12481259. arXiv: 1205.2081 . CiteSeerX   10.1.1.760.2559 . doi:10.1109/TIT.2013.2290112. S2CID   2788088.
  3. Moghaddam, Baback; Weiss, Yair; Avidan, Shai (2007-09-07). Schölkopf, Bernhard; Platt, John; Hofmann, Thomas (eds.). Advances in Neural Information Processing Systems 19: Proceedings of the 2006 Conference. The MIT Press. pp. 915–922. doi:10.7551/mitpress/7503.001.0001. ISBN   978-0-262-25691-9.
  4. Hui Zou; Trevor Hastie; Robert Tibshirani (2006). "Sparse principal component analysis" (PDF). Journal of Computational and Graphical Statistics . 15 (2): 262–286. CiteSeerX   10.1.1.62.580 . doi:10.1198/106186006x113430. S2CID   5730904.
  5. Fan Chen; Karl Rohe (2021). "A New Basis for Sparse Principal Component Analysis". Journal of Computational and Graphical Statistics . arXiv: 2007.00596 . doi:10.1080/10618600.2023.2256502.
  6. 1 2 Alexandre d’Aspremont; Laurent El Ghaoui; Michael I. Jordan; Gert R. G. Lanckriet (2007). "A Direct Formulation for Sparse PCA Using Semidefinite Programming" (PDF). SIAM Review . 49 (3): 434–448. arXiv: cs/0406021 . doi:10.1137/050645506. S2CID   5490061.
  7. Michel Journee; Yurii Nesterov; Peter Richtarik; Rodolphe Sepulchre (2010). "Generalized Power Method for Sparse Principal Component Analysis" (PDF). Journal of Machine Learning Research . 11: 517–553. arXiv: 0811.4724 . Bibcode:2008arXiv0811.4724J. CORE Discussion Paper 2008/70.
  8. Peter Richtarik; Majid Jahani; S. Damla Ahipasaoglu; Martin Takac (2021). "Alternating Maximization: Unifying Framework for 8 Sparse PCA Formulations and Efficient Parallel Codes". Optimization and Engineering . 22 (3): 1493–1519. arXiv: 1212.4137 . doi:10.1007/s11081-020-09562-3. S2CID   2549610.
  9. Baback Moghaddam; Yair Weiss; Shai Avidan (2005). "Spectral Bounds for Sparse PCA: Exact and Greedy Algorithms" (PDF). Advances in Neural Information Processing Systems. Vol. 18. MIT Press.
  10. Lauren Berk; Dimitris Bertsimas (2019). "Certifiably optimal sparse principal component analysis". Mathematical Programming Computation . Springer. 11 (3): 381–420. doi:10.1007/s12532-018-0153-6. hdl: 1721.1/131566 . S2CID   126998398.
  11. Yue Guan; Jennifer Dy (2009). "Sparse Probabilistic Principal Component Analysis" (PDF). Journal of Machine Learning Research Workshop and Conference Proceedings . 5: 185.
  12. Hui Zou; Lingzhou Xue (2018). "A Selective Overview of Sparse Principal Component Analysis". Proceedings of the IEEE . 106 (8): 1311–1320. doi: 10.1109/jproc.2018.2846588 .
  13. Dimitris Bertsimas; Ryan Cory-Wright (2020). "On polyhedral and second-order cone decompositions of semidefinite optimization problems". Operations Research Letters . Elsevier. 48 (1): 78–85. arXiv: 1910.03143 . doi: 10.1016/j.orl.2019.12.003 .
  14. Quentin Berthet; Philippe Rigollet (2013). "Optimal Detection of Sparse Principal Components in High Dimension". Annals of Statistics . 41 (1): 1780–1815. arXiv: 1202.5070 . doi:10.1214/13-aos1127. S2CID   7162068.
  15. https://cran.r-project.org/web/packages/amanpg/index.html
  16. https://cran.r-project.org/web/packages/elasticnet/index.html
  17. https://cran.r-project.org/web/packages/epca/index.html
  18. https://cran.r-project.org/web/packages/nsprcomp/index.html
  19. http://scikit-learn.org/stable/modules/generated/sklearn.decomposition.SparsePCA.html