Sparse principal component analysis (SPCA or sparse PCA) is a technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.
A particular disadvantage of ordinary PCA is that the principal components are usually linear combinations of all input variables. SPCA overcomes this disadvantage by finding components that are linear combinations of just a few input variables (SPCs). This means that some of the coefficients of the linear combinations defining the SPCs, called loadings, [note 1] are equal to zero. The number of nonzero loadings is called the cardinality of the SPC.
Consider a data matrix, , where each of the columns represent an input variable, and each of the rows represents an independent sample from data population. One assumes each column of has mean zero, otherwise one can subtract column-wise mean from each element of . Let be the empirical covariance matrix of , which has dimension .
Given an integer with , the sparse PCA problem can be formulated as maximizing the variance along a direction represented by vector while constraining its cardinality:
The first constraint specifies that v is a unit vector. In the second constraint, represents the pseudo-norm of v, which is defined as the number of its non-zero components. So the second constraint specifies that the number of non-zero components in v is less than or equal to k, which is typically an integer that is much smaller than dimension p. The optimal value of Eq. 1 is known as the k-sparse largest eigenvalue.
If one takes k=p, the problem reduces to the ordinary PCA, and the optimal value becomes the largest eigenvalue of covariance matrix Σ.
After finding the optimal solution v, one deflates Σ to obtain a new matrix
and iterate this process to obtain further principal components. However, unlike PCA, sparse PCA cannot guarantee that different principal components are orthogonal. In order to achieve orthogonality, additional constraints must be enforced.
The following equivalent definition is in matrix form. Let be a p×p symmetric matrix, one can rewrite the sparse PCA problem as
Tr is the matrix trace, and represents the non-zero elements in matrix V. The last line specifies that V has matrix rank one and is positive semidefinite. The last line means that one has , so Eq. 2 is equivalent to Eq. 1 .
Moreover, the rank constraint in this formulation is actually redundant, and therefore sparse PCA can be cast as the following mixed-integer semidefinite program [1]
Because of the cardinality constraint, the maximization problem is hard to solve exactly, especially when dimension p is high. In fact, the sparse PCA problem in Eq. 1 is NP-hard in the strong sense. [2]
As most sparse problems, variable selection in SPCA is a computationally intractable nonconvex NP-hard problem, [3] therefore greedy sub-optimal algorithms are required to find solutions.
Several alternative approaches (of Eq. 1 ) have been proposed, including
The methodological and theoretical developments of Sparse PCA as well as its applications in scientific studies are recently reviewed in a survey paper. [12]
It has been proposed that sparse PCA can be approximated by semidefinite programming (SDP). [6] If one drops the rank constraint and relaxes the cardinality constraint by a 1-norm convex constraint, one gets a semidefinite programming relaxation, which can be solved efficiently in polynomial time:
In the second constraint, is a p×1 vector of ones, and |V| is the matrix whose elements are the absolute values of the elements of V.
The optimal solution to the relaxed problem Eq. 3 is not guaranteed to have rank one. In that case, can be truncated to retain only the dominant eigenvector.
While the semidefinite program does not scale beyond n=300 covariates, it has been shown that a second-order cone relaxation of the semidefinite relaxation is almost as tight and successfully solves problems with n=1000s of covariates [13]
Suppose ordinary PCA is applied to a dataset where each input variable represents a different asset, it may generate principal components that are weighted combination of all the assets. In contrast, sparse PCA would produce principal components that are weighted combination of only a few input assets, so one can easily interpret its meaning. Furthermore, if one uses a trading strategy based on these principal components, fewer assets imply less transaction costs.
Consider a dataset where each input variable corresponds to a specific gene. Sparse PCA can produce a principal component that involves only a few genes, so researchers can focus on these specific genes for further analysis.
Contemporary datasets often have the number of input variables () comparable with or even much larger than the number of samples (). It has been shown that if does not converge to zero, the classical PCA is not consistent. In other words, if we let in Eq. 1 , then the optimal value does not converge to the largest eigenvalue of data population when the sample size , and the optimal solution does not converge to the direction of maximum variance. But sparse PCA can retain consistency even if
The k-sparse largest eigenvalue (the optimal value of Eq. 1 ) can be used to discriminate an isometric model, where every direction has the same variance, from a spiked covariance model in high-dimensional setting. [14] Consider a hypothesis test where the null hypothesis specifies that data are generated from a multivariate normal distribution with mean 0 and covariance equal to an identity matrix, and the alternative hypothesis specifies that data is generated from a spiked model with signal strength :
where has only k non-zero coordinates. The largest k-sparse eigenvalue can discriminate the two hypotheses if and only if .
Since computing k-sparse eigenvalue is NP-hard, one can approximate it by the optimal value of semidefinite programming relaxation ( Eq. 3 ). If that case, we can discriminate the two hypotheses if . The additional term cannot be improved by any other polynomial time algorithm if the planted clique conjecture holds.
In mathematics, a symmetric matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the row vector transpose of More generally, a Hermitian matrix is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of
In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.
Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.
Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.
In statistics, canonical-correlation analysis (CCA), also called canonical variates analysis, is a way of inferring information from cross-covariance matrices. If we have two vectors X = (X1, ..., Xn) and Y = (Y1, ..., Ym) of random variables, and there are correlations among the variables, then canonical-correlation analysis will find linear combinations of X and Y that have a maximum correlation with each other. T. R. Knapp notes that "virtually all of the commonly encountered parametric tests of significance can be treated as special cases of canonical-correlation analysis, which is the general procedure for investigating the relationships between two sets of variables." The method was first introduced by Harold Hotelling in 1936, although in the context of angles between flats the mathematical concept was published by Camille Jordan in 1875.
Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.
In mathematics, in particular functional analysis, the singular values of a compact operator acting between Hilbert spaces and , are the square roots of the eigenvalues of the self-adjoint operator .
In statistics, the theory of minimum norm quadratic unbiased estimation (MINQUE) was developed by C. R. Rao. MINQUE is a theory alongside other estimation methods in estimation theory, such as the method of moments or maximum likelihood estimation. Similar to the theory of best linear unbiased estimation, MINQUE is specifically concerned with linear regression models. The method was originally conceived to estimate heteroscedastic error variance in multiple linear regression. MINQUE estimators also provide an alternative to maximum likelihood estimators or restricted maximum likelihood estimators for variance components in mixed effects models. MINQUE estimators are quadratic forms of the response variable and are used to estimate a linear function of the variances.
In the field of mathematics, norms are defined for elements within a vector space. Specifically, when the vector space comprises matrices, such norms are referred to as matrix norms. Matrix norms differ from vector norms in that they must also interact with matrix multiplication.
Semidefinite programming (SDP) is a subfield of mathematical programming concerned with the optimization of a linear objective function over the intersection of the cone of positive semidefinite matrices with an affine space, i.e., a spectrahedron.
In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space.
A second-order cone program (SOCP) is a convex optimization problem of the form
In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.
In statistical theory, the field of high-dimensional statistics studies data whose dimension is larger than typically considered in classical multivariate analysis. The area arose owing to the emergence of many modern data sets in which the dimension of the data vectors may be comparable to, or even larger than, the sample size, so that justification for the use of traditional techniques, often based on asymptotic arguments with the dimension held fixed as the sample size increased, was lacking.
A sum-of-squares optimization program is an optimization problem with a linear cost function and a particular type of constraint on the decision variables. These constraints are of the form that when the decision variables are used as coefficients in certain polynomials, those polynomials should have the polynomial SOS property. When fixing the maximum degree of the polynomials involved, sum-of-squares optimization is also known as the Lasserre hierarchy of relaxations in semidefinite programming.
For certain applications in linear algebra, it is useful to know properties of the probability distribution of the largest eigenvalue of a finite sum of random matrices. Suppose is a finite sequence of random matrices. Analogous to the well-known Chernoff bound for sums of scalars, a bound on the following is sought for a given parameter t:
In mathematics, low-rank approximation is a minimization problem, in which the cost function measures the fit between a given matrix and an approximating matrix, subject to a constraint that the approximating matrix has reduced rank. The problem is used for mathematical modeling and data compression. The rank constraint is related to a constraint on the complexity of a model that fits the data. In applications, often there are other constraints on the approximating matrix apart from the rank constraint, e.g., non-negativity and Hankel structure.
Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry represents the rating of movie by customer , if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.
Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.
L1-norm principal component analysis (L1-PCA) is a general method for multivariate data analysis. L1-PCA is often preferred over standard L2-norm principal component analysis (PCA) when the analyzed data may contain outliers.