L1-norm principal component analysis

Last updated
L1-PCA compared with PCA. Nominal data (blue points); outlier (red point); PC (black line); L1-PC (red line); nominal maximum-variance line (dotted line). L1-PCA.png
L1-PCA compared with PCA. Nominal data (blue points); outlier (red point); PC (black line); L1-PC (red line); nominal maximum-variance line (dotted line).

L1-norm principal component analysis (L1-PCA) is a general method for multivariate data analysis. [1] L1-PCA is often preferred over standard L2-norm principal component analysis (PCA) when the analyzed data may contain outliers (faulty values or corruptions). [2] [3] [4]

Contents

Both L1-PCA and standard PCA seek a collection of orthogonal directions (principal components) that define a subspace wherein data representation is maximized according to the selected criterion. [5] [6] [7] Standard PCA quantifies data representation as the aggregate of the L2-norm of the data point projections into the subspace, or equivalently the aggregate Euclidean distance of the original points from their subspace-projected representations. L1-PCA uses instead the aggregate of the L1-norm of the data point projections into the subspace. [8] In PCA and L1-PCA, the number of principal components (PCs) is lower than the rank of the analyzed matrix, which coincides with the dimensionality of the space defined by the original data points. Therefore, PCA or L1-PCA are commonly employed for dimensionality reduction for the purpose of data denoising or compression. Among the advantages of standard PCA that contributed to its high popularity are low-cost computational implementation by means of singular-value decomposition (SVD) [9] and statistical optimality when the data set is generated by a true multivariate normal data source.

However, in modern big data sets, data often include corrupted, faulty points, commonly referred to as outliers. [10] Standard PCA is known to be sensitive to outliers, even when they appear as a small fraction of the processed data. [11] The reason is that the L2-norm formulation of L2-PCA places squared emphasis on the magnitude of each coordinate of each data point, ultimately overemphasizing peripheral points, such as outliers. On the other hand, following an L1-norm formulation, L1-PCA places linear emphasis on the coordinates of each data point, effectively restraining outliers. [12]

Formulation

Consider any matrix consisting of -dimensional data points. Define . For integer such that , L1-PCA is formulated as: [1]

 

 

 

 

(1)

For , ( 1 ) simplifies to finding the L1-norm principal component (L1-PC) of by

 

 

 

 

(2)

In ( 1 )-( 2 ), L1-norm returns the sum of the absolute entries of its argument and L2-norm returns the sum of the squared entries of its argument. If one substitutes in ( 2 ) by the Frobenius/L2-norm , then the problem becomes standard PCA and it is solved by the matrix that contains the dominant singular vectors of (i.e., the singular vectors that correspond to the highest singular values).

The maximization metric in ( 2 ) can be expanded as

 

 

 

 

(3)

Solution

For any matrix with , define as the nearest (in the L2-norm sense) matrix to that has orthonormal columns. That is, define

 

 

 

 

(4)

Procrustes Theorem [13] [14] states that if has SVD , then .

Markopoulos, Karystinos, and Pados [1] showed that, if is the exact solution to the binary nuclear-norm maximization (BNM) problem

 

 

 

 

(5)

then

 

 

 

 

(6)

is the exact solution to L1-PCA in ( 2 ). The nuclear-norm in ( 2 ) returns the summation of the singular values of its matrix argument and can be calculated by means of standard SVD. Moreover, it holds that, given the solution to L1-PCA, , the solution to BNM can be obtained as

 

 

 

 

(7)

where returns the -sign matrix of its matrix argument (with no loss of generality, we can consider ). In addition, it follows that . BNM in ( 5 ) is a combinatorial problem over antipodal binary variables. Therefore, its exact solution can be found through exhaustive evaluation of all elements of its feasibility set, with asymptotic cost . Therefore, L1-PCA can also be solved, through BNM, with cost (exponential in the product of the number of data points with the number of the sought-after components). It turns out that L1-PCA can be solved optimally (exactly) with polynomial complexity in for fixed data dimension , . [1]

For the special case of (single L1-PC of ), BNM takes the binary-quadratic-maximization (BQM) form

 

 

 

 

(8)

The transition from ( 5 ) to ( 8 ) for holds true, since the unique singular value of is equal to , for every . Then, if is the solution to BQM in ( 7 ), it holds that

 

 

 

 

(9)

is the exact L1-PC of , as defined in ( 1 ). In addition, it holds that and .

Algorithms

Exact solution of exponential complexity

As shown above, the exact solution to L1-PCA can be obtained by the following two-step process:

1. Solve the problem in ( 5 ) to obtain . 2. Apply SVD on  to obtain .

BNM in ( 5 ) can be solved by exhaustive search over the domain of with cost .

Exact solution of polynomial complexity

Also, L1-PCA can be solved optimally with cost , when is constant with respect to (always true for finite data dimension ). [1] [15]

Approximate efficient solvers

In 2008, Kwak [12] proposed an iterative algorithm for the approximate solution of L1-PCA for . This iterative method was later generalized for components. [16] Another approximate efficient solver was proposed by McCoy and Tropp [17] by means of semi-definite programming (SDP). Most recently, L1-PCA (and BNM in ( 5 )) were solved efficiently by means of bit-flipping iterations (L1-BF algorithm). [8] [18]

L1-BF algorithm

 1  function L1BF(, ):  2      Initialize  and   3      Set  and   4      Until termination (or  iterations)  5          ,   6          For   7              ,   8              // flip bit  9              // calculated by SVD or faster (see [8] ) 10              if  11                  ,  12                   13              end 14              if // no bit was flipped 15                  if  16                      terminate 17                  else 18                      

The computational cost of L1-BF is . [8]

Complex data

L1-PCA has also been generalized to process complex data. For complex L1-PCA, two efficient algorithms were proposed in 2018. [19]

Tensor data

L1-PCA has also been extended for the analysis of tensor data, in the form of L1-Tucker, the L1-norm robust analogous of standard Tucker decomposition. [20] Two algorithms for the solution of L1-Tucker are L1-HOSVD and L1-HOOI. [20] [21] [22]

Code

MATLAB code for L1-PCA is available at MathWorks [23] and other repositories. [18]

Related Research Articles

Multivariate normal distribution Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables each of which clusters around a mean value.

Support-vector machine Set of methods for supervised statistical learning

In machine learning, support-vector machines are supervised learning models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most robust prediction methods, being based on statistical learning frameworks or VC theory proposed by Vapnik and Chervonenkis (1974). Given a set of training examples, each marked as belonging to one of two categories, an SVM training algorithm builds a model that assigns new examples to one category or the other, making it a non-probabilistic binary linear classifier. SVM maps training examples to points in space so as to maximise the width of the gap between the two categories. New examples are then mapped into that same space and predicted to belong to a category based on which side of the gap they fall.

Principal component analysis Conversion of observations of possibly-correlated variables into values of fewer, uncorrelated variables

The principal components of a collection of points in a real coordinate space are a sequence of unit vectors, where the -th vector is the direction of a line that best fits the data while being orthogonal to the first vectors. Here, a best-fitting line is defined as one that minimizes the average squared distance from the points to the line. These directions constitute an orthonormal basis in which different individual dimensions of the data are linearly uncorrelated. Principal component analysis (PCA) is the process of computing the principal components and using them to perform a change of basis on the data, sometimes using only the first few principal components and ignoring the rest.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors, plus "error" terms.

Eigenface

An eigenface is the name given to a set of eigenvectors when used in the computer vision problem of human face recognition. The approach of using eigenfaces for recognition was developed by Sirovich and Kirby (1987) and used by Matthew Turk and Alex Pentland in face classification. The eigenvectors are derived from the covariance matrix of the probability distribution over the high-dimensional vector space of face images. The eigenfaces themselves form a basis set of all images used to construct the covariance matrix. This produces dimension reduction by allowing the smaller set of basis images to represent the original training images. Classification can be achieved by comparing how faces are represented by the basis set.

In statistics, the matrix normal distribution or matrix Gaussian distribution is a probability distribution that is a generalization of the multivariate normal distribution to matrix-valued random variables.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric classification method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration.

Kernel method

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over pairs of data points in raw representation.

Non-negative matrix factorization Algorithms for matrix decomposition

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

In the field of multivariate statistics, kernel principal component analysis is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space.

Autoencoder Neural network that learns efficient data encoding in an unsupervised manner

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. The encoding is validated and refined by attempting to regenerate the input from the encoding. The autoencoder learns a representation (encoding) for a set of data, typically for dimensionality reduction, by training the network to ignore insignificant data (“noise”).

In statistics, the inverse Wishart distribution, also called the inverted Wishart distribution, is a probability distribution defined on real-valued positive-definite matrices. In Bayesian statistics it is used as the conjugate prior for the covariance matrix of a multivariate normal distribution.

In statistics, principal component regression (PCR) is a regression analysis technique that is based on principal component analysis (PCA). More specifically, PCR is used for estimating the unknown regression coefficients in a standard linear regression model.

Sparse principal component analysis is a specialised technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.

In multilinear algebra, the higher-order singular value decomposition (HOSVD) of a tensor is a specific orthogonal Tucker decomposition. It may be regarded as one generalization of the matrix singular value decomposition. The HOSVD has applications in computer graphics, machine learning, scientific computing, and signal processing. Some key ingredients of the HOSVD can be traced as far back as F. L. Hitchcock in 1928, but it was L. R. Tucker who developed for third-order tensors the general Tucker decomposition in the 1960s, including the HOSVD. The HOSVD as decomposition in its own right was further advocated by L. De Lathauwer et al. in 2000. Robust and L1-norm-based variants of HOSVD have also been proposed.

Point set registration

In computer vision, pattern recognition, and robotics, point set registration, also known as point cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

Sparse dictionary learning Representation learning method

Sparse coding is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

In statistics, modes of variation are a continuously indexed set of vectors or functions that are centered at a mean and are used to depict the variation in a population or sample. Typically, variation patterns in the data can be decomposed in descending order of eigenvalues with the directions represented by the corresponding eigenvectors or eigenfunctions. Modes of variation provide a visualization of this decomposition and an efficient description of variation around the mean. Both in principal component analysis (PCA) and in functional principal component analysis (FPCA), modes of variation play an important role in visualizing and describing the variation in the data contributed by each eigencomponent. In real-world applications, the eigencomponents and associated modes of variation aid to interpret complex data, especially in exploratory data analysis (EDA).

References

  1. 1 2 3 4 5 Markopoulos, Panos P.; Karystinos, George N.; Pados, Dimitris A. (October 2014). "Optimal Algorithms for L1-subspace Signal Processing". IEEE Transactions on Signal Processing. 62 (19): 5046–5058. arXiv: 1405.6785 . Bibcode:2014ITSP...62.5046M. doi:10.1109/TSP.2014.2338077. S2CID   1494171.
  2. Barrodale, I. (1968). "L1 Approximation and the Analysis of Data". Applied Statistics. 17 (1): 51–57. doi:10.2307/2985267. JSTOR   2985267.
  3. Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed.). Chichester [u.a.]: Wiley. ISBN   978-0471930945.
  4. Kanade, T.; Ke, Qifa (June 2005). Robust L1 Norm Factorization in the Presence of Outliers and Missing Data by Alternative Convex Programming. 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). 1. IEEE. p. 739. CiteSeerX   10.1.1.63.4605 . doi:10.1109/CVPR.2005.309. ISBN   978-0-7695-2372-9. S2CID   17144854.
  5. Jolliffe, I.T. (2004). Principal component analysis (2nd ed.). New York: Springer. ISBN   978-0387954424.
  6. Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer. ISBN   978-0-387-31073-2.
  7. Pearson, Karl (8 June 2010). "On Lines and Planes of Closest Fit to Systems of Points in Space". The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 2 (11): 559–572. doi:10.1080/14786440109462720.
  8. 1 2 3 4 Markopoulos, Panos P.; Kundu, Sandipan; Chamadia, Shubham; Pados, Dimitris A. (15 August 2017). "Efficient L1-Norm Principal-Component Analysis via Bit Flipping". IEEE Transactions on Signal Processing. 65 (16): 4252–4264. arXiv: 1610.01959 . Bibcode:2017ITSP...65.4252M. doi:10.1109/TSP.2017.2708023. S2CID   7931130.
  9. Golub, Gene H. (April 1973). "Some Modified Matrix Eigenvalue Problems". SIAM Review. 15 (2): 318–334. CiteSeerX   10.1.1.454.9868 . doi:10.1137/1015032.
  10. Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed.). Chichester [u.a.]: Wiley. ISBN   978-0471930945.
  11. Candès, Emmanuel J.; Li, Xiaodong; Ma, Yi; Wright, John (1 May 2011). "Robust principal component analysis?". Journal of the ACM. 58 (3): 1–37. arXiv: 0912.3599 . doi:10.1145/1970392.1970395. S2CID   7128002.
  12. 1 2 Kwak, N. (September 2008). "Principal Component Analysis Based on L1-Norm Maximization". IEEE Transactions on Pattern Analysis and Machine Intelligence. 30 (9): 1672–1680. CiteSeerX   10.1.1.333.1176 . doi:10.1109/TPAMI.2008.114. PMID   18617723. S2CID   11882870.
  13. Eldén, Lars; Park, Haesun (1 June 1999). "A Procrustes problem on the Stiefel manifold". Numerische Mathematik. 82 (4): 599–619. CiteSeerX   10.1.1.54.3580 . doi:10.1007/s002110050432. S2CID   206895591.
  14. Schönemann, Peter H. (March 1966). "A generalized solution of the orthogonal procrustes problem". Psychometrika. 31 (1): 1–10. doi:10.1007/BF02289451. hdl: 10338.dmlcz/103138 . S2CID   121676935.
  15. Markopoulos, PP; Kundu, S; Chamadia, S; Tsagkarakis, N; Pados, DA (2018). Outlier-Resistant Data Processing with L1-Norm Principal Component Analysis. Advances in Principal Component Analysis. p. 121. doi:10.1007/978-981-10-6704-4_6. ISBN   978-981-10-6703-7.
  16. Nie, F; Huang, H; Ding, C; Luo, Dijun; Wang, H (July 2011). "Robust principal component analysis with non-greedy l1-norm maximization". 22nd International Joint Conference on Artificial Intelligence: 1433–1438.
  17. McCoy, Michael; Tropp, Joel A. (2011). "Two proposals for robust PCA using semidefinite programming". Electronic Journal of Statistics. 5: 1123–1160. arXiv: 1012.1086 . doi:10.1214/11-EJS636. S2CID   14102421.
  18. 1 2 Markopoulos, PP. "Software Repository" . Retrieved May 21, 2018.
  19. Tsagkarakis, Nicholas; Markopoulos, Panos P.; Sklivanitis, George; Pados, Dimitris A. (15 June 2018). "L1-Norm Principal-Component Analysis of Complex Data". IEEE Transactions on Signal Processing. 66 (12): 3256–3267. arXiv: 1708.01249 . Bibcode:2018ITSP...66.3256T. doi:10.1109/TSP.2018.2821641. S2CID   21011653.
  20. 1 2 Chachlakis, Dimitris G.; Prater-Bennette, Ashley; Markopoulos, Panos P. (22 November 2019). "L1-norm Tucker Tensor Decomposition". IEEE Access. 7: 178454–178465. arXiv: 1904.06455 . doi: 10.1109/ACCESS.2019.2955134 .
  21. Markopoulos, Panos P.; Chachlakis, Dimitris G.; Prater-Bennette, Ashley (21 February 2019). "L1-norm Higher-Order Singular-Value Decomposition". IEEE Proc. 2018 IEEE Global Conference on Signal and Information Processing: 1353–1357. doi:10.1109/GlobalSIP.2018.8646385. ISBN   978-1-7281-1295-4. S2CID   67874182.
  22. Markopoulos, Panos P.; Chachlakis, Dimitris G.; Papalexakis, Evangelos (April 2018). "The Exact Solution to Rank-1 L1-Norm TUCKER2 Decomposition". IEEE Signal Processing Letters. 25 (4): 511–515. arXiv: 1710.11306 . Bibcode:2018ISPL...25..511M. doi:10.1109/LSP.2018.2790901. S2CID   3693326.
  23. "L1-PCA TOOLBOX" . Retrieved May 21, 2018.