L1-norm principal component analysis

Last updated
L1-PCA compared with PCA. Nominal data (blue points); outlier (red point); PC (black line); L1-PC (red line); nominal maximum-variance line (dotted line). L1-PCA.png
L1-PCA compared with PCA. Nominal data (blue points); outlier (red point); PC (black line); L1-PC (red line); nominal maximum-variance line (dotted line).

L1-norm principal component analysis (L1-PCA) is a general method for multivariate data analysis. [1] L1-PCA is often preferred over standard L2-norm principal component analysis (PCA) when the analyzed data may contain outliers (faulty values or corruptions), as it is believed to be robust. [2] [3] [4]

Contents

Both L1-PCA and standard PCA seek a collection of orthogonal directions (principal components) that define a subspace wherein data representation is maximized according to the selected criterion. [5] [6] [7] Standard PCA quantifies data representation as the aggregate of the L2-norm of the data point projections into the subspace, or equivalently the aggregate Euclidean distance of the original points from their subspace-projected representations. L1-PCA uses instead the aggregate of the L1-norm of the data point projections into the subspace. [8] In PCA and L1-PCA, the number of principal components (PCs) is lower than the rank of the analyzed matrix, which coincides with the dimensionality of the space defined by the original data points. Therefore, PCA or L1-PCA are commonly employed for dimensionality reduction for the purpose of data denoising or compression. Among the advantages of standard PCA that contributed to its high popularity are low-cost computational implementation by means of singular-value decomposition (SVD) [9] and statistical optimality when the data set is generated by a true multivariate normal data source.

However, in modern big data sets, data often include corrupted, faulty points, commonly referred to as outliers. [10] Standard PCA is known to be sensitive to outliers, even when they appear as a small fraction of the processed data. [11] The reason is that the L2-norm formulation of L2-PCA places squared emphasis on the magnitude of each coordinate of each data point, ultimately overemphasizing peripheral points, such as outliers. On the other hand, following an L1-norm formulation, L1-PCA places linear emphasis on the coordinates of each data point, effectively restraining outliers. [12]

Formulation

Consider any matrix consisting of -dimensional data points. Define . For integer such that , L1-PCA is formulated as: [1]

For , ( 1 ) simplifies to finding the L1-norm principal component (L1-PC) of by

In ( 1 )-( 2 ), L1-norm returns the sum of the absolute entries of its argument and L2-norm returns the sum of the squared entries of its argument. If one substitutes in ( 2 ) by the Frobenius/L2-norm , then the problem becomes standard PCA and it is solved by the matrix that contains the dominant singular vectors of (i.e., the singular vectors that correspond to the highest singular values).

The maximization metric in ( 2 ) can be expanded as

Solution

For any matrix with , define as the nearest (in the L2-norm sense) matrix to that has orthonormal columns. That is, define

Procrustes Theorem [13] [14] states that if has SVD , then .

Markopoulos, Karystinos, and Pados [1] showed that, if is the exact solution to the binary nuclear-norm maximization (BNM) problem

then

is the exact solution to L1-PCA in ( 2 ). The nuclear-norm in ( 2 ) returns the summation of the singular values of its matrix argument and can be calculated by means of standard SVD. Moreover, it holds that, given the solution to L1-PCA, , the solution to BNM can be obtained as

where returns the -sign matrix of its matrix argument (with no loss of generality, we can consider ). In addition, it follows that . BNM in ( 5 ) is a combinatorial problem over antipodal binary variables. Therefore, its exact solution can be found through exhaustive evaluation of all elements of its feasibility set, with asymptotic cost . Therefore, L1-PCA can also be solved, through BNM, with cost (exponential in the product of the number of data points with the number of the sought-after components). It turns out that L1-PCA can be solved optimally (exactly) with polynomial complexity in for fixed data dimension , . [1]

For the special case of (single L1-PC of ), BNM takes the binary-quadratic-maximization (BQM) form

The transition from ( 5 ) to ( 8 ) for holds true, since the unique singular value of is equal to , for every . Then, if is the solution to BQM in ( 7 ), it holds that

is the exact L1-PC of , as defined in ( 1 ). In addition, it holds that and .

Algorithms

Exact solution of exponential complexity

As shown above, the exact solution to L1-PCA can be obtained by the following two-step process:

1. Solve the problem in ( 5 ) to obtain . 2. Apply SVD on  to obtain .

BNM in ( 5 ) can be solved by exhaustive search over the domain of with cost .

Exact solution of polynomial complexity

Also, L1-PCA can be solved optimally with cost , when is constant with respect to (always true for finite data dimension ). [1] [15]

Approximate efficient solvers

In 2008, Kwak [12] proposed an iterative algorithm for the approximate solution of L1-PCA for . This iterative method was later generalized for components. [16] Another approximate efficient solver was proposed by McCoy and Tropp [17] by means of semi-definite programming (SDP). Most recently, L1-PCA (and BNM in ( 5 )) were solved efficiently by means of bit-flipping iterations (L1-BF algorithm). [8]

L1-BF algorithm

 1  function L1BF(, ):  2      Initialize  and   3      Set  and   4      Until termination (or  iterations)  5          ,   6          For   7              ,   8              // flip bit  9              // calculated by SVD or faster (see [8] ) 10              if  11                  ,  12                   13              end 14              if // no bit was flipped 15                  if  16                      terminate 17                  else 18                      

The computational cost of L1-BF is . [8]

Complex data

L1-PCA has also been generalized to process complex data. For complex L1-PCA, two efficient algorithms were proposed in 2018. [18]

Tensor data

L1-PCA has also been extended for the analysis of tensor data, in the form of L1-Tucker, the L1-norm robust analogous of standard Tucker decomposition. [19] Two algorithms for the solution of L1-Tucker are L1-HOSVD and L1-HOOI. [19] [20] [21]

Code

MATLAB code for L1-PCA is available at MathWorks. [22]

Related Research Articles

<span class="mw-page-title-main">Discrete Fourier transform</span> Type of Fourier transform in discrete mathematics

In mathematics, the discrete Fourier transform (DFT) converts a finite sequence of equally-spaced samples of a function into a same-length sequence of equally-spaced samples of the discrete-time Fourier transform (DTFT), which is a complex-valued function of frequency. The interval at which the DTFT is sampled is the reciprocal of the duration of the input sequence. An inverse DFT (IDFT) is a Fourier series, using the DTFT samples as coefficients of complex sinusoids at the corresponding DTFT frequencies. It has the same sample-values as the original input sequence. The DFT is therefore said to be a frequency domain representation of the original input sequence. If the original sequence spans all the non-zero values of a function, its DTFT is continuous, and the DFT provides discrete samples of one cycle. If the original sequence is one cycle of a periodic function, the DFT provides all the non-zero values of one DTFT cycle.

<span class="mw-page-title-main">Multivariate normal distribution</span> Generalization of the one-dimensional normal distribution to higher dimensions

In probability theory and statistics, the multivariate normal distribution, multivariate Gaussian distribution, or joint normal distribution is a generalization of the one-dimensional (univariate) normal distribution to higher dimensions. One definition is that a random vector is said to be k-variate normally distributed if every linear combination of its k components has a univariate normal distribution. Its importance derives mainly from the multivariate central limit theorem. The multivariate normal distribution is often used to describe, at least approximately, any set of (possibly) correlated real-valued random variables, each of which clusters around a mean value.

In machine learning, support vector machines are supervised max-margin models with associated learning algorithms that analyze data for classification and regression analysis. Developed at AT&T Bell Laboratories by Vladimir Vapnik with colleagues SVMs are one of the most studied models, being based on statistical learning frameworks of VC theory proposed by Vapnik and Chervonenkis (1974).

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. For example, it is possible that variations in six observed variables mainly reflect the variations in two unobserved (underlying) variables. Factor analysis searches for such joint variations in response to unobserved latent variables. The observed variables are modelled as linear combinations of the potential factors plus "error" terms, hence factor analysis can be thought of as a special case of errors-in-variables models.

In mathematics, the Riesz–Thorin theorem, often referred to as the Riesz–Thorin interpolation theorem or the Riesz–Thorin convexity theorem, is a result about interpolation of operators. It is named after Marcel Riesz and his student G. Olof Thorin.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

FastICA is an efficient and popular algorithm for independent component analysis invented by Aapo Hyvärinen at Helsinki University of Technology. Like most ICA algorithms, FastICA seeks an orthogonal rotation of prewhitened data, through a fixed-point iteration scheme, that maximizes a measure of non-Gaussianity of the rotated components. Non-gaussianity serves as a proxy for statistical independence, which is a very strong condition and requires infinite data to verify. FastICA can also be alternatively derived as an approximative Newton iteration.

In machine learning, kernel machines are a class of algorithms for pattern analysis, whose best known member is the support-vector machine (SVM). These methods involve using linear classifiers to solve nonlinear problems. The general task of pattern analysis is to find and study general types of relations in datasets. For many algorithms that solve these tasks, the data in raw representation have to be explicitly transformed into feature vector representations via a user-specified feature map: in contrast, kernel methods require only a user-specified kernel, i.e., a similarity function over all pairs of data points computed using inner products. The feature map in kernel machines is infinite dimensional but only requires a finite dimensional matrix from user-input according to the Representer theorem. Kernel machines are slow to compute for datasets larger than a couple of thousand examples without parallel processing.

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix V is factorized into (usually) two matrices W and H, with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

In the field of multivariate statistics, kernel principal component analysis (kernel PCA) is an extension of principal component analysis (PCA) using techniques of kernel methods. Using a kernel, the originally linear operations of PCA are performed in a reproducing kernel Hilbert space.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction, to generate lower-dimensional embeddings for subsequent use by other machine learning algorithms.

Sparse principal component analysis is a technique used in statistical analysis and, in particular, in the analysis of multivariate data sets. It extends the classic method of principal component analysis (PCA) for the reduction of dimensionality of data by introducing sparsity structures to the input variables.

In the field of mathematical analysis, an interpolation space is a space which lies "in between" two other Banach spaces. The main applications are in Sobolev spaces, where spaces of functions that have a noninteger number of derivatives are interpolated from the spaces of functions with integer number of derivatives.

In multilinear algebra, the higher-order singular value decomposition (HOSVD) of a tensor is a specific orthogonal Tucker decomposition. It may be regarded as one type of generalization of the matrix singular value decomposition. It has applications in computer vision, computer graphics, machine learning, scientific computing, and signal processing. Some aspects can be traced as far back as F. L. Hitchcock in 1928, but it was L. R. Tucker who developed for third-order tensors the general Tucker decomposition in the 1960s, further advocated by L. De Lathauwer et al. in their Multilinear SVD work that employs the power method, or advocated by Vasilescu and Terzopoulos that developed M-mode SVD a parallel algorithm that employs the matrix SVD.

<span class="mw-page-title-main">Point-set registration</span> Process of finding a spatial transformation that aligns two point clouds

In computer vision, pattern recognition, and robotics, point-set registration, also known as point-cloud registration or scan matching, is the process of finding a spatial transformation that aligns two point clouds. The purpose of finding such a transformation includes merging multiple data sets into a globally consistent model, and mapping a new measurement to a known data set to identify features or to estimate its pose. Raw 3D point cloud data are typically obtained from Lidars and RGB-D cameras. 3D point clouds can also be generated from computer vision algorithms such as triangulation, bundle adjustment, and more recently, monocular image depth estimation using deep learning. For 2D point set registration used in image processing and feature-based image registration, a point set may be 2D pixel coordinates obtained by feature extraction from an image, for example corner detection. Point cloud registration has extensive applications in autonomous driving, motion estimation and 3D reconstruction, object detection and pose estimation, robotic manipulation, simultaneous localization and mapping (SLAM), panorama stitching, virtual and augmented reality, and medical imaging.

Functional principal component analysis (FPCA) is a statistical method for investigating the dominant modes of variation of functional data. Using this method, a random function is represented in the eigenbasis, which is an orthonormal basis of the Hilbert space L2 that consists of the eigenfunctions of the autocovariance operator. FPCA represents functional data in the most parsimonious way, in the sense that when using a fixed number of basis functions, the eigenfunction basis explains more variation than any other basis expansion. FPCA can be applied for representing random functions, or in functional regression and classification.

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

In statistics, modes of variation are a continuously indexed set of vectors or functions that are centered at a mean and are used to depict the variation in a population or sample. Typically, variation patterns in the data can be decomposed in descending order of eigenvalues with the directions represented by the corresponding eigenvectors or eigenfunctions. Modes of variation provide a visualization of this decomposition and an efficient description of variation around the mean. Both in principal component analysis (PCA) and in functional principal component analysis (FPCA), modes of variation play an important role in visualizing and describing the variation in the data contributed by each eigencomponent. In real-world applications, the eigencomponents and associated modes of variation aid to interpret complex data, especially in exploratory data analysis (EDA).

References

  1. 1 2 3 4 5 Markopoulos, Panos P.; Karystinos, George N.; Pados, Dimitris A. (October 2014). "Optimal Algorithms for L1-subspace Signal Processing". IEEE Transactions on Signal Processing. 62 (19): 5046–5058. arXiv: 1405.6785 . Bibcode:2014ITSP...62.5046M. doi:10.1109/TSP.2014.2338077. S2CID   1494171.
  2. Barrodale, I. (1968). "L1 Approximation and the Analysis of Data". Applied Statistics. 17 (1): 51–57. doi:10.2307/2985267. JSTOR   2985267.
  3. Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed.). Chichester [u.a.]: Wiley. ISBN   978-0471930945.
  4. Kanade, T.; Ke, Qifa (June 2005). "Robust L₁ Norm Factorization in the Presence of Outliers and Missing Data by Alternative Convex Programming". 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'05). Vol. 1. IEEE. pp. 739–746. CiteSeerX   10.1.1.63.4605 . doi:10.1109/CVPR.2005.309. ISBN   978-0-7695-2372-9. S2CID   17144854.
  5. Jolliffe, I.T. (2004). Principal component analysis (2nd ed.). New York: Springer. ISBN   978-0387954424.
  6. Bishop, Christopher M. (2007). Pattern recognition and machine learning (Corr. printing. ed.). New York: Springer. ISBN   978-0-387-31073-2.
  7. Pearson, Karl (8 June 2010). "On Lines and Planes of Closest Fit to Systems of Points in Space". The London, Edinburgh, and Dublin Philosophical Magazine and Journal of Science. 2 (11): 559–572. doi:10.1080/14786440109462720. S2CID   125037489.
  8. 1 2 3 4 Markopoulos, Panos P.; Kundu, Sandipan; Chamadia, Shubham; Pados, Dimitris A. (15 August 2017). "Efficient L1-Norm Principal-Component Analysis via Bit Flipping". IEEE Transactions on Signal Processing. 65 (16): 4252–4264. arXiv: 1610.01959 . Bibcode:2017ITSP...65.4252M. doi:10.1109/TSP.2017.2708023. S2CID   7931130.
  9. Golub, Gene H. (April 1973). "Some Modified Matrix Eigenvalue Problems". SIAM Review. 15 (2): 318–334. CiteSeerX   10.1.1.454.9868 . doi:10.1137/1015032.
  10. Barnett, Vic; Lewis, Toby (1994). Outliers in statistical data (3. ed.). Chichester [u.a.]: Wiley. ISBN   978-0471930945.
  11. Candès, Emmanuel J.; Li, Xiaodong; Ma, Yi; Wright, John (1 May 2011). "Robust principal component analysis?". Journal of the ACM. 58 (3): 1–37. arXiv: 0912.3599 . doi:10.1145/1970392.1970395. S2CID   7128002.
  12. 1 2 Kwak, N. (September 2008). "Principal Component Analysis Based on L1-Norm Maximization". IEEE Transactions on Pattern Analysis and Machine Intelligence. 30 (9): 1672–1680. CiteSeerX   10.1.1.333.1176 . doi:10.1109/TPAMI.2008.114. PMID   18617723. S2CID   11882870.
  13. Eldén, Lars; Park, Haesun (1 June 1999). "A Procrustes problem on the Stiefel manifold". Numerische Mathematik. 82 (4): 599–619. CiteSeerX   10.1.1.54.3580 . doi:10.1007/s002110050432. S2CID   206895591.
  14. Schönemann, Peter H. (March 1966). "A generalized solution of the orthogonal procrustes problem". Psychometrika. 31 (1): 1–10. doi:10.1007/BF02289451. hdl: 10338.dmlcz/103138 . S2CID   121676935.
  15. Markopoulos, PP; Kundu, S; Chamadia, S; Tsagkarakis, N; Pados, DA (2018). "Outlier-Resistant Data Processing with L1-Norm Principal Component Analysis". Advances in Principal Component Analysis. pp. 121–135. doi:10.1007/978-981-10-6704-4_6. ISBN   978-981-10-6703-7.
  16. Nie, F; Huang, H; Ding, C; Luo, Dijun; Wang, H (July 2011). "Robust principal component analysis with non-greedy l1-norm maximization". 22nd International Joint Conference on Artificial Intelligence: 1433–1438.
  17. McCoy, Michael; Tropp, Joel A. (2011). "Two proposals for robust PCA using semidefinite programming". Electronic Journal of Statistics. 5: 1123–1160. arXiv: 1012.1086 . doi:10.1214/11-EJS636. S2CID   14102421.
  18. Tsagkarakis, Nicholas; Markopoulos, Panos P.; Sklivanitis, George; Pados, Dimitris A. (15 June 2018). "L1-Norm Principal-Component Analysis of Complex Data". IEEE Transactions on Signal Processing. 66 (12): 3256–3267. arXiv: 1708.01249 . Bibcode:2018ITSP...66.3256T. doi:10.1109/TSP.2018.2821641. S2CID   21011653.
  19. 1 2 Chachlakis, Dimitris G.; Prater-Bennette, Ashley; Markopoulos, Panos P. (22 November 2019). "L1-norm Tucker Tensor Decomposition". IEEE Access. 7: 178454–178465. arXiv: 1904.06455 . doi: 10.1109/ACCESS.2019.2955134 .
  20. Markopoulos, Panos P.; Chachlakis, Dimitris G.; Prater-Bennette, Ashley (21 February 2019). "L1-Norm Higher-Order Singular-Value Decomposition". 2018 IEEE Global Conference on Signal and Information Processing (GlobalSIP). pp. 1353–1357. doi:10.1109/GlobalSIP.2018.8646385. ISBN   978-1-7281-1295-4. S2CID   67874182.
  21. Markopoulos, Panos P.; Chachlakis, Dimitris G.; Papalexakis, Evangelos (April 2018). "The Exact Solution to Rank-1 L1-Norm TUCKER2 Decomposition". IEEE Signal Processing Letters. 25 (4): 511–515. arXiv: 1710.11306 . Bibcode:2018ISPL...25..511M. doi:10.1109/LSP.2018.2790901. S2CID   3693326.
  22. "L1-PCA TOOLBOX" . Retrieved May 21, 2018.