Random projection

Last updated March 27, 2024

In mathematics and statistics, random projection is a technique used to reduce the dimensionality of a set of points which lie in Euclidean space. According to theoretical results, random projection preserves distances well, but empirical results are sparse.^[1] They have been applied to many natural language tasks under the name random indexing.

Dimensionality reduction

Dimensionality reduction, as the name suggests, is reducing the number of random variables using various mathematical methods from statistics and machine learning. Dimensionality reduction is often used to reduce the problem of managing and manipulating large data sets. Dimensionality reduction techniques generally use linear transformations in determining the intrinsic dimensionality of the manifold as well as extracting its principal directions. For this purpose there are various related techniques, including: principal component analysis, linear discriminant analysis, canonical correlation analysis, discrete cosine transform, random projection, etc.

Random projection is a simple and computationally efficient way to reduce the dimensionality of data by trading a controlled amount of error for faster processing times and smaller model sizes. The dimensions and distribution of random projection matrices are controlled so as to approximately preserve the pairwise distances between any two samples of the dataset.

Method

The core idea behind random projection is given in the Johnson-Lindenstrauss lemma,^[2] which states that if points in a vector space are of sufficiently high dimension, then they may be projected into a suitable lower-dimensional space in a way which approximately preserves pairwise distances between the points with high probability.

In random projection, the original d-dimensional data is projected to a k-dimensional (k << d) subspace, using a random $k\times d$ - dimensional matrix R whose columns have unit lengths.^{[ citation needed ]} Using matrix notation: If $X_{d\times N}$ is the original set of N d-dimensional observations, then $X_{k\times N}^{RP}=R_{k\times d}X_{d\times N}$ is the projection of the data onto a lower k-dimensional subspace. Random projection is computationally simple: form the random matrix "R" and project the $d\times N$ data matrix X onto K dimensions of order $O(dkN)$ . If the data matrix X is sparse with about c nonzero entries per column, then the complexity of this operation is of order $O(ckN)$ .^[3]

Gaussian random projection

The random matrix R can be generated using a Gaussian distribution. The first row is a random unit vector uniformly chosen from $S^{d-1}$ . The second row is a random unit vector from the space orthogonal to the first row, the third row is a random unit vector from the space orthogonal to the first two rows, and so on. In this way of choosing R, and the following properties are satisfied:

Spherical symmetry: For any orthogonal matrix $A\in O(d)$ , RA and R have the same distribution.
Orthogonality: The rows of R are orthogonal to each other.
Normality: The rows of R are unit-length vectors.

More computationally efficient random projections

Achlioptas^[4] has shown that the Gaussian distribution can be replaced by a much simpler distribution such as

R_{i,j}={\sqrt {3}}\times {\begin{cases}+1&{\text{with probability }}{\frac {1}{6}}\\0&{\text{with probability }}{\frac {2}{3}}\\-1&{\text{with probability }}{\frac {1}{6}}\end{cases}}

This is efficient for database applications because the computations can be performed using integer arithmetic. More related study is conducted in.^[5]

It was later shown how to use integer arithmetic while making the distribution even sparser, having very few nonzeroes per column, in work on the Sparse JL Transform.^[6] This is advantageous since a sparse embedding matrix means being able to project the data to lower dimension even faster.

Random Projection with Quantization

Random projection can be further condensed by quantization (discretization), with 1-bit (sign random projection) or multi-bits. It is the building block of SimHash,^[7] RP tree,^[8] and other memory efficient estimation and learning methods.^[9]^[10]

Large quasiorthogonal bases

The Johnson-Lindenstrauss lemma states that large sets of vectors in a high-dimensional space can be linearly mapped in a space of much lower (but still high) dimension n with approximate preservation of distances. One of the explanations of this effect is the exponentially high quasiorthogonal dimension of n-dimensional Euclidean space.^[11] There are exponentially large (in dimension n) sets of almost orthogonal vectors (with small value of inner products) in n–dimensional Euclidean space. This observation is useful in indexing of high-dimensional data.^[12]

Quasiorthogonality of large random sets is important for methods of random approximation in machine learning. In high dimensions, exponentially large numbers of randomly and independently chosen vectors from equidistribution on a sphere (and from many other distributions) are almost orthogonal with probability close to one.^[13] This implies that in order to represent an element of such a high-dimensional space by linear combinations of randomly and independently chosen vectors, it may often be necessary to generate samples of exponentially large length if we use bounded coefficients in linear combinations. On the other hand, if coefficients with arbitrarily large values are allowed, the number of randomly generated elements that are sufficient for approximation is even less than dimension of the data space.

Implementations

RandPro - An R package for random projection ^[14]^[15]
sklearn.random_projection - A module for random projection from the scikit-learn Python library
Weka implementation

Related Research Articles

<span class="mw-page-title-main">Principal component analysis</span> Method of data analysis

Principal component analysis (PCA) is a linear dimensionality reduction technique with applications in exploratory data analysis, visualization and data preprocessing.

In linear algebra, an orthogonal matrix, or orthonormal matrix, is a real square matrix whose columns and rows are orthonormal vectors.

Nonlinear dimensionality reduction, also known as manifold learning, refers to various related techniques that aim to project high-dimensional data onto lower-dimensional latent manifolds, with the goal of either visualizing the data in the low-dimensional space, or learning the mapping itself. The techniques described below can be understood as generalizations of linear decomposition methods used for dimensionality reduction, such as singular value decomposition and principal component analysis.

In linear algebra and functional analysis, a projection is a linear transformation $from a vector space to itself such that . That is, whenever is applied twice to any vector, it gives the same result as if it were applied once. It leaves its image unchanged. This definition of "projection" formalizes and generalizes the idea of graphical projection. One can also consider the effect of a projection on a geometrical object by examining the effect of the projection on points in the object.$

Dimensionality reduction, or dimension reduction, is the transformation of data from a high-dimensional space into a low-dimensional space so that the low-dimensional representation retains some meaningful properties of the original data, ideally close to its intrinsic dimension. Working in high-dimensional spaces can be undesirable for many reasons; raw data are often sparse as a consequence of the curse of dimensionality, and analyzing the data is usually computationally intractable. Dimensionality reduction is common in fields that deal with large numbers of observations and/or large numbers of variables, such as signal processing, speech recognition, neuroinformatics, and bioinformatics.

Latent semantic analysis (LSA) is a technique in natural language processing, in particular distributional semantics, of analyzing relationships between a set of documents and the terms they contain by producing a set of concepts related to the documents and terms. LSA assumes that words that are close in meaning will occur in similar pieces of text. A matrix containing word counts per document is constructed from a large piece of text and a mathematical technique called singular value decomposition (SVD) is used to reduce the number of rows while preserving the similarity structure among columns. Documents are then compared by cosine similarity between any two columns. Values close to 1 represent very similar documents while values close to 0 represent very dissimilar documents.

The curse of dimensionality refers to various phenomena that arise when analyzing and organizing data in high-dimensional spaces that do not occur in low-dimensional settings such as the three-dimensional physical space of everyday experience. The expression was coined by Richard E. Bellman when considering problems in dynamic programming. The curse generally refers to issues that arise when the number of datapoints is small relative to the intrinsic dimension of the data.

In statistics, the k-nearest neighbors algorithm (k-NN) is a non-parametric supervised learning method first developed by Evelyn Fix and Joseph Hodges in 1951, and later expanded by Thomas Cover. It is used for classification and regression. In both cases, the input consists of the k closest training examples in a data set. The output depends on whether k-NN is used for classification or regression:

Biclustering, block clustering, Co-clustering or two-mode clustering is a data mining technique which allows simultaneous clustering of the rows and columns of a matrix. The term was first introduced by Boris Mirkin to name a technique introduced many years earlier, in 1972, by John A. Hartigan.

Non-negative matrix factorization, also non-negative matrix approximation is a group of algorithms in multivariate analysis and linear algebra where a matrix $V$ is factorized into (usually) two matrices $W$ and $H$ , with the property that all three matrices have no negative elements. This non-negativity makes the resulting matrices easier to inspect. Also, in applications such as processing of audio spectrograms or muscular activity, non-negativity is inherent to the data being considered. Since the problem is not exactly solvable in general, it is commonly approximated numerically.

An autoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data. An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.

In computer science, locality-sensitive hashing (LSH) is a fuzzy hashing technique that hashes similar input items into the same "buckets" with high probability. Since similar items end up in the same buckets, this technique can be used for data clustering and nearest neighbor search. It differs from conventional hashing techniques in that hash collisions are maximized, not minimized. Alternatively, the technique can be seen as a way to reduce the dimensionality of high-dimensional data; high-dimensional input items can be reduced to low-dimensional versions while preserving relative distances between items.

<span class="mw-page-title-main">Spectral clustering</span> Clustering methods

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. In the classical proof of the lemma, the embedding is a random orthogonal projection.

In computer science, lattice problems are a class of optimization problems related to mathematical objects called lattices. The conjectured intractability of such problems is central to the construction of secure lattice-based cryptosystems: lattice problems are an example of NP-hard problems which have been shown to be average-case hard, providing a test case for the security of cryptographic algorithms. In addition, some lattice problems which are worst-case hard can be used as a basis for extremely secure cryptographic schemes. The use of worst-case hardness in such schemes makes them among the very few schemes that are very likely secure even against quantum computers. For applications in such cryptosystems, lattices over vector space or free modules are generally considered.

Sparse distributed memory (SDM) is a mathematical model of human long-term memory introduced by Pentti Kanerva in 1988 while he was at NASA Ames Research Center.

In data mining and machine learning, $k$ $q$ -flats algorithm is an iterative method which aims to partition $m$ observations into $k$ clusters where each cluster is close to a $q$ -flat, where $q$ is a given integer.

<span class="mw-page-title-main">Matrix completion</span>

Matrix completion is the task of filling in the missing entries of a partially observed matrix, which is equivalent to performing data imputation in statistics. A wide range of datasets are naturally organized in matrix form. One example is the movie-ratings matrix, as appears in the Netflix problem: Given a ratings matrix in which each entry $represents the rating of movie by customer, if customer has watched movie and is otherwise missing, we would like to predict the remaining entries in order to make good recommendations to customers on what to watch next. Another example is the document-term matrix: The frequencies of words used in a collection of documents can be represented as a matrix, where each entry corresponds to the number of times the associated term appears in the indicated document.$

Sparse dictionary learning is a representation learning method which aims at finding a sparse representation of the input data in the form of a linear combination of basic elements as well as those basic elements themselves. These elements are called atoms and they compose a dictionary. Atoms in the dictionary are not required to be orthogonal, and they may be an over-complete spanning set. This problem setup also allows the dimensionality of the signals being represented to be higher than the one of the signals being observed. The above two properties lead to having seemingly redundant atoms that allow multiple representations of the same signal but also provide an improvement in sparsity and flexibility of the representation.

In statistics, machine learning and algorithms, a tensor sketch is a type of dimensionality reduction that is particularly efficient when applied to vectors that have tensor structure. Such a sketch can be used to speed up explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms.

References

↑ Ella, Bingham; Heikki, Mannila (2001). "Random projection in dimensionality reduction: Applications to image and text data". KDD-2001: Proceedings of the Seventh ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: Association for Computing Machinery. pp. 245–250. CiteSeerX 10.1.1.24.5135 . doi:10.1145/502512.502546.
↑ Johnson, William B.; Lindenstrauss, Joram (1984). "Extensions of Lipschitz mappings into a Hilbert space". Conference in Modern Analysis and Probability (New Haven, Conn., 1982) . Contemporary Mathematics. Vol. 26. Providence, RI: American Mathematical Society. pp. 189–206. doi:10.1090/conm/026/737400. ISBN 9780821850305. MR 0737400. S2CID 117819162..
↑ Bingham, Ella; Mannila, Heikki (May 6, 2014). "Random projection in dimensionality reduction: Applications to image and text data" (PDF).
↑ Achlioptas, Dimitris (2001). "Database-friendly random projections". Proceedings of the twentieth ACM SIGMOD-SIGACT-SIGART symposium on Principles of database systems - PODS '01. pp. 274–281. CiteSeerX 10.1.1.28.6652 . doi:10.1145/375551.375608. ISBN 978-1581133615. S2CID 2640788.
↑ Li, Ping; Hastie, Trevor; Church, Kenneth (2006). "Very sparse random projections". Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining. pp. 287–296. doi:10.1145/1150402.1150436. ISBN 1595933395. S2CID 7995734.
↑ Kane, Daniel M.; Nelson, Jelani (2014). "Sparser Johnson-Lindenstrauss Transforms". Journal of the ACM . 61 (1): 1–23. arXiv: 1012.1577 . doi:10.1145/2559902. MR 3167920. S2CID 7821848.
↑ Charikar, Moses (2002). "Similarity estimation techniques from rounding algorithms". Proceedings of the thiry-fourth annual ACM symposium on Theory of computing. Vol. 1. pp. 380–388. doi:10.1145/509907.509965. ISBN 1581134959. S2CID 4229473.
↑ Freund, Yoav; Dasgupta, Sanjoy; Kabra, Mayank; Verma, Nakul (2007). "Learning the structure of manifolds using random projections". 20th International Conference on Neural Information Processing Systems. 1 (1): 473–480.
↑ Boufounos, Petros; Baraniuk, Richard (2008). "1-bit Compressive Sensing". 42nd Annual Conference on Information Sciences and Systems. 1 (1): 16–21. doi:10.1109/CISS.2008.4558487. S2CID 206563812.
↑ Li, Xiaoyun; Li, Ping (2019). "Generalization error analysis of quantized compressive learning". 33rd International Conference on Neural Information Processing Systems. 1: 15150–15160.
↑ Kainen, Paul C.; Kůrková, Věra (1993), "Quasiorthogonal dimension of Euclidean spaces", Applied Mathematics Letters, 6 (3): 7–10, doi: 10.1016/0893-9659(93)90023-G , MR 1347278
↑ Hecht-Nielsen, R. (1994). "Context vectors: General-purpose approximate meaning representations self-organized from raw data". In Zurada, Jacek M.; Marks, Robert Jackson; Robinson, Charles J. (eds.). Computational Intelligence: Imitating Life. IEEE. pp. 43–56. ISBN 978-0-7803-1104-6.
↑ Gorban, Alexander N.; Tyukin, Ivan Y.; Prokhorov, Danil V.; Sofeikov, Konstantin I. (2016). "Approximation with Random Bases: Pro et Contra". Information Sciences . 364–365: 129–145. arXiv: 1506.04631 . doi:10.1016/j.ins.2015.09.021. S2CID 2239376.
↑ Ravindran, Siddharth (2020). "A Data-Independent Reusable Projection (DIRP) Technique for Dimension Reduction in Big Data Classification Using k-Nearest Neighbor (k-NN)". National Academy Science Letters. 43: 13–21. doi:10.1007/s40009-018-0771-6. S2CID 91946077.
↑ Siddharth, R.; Aghila, G. (July 2020). "RandPro- A practical implementation of random projection-based feature extraction for high dimensional multivariate data analysis in R". SoftwareX. 12: 100629. Bibcode:2020SoftX..1200629S. doi: 10.1016/j.softx.2020.100629 .