Part of a series on |
Machine learning and data mining |
---|
In statistics, machine learning and algorithms, a tensor sketch is a type of dimensionality reduction that is particularly efficient when applied to vectors that have tensor structure. [1] [2] Such a sketch can be used to speed up explicit kernel methods, bilinear pooling in neural networks and is a cornerstone in many numerical linear algebra algorithms. [3]
Mathematically, a dimensionality reduction or sketching matrix is a matrix , where , such that for any vector
with high probability. In other words, preserves the norm of vectors up to a small error.
A tensor sketch has the extra property that if for some vectors such that , the transformation can be computed more efficiently. Here denotes the Kronecker product, rather than the outer product, though the two are related by a flattening.
The speedup is achieved by first rewriting , where denotes the elementwise (Hadamard) product. Each of and can be computed in time and , respectively; including the Hadamard product gives overall time . In most use cases this method is significantly faster than the full requiring time.
For higher-order tensors, such as , the savings are even more impressive.
The term tensor sketch was coined in 2013 [4] describing a technique by Rasmus Pagh [5] from the same year. Originally it was understood using the fast Fourier transform to do fast convolution of count sketches. Later research works generalized it to a much larger class of dimensionality reductions via Tensor random embeddings.
Tensor random embeddings were introduced in 2010 in a paper [6] on differential privacy and were first analyzed by Rudelson et al. in 2012 in the context of sparse recovery. [7]
Avron et al. [8] were the first to study the subspace embedding properties of tensor sketches, particularly focused on applications to polynomial kernels. In this context, the sketch is required not only to preserve the norm of each individual vector with a certain probability but to preserve the norm of all vectors in each individual linear subspace. This is a much stronger property, and it requires larger sketch sizes, but it allows the kernel methods to be used very broadly as explored in the book by David Woodruff. [3]
The face-splitting product is defined as the tensor products of the rows (was proposed by V. Slyusar [9] in 1996 [10] [11] [12] [13] [14] for radar and digital antenna array applications). More directly, let and be two matrices. Then the face-splitting product is [10] [11] [12] [13] The reason this product is useful is the following identity:
where is the element-wise (Hadamard) product. Since this operation can be computed in linear time, can be multiplied on vectors with tensor structure much faster than normal matrices.
The tensor sketch of Pham and Pagh [4] computes , where and are independent count sketch matrices and is vector convolution. They show that, amazingly, this equals – a count sketch of the tensor product!
It turns out that this relation can be seen in terms of the face-splitting product as
Since is an orthonormal matrix, doesn't impact the norm of and may be ignored. What's left is that .
On the other hand,
The problem with the original tensor sketch algorithm was that it used count sketch matrices, which aren't always very good dimensionality reductions.
In 2020 [15] it was shown that any matrices with random enough independent rows suffice to create a tensor sketch. This allows using matrices with stronger guarantees, such as real Gaussian Johnson Lindenstrauss matrices.
In particular, we get the following theorem
In particular, if the entries of are we get which matches the normal Johnson Lindenstrauss theorem of when is small.
The paper [15] also shows that the dependency on is necessary for constructions using tensor randomized projections with Gaussian entries.
Because of the exponential dependency on in tensor sketches based on the face-splitting product, a different approach was developed in 2020 [15] which applies
We can achieve such an by letting
With this method, we only apply the general tensor sketch method to order 2 tensors, which avoids the exponential dependency in the number of rows.
It can be proved [15] that combining dimensionality reductions like this only increases by a factor .
The fast Johnson–Lindenstrauss transform is a dimensionality reduction matrix
Given a matrix , computing the matrix vector product takes time. The Fast Johnson Lindenstrauss Transform (FJLT), [16] was introduced by Ailon and Chazelle in 2006.
A version of this method takes where
The matrix-vector multiplication can be computed in time.
If the diagonal matrix is replaced by one which has a tensor product of values on the diagonal, instead of being fully independent, it is possible to compute fast.
For an example of this, let be two independent vectors and let be a diagonal matrix with on the diagonal. We can then split up as follows:
In other words, , splits up into two Fast Johnson–Lindenstrauss transformations, and the total reduction takes time rather than as with the direct approach.
The same approach can be extended to compute higher degree products, such as
Ahle et al. [15] shows that if has rows, then for any vector with probability , while allowing fast multiplication with degree tensors.
Jin et al., [17] the same year, showed a similar result for the more general class of matrices call RIP, which includes the subsampled Hadamard matrices. They showed that these matrices allow splitting into tensors provided the number of rows is . In the case this matches the previous result.
These fast constructions can again be combined with the recursion approach mentioned above, giving the fastest overall tensor sketch.
It is also possible to do so-called "data aware" tensor sketching. Instead of multiplying a random matrix on the data, the data points are sampled independently with a certain probability depending on the norm of the point. [18]
Kernel methods are popular in machine learning as they give the algorithm designed the freedom to design a "feature space" in which to measure the similarity of their data points. A simple kernel-based binary classifier is based on the following computation:
where are the data points, is the label of the th point (either −1 or +1), and is the prediction of the class of . The function is the kernel. Typical examples are the radial basis function kernel, , and polynomial kernels such as .
When used this way, the kernel method is called "implicit". Sometimes it is faster to do an "explicit" kernel method, in which a pair of functions are found, such that . This allows the above computation to be expressed as
where the value can be computed in advance.
The problem with this method is that the feature space can be very large. That is . For example, for the polynomial kernel we get and , where is the tensor product and where . If is already large, can be much larger than the number of data points () and so the explicit method is inefficient.
The idea of tensor sketch is that we can compute approximate functions where can even be smaller than , and which still have the property that .
This method was shown in 2020 [15] to work even for high degree polynomials and radial basis function kernels.
Assume we have two large datasets, represented as matrices , and we want to find the rows with the largest inner products . We could compute and simply look at all possibilities. However, this would take at least time, and probably closer to using standard matrix multiplication techniques.
The idea of Compressed Matrix Multiplication is the general identity
where is the tensor product. Since we can compute a (linear) approximation to efficiently, we can sum those up to get an approximation for the complete product.
Bilinear pooling is the technique of taking two input vectors, from different sources, and using the tensor product as the input layer to a neural network.
In [19] the authors considered using tensor sketch to reduce the number of variables needed.
In 2017 another paper [20] takes the FFT of the input features, before they are combined using the element-wise product. This again corresponds to the original tensor sketch.
In vector calculus, the gradient of a scalar-valued differentiable function of several variables is the vector field whose value at a point is the "direction and rate of fastest increase". If the gradient of a function is non-zero at a point , the direction of the gradient is the direction in which the function increases most quickly from , and the magnitude of the gradient is the rate of increase in that direction, the greatest absolute directional derivative. Further, a point where the gradient is the zero vector is known as a stationary point. The gradient thus plays a fundamental role in optimization theory, where it is used to maximize a function by gradient ascent. In coordinate-free terms, the gradient of a function may be defined by:
In mathematical physics and mathematics, the Pauli matrices are a set of three 2 × 2 complex matrices which are Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.
In mathematics, the tensor product of two vector spaces V and W is a vector space to which is associated a bilinear map that maps a pair to an element of denoted
In mathematics, a symmetric matrix with real entries is positive-definite if the real number is positive for every nonzero real column vector where is the transpose of . More generally, a Hermitian matrix is positive-definite if the real number is positive for every nonzero complex column vector where denotes the conjugate transpose of
In linear algebra, the outer product of two coordinate vectors is the matrix whose entries are all products of an element in the first vector with an element in the second vector. If the two coordinate vectors have dimensions n and m, then their outer product is an n × m matrix. More generally, given two tensors, their outer product is a tensor. The outer product of tensors is also referred to as their tensor product, and can be used to define the tensor algebra.
In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices A and B is denoted as AB.
In mechanics and geometry, the 3D rotation group, often denoted SO(3), is the group of all rotations about the origin of three-dimensional Euclidean space under the operation of composition.
In mathematics, especially the usage of linear algebra in mathematical physics, Einstein notation is a notational convention that implies summation over a set of indexed terms in a formula, thus achieving brevity. As part of mathematics it is a notational subset of Ricci calculus; however, it is often used in physics applications that do not distinguish between tangent and cotangent spaces. It was introduced to physics by Albert Einstein in 1916.
In the mathematical field of differential geometry, a metric tensor is an additional structure on a manifold M that allows defining distances and angles, just as the inner product on a Euclidean space allows defining distances and angles there. More precisely, a metric tensor at a point p of M is a bilinear form defined on the tangent space at p, and a metric tensor on M consists of a metric tensor at each point p of M that varies smoothly with p.
In mathematics, the Hessian matrix, Hessian or Hesse matrix is a square matrix of second-order partial derivatives of a scalar-valued function, or scalar field. It describes the local curvature of a function of many variables. The Hessian matrix was developed in the 19th century by the German mathematician Ludwig Otto Hesse and later named after him. Hesse originally used the term "functional determinants".
The density matrix renormalization group (DMRG) is a numerical variational technique devised to obtain the low-energy physics of quantum many-body systems with high accuracy. As a variational method, DMRG is an efficient algorithm that attempts to find the lowest-energy matrix product state wavefunction of a Hamiltonian. It was invented in 1992 by Steven R. White and it is nowadays the most efficient method for 1-dimensional systems.
In mathematics, the Kronecker product, sometimes denoted by ⊗, is an operation on two matrices of arbitrary size resulting in a block matrix. It is a specialization of the tensor product from vectors to matrices and gives the matrix of the tensor product linear map with respect to a standard choice of basis. The Kronecker product is to be distinguished from the usual matrix multiplication, which is an entirely different operation. The Kronecker product is also sometimes called matrix direct product.
In linear algebra, a rotation matrix is a transformation matrix that is used to perform a rotation in Euclidean space. For example, using the convention below, the matrix
In geometry and linear algebra, a Cartesian tensor uses an orthonormal basis to represent a tensor in a Euclidean space in the form of components. Converting a tensor's components from one such basis to another is done through an orthogonal transformation.
In mathematics, especially in linear algebra and matrix theory, the vectorization of a matrix is a linear transformation which converts the matrix into a vector. Specifically, the vectorization of a m × n matrix A, denoted vec(A), is the mn × 1 column vector obtained by stacking the columns of the matrix A on top of one another:
In statistics, the generalized linear array model (GLAM) is used for analyzing data sets with array structures. It based on the generalized linear model with the design matrix written as a Kronecker product.
In mathematics, the Johnson–Lindenstrauss lemma is a result named after William B. Johnson and Joram Lindenstrauss concerning low-distortion embeddings of points from high-dimensional into low-dimensional Euclidean space. The lemma states that a set of points in a high-dimensional space can be embedded into a space of much lower dimension in such a way that distances between the points are nearly preserved. The map used for the embedding is at least Lipschitz, and can even be taken to be an orthogonal projection.
In multilinear algebra, the tensor rank decomposition or the decomposition of a tensor is the decomposition of a tensor in terms of a sum of minimum tensors. This is an open problem.
In mathematics, the Hadamard product is a binary operation that takes in two matrices of the same dimensions and returns a matrix of the multiplied corresponding elements. This operation can be thought as a "naive matrix multiplication" and is different from the matrix product. It is attributed to, and named after, either French-Jewish mathematician Jacques Hadamard or German-Jewish mathematician Issai Schur.
In mathematics, the Khatri–Rao product of matrices is defined as