Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. [1] Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors (perhaps over a network).
Directly applying the mathematical definition of matrix multiplication gives an algorithm that takes time on the order of n3 field operations to multiply two n × n matrices over that field (Θ(n3) in big O notation). Better asymptotic bounds on the time required to multiply matrices have been known since the Strassen's algorithm in the 1960s, but the optimal time (that is, the computational complexity of matrix multiplication) remains unknown. As of October 2022 [update] , the best announced bound on the asymptotic complexity of a matrix multiplication algorithm is O(n2.37188) time, given by Duan, Wu and Zhou [2] announced in a preprint. This improves on the bound of O(n2.3728596) time, given by Josh Alman and Virginia Vassilevska Williams. [3] [4] However, this algorithm is a galactic algorithm because of the large constants and cannot be realized practically.
The definition of matrix multiplication is that if C = AB for an n × m matrix A and an m × p matrix B, then C is an n × p matrix with entries
From this, a simple algorithm can be constructed which loops over the indices i from 1 through n and j from 1 through p, computing the above using a nested loop:
This algorithm takes time Θ(nmp) (in asymptotic notation). [1] A common simplification for the purpose of algorithm analysis is to assume that the inputs are all square matrices of size n × n, in which case the running time is Θ(n3), i.e., cubic in the size of the dimension. [5]
The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to the memory access patterns and cache use of the algorithm; [1] which order is best also depends on whether the matrices are stored in row-major order, column-major order, or a mix of both.
In particular, in the idealized case of a fully associative cache consisting of M bytes and b bytes per cache line (i.e. M/b cache lines), the above algorithm is sub-optimal for A and B stored in row-major order. When n > M/b, every iteration of the inner loop (a simultaneous sweep through a row of A and a column of B) incurs a cache miss when accessing an element of B. This means that the algorithm incurs Θ(n3) cache misses in the worst case. As of 2010 [update] , the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices. [6]
The optimal variant of the iterative algorithm for A and B in row-major layout is a tiled version, where the matrix is implicitly divided into square tiles of size √M by √M: [6] [7]
In the idealized cache model, this algorithm incurs only Θ(n3/b√M) cache misses; the divisor b√M amounts to several orders of magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses. [6]
An alternative to the iterative algorithm is the divide-and-conquer algorithm for matrix multiplication. This relies on the block partitioning
which works for all square matrices whose dimensions are powers of two, i.e., the shapes are 2n × 2n for some n. The matrix product is now
which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm computes the smaller multiplications recursively, using the scalar multiplication c11 = a11b11 as its base case.
The complexity of this algorithm as a function of n is given by the recurrence [5]
accounting for the eight recursive calls on matrices of size n/2 and Θ(n2) to sum the four pairs of resulting matrices element-wise. Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution Θ(n3), the same as the iterative algorithm. [5]
A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice [6] splits matrices in two instead of four submatrices, as follows. [8] Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions.
The cache miss rate of recursive matrix multiplication is the same as that of a tiled iterative version, but unlike that algorithm, the recursive algorithm is cache-oblivious: [8] there is no tuning parameter required to get optimal cache performance, and it behaves well in a multiprogramming environment where cache sizes are effectively dynamic due to other processes taking up cache space. [6] (The simple iterative algorithm is cache-oblivious as well, but much slower in practice if the matrix layout is not adapted to the algorithm.)
The number of cache misses incurred by this algorithm, on a machine with M lines of ideal cache, each of size b bytes, is bounded by [8] : 13
Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was Strassen's algorithm, devised by Volker Strassen in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two 2 × 2-matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of . Strassen's algorithm is more complex, and the numerical stability is reduced compared to the naïve algorithm, [9] but it is faster in cases where n > 100 or so [1] and appears in several libraries, such as BLAS. [10] It is very useful for large matrices over exact domains such as finite fields, where numerical stability is not an issue.
Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems improving on the constants hidden in the big O notation has its merits. A table which compares key aspects of improved version based on recursive multiplication of 2x2-block matrices via 7 block matrix multiplications follows. As usual gives the dimensions of the matrix and designates the memory size.
Year | Reference | #matrix multiplications per step | #matrix additions per step | total arithmetic operations | total I/O-complexity |
---|---|---|---|---|---|
1969 | Strassen [11] | 7 | 18 | ||
1971 | Winograd [12] | 7 | 15 | ||
2017 | Karstadt, Schwartz [13] | 7 | 12 |
It is known that a Strassen-like algorithm with a 2x2-block matrix step requires at least 7 block matrix multiplications. In 1976 Probert [14] showed that such an algorithm requires at least 15 additions (including subtractions), however a hidden assumption was that the blocks and the 2x2-blockmatrix are represented in the same basis. Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using different bases. In subsequent work Beniamini et el. [15] applied this base change trick to more general decompositions than 2x2-blockmatrices and improved leading constant for their run times.
It is an open question in theoretical computer science how well Strassen's algorithm can be improved in terms of asymptotic complexity. The matrix multiplication exponent, usually denoted , is the smallest real number for which any matrix over a field can be multiplied together using field operations. The current best bound on is , by Josh Alman and Virginia Vassilevska Williams. [3] This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by Don Coppersmith and Shmuel Winograd in 1990. [16] The conceptual idea of these algorithms are similar to Strassen's algorithm: a way is devised for multiplying two k × k-matrices with fewer than k3 multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the Big O notation is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers. [17] [18]
Freivalds' algorithm is a simple Monte Carlo algorithm that, given matrices A, B and C, verifies in Θ(n2) time if AB = C.
In 2022, DeepMind introduced AlphaTensor, a neural network that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans and some that were not. [19] Operations were restricted to the non-commutative ground field (normal arithmetic) and finite field (mod 2 arithmetic). The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n2.778). [20] Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication even for 3x3 matrices remains unknown, even in commutative field. [20] On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97 [21] in normal arithmetic. [22] Some algorithms were never discovered before, e.g. (4, 5, 5) got improved to 76 from 80 in normal and mod 2 arithmetics.
The divide-and-conquer algorithm sketched earlier can be parallelized in two ways for shared-memory multiprocessors. These are based on the fact that the eight recursive matrix multiplications in
can be performed independently of each other, as can the four summations (although the algorithm needs to "join" the multiplications before doing the summations). Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed in fork–join style pseudocode: [23]
Proceduremultiply(C, A, B):
Procedureadd(C, T) adds T into C, element-wise:
Here, fork is a keyword that signal a computation may be run in parallel with the rest of the function call, while join waits for all previously "forked" computations to complete. partition achieves its goal by pointer manipulation only.
This algorithm has a critical path length of Θ(log2n) steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible speedup of Θ(n3/log2n) on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix T, but a more practical variant achieves Θ(n2) speedup, without using a temporary matrix. [23]
On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the communication bandwidth. The naïve algorithm using three nested loops uses Ω(n3) communication bandwidth.
Cannon's algorithm, also known as the 2D algorithm, is a communication-avoiding algorithm that partitions each input matrix into a block matrix whose elements are submatrices of size √M/3 by √M/3, where M is the size of fast memory. [24] The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to O(n3/√M), which is asymptotically optimal (for algorithms performing Ω(n3) computation). [25] [26]
In a distributed setting with p processors arranged in a √p by √p 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting O(n2/√p) words, which is asymptotically optimal assuming that each node stores the minimum O(n2/p) elements. [26] This can be improved by the 3D algorithm, which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row. [27] This algorithm transmits O(n2/p2/3) words per processor, which is asymptotically optimal. [26] However, this requires replicating each input matrix element p1/3 times, and so requires a factor of p1/3 more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime. [27] "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth. [28] On modern distributed computing environments such as MapReduce, specialized multiplication algorithms have been developed. [29]
There are a variety of algorithms for multiplication on meshes. For multiplication of two n×n on a standard two-dimensional mesh using the 2D Cannon's algorithm, one can complete the multiplication in 3n-2 steps although this is reduced to half this number for repeated computations. [30] The standard array is inefficient because the data from the two matrices does not arrive simultaneously and it must be padded with zeroes.
The result is even faster on a two-layered cross-wired mesh, where only 2n-1 steps are needed. [31] The performance improves further for repeated computations leading to 100% efficiency. [32] The cross-wired mesh array may be seen as a special case of a non-planar (i.e. multilayered) processing structure. [33]
In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. The determinant of a matrix A is commonly denoted det(A), det A, or |A|. Its value characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if and only if the matrix is invertible and the linear map represented by the matrix is an isomorphism. The determinant of a product of matrices is the product of their determinants.
In mathematics and computer programming, exponentiating by squaring is a general method for fast computation of large positive integer powers of a number, or more generally of an element of a semigroup, like a polynomial or a square matrix. Some variants are commonly referred to as square-and-multiply algorithms or binary exponentiation. These can be of quite general use, for example in modular arithmetic or powering of matrices. For semigroups for which additive notation is commonly used, like elliptic curves used in cryptography, this method is also referred to as double-and-add.
In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices A and B is denoted as AB.
Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.
In linear algebra, the Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924. When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.
In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.
In linear algebra, an n-by-n square matrix A is called invertible if there exists an n-by-n square matrix B such that
In mathematics, scalar multiplication is one of the basic operations defining a vector space in linear algebra. In common geometrical contexts, scalar multiplication of a real Euclidean vector by a positive real number multiplies the magnitude of the vector without changing its direction. Scalar multiplication is the multiplication of a vector by a scalar, and is to be distinguished from inner product of two vectors.
Toom–Cook, sometimes known as Toom-3, named after Andrei Toom, who introduced the new algorithm with its low complexity, and Stephen Cook, who cleaned the description of it, is a multiplication algorithm for large integers.
In mathematics, a block matrix or a partitioned matrix is a matrix that is interpreted as having been broken into sections called blocks or submatrices. Intuitively, a matrix interpreted as a block matrix can be visualized as the original matrix with a collection of horizontal and vertical lines, which break it up, or partition it, into a collection of smaller matrices. Any matrix may be interpreted as a block matrix in one or more ways, with each interpretation defined by how its rows and columns are partitioned.
In linear algebra, a tridiagonal matrix is a band matrix that has nonzero elements only on the main diagonal, the subdiagonal/lower diagonal, and the supradiagonal/upper diagonal. For example, the following matrix is tridiagonal:
In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices. The Strassen algorithm is slower than the fastest known algorithms for extremely large matrices, but such galactic algorithms are not useful in practice, as they are much slower for matrices of practical size. For small matrices even faster algorithms exist.
The Schönhage–Strassen algorithm is an asymptotically fast multiplication algorithm for large integers, published by Arnold Schönhage and Volker Strassen in 1971. It works by recursively applying fast Fourier transform (FFT) over the integers modulo 2n+1. The run-time bit complexity to multiply two n-digit numbers using the algorithm is in big O notation.
Matrix chain multiplication is an optimization problem concerning the most efficient way to multiply a given sequence of matrices. The problem is not actually to perform the multiplications, but merely to decide the sequence of the matrix multiplications involved. The problem may be solved using dynamic programming.
In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.
In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix. The product sometimes includes a permutation matrix as well. LU decomposition can be viewed as the matrix form of Gaussian elimination. Computers usually solve square systems of linear equations using LU decomposition, and it is also a key step when inverting a matrix or computing the determinant of a matrix. The LU decomposition was introduced by the Polish astronomer Tadeusz Banachiewicz in 1938. To quote: "It appears that Gauss and Doolittle applied the method [of elimination] only to symmetric equations. More recent authors, for example, Aitken, Banachiewicz, Dwyer, and Crout … have emphasized the use of the method, or variations of it, in connection with non-symmetric problems … Banachiewicz … saw the point … that the basic problem is really one of matrix factorization, or “decomposition” as he called it." It's also referred to as LR decomposition.
The following tables list the computational complexity of various algorithms for common mathematical operations.
In theoretical computer science, the computational complexity of matrix multiplication dictates how quickly the operation of matrix multiplication can be performed. Matrix multiplication algorithms are a central subroutine in theoretical and numerical algorithms for numerical linear algebra and optimization, so finding the fastest algorithm for matrix multiplication is of major practical relevance.
Multidimensional Digital Signal Processing (MDSP) refers to the extension of Digital signal processing (DSP) techniques to signals that vary in more than one dimension. While conventional DSP typically deals with one-dimensional data, such as time-varying audio signals, MDSP involves processing signals in two or more dimensions. Many of the principles from one-dimensional DSP, such as Fourier transforms and filter design, have analogous counterparts in multidimensional signal processing.
Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs : arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.
The Coppersmith–Winograd algorithm is not practical, due to the very large hidden constant in the upper bound on the number of multiplications required.
Even if someone manages to prove one of the conjectures—thereby demonstrating that ω = 2—the wreath product approach is unlikely to be applicable to the large matrix problems that arise in practice. [...] the input matrices must be astronomically large for the difference in time to be apparent.