Matrix multiplication algorithm

Last updated

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. [1] Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors (perhaps over a network).

Contents

Directly applying the mathematical definition of matrix multiplication gives an algorithm that takes time on the order of n3 field operations to multiply two n × n matrices over that field (Θ(n3) in big O notation). Better asymptotic bounds on the time required to multiply matrices have been known since the Strassen's algorithm in the 1960s, but the optimal time (that is, the computational complexity of matrix multiplication) remains unknown. As of October 2022, the best announced bound on the asymptotic complexity of a matrix multiplication algorithm is O(n2.37188) time, given by Duan, Wu and Zhou [2] announced in a preprint. This improves on the bound of O(n2.3728596) time, given by Josh Alman and Virginia Vassilevska Williams. [3] [4] However, this algorithm is a galactic algorithm because of the large constants and cannot be realized practically.

Iterative algorithm

The definition of matrix multiplication is that if C = AB for an n × m matrix A and an m × p matrix B, then C is an n × p matrix with entries

From this, a simple algorithm can be constructed which loops over the indices i from 1 through n and j from 1 through p, computing the above using a nested loop:

  • Input: matrices A and B
  • Let C be a new matrix of the appropriate size
  • For i from 1 to n:
    • For j from 1 to p:
      • Let sum = 0
      • For k from 1 to m:
        • Set sum ← sum + Aik × Bkj
      • Set Cij ← sum
  • Return C

This algorithm takes time Θ(nmp) (in asymptotic notation). [1] A common simplification for the purpose of algorithm analysis is to assume that the inputs are all square matrices of size n × n, in which case the running time is Θ(n3), i.e., cubic in the size of the dimension. [5]

Cache behavior

Illustration of row- and column-major order Row and column major order.svg
Illustration of row- and column-major order

The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to the memory access patterns and cache use of the algorithm; [1] which order is best also depends on whether the matrices are stored in row-major order, column-major order, or a mix of both.

In particular, in the idealized case of a fully associative cache consisting of M bytes and b bytes per cache line (i.e. M/b cache lines), the above algorithm is sub-optimal for A and B stored in row-major order. When n > M/b, every iteration of the inner loop (a simultaneous sweep through a row of A and a column of B) incurs a cache miss when accessing an element of B. This means that the algorithm incurs Θ(n3) cache misses in the worst case. As of 2010, the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices. [6]

The optimal variant of the iterative algorithm for A and B in row-major layout is a tiled version, where the matrix is implicitly divided into square tiles of size M by M: [6] [7]

  • Input: matrices A and B
  • Let C be a new matrix of the appropriate size
  • Pick a tile size T = Θ(M)
  • For I from 1 to n in steps of T:
    • For J from 1 to p in steps of T:
      • For K from 1 to m in steps of T:
        • Multiply AI:I+T, K:K+T and BK:K+T, J:J+T into CI:I+T, J:J+T, that is:
        • For i from I to min(I + T, n):
          • For j from J to min(J + T, p):
            • Let sum = 0
            • For k from K to min(K + T, m):
              • Set sum ← sum + Aik × Bkj
            • Set CijCij + sum
  • Return C

In the idealized cache model, this algorithm incurs only Θ(n3/bM) cache misses; the divisor bM amounts to several orders of magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses. [6]

Divide-and-conquer algorithm

An alternative to the iterative algorithm is the divide-and-conquer algorithm for matrix multiplication. This relies on the block partitioning

which works for all square matrices whose dimensions are powers of two, i.e., the shapes are 2n × 2n for some n. The matrix product is now

which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm computes the smaller multiplications recursively, using the scalar multiplication c11 = a11b11 as its base case.

The complexity of this algorithm as a function of n is given by the recurrence [5]

accounting for the eight recursive calls on matrices of size n/2 and Θ(n2) to sum the four pairs of resulting matrices element-wise. Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution Θ(n3), the same as the iterative algorithm. [5]

Non-square matrices

A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice [6] splits matrices in two instead of four submatrices, as follows. [8] Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions.

  • Inputs: matrices A of size n × m, B of size m × p.
  • Base case: if max(n, m, p) is below some threshold, use an unrolled version of the iterative algorithm.
  • Recursive cases:
  • If max(n, m, p) = n, split A horizontally:
  • Else, if max(n, m, p) = p, split B vertically:
  • Otherwise, max(n, m, p) = m. Split A vertically and B horizontally:

Cache behavior

The cache miss rate of recursive matrix multiplication is the same as that of a tiled iterative version, but unlike that algorithm, the recursive algorithm is cache-oblivious: [8] there is no tuning parameter required to get optimal cache performance, and it behaves well in a multiprogramming environment where cache sizes are effectively dynamic due to other processes taking up cache space. [6] (The simple iterative algorithm is cache-oblivious as well, but much slower in practice if the matrix layout is not adapted to the algorithm.)

The number of cache misses incurred by this algorithm, on a machine with M lines of ideal cache, each of size b bytes, is bounded by [8] :13

Sub-cubic algorithms

Improvement of estimates of exponent o over time for the computational complexity of matrix multiplication
O
(
n
o
)
{\displaystyle O(n^{\omega })}
. MatrixMultComplexity svg.svg
Improvement of estimates of exponent ω over time for the computational complexity of matrix multiplication .

Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was Strassen's algorithm, devised by Volker Strassen in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two 2 × 2-matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of . Strassen's algorithm is more complex, and the numerical stability is reduced compared to the naïve algorithm, [9] but it is faster in cases where n > 100 or so [1] and appears in several libraries, such as BLAS. [10] It is very useful for large matrices over exact domains such as finite fields, where numerical stability is not an issue.

Since Strassen's algorithm is actually used in practical numerical software and computer algebra systems improving on the constants hidden in the big O notation has its merits. A table which compares key aspects of improved version based on recursive multiplication of 2x2-block matrices via 7 block matrix multiplications follows. As usual gives the dimensions of the matrix and designates the memory size.

Progress for Strassen-like recursive 2x2-block matrix multiplication
YearReference#matrix multiplications per step#matrix additions per steptotal arithmetic operationstotal I/O-complexity
1969Strassen [11] 718
1971Winograd [12] 715
2017Karstadt, Schwartz [13] 712

It is known that a Strassen-like algorithm with a 2x2-block matrix step requires at least 7 block matrix multiplications. In 1976 Probert [14] showed that such an algorithm requires at least 15 additions (including subtractions), however a hidden assumption was that the blocks and the 2x2-blockmatrix are represented in the same basis. Karstadt and Schwartz computed in different bases and traded 3 additions for less expensive basis transformations. They also proved that one cannot go below 12 additions per step using different bases. In subsequent work Beniamini et el. [15] applied this base change trick to more general decompositions than 2x2-blockmatrices and improved leading constant for their run times.

It is an open question in theoretical computer science how well Strassen's algorithm can be improved in terms of asymptotic complexity. The matrix multiplication exponent, usually denoted , is the smallest real number for which any matrix over a field can be multiplied together using field operations. The current best bound on is , by Josh Alman and Virginia Vassilevska Williams. [3] This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by Don Coppersmith and Shmuel Winograd in 1990. [16] The conceptual idea of these algorithms are similar to Strassen's algorithm: a way is devised for multiplying two k × k-matrices with fewer than k3 multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the Big O notation is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers. [17] [18]

Freivalds' algorithm is a simple Monte Carlo algorithm that, given matrices A, B and C, verifies in Θ(n2) time if AB = C.

AlphaTensor

In 2022, DeepMind introduced AlphaTensor, a neural network that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans and some that were not. [19] Operations were restricted to the non-commutative ground field (normal arithmetic) and finite field (mod 2 arithmetic). The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n2.778). [20] Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication even for 3x3 matrices remains unknown, even in commutative field. [20] On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps in mod 2 arithmetic and to 97 [21] in normal arithmetic. [22] Some algorithms were never discovered before, e.g. (4, 5, 5) got improved to 76 from 80 in normal and mod 2 arithmetics.

Parallel and distributed algorithms

Shared-memory parallelism

The divide-and-conquer algorithm sketched earlier can be parallelized in two ways for shared-memory multiprocessors. These are based on the fact that the eight recursive matrix multiplications in

can be performed independently of each other, as can the four summations (although the algorithm needs to "join" the multiplications before doing the summations). Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed in fork–join style pseudocode: [23]

Proceduremultiply(C, A, B):

  • Base case: if n = 1, set c11a11 × b11 (or multiply a small block matrix).
  • Otherwise, allocate space for a new matrix T of shape n × n, then:
    • Partition A into A11, A12, A21, A22.
    • Partition B into B11, B12, B21, B22.
    • Partition C into C11, C12, C21, C22.
    • Partition T into T11, T12, T21, T22.
    • Parallel execution:
      • Forkmultiply(C11, A11, B11).
      • Forkmultiply(C12, A11, B12).
      • Forkmultiply(C21, A21, B11).
      • Forkmultiply(C22, A21, B12).
      • Forkmultiply(T11, A12, B21).
      • Forkmultiply(T12, A12, B22).
      • Forkmultiply(T21, A22, B21).
      • Forkmultiply(T22, A22, B22).
    • Join (wait for parallel forks to complete).
    • add(C, T).
    • Deallocate T.

Procedureadd(C, T) adds T into C, element-wise:

  • Base case: if n = 1, set c11c11 + t11 (or do a short loop, perhaps unrolled).
  • Otherwise:
    • Partition C into C11, C12, C21, C22.
    • Partition T into T11, T12, T21, T22.
    • In parallel:
      • Forkadd(C11, T11).
      • Forkadd(C12, T12).
      • Forkadd(C21, T21).
      • Forkadd(C22, T22).
    • Join.

Here, fork is a keyword that signal a computation may be run in parallel with the rest of the function call, while join waits for all previously "forked" computations to complete. partition achieves its goal by pointer manipulation only.

This algorithm has a critical path length of Θ(log2n) steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible speedup of Θ(n3/log2n) on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix T, but a more practical variant achieves Θ(n2) speedup, without using a temporary matrix. [23]

Block matrix multiplication. In the 2D algorithm, each processor is responsible for one submatrix of C. In the 3D algorithm, every pair of submatrices from A and B that is multiplied is assigned to one processor. Block matrix multiplication.svg
Block matrix multiplication. In the 2D algorithm, each processor is responsible for one submatrix of C. In the 3D algorithm, every pair of submatrices from A and B that is multiplied is assigned to one processor.

Communication-avoiding and distributed algorithms

On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the communication bandwidth. The naïve algorithm using three nested loops uses Ω(n3) communication bandwidth.

Cannon's algorithm, also known as the 2D algorithm, is a communication-avoiding algorithm that partitions each input matrix into a block matrix whose elements are submatrices of size M/3 by M/3, where M is the size of fast memory. [24] The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to O(n3/M), which is asymptotically optimal (for algorithms performing Ω(n3) computation). [25] [26]

In a distributed setting with p processors arranged in a p by p 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting O(n2/p) words, which is asymptotically optimal assuming that each node stores the minimum O(n2/p) elements. [26] This can be improved by the 3D algorithm, which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row. [27] This algorithm transmits O(n2/p2/3) words per processor, which is asymptotically optimal. [26] However, this requires replicating each input matrix element p1/3 times, and so requires a factor of p1/3 more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime. [27] "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth. [28] On modern distributed computing environments such as MapReduce, specialized multiplication algorithms have been developed. [29]

Algorithms for meshes

Matrix multiplication completed in 2n-1 steps for two nxn matrices on a cross-wired mesh. Matrix multiplication on a cross-wire two-dimensional array.png
Matrix multiplication completed in 2n-1 steps for two n×n matrices on a cross-wired mesh.

There are a variety of algorithms for multiplication on meshes. For multiplication of two n×n on a standard two-dimensional mesh using the 2D Cannon's algorithm, one can complete the multiplication in 3n-2 steps although this is reduced to half this number for repeated computations. [30] The standard array is inefficient because the data from the two matrices does not arrive simultaneously and it must be padded with zeroes.

The result is even faster on a two-layered cross-wired mesh, where only 2n-1 steps are needed. [31] The performance improves further for repeated computations leading to 100% efficiency. [32] The cross-wired mesh array may be seen as a special case of a non-planar (i.e. multilayered) processing structure. [33]

See also

Related Research Articles

In mathematics, the determinant is a scalar value that is a function of the entries of a square matrix. The determinant of a matrix A is commonly denoted det(A), det A, or |A|. Its value characterizes some properties of the matrix and the linear map represented by the matrix. In particular, the determinant is nonzero if and only if the matrix is invertible and the linear map represented by the matrix is an isomorphism. The determinant of a product of matrices is the product of their determinants.

In mathematics and computer programming, exponentiating by squaring is a general method for fast computation of large positive integer powers of a number, or more generally of an element of a semigroup, like a polynomial or a square matrix. Some variants are commonly referred to as square-and-multiply algorithms or binary exponentiation. These can be of quite general use, for example in modular arithmetic or powering of matrices. For semigroups for which additive notation is commonly used, like elliptic curves used in cryptography, this method is also referred to as double-and-add.

<span class="mw-page-title-main">Matrix multiplication</span> Mathematical operation in linear algebra

In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices A and B is denoted as AB.

<span class="mw-page-title-main">Dynamic programming</span> Problem optimization method

Dynamic programming is both a mathematical optimization method and an algorithmic paradigm. The method was developed by Richard Bellman in the 1950s and has found applications in numerous fields, from aerospace engineering to economics.

In linear algebra, the Cholesky decomposition or Cholesky factorization is a decomposition of a Hermitian, positive-definite matrix into the product of a lower triangular matrix and its conjugate transpose, which is useful for efficient numerical solutions, e.g., Monte Carlo simulations. It was discovered by André-Louis Cholesky for real matrices, and posthumously published in 1924. When it is applicable, the Cholesky decomposition is roughly twice as efficient as the LU decomposition for solving systems of linear equations.

In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved directly. The solutions to the sub-problems are then combined to give a solution to the original problem.

In linear algebra, an n-by-n square matrix A is called invertible if there exists an n-by-n square matrix B such that

<span class="mw-page-title-main">Scalar multiplication</span> Algebraic operation

In mathematics, scalar multiplication is one of the basic operations defining a vector space in linear algebra. In common geometrical contexts, scalar multiplication of a real Euclidean vector by a positive real number multiplies the magnitude of the vector without changing its direction. Scalar multiplication is the multiplication of a vector by a scalar, and is to be distinguished from inner product of two vectors.

Toom–Cook, sometimes known as Toom-3, named after Andrei Toom, who introduced the new algorithm with its low complexity, and Stephen Cook, who cleaned the description of it, is a multiplication algorithm for large integers.

In mathematics, a block matrix or a partitioned matrix is a matrix that is interpreted as having been broken into sections called blocks or submatrices. Intuitively, a matrix interpreted as a block matrix can be visualized as the original matrix with a collection of horizontal and vertical lines, which break it up, or partition it, into a collection of smaller matrices. Any matrix may be interpreted as a block matrix in one or more ways, with each interpretation defined by how its rows and columns are partitioned.

In linear algebra, a tridiagonal matrix is a band matrix that has nonzero elements only on the main diagonal, the subdiagonal/lower diagonal, and the supradiagonal/upper diagonal. For example, the following matrix is tridiagonal:

In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices. The Strassen algorithm is slower than the fastest known algorithms for extremely large matrices, but such galactic algorithms are not useful in practice, as they are much slower for matrices of practical size. For small matrices even faster algorithms exist.

<span class="mw-page-title-main">Schönhage–Strassen algorithm</span> Multiplication algorithm

The Schönhage–Strassen algorithm is an asymptotically fast multiplication algorithm for large integers, published by Arnold Schönhage and Volker Strassen in 1971. It works by recursively applying fast Fourier transform (FFT) over the integers modulo 2n+1. The run-time bit complexity to multiply two n-digit numbers using the algorithm is in big O notation.

Matrix chain multiplication is an optimization problem concerning the most efficient way to multiply a given sequence of matrices. The problem is not actually to perform the multiplications, but merely to decide the sequence of the matrix multiplications involved. The problem may be solved using dynamic programming.

In computing, a cache-oblivious algorithm is an algorithm designed to take advantage of a processor cache without having the size of the cache as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally. Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit loop tiling, which explicitly breaks a problem into blocks that are optimally sized for a given cache.

In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix. The product sometimes includes a permutation matrix as well. LU decomposition can be viewed as the matrix form of Gaussian elimination. Computers usually solve square systems of linear equations using LU decomposition, and it is also a key step when inverting a matrix or computing the determinant of a matrix. The LU decomposition was introduced by the Polish astronomer Tadeusz Banachiewicz in 1938. To quote: "It appears that Gauss and Doolittle applied the method [of elimination] only to symmetric equations. More recent authors, for example, Aitken, Banachiewicz, Dwyer, and Crout … have emphasized the use of the method, or variations of it, in connection with non-symmetric problems … Banachiewicz … saw the point … that the basic problem is really one of matrix factorization, or “decomposition” as he called it." It's also referred to as LR decomposition.

<span class="mw-page-title-main">Computational complexity of mathematical operations</span> Algorithmic runtime requirements for common math procedures

The following tables list the computational complexity of various algorithms for common mathematical operations.

In theoretical computer science, the computational complexity of matrix multiplication dictates how quickly the operation of matrix multiplication can be performed. Matrix multiplication algorithms are a central subroutine in theoretical and numerical algorithms for numerical linear algebra and optimization, so finding the fastest algorithm for matrix multiplication is of major practical relevance.

Multidimensional Digital Signal Processing (MDSP) refers to the extension of Digital signal processing (DSP) techniques to signals that vary in more than one dimension. While conventional DSP typically deals with one-dimensional data, such as time-varying audio signals, MDSP involves processing signals in two or more dimensions. Many of the principles from one-dimensional DSP, such as Fourier transforms and filter design, have analogous counterparts in multidimensional signal processing.

Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs : arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic.

References

  1. 1 2 3 4 Skiena, Steven (2012). "Sorting and Searching". The Algorithm Design Manual . Springer. pp.  45–46, 401–3. doi:10.1007/978-1-84800-070-4_4. ISBN   978-1-84800-069-8.
  2. Duan, Ran; Wu, Hongxun; Zhou, Renfei (2022), Faster Matrix Multiplication via Asymmetric Hashing, arXiv: 2210.10173
  3. 1 2 Alman, Josh; Williams, Virginia Vassilevska (2020), "A Refined Laser Method and Faster Matrix Multiplication", 32nd Annual ACM-SIAM Symposium on Discrete Algorithms (SODA 2021), arXiv: 2010.05846
  4. Hartnett, Kevin (23 March 2021). "Matrix Multiplication Inches Closer to Mythic Goal". Quanta Magazine. Retrieved 2021-04-01.
  5. 1 2 3 Cormen, Thomas H.; Leiserson, Charles E.; Rivest, Ronald L.; Stein, Clifford (2009) [1990]. Introduction to Algorithms (3rd ed.). MIT Press and McGraw-Hill. pp. 75–79. ISBN   0-262-03384-4.
  6. 1 2 3 4 5 Amarasinghe, Saman; Leiserson, Charles (2010). "6.172 Performance Engineering of Software Systems, Lecture 8". MIT OpenCourseWare. Massachusetts Institute of Technology. Retrieved 27 January 2015.
  7. Lam, Monica S.; Rothberg, Edward E.; Wolf, Michael E. (1991). The Cache Performance and Optimizations of Blocked Algorithms. ASPLOS91: 4th Int'l Conference on Architecture Support for Programming Languages & Operating Systems. doi:10.1145/106972.106981. ISBN   978-0-89791-380-5.
  8. 1 2 3 Prokop, Harald (1999). Cache-Oblivious Algorithms (PDF) (Master's). MIT. hdl:1721.1/80568.
  9. Miller, Webb (1975), "Computational complexity and numerical stability", SIAM News, 4 (2): 97–107, CiteSeerX   10.1.1.148.9947 , doi:10.1137/0204009
  10. Press, William H.; Flannery, Brian P.; Teukolsky, Saul A.; Vetterling, William T. (2007). Numerical Recipes: The Art of Scientific Computing (3rd ed.). Cambridge University Press. p.  108. ISBN   978-0-521-88068-8.
  11. Strassen, Volker (1969). "Gaussian Elimination is not Optimal". Numer. Math. 13 (4): 354–356. doi:10.1007/BF02165411. S2CID   121656251.
  12. Winograd, Shmuel (1971). "On multiplication of 2×2 matrices". Linear Algebra and Its Applications. 4 (4): 381–388. doi: 10.1016/0024-3795(71)90009-7 .
  13. Karstadt, Elaye; Schwartz, Oded (July 2017). "Matrix Multiplication, a Little Faster". Proceedings of the 29th ACM Symposium on Parallelism in Algorithms and Architectures. SPAA '17. pp. 101–110. doi:10.1145/3087556.3087579.
  14. Probert, Robert L. (1976). "On the additive complexity of matrix multiplication". SIAM J. Comput. 5 (2): 187–203. doi:10.1137/0205016.
  15. Beniamini, Gal; Cheng, Nathan; Holtz, Olga; Karstadt, Elaye; Schwartz, Oded (2020). "Sparsifying the Operators of Fast Matrix Multiplication Algorithms". arXiv: 2008.03759 [cs.DS].
  16. Coppersmith, Don; Winograd, Shmuel (1990), "Matrix multiplication via arithmetic progressions" (PDF), Journal of Symbolic Computation, 9 (3): 251, doi: 10.1016/S0747-7171(08)80013-2
  17. Iliopoulos, Costas S. (1989), "Worst-case complexity bounds on algorithms for computing the canonical structure of finite abelian groups and the Hermite and Smith normal forms of an integer matrix" (PDF), SIAM Journal on Computing, 18 (4): 658–669, CiteSeerX   10.1.1.531.9309 , doi:10.1137/0218045, MR   1004789, archived from the original (PDF) on 2014-03-05, retrieved 2015-01-16, The Coppersmith–Winograd algorithm is not practical, due to the very large hidden constant in the upper bound on the number of multiplications required.
  18. Robinson, Sara (November 2005), "Toward an Optimal Algorithm for Matrix Multiplication" (PDF), SIAM News, 38 (9), Even if someone manages to prove one of the conjectures—thereby demonstrating that ω = 2—the wreath product approach is unlikely to be applicable to the large matrix problems that arise in practice. [...] the input matrices must be astronomically large for the difference in time to be apparent.
  19. "Discovering novel algorithms with AlphaTensor". www.deepmind.com. Retrieved 2022-11-01.
  20. 1 2 Fawzi, Alhussein; Balog, Matej; Huang, Aja; Hubert, Thomas; Romera-Paredes, Bernardino; Barekatain, Mohammadamin; Novikov, Alexander; R. Ruiz, Francisco J.; Schrittwieser, Julian; Swirszcz, Grzegorz; Silver, David; Hassabis, Demis; Kohli, Pushmeet (October 2022). "Discovering faster matrix multiplication algorithms with reinforcement learning". Nature. 610 (7930): 47–53. Bibcode:2022Natur.610...47F. doi:10.1038/s41586-022-05172-4. ISSN   1476-4687. PMC   9534758 . PMID   36198780.
  21. Kauers, Manuel; Moosbauer, Jakob (2022-12-02). "Flip Graphs for Matrix Multiplication". arXiv: 2212.01175 [cs.SC].
  22. Brubaker, Ben (23 November 2022). "AI Reveals New Possibilities in Matrix Multiplication". Quanta Magazine. Retrieved 26 November 2022.
  23. 1 2 Randall, Keith H. (1998). Cilk: Efficient Multithreaded Computing (PDF) (Ph.D.). Massachusetts Institute of Technology. pp. 54–57. hdl:1721.1/47519.
  24. Cannon, Lynn Elliot (14 July 1969). A cellular computer to implement the Kalman Filter Algorithm (Ph.D.). Montana State University.
  25. Hong, J. W.; Kung, H. T. (1981). "I/O complexity: The red-blue pebble game" (PDF). Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81. pp. 326–333. doi:10.1145/800076.802486. S2CID   8410593. Archived (PDF) from the original on December 15, 2019.
  26. 1 2 3 Irony, Dror; Toledo, Sivan; Tiskin, Alexander (September 2004). "Communication lower bounds for distributed-memory matrix multiplication". J. Parallel Distrib. Comput. 64 (9): 1017–26. CiteSeerX   10.1.1.20.7034 . doi:10.1016/j.jpdc.2004.03.021.
  27. 1 2 Agarwal, R.C.; Balle, S. M.; Gustavson, F. G.; Joshi, M.; Palkar, P. (September 1995). "A three-dimensional approach to parallel matrix multiplication". IBM J. Res. Dev. 39 (5): 575–582. CiteSeerX   10.1.1.44.3404 . doi:10.1147/rd.395.0575.
  28. Solomonik, Edgar; Demmel, James (2011). "Communication-optimal parallel 2.5D matrix multiplication and LU factorization algorithms" (PDF). Proceedings of the 17th International Conference on Parallel Processing. Vol. Part II. pp. 90–109. doi:10.1007/978-3-642-23397-5_10. ISBN   978-3-642-23397-5.
  29. Bosagh Zadeh, Reza; Carlsson, Gunnar (2013). "Dimension Independent Matrix Square Using MapReduce" (PDF). arXiv: 1304.1467 . Bibcode:2013arXiv1304.1467B . Retrieved 12 July 2014.
  30. Bae, S.E.; Shinn, T.-W.; Takaoka, T. (2014). "A faster parallel algorithm for matrix multiplication on a mesh array". Procedia Computer Science. 29: 2230–40. doi: 10.1016/j.procs.2014.05.208 .
  31. Kak, S (1988). "A two-layered mesh array for matrix multiplication". Parallel Computing. 6 (3): 383–5. CiteSeerX   10.1.1.88.8527 . doi:10.1016/0167-8191(88)90078-6.
  32. Kak, S. (2014). "Efficiency of matrix multiplication on the cross-wired mesh array". arXiv: 1411.3273 [cs.DC].
  33. Kak, S (1988). "Multilayered array computing". Information Sciences. 45 (3): 347–365. CiteSeerX   10.1.1.90.4753 . doi:10.1016/0020-0255(88)90010-2.

Further reading