Communication-avoiding algorithm

Last updated

Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communication, in this context refers to moving data, either between levels of memory or between multiple processors over a network. It is much more expensive than arithmetic. [1]

Contents

Formal theory

Two-level memory model

A common computational model in analyzing communication-avoiding algorithms is the two-level memory model:

Matrix multiplication

[2] Corollary 6.2:

Theorem  Given matrices of sizes , then has communication complexity .

This lower bound is achievable by tiling matrix multiplication.

More general results for other numerical linear algebra operations can be found in. [3] The following proof is from. [4]

Proof

We can draw the computation graph of as a cube of lattice points, each point is of form . Since , computing requires the processor to have access to each point within the cube at least once. So the problem becomes covering the lattice points with a minimal amount of communication.

If is large, then we can simply load all entries then write entries. This is uninteresting.

If is small, then we can divide the minimal-communication algorithm into separate segments. During each segment, it performs exactly reads to cache, and any number of writes from cache.

During each segment, the processor has access to at most different points from .

Let be the set of lattice points covered during this segment. Then by the Loomis–Whitney inequality,

with constraint .

By the inequality of arithmetic and geometric means, we have , with extremum reached when .

Thus the arithmetic intensity is bounded above by where , and so the communication is bounded below by .

Direct computation verifies that the tiling matrix multiplication algorithm reaches the lower bound.

Motivation

Consider the following running-time model: [5]

⇒ Total running time = γ·(no. of FLOPs) + β·(no. of words)

From the fact that β >> γ as measured in time and energy, communication cost dominates computation cost. Technological trends [6] indicate that the relative cost of communication is increasing on a variety of platforms, from cloud computing to supercomputers to mobile devices. The report also predicts that gap between DRAM access time and FLOPs will increase 100× over coming decade to balance power usage between processors and DRAM. [1]

FLOP rate (γ)DRAM bandwidth (β)Network bandwidth (β)
59% / year23% / year26% / year
Energy cost of data movement in 2010: On-chip vs Off-chip Energy cost of data movement in 2010 - on chip vs off chip.png
Energy cost of data movement in 2010: On-chip vs Off-chip

Energy consumption increases by orders of magnitude as we go higher in the memory hierarchy. [7]

United States president Barack Obama cited communication-avoiding algorithms in the FY 2012 Department of Energy budget request to Congress: [1]

New Algorithm Improves Performance and Accuracy on Extreme-Scale Computing Systems. On modern computer architectures, communication between processors takes longer than the performance of a floating-point arithmetic operation by a given processor. ASCR researchers have developed a new method, derived from commonly used linear algebra methods, to minimize communications between processors and the memory hierarchy, by reformulating the communication patterns specified within the algorithm. This method has been implemented in the TRILINOS framework, a highly-regarded suite of software, which provides functionality for researchers around the world to solve large scale, complex multi-physics problems.

Objectives

Communication-avoiding algorithms are designed with the following objectives:

The following simple example [1] demonstrates how these are achieved.

Matrix multiplication example

Let A, B and C be square matrices of order n × n. The following naive algorithm implements C = C + A * B:

Matrix multiplication algorithm diagram.png

 for i = 1 to n      for j = 1 to n          for k = 1 to n              C(i,j) = C(i,j) + A(i,k) * B(k,j)

Arithmetic cost (time-complexity): n2(2n  1) for sufficiently large n or O(n3).

Rewriting this algorithm with communication cost labelled at each step

 for i = 1 to n      {read row i of A into fast memory}               - n2 reads      for j = 1 to n          {read C(i,j) into fast memory}               - n2 reads          {read column j of B into fast memory}        - n3 reads          for k = 1 to n              C(i,j) = C(i,j) + A(i,k) * B(k,j)          {write C(i,j) back to slow memory}           - n2 writes

Fast memory may be defined as the local processor memory (CPU cache) of size M and slow memory may be defined as the DRAM.

Communication cost (reads/writes): n3 + 3n2 or O(n3)

Since total running time = γ·O(n3) + β·O(n3) and β >> γ the communication cost is dominant. The blocked (tiled) matrix multiplication algorithm [1] reduces this dominant term:

Blocked (tiled) matrix multiplication

Consider A, B and C to be n/b-by-n/b matrices of b-by-b sub-blocks where b is called the block size; assume three b-by-b blocks fit in fast memory.

Tiled matrix multiplication diagram.png

 for i = 1 to n/b      for j = 1 to n/b          {read block C(i,j) into fast memory}           - b2 × (n/b)2 = n2 reads          for k = 1 to n/b              {read block A(i,k) into fast memory}       - b2 × (n/b)3 = n3/b reads               {read block B(k,j) into fast memory}       - b2 × (n/b)3 = n3/b reads              C(i,j) = C(i,j) + A(i,k) * B(k,j)          - {do a matrix multiply on blocks}          {write block C(i,j) back to slow memory}       - b2 × (n/b)2 = n2 writes

Communication cost: 2n3/b + 2n2 reads/writes << 2n3 arithmetic cost

Making b as large possible:

3b2M

we achieve the following communication lower bound:

31/2n3/M1/2 + 2n2 or Ω (no. of FLOPs / M1/2)

Previous approaches for reducing communication

Most of the approaches investigated in the past to address this problem rely on scheduling or tuning techniques that aim at overlapping communication with computation. However, this approach can lead to an improvement of at most a factor of two. Ghosting is a different technique for reducing communication, in which a processor stores and computes redundantly data from neighboring processors for future computations. Cache-oblivious algorithms represent a different approach introduced in 1999 for fast Fourier transforms, [8] and then extended to graph algorithms, dynamic programming, etc. They were also applied to several operations in linear algebra [9] [10] [11] as dense LU and QR factorizations. The design of architecture specific algorithms is another approach that can be used for reducing the communication in parallel algorithms, and there are many examples in the literature of algorithms that are adapted to a given communication topology. [12]

See also

Related Research Articles

In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time and memory storage requirements. The complexity of a problem is the complexity of the best algorithms that allow solving the problem.

<span class="mw-page-title-main">Fast Fourier transform</span> O(N log N) discrete Fourier transform algorithm

A Fast Fourier Transform (FFT) is an algorithm that computes the Discrete Fourier Transform (DFT) of a sequence, or its inverse (IDFT). Fourier analysis converts a signal from its original domain to a representation in the frequency domain and vice versa. The DFT is obtained by decomposing a sequence of values into components of different frequencies. This operation is useful in many fields, but computing it directly from the definition is often too slow to be practical. An FFT rapidly computes such transformations by factorizing the DFT matrix into a product of sparse factors. As a result, it manages to reduce the complexity of computing the DFT from , which arises if one simply applies the definition of DFT, to , where n is the data size. The difference in speed can be enormous, especially for long data sets where n may be in the thousands or millions. In the presence of round-off error, many FFT algorithms are much more accurate than evaluating the DFT definition directly or indirectly. There are many different FFT algorithms based on a wide range of published theories, from simple complex-number arithmetic to group theory and number theory.

<span class="mw-page-title-main">Gaussian elimination</span> Algorithm for solving systems of linear equations

In mathematics, Gaussian elimination, also known as row reduction, is an algorithm for solving systems of linear equations. It consists of a sequence of row-wise operations performed on the corresponding matrix of coefficients. This method can also be used to compute the rank of a matrix, the determinant of a square matrix, and the inverse of an invertible matrix. The method is named after Carl Friedrich Gauss (1777–1855). To perform row reduction on a matrix, one uses a sequence of elementary row operations to modify the matrix until the lower left-hand corner of the matrix is filled with zeros, as much as possible. There are three types of elementary row operations:

A discrete cosine transform (DCT) expresses a finite sequence of data points in terms of a sum of cosine functions oscillating at different frequencies. The DCT, first proposed by Nasir Ahmed in 1972, is a widely used transformation technique in signal processing and data compression. It is used in most digital media, including digital images, digital video, digital audio, digital television, digital radio, and speech coding. DCTs are also important to numerous other applications in science and engineering, such as digital signal processing, telecommunication devices, reducing network bandwidth usage, and spectral methods for the numerical solution of partial differential equations.

A discrete Hartley transform (DHT) is a Fourier-related transform of discrete, periodic data similar to the discrete Fourier transform (DFT), with analogous applications in signal processing and related fields. Its main distinction from the DFT is that it transforms real inputs to real outputs, with no intrinsic involvement of complex numbers. Just as the DFT is the discrete analogue of the continuous Fourier transform (FT), the DHT is the discrete analogue of the continuous Hartley transform (HT), introduced by Ralph V. L. Hartley in 1942.

The Cooley–Tukey algorithm, named after J. W. Cooley and John Tukey, is the most common fast Fourier transform (FFT) algorithm. It re-expresses the discrete Fourier transform (DFT) of an arbitrary composite size in terms of N1 smaller DFTs of sizes N2, recursively, to reduce the computation time to O(N log N) for highly composite N (smooth numbers). Because of the algorithm's importance, specific variants and implementation styles have become known by their own names, as described below.

In numerical analysis, inverse iteration is an iterative eigenvalue algorithm. It allows one to find an approximate eigenvector when an approximation to a corresponding eigenvalue is already known. The method is conceptually similar to the power method. It appears to have originally been developed to compute resonance frequencies in the field of structural mechanics.

In numerical linear algebra, the QR algorithm or QR iteration is an eigenvalue algorithm: that is, a procedure to calculate the eigenvalues and eigenvectors of a matrix. The QR algorithm was developed in the late 1950s by John G. F. Francis and by Vera N. Kublanovskaya, working independently. The basic idea is to perform a QR decomposition, writing the matrix as a product of an orthogonal matrix and an upper triangular matrix, multiply the factors in the reverse order, and iterate.

In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although the naive algorithm is often better for smaller matrices. The Strassen algorithm is slower than the fastest known algorithms for extremely large matrices, but such galactic algorithms are not useful in practice, as they are much slower for matrices of practical size. For small matrices even faster algorithms exist.

In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead reduction of the loop nests. One classical usage is to reduce memory access latency or the cache bandwidth necessary due to cache reuse for some common linear algebra algorithms.

A residue numeral system (RNS) is a numeral system representing integers by their values modulo several pairwise coprime integers called the moduli. This representation is allowed by the Chinese remainder theorem, which asserts that, if M is the product of the moduli, there is, in an interval of length M, exactly one integer having any given set of modular values. The arithmetic of a residue numeral system is also called multi-modular arithmetic.

Matrix chain multiplication is an optimization problem concerning the most efficient way to multiply a given sequence of matrices. The problem is not actually to perform the multiplications, but merely to decide the sequence of the matrix multiplications involved. The problem may be solved using dynamic programming.

Automatically Tuned Linear Algebra Software (ATLAS) is a software library for linear algebra. It provides a mature open source implementation of BLAS APIs for C and FORTRAN 77.

<span class="mw-page-title-main">Spectral clustering</span> Clustering methods

In multivariate statistics, spectral clustering techniques make use of the spectrum (eigenvalues) of the similarity matrix of the data to perform dimensionality reduction before clustering in fewer dimensions. The similarity matrix is provided as an input and consists of a quantitative assessment of the relative similarity of each pair of points in the dataset.

In mathematics, the Bareiss algorithm, named after Erwin Bareiss, is an algorithm to calculate the determinant or the echelon form of a matrix with integer entries using only integer arithmetic; any divisions that are performed are guaranteed to be exact. The method can also be used to compute the determinant of matrices with (approximated) real entries, avoiding the introduction of any round-off errors beyond those already present in the input.

Because matrix multiplication is such a central operation in many numerical algorithms, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including scientific computing and pattern recognition and in seemingly unrelated problems such as counting the paths through a graph. Many different algorithms have been designed for multiplying matrices on different types of hardware, including parallel and distributed systems, where the computational work is spread over multiple processors.

In theoretical computer science, the computational complexity of matrix multiplication dictates how quickly the operation of matrix multiplication can be performed. Matrix multiplication algorithms are a central subroutine in theoretical and numerical algorithms for numerical linear algebra and optimization, so finding the fastest algorithm for matrix multiplication is of major practical relevance.

In mathematics and computer algebra the factorization of a polynomial consists of decomposing it into a product of irreducible factors. This decomposition is theoretically possible and is unique for polynomials with coefficients in any field, but rather strong restrictions on the field of the coefficients are needed to allow the computation of the factorization by means of an algorithm. In practice, algorithms have been designed only for polynomials with coefficients in a finite field, in the field of rationals or in a finitely generated field extension of one of them.

Parallelmultidimensionaldigitalsignal processing (mD-DSP) is defined as the application of parallel programming and multiprocessing to digital signal processing techniques to process digital signals that have more than a single dimension. The use of mD-DSP is fundamental to many application areas such as digital image and video processing, medical imaging, geophysical signal analysis, sonar, radar, lidar, array processing, computer vision, computational photography, and augmented and virtual reality. However, as the number of dimensions of a signal increases the computational complexity to operate on the signal increases rapidly. This relationship between the number of dimensions and the amount of complexity, related to both time and space, as studied in the field of algorithm analysis, is analogues to the concept of the curse of dimensionality. This large complexity generally results in an extremely long execution run-time of a given mD-DSP application rendering its usage to become impractical for many applications; especially for real-time applications. This long run-time is the primary motivation of applying parallel algorithmic techniques to mD-DSP problems.

<span class="mw-page-title-main">Parallel external memory</span>

In computer science, a parallel external memory (PEM) model is a cache-aware, external-memory abstract machine. It is the parallel-computing analogy to the single-processor external memory (EM) model. In a similar way, it is the cache-aware analogy to the parallel random-access machine (PRAM). The PEM model consists of a number of processors, together with their respective private caches and a shared main memory.

References

  1. 1 2 3 4 5 Demmel, Jim. "Communication avoiding algorithms". 2012 SC Companion: High Performance Computing, Networking Storage and Analysis. IEEE, 2012.
  2. Jia-Wei, Hong; Kung, H. T. (1981). "I/O complexity". Proceedings of the thirteenth annual ACM symposium on Theory of computing - STOC '81. New York, New York, USA: ACM Press. pp. 326–333. doi:10.1145/800076.802486. S2CID   8410593.
  3. Ballard, G.; Carson, E.; Demmel, J.; Hoemmen, M.; Knight, N.; Schwartz, O. (May 2014). "Communication lower bounds and optimal algorithms for numerical linear algebra". Acta Numerica. 23: 1–155. doi:10.1017/s0962492914000038. ISSN   0962-4929. S2CID   122513943.
  4. Demmel, James; Dinh, Grace (2018-04-24). "Communication-Optimal Convolutional Neural Nets". arXiv: 1802.06905 [cs.DS].
  5. Demmel, James, and Kathy Yelick. "Communication Avoiding (CA) and Other Innovative Algorithms". The Berkeley Par Lab: Progress in the Parallel Computing Landscape: 243–250.
  6. Bergman, Keren, et al. "Exascale computing study: Technology challenges in exascale computing systems." Defense Advanced Research Projects Agency Information Processing Techniques Office (DARPA IPTO), Tech. Rep 15 (2008).
  7. Shalf, John, Sudip Dosanjh, and John Morrison. "Exascale computing technology challenges". High Performance Computing for Computational Science–VECPAR 2010. Springer Berlin Heidelberg, 2011. 1–25.
  8. M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, "Cacheoblivious algorithms", In FOCS '99: Proceedings of the 40th Annual Symposium on Foundations of Computer Science, 1999. IEEE Computer Society.
  9. S. Toledo, "Locality of reference in LU Decomposition with partial pivoting," SIAM J. Matrix Anal. Appl., vol. 18, no. 4, 1997.
  10. F. Gustavson, "Recursion Leads to Automatic Variable Blocking for Dense Linear-Algebra Algorithms," IBM Journal of Research and Development, vol. 41, no. 6, pp. 737–755, 1997.
  11. E. Elmroth, F. Gustavson, I. Jonsson, and B. Kagstrom, "Recursive blocked algorithms and hybrid data structures for dense matrix library software," SIAM Review, vol. 46, no. 1, pp. 3–45, 2004.
  12. Grigori, Laura. "Introduction to communication avoiding linear algebra algorithms in high performance computing.