SPIKE algorithm

Last updated

The SPIKE algorithm is a hybrid parallel solver for banded linear systems developed by Eric Polizzi and Ahmed Sameh ^

Contents

Overview

The SPIKE algorithm deals with a linear system AX = F, where A is a banded matrix of bandwidth much less than , and F is an matrix containing right-hand sides. It is divided into a preprocessing stage and a postprocessing stage.

Preprocessing stage

In the preprocessing stage, the linear system AX = F is partitioned into a block tridiagonal form

Assume, for the time being, that the diagonal blocks Aj (j = 1,...,p with p 2) are nonsingular. Define a block diagonal matrix

D = diag(A1,...,Ap),

then D is also nonsingular. Left-multiplying D−1 to both sides of the system gives

which is to be solved in the postprocessing stage. Left-multiplication by D−1 is equivalent to solving systems of the form

Aj[VjWjGj] = [BjCjFj]

(omitting W1 and C1 for , and Vp and Bp for ), which can be carried out in parallel.

Due to the banded nature of A, only a few leftmost columns of each Vj and a few rightmost columns of each Wj can be nonzero. These columns are called the spikes.

Postprocessing stage

Without loss of generality, assume that each spike contains exactly columns ( is much less than ) (pad the spike with columns of zeroes if necessary). Partition the spikes in all Vj and Wj into

and

where V (t)
j
 
, V (b)
j
 
, W (t)
j
 
and W (b)
j
 
are of dimensions . Partition similarly all Xj and Gj into

and

Notice that the system produced by the preprocessing stage can be reduced to a block pentadiagonal system of much smaller size (recall that is much less than )

which we call the reduced system and denote by S̃X̃ = G̃.

Once all X (t)
j
 
and X (b)
j
 
are found, all Xj can be recovered with perfect parallelism via

SPIKE as a polyalgorithmic banded linear system solver

Despite being logically divided into two stages, computationally, the SPIKE algorithm comprises three stages:

  1. factorizing the diagonal blocks,
  2. computing the spikes,
  3. solving the reduced system.

Each of these stages can be accomplished in several ways, allowing a multitude of variants. Two notable variants are the recursive SPIKE algorithm for non-diagonally-dominant cases and the truncated SPIKE algorithm for diagonally-dominant cases. Depending on the variant, a system can be solved either exactly or approximately. In the latter case, SPIKE is used as a preconditioner for iterative schemes like Krylov subspace methods and iterative refinement.

Recursive SPIKE

Preprocessing stage

The first step of the preprocessing stage is to factorize the diagonal blocks Aj. For numerical stability, one can use LAPACK's XGBTRF routines to LU factorize them with partial pivoting. Alternatively, one can also factorize them without partial pivoting but with a "diagonal boosting" strategy. The latter method tackles the issue of singular diagonal blocks.

In concrete terms, the diagonal boosting strategy is as follows. Let 0ε denote a configurable "machine zero". In each step of LU factorization, we require that the pivot satisfy the condition

|pivot| > 0εA1.

If the pivot does not satisfy the condition, it is then boosted by

where ε is a positive parameter depending on the machine's unit roundoff, and the factorization continues with the boosted pivot. This can be achieved by modified versions of ScaLAPACK's XDBTRF routines. After the diagonal blocks are factorized, the spikes are computed and passed on to the postprocessing stage.

Postprocessing stage

The two-partition case

In the two-partition case, i.e., when p = 2, the reduced system S̃X̃ = G̃ has the form

An even smaller system can be extracted from the center:

which can be solved using the block LU factorization

Once X (b)
1
 
and X (t)
2
 
are found, X (t)
1
 
and X (b)
2
 
can be computed via

X (t)
1
 
= G (t)
1
 
V (t)
1
 
X (t)
2
 
,
X (b)
2
 
= G (b)
2
 
W (b)
2
 
X (b)
1
 
.
The multiple-partition case

Assume that p is a power of two, i.e., p = 2d. Consider a block diagonal matrix

D̃1 = diag(D̃ [1]
1
 
,...,D̃ [1]
p/2
 
)

where

for k = 1,...,p/2. Notice that D̃1 essentially consists of diagonal blocks of order 4m extracted from S̃. Now we factorize S̃ as

S̃ = D̃1S̃2.

The new matrix S̃2 has the form

Its structure is very similar to that of S̃2, only differing in the number of spikes and their height (their width stays the same at m). Thus, a similar factorization step can be performed on S̃2 to produce

S̃2 = D̃2S̃3

and

S̃ = D̃1D̃2S̃3.

Such factorization steps can be performed recursively. After d 1 steps, we obtain the factorization

S̃ = D̃1D̃d1S̃d,

where S̃d has only two spikes. The reduced system will then be solved via

X̃ = S̃ 1
d
 
D̃ 1
d1
 
D̃ 1
1
 
G̃
.

The block LU factorization technique in the two-partition case can be used to handle the solving steps involving D̃1, ..., D̃d1 and S̃d for they essentially solve multiple independent systems of generalized two-partition forms.

Generalization to cases where p is not a power of two is almost trivial.

Truncated SPIKE

When A is diagonally-dominant, in the reduced system

the blocks V (t)
j
 
and W (b)
j
 
are often negligible. With them omitted, the reduced system becomes block diagonal

and can be easily solved in parallel .

The truncated SPIKE algorithm can be wrapped inside some outer iterative scheme (e.g., BiCGSTAB or iterative refinement) to improve the accuracy of the solution.

SPIKE for tridiagonal systems

The first SPIKE partitioning and algorithm was presented in and was designed as the means to improve the stability properties of a parallel Givens rotations-based solver for tridiagonal systems. A version of the algorithm, termed g-Spike, that is based on serial Givens rotations applied independently on each block was designed for the NVIDIA GPU . A SPIKE-based algorithm for the GPU that is based on a special block diagonal pivoting strategy is described in .

SPIKE as a preconditioner

The SPIKE algorithm can also function as a preconditioner for iterative methods for solving linear systems. To solve a linear system Ax = b using a SPIKE-preconditioned iterative solver, one extracts center bands from A to form a banded preconditioner M and solves linear systems involving M in each iteration with the SPIKE algorithm.

In order for the preconditioner to be effective, row and/or column permutation is usually necessary to move "heavy" elements of A close to the diagonal so that they are covered by the preconditioner. This can be accomplished by computing the weighted spectral reordering of A.

The SPIKE algorithm can be generalized by not restricting the preconditioner to be strictly banded. In particular, the diagonal block in each partition can be a general matrix and thus handled by a direct general linear system solver rather than a banded solver. This enhances the preconditioner, and hence allows better chance of convergence and reduces the number of iterations.

Implementations

Intel offers an implementation of the SPIKE algorithm under the name Intel Adaptive Spike-Based Solver . Tridiagonal solvers have also been developed for the NVIDIA GPU and the Xeon Phi co-processors. The method in is the basis for a tridiagonal solver in the cuSPARSE library. [1] The Givens rotations based solver was also implemented for the GPU and the Intel Xeon Phi. [2]

Related Research Articles

<span class="mw-page-title-main">Matrix addition</span> Notions of sums for matrices in linear algebra

In mathematics, matrix addition is the operation of adding two matrices by adding the corresponding entries together.

<span class="mw-page-title-main">Matrix multiplication</span> Mathematical operation in linear algebra

In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the second matrix. The resulting matrix, known as the matrix product, has the number of rows of the first and the number of columns of the second matrix. The product of matrices A and B is denoted as AB.

In linear algebra, a Toeplitz matrix or diagonal-constant matrix, named after Otto Toeplitz, is a matrix in which each descending diagonal from left to right is constant. For instance, the following matrix is a Toeplitz matrix:

In linear algebra, a diagonal matrix is a matrix in which the entries outside the main diagonal are all zero; the term usually refers to square matrices. Elements of the main diagonal can either be zero or nonzero. An example of a 2×2 diagonal matrix is , while an example of a 3×3 diagonal matrix is. An identity matrix of any size, or any multiple of it, is a diagonal matrix.

Levinson recursion or Levinson–Durbin recursion is a procedure in linear algebra to recursively calculate the solution to an equation involving a Toeplitz matrix. The algorithm runs in Θ(n2) time, which is a strong improvement over Gauss–Jordan elimination, which runs in Θ(n3).

In linear algebra, a square matrix  is called diagonalizable or non-defective if it is similar to a diagonal matrix, i.e., if there exists an invertible matrix  and a diagonal matrix such that , or equivalently . For a finite-dimensional vector space , a linear map  is called diagonalizable if there exists an ordered basis of  consisting of eigenvectors of . These definitions are equivalent: if  has a matrix representation as above, then the column vectors of  form a basis consisting of eigenvectors of , and the diagonal entries of  are the corresponding eigenvalues of ; with respect to this eigenvector basis,  is represented by .Diagonalization is the process of finding the above  and .

In linear algebra, a QR decomposition, also known as a QR factorization or QU factorization, is a decomposition of a matrix A into a product A = QR of an orthonormal matrix Q and an upper triangular matrix R. QR decomposition is often used to solve the linear least squares problem and is the basis for a particular eigenvalue algorithm, the QR algorithm.

In linear algebra, a Vandermonde matrix, named after Alexandre-Théophile Vandermonde, is a matrix with the terms of a geometric progression in each row: an matrix

In mathematics, a triangular matrix is a special kind of square matrix. A square matrix is called lower triangular if all the entries above the main diagonal are zero. Similarly, a square matrix is called upper triangular if all the entries below the main diagonal are zero.

In mathematics, a block matrix or a partitioned matrix is a matrix that is interpreted as having been broken into sections called blocks or submatrices. Intuitively, a matrix interpreted as a block matrix can be visualized as the original matrix with a collection of horizontal and vertical lines, which break it up, or partition it, into a collection of smaller matrices. Any matrix may be interpreted as a block matrix in one or more ways, with each interpretation defined by how its rows and columns are partitioned.

<span class="mw-page-title-main">Total least squares</span>

In applied statistics, total least squares is a type of errors-in-variables regression, a least squares data modeling technique in which observational errors on both dependent and independent variables are taken into account. It is a generalization of Deming regression and also of orthogonal regression, and can be applied to both linear and non-linear models.

In signal processing, the Wiener filter is a filter used to produce an estimate of a desired or target random process by linear time-invariant (LTI) filtering of an observed noisy process, assuming known stationary signal and noise spectra, and additive noise. The Wiener filter minimizes the mean square error between the estimated random process and the desired process.

In mathematics, particularly matrix theory, a band matrix or banded matrix is a sparse matrix whose non-zero entries are confined to a diagonal band, comprising the main diagonal and zero or more diagonals on either side.

In numerical linear algebra, the Jacobi method is an iterative algorithm for determining the solutions of a strictly diagonally dominant system of linear equations. Each diagonal element is solved for, and an approximate value is plugged in. The process is then iterated until it converges. This algorithm is a stripped-down version of the Jacobi transformation method of matrix diagonalization. The method is named after Carl Gustav Jacob Jacobi.

In numerical linear algebra, the tridiagonal matrix algorithm, also known as the Thomas algorithm, is a simplified form of Gaussian elimination that can be used to solve tridiagonal systems of equations. A tridiagonal system for n unknowns may be written as

In numerical analysis and linear algebra, lower–upper (LU) decomposition or factorization factors a matrix as the product of a lower triangular matrix and an upper triangular matrix. The product sometimes includes a permutation matrix as well. LU decomposition can be viewed as the matrix form of Gaussian elimination. Computers usually solve square systems of linear equations using LU decomposition, and it is also a key step when inverting a matrix or computing the determinant of a matrix. The LU decomposition was introduced by the Polish astronomer Tadeusz Banachiewicz in 1938. To quote: "It appears that Gauss and Doolittle applied the method [of elimination] only to symmetric equations. More recent authors, for example, Aitken, Banachiewicz, Dwyer, and Crout … have emphasized the use of the method, or variations of it, in connection with non-symmetric problems … Banachiewicz … saw the point … that the basic problem is really one of matrix factorization, or “decomposition” as he called it." It's also referred to as LR decomposition.

In mathematical systems theory, a multidimensional system or m-D system is a system in which not only one independent variable exists, but there are several independent variables.

In numerical linear algebra, the conjugate gradient method is an iterative method for numerically solving the linear system

In mathematical optimization, the revised simplex method is a variant of George Dantzig's simplex method for linear programming.

Polynomial matrices are widely studied in the fields of systems theory and control theory and have seen other uses relating to stable polynomials. In stability theory, Spectral Factorization has been used to find determinantal matrix representations for bivariate stable polynomials and real zero polynomials. A key tool used to study these is a matrix factorization known as either the Polynomial Matrix Spectral Factorization or the Matrix Fejer–Riesz Theorem.

References

  1. NVIDIA, Accessed October 28, 2014. CUDA Toolkit Documentation v. 6.5: cuSPARSE, http://docs.nvidia.com/cuda/cusparse.
  2. Venetis, Ioannis; Sobczyk, Aleksandros; Kouris, Alexandros; Nakos, Alexandros; Nikoloutsakos, Nikolaos; Gallopoulos, Efstratios (2015-09-03). "A general tridiagonal solver for coprocessors: Adapting g-Spike for the Intel Xeon Phi" via ResearchGate.
  1. ^ Polizzi, E.; Sameh, A. H. (2006). "A parallel hybrid banded system solver: the SPIKE algorithm". Parallel Computing. 32 (2): 177–194. doi:10.1016/j.parco.2005.07.005.
  2. ^ Polizzi, E.; Sameh, A. H. (2007). "SPIKE: A parallel environment for solving banded linear systems". Computers & Fluids. 36: 113–141. doi:10.1016/j.compfluid.2005.07.005.
  3. ^ Mikkelsen, C. C. K.; Manguoglu, M. (2008). "Analysis of the Truncated SPIKE Algorithm". SIAM J. Matrix Anal. Appl. 30 (4): 1500–1519. CiteSeerX   10.1.1.514.8748 . doi:10.1137/080719571.
  4. ^ Manguoglu, M.; Sameh, A. H.; Schenk, O. (2009). "PSPIKE: A Parallel Hybrid Sparse Linear System Solver". Euro-Par 2009 Parallel Processing. Lecture Notes in Computer Science. Vol. 5704. pp. 797–808. Bibcode:2009LNCS.5704..797M. doi: 10.1007/978-3-642-03869-3_74 . ISBN   978-3-642-03868-6.
  5. ^ "Intel Adaptive Spike-Based Solver - Intel Software Network" . Retrieved 2009-03-23.
  6. ^ Sameh, A. H.; Kuck, D. J. (1978). "On Stable Parallel Linear System Solvers". Journal of the ACM. 25: 81–91. doi: 10.1145/322047.322054 . S2CID   17109524.
  7. ^ Venetis, I.E.; Kouris, A.; Sobczyk, A.; Gallopoulos, E.; Sameh, A. H. (2015). "A direct tridiagonal solver based on Givens rotations for GPU architectures". Parallel Computing. 25: 101–116. doi:10.1016/j.parco.2015.03.008.
  8. ^ Chang, L.-W.; Stratton, J.; Kim, H.; Hwu, W.-M. (2012). "A scalable, numerically stable, high-performance tridiagonal solver using GPUs". Proc. Int'l. Conf. High Performance Computing, Networking Storage and Analysis (SC'12). Los Alamitos, CA, USA: IEEE Computer Soc. Press: 27:1–27:11. ISBN   978-1-4673-0804-5.

Further reading