The SPIKE algorithm is a hybrid

parallel Parallel is a geometric term of location which may refer to: Computing * Parallel algorithm * Parallel computing * Parallel metaheuristic * Parallel (software), a UNIX utility for running programs in parallel * Parallel Sysplex, a cluster of I ...

solver for banded

linear systems In systems theory, a linear system is a mathematical model of a system based on the use of a linear operator. Linear systems typically exhibit features and properties that are much simpler than the nonlinear case. As a mathematical abstraction o ...

developed by Eric Polizzi and

Ahmed Sameh Ahmed Hamdy Mohamed Sameh is the Samuel D. Conte Professor of Computer Science at Purdue University. He is known for his contributions to parallel algorithms in numerical linear algebra. Biography Sameh received his BSc in civil engineering f ...

Overview

The SPIKE algorithm deals with a linear system , where is a banded

n\times n

matrix of

bandwidth Bandwidth commonly refers to: * Bandwidth (signal processing) or ''analog bandwidth'', ''frequency bandwidth'', or ''radio bandwidth'', a measure of the width of a frequency range * Bandwidth (computing), the rate of data transfer, bit rate or thr ...

much less than

n

, and is an

n\times s

matrix containing

s

right-hand sides. It is divided into a preprocessing stage and a postprocessing stage.

Preprocessing stage

In the preprocessing stage, the linear system is partitioned into a block tridiagonal form :

\begin
\boldsymbol_1 & \boldsymbol_1\\
\boldsymbol_2 & \boldsymbol_2 & \boldsymbol_2\\
& \ddots & \ddots & \ddots\\
& & \boldsymbol_ & \boldsymbol_ & \boldsymbol_\\
& & & \boldsymbol_p & \boldsymbol_p
\end
\begin
\boldsymbol_1\\
\boldsymbol_2\\
\vdots\\
\boldsymbol_\\
\boldsymbol_p
\end
=
\begin
\boldsymbol_1\\
\boldsymbol_2\\
\vdots\\
\boldsymbol_\\
\boldsymbol_p
\end.

Assume, for the time being, that the diagonal blocks ( with ) are

nonsingular In linear algebra, an -by- square matrix is called invertible (also nonsingular or nondegenerate), if there exists an -by- square matrix such that :\mathbf = \mathbf = \mathbf_n \ where denotes the -by- identity matrix and the multiplicati ...

. Define a

block diagonal In mathematics, a block matrix or a partitioned matrix is a matrix that is '' interpreted'' as having been broken into sections called blocks or submatrices. Intuitively, a matrix interpreted as a block matrix can be visualized as the original ma ...

matrix :, then is also nonsingular. Left-multiplying to both sides of the system gives ::

\begin
\boldsymbol & \boldsymbol_1\\
\boldsymbol_2 & \boldsymbol & \boldsymbol_2\\
& \ddots & \ddots & \ddots\\
& & \boldsymbol_ & \boldsymbol & \boldsymbol_\\
& & & \boldsymbol_p & \boldsymbol
\end
\begin
\boldsymbol_1\\
\boldsymbol_2\\
\vdots\\
\boldsymbol_\\
\boldsymbol_p
\end
=
\begin
\boldsymbol_1\\
\boldsymbol_2\\
\vdots\\
\boldsymbol_\\
\boldsymbol_p
\end,

which is to be solved in the postprocessing stage. Left-multiplication by is equivalent to solving

p

systems of the form : (omitting and for

j=1

, and and for

j=p

), which can be carried out in parallel. Due to the banded nature of , only a few leftmost columns of each and a few rightmost columns of each can be nonzero. These columns are called the ''spikes''.

Postprocessing stage

Without loss of generality ''Without loss of generality'' (often abbreviated to WOLOG, WLOG or w.l.o.g.; less commonly stated as ''without any loss of generality'' or ''with no loss of generality'') is a frequently used expression in mathematics. The term is used to indicat ...

, assume that each spike contains exactly

m

columns (

m

is much less than

n

) (pad the spike with columns of zeroes if necessary). Partition the spikes in all and into :

\begin
\boldsymbol_j^\\
\boldsymbol_j'\\
\boldsymbol_j^
\end

and

\begin
\boldsymbol_j^\\
\boldsymbol_j'\\
\boldsymbol_j^\\
\end

where , , and are of dimensions

m\times m

. Partition similarly all and into :

\begin
\boldsymbol_j^\\
\boldsymbol_j'\\
\boldsymbol_j^
\end

and

\begin
\boldsymbol_j^\\
\boldsymbol_j'\\
\boldsymbol_j^\\
\end.

Notice that the system produced by the preprocessing stage can be reduced to a block pentadiagonal system of much smaller size (recall that

m

is much less than

n

) :

\begin
\boldsymbol_m & \boldsymbol & \boldsymbol_1^\\
\boldsymbol & \boldsymbol_m & \boldsymbol_1^ & \boldsymbol\\
\boldsymbol & \boldsymbol_2^ & \boldsymbol_m & \boldsymbol & \boldsymbol_2^\\
& \boldsymbol_2^ & \boldsymbol & \boldsymbol_m & \boldsymbol_2^ & \boldsymbol \\
& & \ddots & \ddots & \ddots & \ddots & \ddots\\
& & & \boldsymbol & \boldsymbol_^ & \boldsymbol_m & \boldsymbol & \boldsymbol_^\\
& & & & \boldsymbol_^ & \boldsymbol & \boldsymbol_m & \boldsymbol_^ & \boldsymbol\\
& & & & & \boldsymbol & \boldsymbol_p^ & \boldsymbol_m & \boldsymbol\\
& & & & & & \boldsymbol_p^ & \boldsymbol & \boldsymbol_m
\end
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^\\
\vdots\\
\boldsymbol_^\\
\boldsymbol_^\\
\boldsymbol_p^\\
\boldsymbol_p^
\end
=
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^\\
\vdots\\
\boldsymbol_^\\
\boldsymbol_^\\
\boldsymbol_p^\\
\boldsymbol_p^
\end\text

which we call the ''reduced system'' and denote by . Once all and are found, all can be recovered with perfect parallelism via :

\begin
\boldsymbol_1'=\boldsymbol_1'-\boldsymbol_1'\boldsymbol_2^\text\\
\boldsymbol_j'=\boldsymbol_j'-\boldsymbol_j'\boldsymbol_^-\boldsymbol_j'\boldsymbol_^\text & j=2,\ldots,p-1\text\\
\boldsymbol_p'=\boldsymbol_p'-\boldsymbol_p\boldsymbol_^\text
\end

SPIKE as a polyalgorithmic banded linear system solver

Despite being logically divided into two stages, computationally, the SPIKE algorithm comprises three stages: # factorizing the diagonal blocks, # computing the spikes, # solving the reduced system. Each of these stages can be accomplished in several ways, allowing a multitude of variants. Two notable variants are the ''recursive SPIKE'' algorithm for non- diagonally-dominant cases and the ''truncated SPIKE'' algorithm for diagonally-dominant cases. Depending on the variant, a system can be solved either exactly or approximately. In the latter case, SPIKE is used as a preconditioner for iterative schemes like

Krylov subspace method In computational mathematics, an iterative method is a mathematical procedure that uses an initial value to generate a sequence of improving approximate solutions for a class of problems, in which the ''n''-th approximation is derived from the pre ...

s and

iterative refinement Iterative refinement is an iterative method proposed by James H. Wilkinson to improve the accuracy of numerical solutions to systems of linear equations. When solving a linear system \,\mathrm \, \mathbf = \mathbf \,, due to the compounded ac ...

Recursive SPIKE

Preprocessing stage

The first step of the preprocessing stage is to factorize the diagonal blocks . For numerical stability, one can use

LAPACK LAPACK ("Linear Algebra Package") is a standard software library for numerical linear algebra. It provides routines for solving systems of linear equations and linear least squares, eigenvalue problems, and singular value decomposition. It al ...

's XGBTRF routines to LU factorize them with partial pivoting. Alternatively, one can also factorize them without partial pivoting but with a "diagonal boosting" strategy. The latter method tackles the issue of singular diagonal blocks. In concrete terms, the diagonal boosting strategy is as follows. Let denote a configurable "machine zero". In each step of LU factorization, we require that the pivot satisfy the condition :. If the pivot does not satisfy the condition, it is then boosted by :

\mathrm=
\begin
\mathrm+\epsilon\lVert\boldsymbol_j\rVert_1 & \text\mathrm\geq 0\text\\
\mathrm-\epsilon\lVert\boldsymbol_j\rVert_1 & \text\mathrm<0
\end

where is a positive parameter depending on the machine's unit roundoff, and the factorization continues with the boosted pivot. This can be achieved by modified versions of

ScaLAPACK The ScaLAPACK (or Scalable LAPACK) library includes a subset of LAPACK routines redesigned for distributed memory MIMD parallel computers. It is currently written in a Single-Program-Multiple-Data style using explicit message passing for interproce ...

's XDBTRF routines. After the diagonal blocks are factorized, the spikes are computed and passed on to the postprocessing stage.

Postprocessing stage

=The two-partition case

= In the two-partition case, i.e., when , the reduced system has the form :

\begin
\boldsymbol_m & \boldsymbol & \boldsymbol_1^\\
\boldsymbol & \boldsymbol_m & \boldsymbol_1^ & \boldsymbol\\
\boldsymbol & \boldsymbol_2^ & \boldsymbol_m & \boldsymbol\\
& \boldsymbol_2^ & \boldsymbol & \boldsymbol_m
\end
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^
\end
=
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^
\end\text

An even smaller system can be extracted from the center: :

\begin
\boldsymbol_m & \boldsymbol_1^\\
\boldsymbol_2^ & \boldsymbol_m
\end
\begin
\boldsymbol_1^\\
\boldsymbol_2^
\end
=
\begin
\boldsymbol_1^\\
\boldsymbol_2^
\end\text

which can be solved using the block LU factorization :

\begin
\boldsymbol_m & \boldsymbol_1^\\
\boldsymbol_2^ & \boldsymbol_m
\end
=
\begin
\boldsymbol_m\\
\boldsymbol_2^ & \boldsymbol_m
\end
\begin
\boldsymbol_m & \boldsymbol_1^\\
& \boldsymbol_m-\boldsymbol_2^\boldsymbol_1^
\end\text

Once and are found, and can be computed via :, :.

=The multiple-partition case

= Assume that is a power of two, i.e., . Consider a block diagonal matrix : where :

\boldsymbol_k^=
\begin
\boldsymbol_m & \boldsymbol & \boldsymbol_^\\
\boldsymbol & \boldsymbol_m & \boldsymbol_^ & \boldsymbol\\
\boldsymbol & \boldsymbol_^ & \boldsymbol_m & \boldsymbol\\
& \boldsymbol_^ & \boldsymbol & \boldsymbol_m
\end

for . Notice that essentially consists of diagonal blocks of order extracted from . Now we factorize as :. The new matrix has the form :

\begin
\boldsymbol_ & \boldsymbol & \boldsymbol_1^\\
\boldsymbol & \boldsymbol_m & \boldsymbol_1^ & \boldsymbol\\
\boldsymbol & \boldsymbol_2^ & \boldsymbol_m & \boldsymbol & \boldsymbol_2^\\
& \boldsymbol_2^ & \boldsymbol & \boldsymbol_ & \boldsymbol_2^ & \boldsymbol \\
& & \ddots & \ddots & \ddots & \ddots & \ddots\\
& & & \boldsymbol & \boldsymbol_^ & \boldsymbol_ & \boldsymbol & \boldsymbol_^\\
& & & & \boldsymbol_^ & \boldsymbol & \boldsymbol_m & \boldsymbol_^ & \boldsymbol\\
& & & & & \boldsymbol & \boldsymbol_^ & \boldsymbol_m & \boldsymbol\\
& & & & & & \boldsymbol_^ & \boldsymbol & \boldsymbol_
\end\text

Its structure is very similar to that of , only differing in the number of spikes and their height (their width stays the same at ). Thus, a similar factorization step can be performed on to produce : and :. Such factorization steps can be performed recursively. After steps, we obtain the factorization :, where has only two spikes. The reduced system will then be solved via :. The block LU factorization technique in the two-partition case can be used to handle the solving steps involving , ..., and for they essentially solve multiple independent systems of generalized two-partition forms. Generalization to cases where is not a power of two is almost trivial.

Truncated SPIKE

When is diagonally-dominant, in the reduced system :

\begin
\boldsymbol_m & \boldsymbol & \boldsymbol_1^\\
\boldsymbol & \boldsymbol_m & \boldsymbol_1^ & \boldsymbol\\
\boldsymbol & \boldsymbol_2^ & \boldsymbol_m & \boldsymbol & \boldsymbol_2^\\
& \boldsymbol_2^ & \boldsymbol & \boldsymbol_m & \boldsymbol_2^ & \boldsymbol \\
& & \ddots & \ddots & \ddots & \ddots & \ddots\\
& & & \boldsymbol & \boldsymbol_^ & \boldsymbol_m & \boldsymbol & \boldsymbol_^\\
& & & & \boldsymbol_^ & \boldsymbol & \boldsymbol_m & \boldsymbol_^ & \boldsymbol\\
& & & & & \boldsymbol & \boldsymbol_p^ & \boldsymbol_m & \boldsymbol\\
& & & & & & \boldsymbol_p^ & \boldsymbol & \boldsymbol_m
\end
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^\\
\vdots\\
\boldsymbol_^\\
\boldsymbol_^\\
\boldsymbol_p^\\
\boldsymbol_p^
\end
=
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^\\
\vdots\\
\boldsymbol_^\\
\boldsymbol_^\\
\boldsymbol_p^\\
\boldsymbol_p^
\end\text

the blocks and are often negligible. With them omitted, the reduced system becomes block diagonal :

\begin
\boldsymbol_m\\
& \boldsymbol_m & \boldsymbol_1^\\
& \boldsymbol_2^ & \boldsymbol_m\\
& & & \boldsymbol_m & \boldsymbol_2^\\
& & & \ddots & \ddots & \ddots\\
& & & & \boldsymbol_^ & \boldsymbol_m\\
& & & & & & \boldsymbol_m & \boldsymbol_^\\
& & & & & & \boldsymbol_p^ & \boldsymbol_m\\
& & & & & & & & \boldsymbol_m
\end
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^\\
\vdots\\
\boldsymbol_^\\
\boldsymbol_^\\
\boldsymbol_p^\\
\boldsymbol_p^
\end
=
\begin
\boldsymbol_1^\\
\boldsymbol_1^\\
\boldsymbol_2^\\
\boldsymbol_2^\\
\vdots\\
\boldsymbol_^\\
\boldsymbol_^\\
\boldsymbol_p^\\
\boldsymbol_p^
\end

and can be easily solved in parallel . The truncated SPIKE algorithm can be wrapped inside some outer iterative scheme (e.g.,

BiCGSTAB In numerical linear algebra, the biconjugate gradient stabilized method, often abbreviated as BiCGSTAB, is an iterative method developed by H. A. van der Vorst for the numerical solution of nonsymmetric linear systems. It is a variant of the bico ...

) to improve the accuracy of the solution.

SPIKE for tridiagonal systems

The first SPIKE partitioning and algorithm was presented in and was designed as the means to improve the stability properties of a parallel Givens rotations-based solver for tridiagonal systems. A version of the algorithm, termed g-Spike, that is based on serial Givens rotations applied independently on each block was designed for the NVIDIA GPU . A SPIKE-based algorithm for the GPU that is based on a special block diagonal pivoting strategy is described in .

SPIKE as a preconditioner

The SPIKE algorithm can also function as a preconditioner for iterative methods for solving linear systems. To solve a linear system using a SPIKE-preconditioned iterative solver, one extracts center bands from to form a banded preconditioner and solves linear systems involving in each iteration with the SPIKE algorithm. In order for the preconditioner to be effective, row and/or column permutation is usually necessary to move "heavy" elements of close to the diagonal so that they are covered by the preconditioner. This can be accomplished by computing the weighted spectral reordering of . The SPIKE algorithm can be generalized by not restricting the preconditioner to be strictly banded. In particular, the diagonal block in each partition can be a general matrix and thus handled by a direct general linear system solver rather than a banded solver. This enhances the preconditioner, and hence allows better chance of convergence and reduces the number of iterations.

Implementations

Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the devel ...

offers an implementation of the SPIKE algorithm under the name ''Intel Adaptive Spike-Based Solver'' . Tridiagonal solvers have also been developed for the NVIDIA GPU and the Xeon Phi co-processors. The method in is the basis for a tridiagonal solver in the cuSPARSE library. The Givens rotations based solver was also implemented for the GPU and the Intel Xeon Phi.

References

# # # # # # # #

Overview

Preprocessing stage

Postprocessing stage

SPIKE as a polyalgorithmic banded linear system solver

Recursive SPIKE

Preprocessing stage

Postprocessing stage

=The two-partition case

=The multiple-partition case

Truncated SPIKE

SPIKE for tridiagonal systems

SPIKE as a preconditioner

Implementations

References

Further reading