Cache-oblivious Matrix Multiplication
   HOME

TheInfoList



OR:

Because
matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...
is such a central operation in many
numerical algorithm Numerical analysis is the study of algorithms that use numerical approximation (as opposed to symbolic manipulations) for the problems of mathematical analysis (as distinguished from discrete mathematics). It is the study of numerical methods th ...
s, much work has been invested in making matrix multiplication algorithms efficient. Applications of matrix multiplication in computational problems are found in many fields including
scientific computing Computational science, also known as scientific computing or scientific computation (SC), is a field in mathematics that uses advanced computing capabilities to understand and solve complex problems. It is an area of science that spans many disc ...
and
pattern recognition Pattern recognition is the automated recognition of patterns and regularities in data. It has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphi ...
and in seemingly unrelated problems such as counting the paths through a
graph Graph may refer to: Mathematics *Graph (discrete mathematics), a structure made of vertices and edges **Graph theory, the study of such graphs and their properties *Graph (topology), a topological space resembling a graph in the sense of discre ...
. Many different algorithms have been designed for multiplying matrices on different types of hardware, including
parallel Parallel is a geometric term of location which may refer to: Computing * Parallel algorithm * Parallel computing * Parallel metaheuristic * Parallel (software), a UNIX utility for running programs in parallel * Parallel Sysplex, a cluster of IBM ...
and
distributed Distribution may refer to: Mathematics *Distribution (mathematics), generalized functions used to formulate solutions of partial differential equations *Probability distribution, the probability of a particular value or value range of a varia ...
systems, where the computational work is spread over multiple processors (perhaps over a network). Directly applying the mathematical definition of matrix multiplication gives an algorithm that takes time on the order of
field Field may refer to: Expanses of open ground * Field (agriculture), an area of land used for agricultural purposes * Airfield, an aerodrome that lacks the infrastructure of an airport * Battlefield * Lawn, an area of mowed grass * Meadow, a grass ...
operations to multiply two matrices over that field ( in
big O notation Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Lan ...
). Better asymptotic bounds on the time required to multiply matrices have been known since the
Strassen's algorithm In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although th ...
in the 1960s, but the optimal time (that is, the
computational complexity of matrix multiplication In theoretical computer science, the computational complexity of matrix multiplication dictates how quickly the operation of matrix multiplication can be performed. Matrix multiplication algorithms are a central subroutine in theoretical and num ...
) remains unknown. , the best announced bound on the
asymptotic complexity In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations) ...
of a matrix multiplication algorithm is time, given by Duan, Wu and Zhou announced in a preprint. This improves on the bound of time, given by Josh Alman and
Virginia Vassilevska Williams Virginia Vassilevska Williams (née Virginia Panayotova Vassilevska) is a theoretical computer scientist and mathematician known for her research in computational complexity theory and algorithms. She is currently the Steven and Renee Finn Care ...
. However, this algorithm is a
galactic algorithm A galactic algorithm is one that outperforms any other algorithm for problems that are sufficiently large, but where "sufficiently large" is so big that the algorithm is never used in practice. Galactic algorithms were so named by Richard Lipton a ...
because of the large constants and cannot be realized practically.


Iterative algorithm

The definition of matrix multiplication is that if for an matrix and an matrix , then is an matrix with entries :c_ = \sum_^m a_ b_. From this, a simple algorithm can be constructed which loops over the indices from 1 through and from 1 through , computing the above using a nested loop: * Input: matrices and * Let be a new matrix of the appropriate size * For from 1 to : ** For from 1 to : *** Let *** For from 1 to : **** Set *** Set * Return This algorithm takes
time Time is the continued sequence of existence and events that occurs in an apparently irreversible succession from the past, through the present, into the future. It is a component quantity of various measurements used to sequence events, to ...
(in
asymptotic notation Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Land ...
). A common simplification for the purpose of algorithms analysis is to assume that the inputs are all square matrices of size , in which case the running time is , i.e., cubic in the size of the dimension.


Cache behavior

The three loops in iterative matrix multiplication can be arbitrarily swapped with each other without an effect on correctness or asymptotic running time. However, the order can have a considerable impact on practical performance due to the memory access patterns and
cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache Count ...
use of the algorithm; which order is best also depends on whether the matrices are stored in row-major order, column-major order, or a mix of both. In particular, in the idealized case of a fully associative cache consisting of bytes and bytes per cache line (i.e. cache lines), the above algorithm is sub-optimal for and stored in row-major order. When , every iteration of the inner loop (a simultaneous sweep through a row of and a column of ) incurs a cache miss when accessing an element of . This means that the algorithm incurs cache misses in the worst case. , the speed of memories compared to that of processors is such that the cache misses, rather than the actual calculations, dominate the running time for sizable matrices. The optimal variant of the iterative algorithm for and in row-major layout is a '' tiled'' version, where the matrix is implicitly divided into square tiles of size by : * Input: matrices and * Let be a new matrix of the appropriate size * Pick a tile size * For from 1 to in steps of : ** For from 1 to in steps of : *** For from 1 to in steps of : **** Multiply and into , that is: **** For from to : ***** For from to : ****** Let ****** For from to : ******* Set ****** Set * Return In the idealized cache model, this algorithm incurs only cache misses; the divisor amounts to several orders of magnitude on modern machines, so that the actual calculations dominate the running time, rather than the cache misses.


Divide-and-conquer algorithm

An alternative to the iterative algorithm is the
divide-and-conquer algorithm In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved dire ...
for matrix multiplication. This relies on the block partitioning :C = \begin C_ & C_ \\ C_ & C_ \\ \end,\, A = \begin A_ & A_ \\ A_ & A_ \\ \end,\, B = \begin B_ & B_ \\ B_ & B_ \\ \end, which works for all square matrices whose dimensions are powers of two, i.e., the shapes are for some . The matrix product is now :\begin C_ & C_ \\ C_ & C_ \\ \end = \begin A_ & A_ \\ A_ & A_ \\ \end \begin B_ & B_ \\ B_ & B_ \\ \end = \begin A_ B_ + A_ B_ & A_ B_ + A_ B_\\ A_ B_ + A_ B_ & A_ B_ + A_ B_\\ \end which consists of eight multiplications of pairs of submatrices, followed by an addition step. The divide-and-conquer algorithm computes the smaller multiplications
recursively Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...
, using the
scalar multiplication In mathematics, scalar multiplication is one of the basic operations defining a vector space in linear algebra (or more generally, a module in abstract algebra). In common geometrical contexts, scalar multiplication of a real Euclidean vector by ...
as its base case. The complexity of this algorithm as a function of is given by the recurrence :T(1) = \Theta(1); :T(n) = 8T(n/2) + \Theta(n^2), accounting for the eight recursive calls on matrices of size and to sum the four pairs of resulting matrices element-wise. Application of the master theorem for divide-and-conquer recurrences shows this recursion to have the solution , the same as the iterative algorithm.


Non-square matrices

A variant of this algorithm that works for matrices of arbitrary shapes and is faster in practice splits matrices in two instead of four submatrices, as follows. Splitting a matrix now means dividing it into two parts of equal size, or as close to equal sizes as possible in the case of odd dimensions. * Inputs: matrices of size , of size . * Base case: if is below some threshold, use an unrolled version of the iterative algorithm. * Recursive cases: :* If , split horizontally: ::C = \begin A_1 \\ A_2 \end = \begin A_1 B \\ A_2 B \end :* Else, if , split vertically: ::C = A \begin B_1 & B_2 \end = \begin A B_1 & A B_2 \end :* Otherwise, . Split vertically and horizontally: ::C = \begin A_1 & A_2 \end \begin B_1 \\ B_2 \end = A_1 B_1 + A_2 B_2


Cache behavior

The cache miss rate of recursive matrix multiplication is the same as that of a tiled iterative version, but unlike that algorithm, the recursive algorithm is
cache-oblivious In computing, a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a processor cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter. An ...
: there is no tuning parameter required to get optimal cache performance, and it behaves well in a
multiprogramming In computing, multitasking is the concurrent execution of multiple tasks (also known as processes) over a certain period of time. New tasks can interrupt already started ones before they finish, instead of waiting for them to end. As a result ...
environment where cache sizes are effectively dynamic due to other processes taking up cache space. (The simple iterative algorithm is cache-oblivious as well, but much slower in practice if the matrix layout is not adapted to the algorithm.) The number of cache misses incurred by this algorithm, on a machine with lines of ideal cache, each of size bytes, is bounded by :\Theta \left(m + n + p + \frac + \frac \right)


Sub-cubic algorithms

Algorithms exist that provide better running times than the straightforward ones. The first to be discovered was
Strassen's algorithm In linear algebra, the Strassen algorithm, named after Volker Strassen, is an algorithm for matrix multiplication. It is faster than the standard matrix multiplication algorithm for large matrices, with a better asymptotic complexity, although th ...
, devised by
Volker Strassen Volker Strassen (born April 29, 1936) is a German mathematician, a professor emeritus in the department of mathematics and statistics at the University of Konstanz. For important contributions to the analysis of algorithms he has received many aw ...
in 1969 and often referred to as "fast matrix multiplication". It is based on a way of multiplying two -matrices which requires only 7 multiplications (instead of the usual 8), at the expense of several additional addition and subtraction operations. Applying this recursively gives an algorithm with a multiplicative cost of O( n^) \approx O(n^). Strassen's algorithm is more complex, and the
numerical stability In the mathematical subfield of numerical analysis, numerical stability is a generally desirable property of numerical algorithms. The precise definition of stability depends on the context. One is numerical linear algebra and the other is algorit ...
is reduced compared to the naïve algorithm, but it is faster in cases where or so and appears in several libraries, such as
BLAS Basic Linear Algebra Subprograms (BLAS) is a specification that prescribes a set of low-level routines for performing common linear algebra operations such as vector addition, scalar multiplication, dot products, linear combinations, and matrix ...
. It is very useful for large matrices over exact domains such as
finite field In mathematics, a finite field or Galois field (so-named in honor of Évariste Galois) is a field that contains a finite number of elements. As with any field, a finite field is a set on which the operations of multiplication, addition, subtr ...
s, where numerical stability is not an issue. It is an open question in
theoretical computer science Theoretical computer science (TCS) is a subset of general computer science and mathematics that focuses on mathematical aspects of computer science such as the theory of computation, lambda calculus, and type theory. It is difficult to circumsc ...
how well Strassen's algorithm can be improved in terms of
asymptotic complexity In computer science, the computational complexity or simply complexity of an algorithm is the amount of resources required to run it. Particular focus is given to computation time (generally measured by the number of needed elementary operations) ...
. The ''matrix multiplication exponent'', usually denoted \omega, is the smallest real number for which any n\times n matrix over a field can be multiplied together using n^ field operations. The current best bound on \omega is \omega < 2.3728596, by Josh Alman and
Virginia Vassilevska Williams Virginia Vassilevska Williams (née Virginia Panayotova Vassilevska) is a theoretical computer scientist and mathematician known for her research in computational complexity theory and algorithms. She is currently the Steven and Renee Finn Care ...
. This algorithm, like all other recent algorithms in this line of research, is a generalization of the Coppersmith–Winograd algorithm, which was given by
Don Coppersmith Don Coppersmith (born 1950) is a cryptographer and mathematician. He was involved in the design of the Data Encryption Standard block cipher at IBM, particularly the design of the S-boxes, strengthening them against differential cryptanalysis ...
and
Shmuel Winograd __NOTOC__ Shmuel Winograd ( he, שמואל וינוגרד; January 4, 1936 – March 25, 2019) was an Israeli-American computer scientist, noted for his contributions to computational complexity. He has proved several major results regarding the ...
in 1990. The conceptual idea of these algorithms are similar to Strassen's algorithm: a way is devised for multiplying two -matrices with fewer than multiplications, and this technique is applied recursively. However, the constant coefficient hidden by the
Big O notation Big ''O'' notation is a mathematical notation that describes the limiting behavior of a function when the argument tends towards a particular value or infinity. Big O is a member of a family of notations invented by Paul Bachmann, Edmund Lan ...
is so large that these algorithms are only worthwhile for matrices that are too large to handle on present-day computers. Freivalds' algorithm is a simple
Monte Carlo algorithm In computing, a Monte Carlo algorithm is a randomized algorithm whose output may be incorrect with a certain (typically small) probability. Two examples of such algorithms are Karger–Stein algorithm and Monte Carlo algorithm for minimum Feedba ...
that, given matrices , and , verifies in time if .


AlphaTensor

In 2022,
DeepMind DeepMind Technologies is a British artificial intelligence subsidiary of Alphabet Inc. and research laboratory founded in 2010. DeepMind was List of mergers and acquisitions by Google, acquired by Google in 2014 and became a wholly owned subsid ...
introduced AlphaTensor, a
neural network A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological ...
that used a single-player game analogy to invent thousands of matrix multiplication algorithms, including some previously discovered by humans. Operations were restricted to the finite field \mathbb Z/2\mathbb Z. The best "practical" (explicit low-rank decomposition of a matrix multiplication tensor) algorithm found ran in O(n2.778). Finding low-rank decompositions of such tensors (and beyond) is NP-hard; optimal multiplication for even 3x3 matrices remains unknown. On 4x4 matrices, AlphaTensor unexpectedly discovered a solution with 47 multiplication steps, an improvement over the 49 required with Strassen’s algorithm of 1969, albeit restricted to mod 2 arithmetic. Similarly, AlphaTensor solved 5x5 matrices with 96 rather than Strassen's 98 steps. Based on the surprising discovery that such improvements exist, other researchers were quickly able to find a similar independent 4x4 algorithm, and separately tweaked Deepmind's 96-step 5x5 algorithm down to 95 steps.


Parallel and distributed algorithms


Shared-memory parallelism

The
divide-and-conquer algorithm In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved dire ...
sketched earlier can be parallelized in two ways for
shared-memory multiprocessor In computer science, shared memory is memory that may be simultaneously accessed by multiple programs with an intent to provide communication among them or avoid redundant copies. Shared memory is an efficient means of passing data between progr ...
s. These are based on the fact that the eight recursive matrix multiplications in :\begin A_ B_ + A_ B_ & A_ B_ + A_ B_\\ A_ B_ + A_ B_ & A_ B_ + A_ B_\\ \end can be performed independently of each other, as can the four summations (although the algorithm needs to "join" the multiplications before doing the summations). Exploiting the full parallelism of the problem, one obtains an algorithm that can be expressed in fork–join style
pseudocode In computer science, pseudocode is a plain language description of the steps in an algorithm or another system. Pseudocode often uses structural conventions of a normal programming language, but is intended for human reading rather than machine re ...
: Procedure : * Base case: if , set (or multiply a small block matrix). * Otherwise, allocate space for a new matrix of shape , then: ** Partition into , , , . ** Partition into , , , . ** Partition into , , , . ** Partition into , , , . ** Parallel execution: *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . ** ''Join'' (wait for parallel forks to complete). ** . ** Deallocate . Procedure adds into , element-wise: * Base case: if , set (or do a short loop, perhaps unrolled). * Otherwise: ** Partition into , , , . ** Partition into , , , . ** In parallel: *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . *** ''Fork'' . ** ''Join''. Here, ''fork'' is a keyword that signal a computation may be run in parallel with the rest of the function call, while ''join'' waits for all previously "forked" computations to complete. achieves its goal by pointer manipulation only. This algorithm has a
critical path length In computer science, the analysis of parallel algorithms is the process of finding the computational complexity of algorithms executed in parallel – the amount of time, storage, or other resources needed to execute them. In many respects, analysi ...
of steps, meaning it takes that much time on an ideal machine with an infinite number of processors; therefore, it has a maximum possible
speedup In computer architecture, speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with d ...
of on any real computer. The algorithm isn't practical due to the communication cost inherent in moving data to and from the temporary matrix , but a more practical variant achieves speedup, without using a temporary matrix.


Communication-avoiding and distributed algorithms

On modern architectures with hierarchical memory, the cost of loading and storing input matrix elements tends to dominate the cost of arithmetic. On a single machine this is the amount of data transferred between RAM and cache, while on a distributed memory multi-node machine it is the amount transferred between nodes; in either case it is called the ''communication bandwidth''. The naïve algorithm using three nested loops uses communication bandwidth.
Cannon's algorithm In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.
, also known as the ''2D algorithm'', is a
communication-avoiding algorithm Communication-avoiding algorithms minimize movement of data within a memory hierarchy for improving its running-time and energy consumption. These minimize the total of two costs (in terms of time and energy): arithmetic and communication. Communic ...
that partitions each input matrix into a block matrix whose elements are submatrices of size by , where is the size of fast memory. The naïve algorithm is then used over the block matrices, computing products of submatrices entirely in fast memory. This reduces communication bandwidth to , which is asymptotically optimal (for algorithms performing computation). In a distributed setting with processors arranged in a by 2D mesh, one submatrix of the result can be assigned to each processor, and the product can be computed with each processor transmitting words, which is asymptotically optimal assuming that each node stores the minimum elements. This can be improved by the ''3D algorithm,'' which arranges the processors in a 3D cube mesh, assigning every product of two input submatrices to a single processor. The result submatrices are then generated by performing a reduction over each row. This algorithm transmits words per processor, which is asymptotically optimal. However, this requires replicating each input matrix element times, and so requires a factor of more memory than is needed to store the inputs. This algorithm can be combined with Strassen to further reduce runtime. "2.5D" algorithms provide a continuous tradeoff between memory usage and communication bandwidth. On modern distributed computing environments such as
MapReduce MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. A MapReduce program is composed of a ''map'' procedure, which performs filtering ...
, specialized multiplication algorithms have been developed.


Algorithms for meshes

There are a variety of algorithms for multiplication on
meshes A mesh is a barrier made of connected strands of metal, fiber, or other flexible or ductile materials. A mesh is similar to a web or a net in that it has many attached or woven strands. Types * A plastic mesh may be extruded, oriented, ex ...
. For multiplication of two ''n''×''n'' on a standard two-dimensional mesh using the 2D
Cannon's algorithm In computer science, Cannon's algorithm is a distributed algorithm for matrix multiplication for two-dimensional meshes first described in 1969 by Lynn Elliot Cannon.
, one can complete the multiplication in 3''n''-2 steps although this is reduced to half this number for repeated computations. The standard array is inefficient because the data from the two matrices does not arrive simultaneously and it must be padded with zeroes. The result is even faster on a two-layered cross-wired mesh, where only 2''n''-1 steps are needed. The performance improves further for repeated computations leading to 100% efficiency. The cross-wired mesh array may be seen as a special case of a non-planar (i.e. multilayered) processing structure.


See also

*
Computational complexity of mathematical operations The following tables list the computational complexity of various algorithms for common mathematical operations. Here, complexity refers to the time complexity of performing computations on a multitape Turing machine. See big O notation for an ...
*
Computational complexity of matrix multiplication In theoretical computer science, the computational complexity of matrix multiplication dictates how quickly the operation of matrix multiplication can be performed. Matrix multiplication algorithms are a central subroutine in theoretical and num ...
* CYK algorithm § Valiant's algorithm *
Matrix chain multiplication Matrix chain multiplication (or the matrix chain ordering problem) is an optimization problem concerning the most efficient way to multiply a given sequence of matrices. The problem is not actually to ''perform'' the multiplications, but merely t ...
* Sparse matrix–vector multiplication *
Method of Four Russians In computer science, the Method of Four Russians is a technique for speeding up algorithms involving Boolean matrices, or more generally algorithms involving matrices in which each cell may take on only a bounded number of possible values. Idea ...


References


Further reading

* * *
How To Optimize GEMM
{{Numerical linear algebra Numerical linear algebra Matrix theory Articles with example pseudocode