In-place matrix transposition, also called in-situ matrix transposition, is the problem of
transposing an ''N''×''M''
matrix
Matrix (: matrices or matrixes) or MATRIX may refer to:
Science and mathematics
* Matrix (mathematics), a rectangular array of numbers, symbols or expressions
* Matrix (logic), part of a formula in prenex normal form
* Matrix (biology), the m ...
in-place in
computer memory
Computer memory stores information, such as data and programs, for immediate use in the computer. The term ''memory'' is often synonymous with the terms ''RAM,'' ''main memory,'' or ''primary storage.'' Archaic synonyms for main memory include ...
, ideally with
''O''(1) (bounded) additional storage, or at most with additional storage much less than ''NM''. Typically, the matrix is assumed to be stored in
row-major or column-major order (i.e., contiguous rows or columns, respectively, arranged consecutively).
Performing an in-place transpose (in-situ transpose) is most difficult when ''N'' ≠ ''M'', i.e. for a non-square (rectangular) matrix, where it involves a complex
permutation
In mathematics, a permutation of a set can mean one of two different things:
* an arrangement of its members in a sequence or linear order, or
* the act or process of changing the linear order of an ordered set.
An example of the first mean ...
of the data elements, with many
cycles of length greater than 2. In contrast, for a square matrix (''N'' = ''M''), all of the cycles are of length 1 or 2, and the transpose can be achieved by a simple loop to swap the upper triangle of the matrix with the lower triangle. Further complications arise if one wishes to maximize
memory locality in order to improve
cache line
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...
utilization or to operate
out-of-core
In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access d ...
(where the matrix does not fit into main memory), since transposes inherently involve non-consecutive memory accesses.
The problem of non-square in-place transposition has been studied since at least the late 1950s, and several algorithms are known, including several which attempt to optimize locality for cache, out-of-core, or similar memory-related contexts.
Background
On a
computer
A computer is a machine that can be Computer programming, programmed to automatically Execution (computing), carry out sequences of arithmetic or logical operations (''computation''). Modern digital electronic computers can perform generic set ...
, one can often avoid explicitly transposing a matrix in
memory
Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembe ...
by simply accessing the same data in a different order. For example,
software libraries for
linear algebra
Linear algebra is the branch of mathematics concerning linear equations such as
:a_1x_1+\cdots +a_nx_n=b,
linear maps such as
:(x_1, \ldots, x_n) \mapsto a_1x_1+\cdots +a_nx_n,
and their representations in vector spaces and through matrix (mathemat ...
, such as
BLAS, typically provide options to specify that certain matrices are to be interpreted in transposed order to avoid data movement.
However, there remain a number of circumstances in which it is necessary or desirable to physically reorder a matrix in memory to its transposed ordering. For example, with a matrix stored in
row-major order, the rows of the matrix are contiguous in memory and the columns are discontiguous. If repeated operations need to be performed on the columns, for example in a
fast Fourier transform
A fast Fourier transform (FFT) is an algorithm that computes the discrete Fourier transform (DFT) of a sequence, or its inverse (IDFT). A Fourier transform converts a signal from its original domain (often time or space) to a representation in ...
algorithm (e.g. Frigo & Johnson, 2005), transposing the matrix in memory (to make the columns contiguous) may improve performance by increasing
memory locality. Since these situations normally coincide with the case of very large matrices (which exceed the cache size), performing the transposition in-place with minimal additional storage becomes desirable.
Also, as a purely mathematical problem, in-place transposition involves a number of interesting
number theory
Number theory is a branch of pure mathematics devoted primarily to the study of the integers and arithmetic functions. Number theorists study prime numbers as well as the properties of mathematical objects constructed from integers (for example ...
puzzles that have been worked out over the course of several decades.
Example
For example, consider the 2×4 matrix:
:
In row-major format, this would be stored in computer memory as the sequence (11, 12, 13, 14, 21, 22, 23, 24), i.e. the two rows stored consecutively. If we transpose this, we obtain the 4×2 matrix:
:
which is stored in computer memory as the sequence (11, 21, 12, 22, 13, 23, 14, 24).
If we number the storage locations 0 to 7, from left to right, then this permutation consists of four cycles:
:(0), (1 2 4), (3 6 5), (7)
That is, the value in position 0 goes to position 0 (a cycle of length 1, no data motion). Next, the value in position 1 (in the original storage: 11, 12, 13, 14, 21, 22, 23, 24) goes to position 2 (in the transposed storage 11, 21, 12, 22, 13, 23, 14, 24), while the value in position 2 (11, 12, 13, 14, 21, 22, 23, 24) goes to position 4 (11, 21, 12, 22, 13, 23, 14, 24), and position 4 (11, 12, 13, 14, 21, 22, 23, 24) goes back to position 1 (11, 21, 12, 22, 13, 23, 14, 24). Similarly for the values in position 7 and positions (3 6 5).
Properties of the permutation
In the following, we assume that the ''N''×''M'' matrix is stored in row-major order with zero-based indices. This means that the (''n'',''m'') element, for ''n'' = 0,...,''N''−1 and ''m'' = 0,...,''M''−1, is stored at an address ''a'' = ''Mn'' + ''m'' (plus some offset in memory, which we ignore). In the transposed ''M''×''N'' matrix, the corresponding (''m'',''n'') element is stored at the address ''a' '' = ''Nm'' + ''n'', again in row-major order. We define the ''transposition permutation'' to be the function ''a' '' = ''P''(''a'') such that:
:
for all
This defines a permutation on the numbers
.
It turns out that one can define simple formulas for ''P'' and its inverse (Cate & Twigg, 1977). First:
:
where "mod" is the
modulo operation
In computing and mathematics, the modulo operation returns the remainder or signed remainder of a Division (mathematics), division, after one number is divided by another, the latter being called the ''modular arithmetic, modulus'' of the operatio ...
.
If 0 ≤ ''a'' = ''Mn'' + ''m'' < ''MN'' − 1, then ''Na'' mod (''MN''−1) = ''MN'' ''n'' + ''Nm'' mod (''MN'' − 1) = ''n'' + ''Nm''.
[''MN'' ''x'' mod (''MN''−1) = (''MN'' − 1) ''x'' + ''x'' mod (''MN''−1) = ''x'' for 0 ≤ ''x'' < ''MN'' − 1.][The first (''a'' = 0) and last (''a'' = ''MN''−1) elements are always left invariant under transposition.]
Second, the inverse permutation is given by:
:
(This is just a consequence of the fact that the inverse of an ''N''×''M'' transpose is an ''M''×''N'' transpose, although it is also easy to show explicitly that ''P''
−1 composed with ''P'' gives the identity.)
As proved by Cate & Twigg (1977), the number of
fixed points (cycles of length 1) of the permutation is precisely , where gcd is the
greatest common divisor
In mathematics, the greatest common divisor (GCD), also known as greatest common factor (GCF), of two or more integers, which are not all zero, is the largest positive integer that divides each of the integers. For two integers , , the greatest co ...
. For example, with ''N'' = ''M'' the number of fixed points is simply ''N'' (the diagonal of the matrix). If and are
coprime
In number theory, two integers and are coprime, relatively prime or mutually prime if the only positive integer that is a divisor of both of them is 1. Consequently, any prime number that divides does not divide , and vice versa. This is equiv ...
, on the other hand, the only two fixed points are the upper-left and lower-right corners of the matrix.
The number of cycles of any length ''k''>1 is given by (Cate & Twigg, 1977):
:
where μ is the
Möbius function
The Möbius function \mu(n) is a multiplicative function in number theory introduced by the German mathematician August Ferdinand Möbius (also transliterated ''Moebius'') in 1832. It is ubiquitous in elementary and analytic number theory and m ...
and the sum is over the
divisor
In mathematics, a divisor of an integer n, also called a factor of n, is an integer m that may be multiplied by some integer to produce n. In this case, one also says that n is a '' multiple'' of m. An integer n is divisible or evenly divisibl ...
s ''d'' of ''k''.
Furthermore, the cycle containing ''a''=1 (i.e. the second element of the first row of the matrix) is always a cycle of maximum length ''L'', and the lengths ''k'' of all other cycles must be divisors of ''L'' (Cate & Twigg, 1977).
For a given cycle ''C'', every element
has the same greatest common divisor
.
Let ''s'' be the smallest element of the cycle, and
. From the definition of the permutation ''P'' above, every other element ''x'' of the cycle is obtained by repeatedly multiplying ''s'' by ''N'' modulo ''MN''−1, and therefore every other element is divisible by ''d''. But, since ''N'' and are coprime, ''x'' cannot be divisible by any factor of larger than ''d'', and hence
.
This theorem is useful in searching for cycles of the permutation, since an efficient search can look only at multiples of divisors of ''MN''−1 (Brenner, 1973).
Laflin & Brebner (1970) pointed out that the cycles often come in pairs, which is exploited by several algorithms that permute pairs of cycles at a time. In particular, let ''s'' be the smallest element of some cycle ''C'' of length ''k''. It follows that ''MN''−1−''s'' is also an element of a cycle of length ''k'' (possibly the same cycle).
The length ''k'' of the cycle containing ''s'' is the smallest ''k'' > 0 such that
. Clearly, this is the same as the smallest ''k''>0 such that
, since we are just multiplying both sides by −1, and
.
Algorithms
The following briefly summarizes the published algorithms to perform in-place matrix transposition.
Source code
In computing, source code, or simply code or source, is a plain text computer program written in a programming language. A programmer writes the human readable source code to control the behavior of a computer.
Since a computer, at base, only ...
implementing some of these algorithms can be found in the references, below.
Accessor transpose
Because physically transposing a matrix is computationally expensive, instead of moving values in memory, the access path may be transposed instead. It is trivial to perform this operation for CPU access, as the access paths of
iterator
In computer programming, an iterator is an object that progressively provides access to each item of a collection, in order.
A collection may provide multiple iterators via its interface that provide items in different orders, such as forwards ...
s must simply be exchanged, however
hardware acceleration may require that still be physically realigned.
Square matrices
For a square ''N''×''N'' matrix ''A''
''n'',''m'' = ''A''(''n'',''m''), in-place transposition is easy because all of the cycles have length 1 (the diagonals ''A''
''n'',''n'') or length 2 (the upper triangle is swapped with the lower triangle).
Pseudocode
In computer science, pseudocode is a description of the steps in an algorithm using a mix of conventions of programming languages (like assignment operator, conditional operator, loop) with informal, usually self-explanatory, notation of actio ...
to accomplish this (assuming zero-based
array indices) is:
for n = 0 to N - 1
for m = n + 1 to N
swap A(n,m) with A(m,n)
This type of implementation, while simple, can exhibit poor performance due to poor cache-line utilization, especially when ''N'' is a
power of two
A power of two is a number of the form where is an integer, that is, the result of exponentiation with number 2, two as the Base (exponentiation), base and integer as the exponent. In the fast-growing hierarchy, is exactly equal to f_1^ ...
(due to cache-line conflicts in a
CPU cache
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whi ...
with limited associativity). The reason for this is that, as ''m'' is incremented in the inner loop, the
memory address
In computing, a memory address is a reference to a specific memory location in memory used by both software and hardware. These addresses are fixed-length sequences of digits, typically displayed and handled as unsigned integers. This numeric ...
corresponding to ''A''(''n'',''m'') or ''A''(''m'',''n'') jumps discontiguously by ''N'' in memory (depending on whether the array is in column-major or row-major format, respectively). That is, the algorithm does not exploit
locality of reference
In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...
.
One solution to improve the cache utilization is to "block" the algorithm to operate on several numbers at once, in blocks given by the cache-line size; unfortunately, this means that the algorithm depends on the size of the cache line (it is "cache-aware"), and on a modern computer with multiple levels of cache it requires multiple levels of machine-dependent blocking. Instead, it has been suggested (Frigo ''et al.'', 1999) that better performance can be obtained by a
recursive algorithm: divide the matrix into four submatrices of roughly equal size, transposing the two submatrices along the diagonal recursively and transposing and swapping the two submatrices above and below the diagonal. (When ''N'' is sufficiently small, the simple algorithm above is used as a base case, as naively recurring all the way down to ''N''=1 would have excessive function-call overhead.) This is a
cache-oblivious algorithm
In computing, a cache-oblivious algorithm (or cache-transcendent algorithm) is an algorithm designed to take advantage of a processor cache without having the size of the cache (or the length of the cache lines, etc.) as an explicit parameter. An ...
, in the sense that it can exploit the cache line without the cache-line size being an explicit parameter.
Non-square matrices: Following the cycles
For non-square matrices, the algorithms are more complex. Many of the algorithms prior to 1980 could be described as "follow-the-cycles" algorithms. That is, they loop over the cycles, moving the data from one location to the next in the cycle. In pseudocode form:
for each length>1 cycle ''C'' of the permutation
pick a starting address ''s'' in ''C''
let ''D'' = data at ''s''
let ''x'' = predecessor of ''s'' in the cycle
while ''x'' ≠ ''s''
move data from ''x'' to successor of ''x''
let ''x'' = predecessor of ''x''
move data from ''D'' to successor of ''s''
The differences between the algorithms lie mainly in how they locate the cycles, how they find the starting addresses in each cycle, and how they ensure that each cycle is moved exactly once. Typically, as discussed above, the cycles are moved in pairs, since ''s'' and ''MN''−1−''s'' are in cycles of the same length (possibly the same cycle). Sometimes, a small scratch array, typically of length ''M''+''N'' (e.g. Brenner, 1973; Cate & Twigg, 1977) is used to keep track of a subset of locations in the array that have been visited, to accelerate the algorithm.
In order to determine whether a given cycle has been moved already, the simplest scheme would be to use ''O''(''MN'') auxiliary storage, one
bit per element, to indicate whether a given element has been moved. To use only ''O''(''M''+''N'') or even auxiliary storage, more-complex algorithms are required, and the known algorithms have a worst-case
linearithmic computational cost of at best, as first proved by
Knuth (Fich ''et al.'', 1995; Gustavson & Swirszcz, 2007).
Such algorithms are designed to move each
data element
In metadata, the term data element is an atomic unit of data that has precise meaning or precise semantics. A data element has:
# An identification such as a data element name
# A clear data element definition
# One or more representation term ...
exactly once. However, they also involve a considerable amount of arithmetic to compute the cycles, and require heavily non-consecutive memory accesses since the adjacent elements of the cycles differ by multiplicative factors of ''N'', as discussed above.
Improving memory locality at the cost of greater total data movement
Several algorithms have been designed to achieve greater memory locality at the cost of greater data movement, as well as slightly greater storage requirements. That is, they may move each data element more than once, but they involve more consecutive memory access (greater spatial locality), which can improve performance on modern CPUs that rely on caches, as well as on
SIMD
Single instruction, multiple data (SIMD) is a type of parallel computer, parallel processing in Flynn's taxonomy. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneousl ...
architectures optimized for processing consecutive data blocks. The oldest context in which the spatial locality of transposition seems to have been studied is for out-of-core operation (by Alltop, 1975), where the matrix is too large to fit into main memory ("
core").
For example, if ''d'' =
gcd(''N'',''M'') is not small, one can perform the transposition using a small amount (''NM''/''d'') of additional storage, with at most three passes over the array (Alltop, 1975; Dow, 1995). Two of the passes involve a sequence of separate, small transpositions (which can be performed efficiently out of place using a small buffer) and one involves an in-place ''d''×''d'' square transposition of
blocks (which is efficient since the blocks being moved are large and consecutive, and the cycles are of length at most 2). This is further simplified if N is a multiple of M (or vice versa), since only one of the two out-of-place passes is required.
Another algorithm for non-
coprime
In number theory, two integers and are coprime, relatively prime or mutually prime if the only positive integer that is a divisor of both of them is 1. Consequently, any prime number that divides does not divide , and vice versa. This is equiv ...
dimensions, involving multiple subsidiary transpositions, was described by Catanzaro et al. (2014). For the case where is small, Dow (1995) describes another algorithm requiring additional storage, involving a square transpose preceded or followed by a small out-of-place transpose. Frigo & Johnson (2005) describe the adaptation of these algorithms to use cache-oblivious techniques for general-purpose CPUs relying on cache lines to exploit spatial locality.
Work on out-of-core matrix transposition, where the matrix does not fit in main memory and must be stored largely on a
hard disk
A hard disk drive (HDD), hard disk, hard drive, or fixed disk is an electro-mechanical data storage device that stores and retrieves digital data using magnetic storage with one or more rigid rapidly rotating hard disk drive platter, pla ...
, has focused largely on the ''N'' = ''M'' square-matrix case, with some exceptions (e.g. Alltop, 1975). Reviews of out-of-core algorithms, especially as applied to
parallel computing
Parallel computing is a type of computing, computation in which many calculations or Process (computing), processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. ...
, can be found in e.g. Suh & Prasanna (2002) and Krishnamoorth et al. (2004).
References
* P. F. Windley, "Transposing matrices in a digital computer," ''Computer Journal'' 2, p. 47-48 (1959).
* G. Pall, and E. Seiden, "A problem in Abelian Groups, with application to the transposition of a matrix on an electronic computer," ''Math. Comp.'' 14, p. 189-192 (1960).
* J. Boothroyd, "Algorithm 302: Transpose vector stored array," ''ACM Transactions on Mathematical Software'' 10 (5), p. 292-293 (1967).
* Susan Laflin and M. A. Brebner, "Algorithm 380: in-situ transposition of a rectangular matrix," ''ACM Transactions on Mathematical Software'' 13 (5), p. 324-326 (1970).
Source code
* Norman Brenner, "Algorithm 467: matrix transposition in place," ''ACM Transactions on Mathematical Software'' 16 (11), p. 692-694 (1973).
Source code
* W. O. Alltop, "A computer algorithm for transposing nonsquare matrices," ''IEEE Trans. Comput.'' 24 (10), p. 1038-1040 (1975).
* Esko G. Cate and David W. Twigg, "Algorithm 513: Analysis of In-Situ Transposition," ''ACM Transactions on Mathematical Software'' 3 (1), p. 104-110 (1977).
Source code
* Bryan Catanzaro, Alexander Keller, and Michael Garland, "A decomposition for in-place matrix transposition," Proceedings of the 19th ACM SIGPLAN symposium on Principles and practice of parallel programming (PPoPP '14), pp. 193–206 (2014).
* Murray Dow, "Transposing a matrix on a vector computer," ''Parallel Computing'' 21 (12), p. 1997-2005 (1995).
* Donald E. Knuth, ''
The Art of Computer Programming Volume 1: Fundamental Algorithms'', third edition, section 1.3.3 exercise 12 (Addison-Wesley: New York, 1997).
* M. Frigo, C. E. Leiserson, H. Prokop, and S. Ramachandran, "Cache-oblivious algorithms," in ''Proceedings of the 40th IEEE Symposium on Foundations of Computer Science'' (FOCS 99), p. 285-297 (1999).
* J. Suh and V. K. Prasanna, "An efficient algorithm for out-of-core matrix transposition," ''IEEE Trans. Computers'' 51 (4), p. 420-438 (2002).
* S. Krishnamoorthy, G. Baumgartner, D. Cociorva, C.-C. Lam, and P. Sadayappan,
Efficient parallel out-of-core matrix transposition" ''International Journal of High Performance Computing and Networking'' 2 (2-4), p. 110-119 (2004).
* M. Frigo and S. G. Johnson,
The Design and Implementation of FFTW3" ''Proceedings of the IEEE'' 93 (2), 216–231 (2005)
Source codeof the
FFTW library, which includes optimized serial and
parallel square and non-square transposes, in addition to
FFTs.
*
Faith E. Fich, J. Ian Munro, and Patricio V. Poblete, "Permuting in place," ''SIAM Journal on Computing'' 24 (2), p. 266-278 (1995).
*
Fred G. Gustavson and Tadeusz Swirszcz, "In-place transposition of rectangular matrices," ''Lecture Notes in Computer Science'' 4699, p. 560-569 (2007), from the Proceedings of the 2006 Workshop on State-of-the-Art
''sic''">/nowiki>''sic''/nowiki> in Scientific and Parallel Computing (PARA 2006) (Umeå, Sweden, June 2006).
*
*
*
{{Refend
External links
Source code
OFFT
- recursive block in-place transpose of square matrices, in Fortran
Jason Stratos Papadopoulos
blocked in-place transpose of square matrices, in C, ''sci.math.num-analysis'' newsgroup (April 7, 1998).
* See "Source code" links in the references section above, for additional code to perform in-place transposes of both square and non-square matrices.
libmarshal
Blocked in-place transpose of rectangular matrices for the GPUs.
Numerical linear algebra
Permutations
Articles with example pseudocode