computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, ...

, a cache-oblivious algorithm (or cache-transcendent algorithm) is an

algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...

designed to take advantage of a processor

cache Cache, caching, or caché may refer to: Places United States * Cache, Idaho, an unincorporated community * Cache, Illinois, an unincorporated community * Cache, Oklahoma, a city in Comanche County * Cache, Utah, Cache County, Utah * Cache County ...

without having the size of the cache (or the length of the

cache line A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whi ...

s, etc.) as an explicit parameter. An optimal cache-oblivious algorithm is a cache-oblivious algorithm that uses the cache optimally (in an asymptotic sense, ignoring constant factors). Thus, a cache-oblivious algorithm is designed to perform well, without modification, on multiple machines with different cache sizes, or for a memory hierarchy with different levels of cache having different sizes. Cache-oblivious algorithms are contrasted with explicit ''

loop tiling In computer science and particularly in compiler design, loop nest optimization (LNO) is an optimization technique that applies a set of loop transformations for the purpose of locality optimization or parallelization or another loop overhead redu ...

'', which explicitly breaks a problem into blocks that are optimally sized for a given cache. Optimal cache-oblivious algorithms are known for

matrix multiplication In mathematics, particularly in linear algebra, matrix multiplication is a binary operation that produces a matrix from two matrices. For matrix multiplication, the number of columns in the first matrix must be equal to the number of rows in the s ...

matrix transposition In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other notations). The t ...

, sorting, and several other problems. Some more general algorithms, such as Cooley–Tukey FFT, are optimally cache-oblivious under certain choices of parameters. As these algorithms are only optimal in an asymptotic sense (ignoring constant factors), further machine-specific

tuning Tuning can refer to: Common uses * Tuning, the process of tuning a tuned amplifier or other electronic component * Musical tuning, musical systems of tuning, and the act of tuning an instrument or voice ** Guitar tunings ** Piano tuning, adjusti ...

may be required to obtain nearly optimal performance in an absolute sense. The goal of cache-oblivious algorithms is to reduce the amount of such tuning that is required. Typically, a cache-oblivious algorithm works by a

recursive Recursion (adjective: ''recursive'') occurs when a thing is defined in terms of itself or of its type. Recursion is used in a variety of disciplines ranging from linguistics to logic. The most common application of recursion is in mathematics ...

divide-and-conquer algorithm In computer science, divide and conquer is an algorithm design paradigm. A divide-and-conquer algorithm recursively breaks down a problem into two or more sub-problems of the same or related type, until these become simple enough to be solved dire ...

, where the problem is divided into smaller and smaller subproblems. Eventually, one reaches a subproblem size that fits into the cache, regardless of the cache size. For example, an optimal cache-oblivious matrix multiplication is obtained by recursively dividing each matrix into four sub-matrices to be multiplied, multiplying the submatrices in a

depth-first Depth-first search (DFS) is an algorithm for traversing or searching tree or graph data structures. The algorithm starts at the root node (selecting some arbitrary node as the root node in the case of a graph) and explores as far as possible alon ...

fashion. In tuning for a specific machine, one may use a

hybrid algorithm {{Unreferenced, date=May 2014 A hybrid algorithm is an algorithm that combines two or more other algorithms that solve the same problem, and is mostly used in programming languages like C++, either choosing one (depending on the data), or switching ...

which uses loop tiling tuned for the specific cache sizes at the bottom level but otherwise uses the cache-oblivious algorithm.

History

The idea (and name) for cache-oblivious algorithms was conceived by

Charles E. Leiserson Charles Eric Leiserson is a computer scientist, specializing in the theory of parallel computing and distributed computing, and particularly practical applications thereof. As part of this effort, he developed the Cilk multithreaded language. ...

as early as 1996 and first published by

Harald Prokop Harald Prokop is the Chief Technical Officer of SCVNGR, parent company of LevelUp, as of April 2012. He worked at Akamai Technologies from 1999 to 2012, where he was Senior Vice President of Engineering and is also known for having elucidated t ...

in his master's thesis at the

Massachusetts Institute of Technology The Massachusetts Institute of Technology (MIT) is a private land-grant research university in Cambridge, Massachusetts. Established in 1861, MIT has played a key role in the development of modern technology and science, and is one of the ...

in 1999.Harald Prokop
Cache-Oblivious Algorithms
Masters thesis, MIT. 1999. There were many predecessors, typically analyzing specific problems; these are discussed in detail in Frigo et al. 1999. Early examples cited include Singleton 1969 for a recursive Fast Fourier Transform, similar ideas in Aggarwal et al. 1987, Frigo 1996 for matrix multiplication and LU decomposition, and Todd Veldhuizen 1996 for matrix algorithms in the

Blitz++ Blitz++ is a high-performance vector mathematics library written in C++. This library is intended for use in scientific applications that might otherwise be implemented with Fortran or MATLAB MATLAB (an abbreviation of "MATrix LABoratory") ...

library.

Idealized cache model

In general, a

program Program, programme, programmer, or programming may refer to: Business and management * Program management, the process of managing several related projects * Time management * Program, a part of planning Arts and entertainment Audio * Progra ...

can be made more cache-conscious: *''

Temporal locality In computer science, locality of reference, also known as the principle of locality, is the tendency of a processor to access the same set of memory locations repetitively over a short period of time. There are two basic types of reference localit ...

'', where the algorithm fetches the same pieces of memory multiple times; *''Spatial locality'', where the subsequent memory accesses are adjacent or nearby

memory address In computing, a memory address is a reference to a specific memory location used at various levels by software and hardware. Memory addresses are fixed-length sequences of digits conventionally displayed and manipulated as unsigned integers. ...

es. Cache-oblivious algorithms are typically analyzed using an idealized model of the cache, sometimes called the cache-oblivious model. This model is much easier to analyze than a real cache's characteristics (which have complicated associativity, replacement policies, etc.), but in many cases is within a constant factor of a more realistic cache's performance. It is different than the

external memory model In computing, external memory algorithms or out-of-core algorithms are algorithms that are designed to process data that are too large to fit into a computer's main memory at once. Such algorithms must be optimized to efficiently fetch and access da ...

because cache-oblivious algorithms do not know the block size or the

size. In particular, the cache-oblivious model is an

abstract machine An abstract machine is a computer science theoretical model that allows for a detailed and precise analysis of how a computer system functions. It is analogous to a mathematical function in that it receives inputs and produces outputs based on pr ...

(i.e., a theoretical

model of computation In computer science, and more specifically in computability theory and computational complexity theory, a model of computation is a model which describes how an output of a mathematical function is computed given an input. A model describes how ...

). It is similar to the RAM machine model which replaces the

Turing machine A Turing machine is a mathematical model of computation describing an abstract machine that manipulates symbols on a strip of tape according to a table of rules. Despite the model's simplicity, it is capable of implementing any computer algori ...

's infinite tape with an infinite array. Each location within the array can be accessed in

O(1)

time, similar to the

random-access memory Random-access memory (RAM; ) is a form of computer memory that can be read and changed in any order, typically used to store working data and machine code. A random-access memory device allows data items to be read or written in almost the ...

on a real computer. Unlike the RAM machine model, it also introduces a cache: the second level of storage between the RAM and the CPU. The other differences between the two models are listed below. In the cache-oblivious model: External memory

*Memory is broken into blocks of

B

objects each. *A load or a store between main memory and a CPU register may now be serviced from the cache. *If a load or a store cannot be serviced from the cache, it is called a ''cache miss''. *A cache miss results in one block being loaded from the main memory into the cache. Namely, if the CPU tries to access word

w

and

x

is the line containing

w

, then

x

is loaded into the cache. If the cache was previously full, then a line will be evicted as well (see replacement policy below). *The cache holds

M

objects, where

M = \Omega(B^2)

. This is also known as the ''tall cache assumption''. *The cache is fully associative: each line can be loaded into any location in the cache. *The replacement policy is optimal. In other words, the cache is assumed to be given the entire sequence of memory accesses during algorithm execution. If it needs to evict a line at time

t

, it will look into its sequence of future requests and evict the line whose first access is furthest in the future. This can be emulated in practice with the Least Recently Used policy, which is shown to be within a small constant factor of the offline optimal replacement strategyDaniel Sleator, Robert Tarjan
Amortized Efficiency of List Update and Paging Rules
In ''Communications of the ACM'', Volume 28, Number 2, pp. 202–208. Feb 1985. To measure the complexity of an algorithm that executes within the cache-oblivious model, we measure the number of

cache miss In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...

es that the algorithm experiences. Because the model captures the fact that accessing elements in the

is much faster than accessing things in main memory, the

running time In computer science, the time complexity is the computational complexity that describes the amount of computer time it takes to run an algorithm. Time complexity is commonly estimated by counting the number of elementary operations performed by t ...

of the algorithm is defined only by the number of memory transfers between the cache and main memory. This is similar to the

, which all of the features above, but cache-oblivious algorithms are independent of cache parameters (

B

and

M

Erik Demaine Erik D. Demaine (born February 28, 1981) is a professor of computer science at the Massachusetts Institute of Technology and a former child prodigy. Early life and education Demaine was born in Halifax, Nova Scotia, to artist sculptor Marti ...

Cache-Oblivious Algorithms and Data Structures
in Lecture Notes from the EEF Summer School on Massive Data Sets, BRICS, University of Aarhus, Denmark, June 27–July 1, 2002. The benefit of such an algorithm is that what is efficient on a cache-oblivious machine is likely to be efficient across many real machines without fine-tuning for particular real machine parameters. For many problems, an optimal cache-oblivious algorithm will also be optimal for a machine with more than two memory hierarchy levels.

Examples

The simplest cache-oblivious algorithm presented in Frigo et al. is an out-of-place

matrix transpose In linear algebra, the transpose of a matrix is an operator which flips a matrix over its diagonal; that is, it switches the row and column indices of the matrix by producing another matrix, often denoted by (among other notations). The tr ...

operation ( in-place algorithms have also been devised for transposition, but are much more complicated for non-square matrices). Given ''m''×''n'' array A and ''n''×''m'' array B, we would like to store the transpose of in . The naive solution traverses one array in row-major order and another in column-major. The result is that when the matrices are large, we get a cache miss on every step of the column-wise traversal. The total number of cache misses is

\Theta(mn)

The cache-oblivious algorithm has optimal work complexity

O(mn)

and optimal cache complexity

O(1+mn/B)

. The basic idea is to reduce the transpose of two large matrices into the transpose of small (sub)matrices. We do this by dividing the matrices in half along their larger dimension until we just have to perform the transpose of a matrix that will fit into the cache. Because the cache size is not known to the algorithm, the matrices will continue to be divided recursively even after this point, but these further subdivisions will be in cache. Once the dimensions and are small enough so an ''input'' array of size

m \times n

and an output array of size

n \times m

fit into the cache, both row-major and column-major traversals result in

O(mn)

work and

O(mn/B)

cache misses. By using this divide and conquer approach we can achieve the same level of complexity for the overall matrix. (In principle, one could continue dividing the matrices until a base case of size 1×1 is reached, but in practice one uses a larger base case (e.g. 16×16) in order to

amortize Amortization or amortisation may refer to: * The process by which loan principal decreases over the life of an amortizing loan * Amortization (accounting), the expensing of acquisition cost minus the residual value of intangible assets in a syste ...

the overhead of the recursive subroutine calls.) Most cache-oblivious algorithms rely on a divide-and-conquer approach. They reduce the problem, so that it eventually fits in cache no matter how small the cache is, and end the recursion at some small size determined by the function-call overhead and similar cache-unrelated optimizations, and then use some cache-efficient access pattern to merge the results of these small, solved problems. Like

external sorting External sorting is a class of sorting algorithms that can handle massive amounts of data. External sorting is required when the data being sorted do not fit into the main memory of a computing device (usually RAM) and instead they must reside in t ...

in the

, cache-oblivious sorting is possible in two variants:

funnelsort Funnelsort is a comparison-based sorting algorithm. It is similar to mergesort, but it is a cache-oblivious algorithm, designed for a setting where the number of elements to sort is too large to fit in a cache where operations are done. It wa ...

, which resembles

mergesort In computer science, merge sort (also commonly spelled as mergesort) is an efficient, general-purpose, and comparison-based sorting algorithm. Most implementations produce a stable sort, which means that the order of equal elements is the same ...

, and cache-oblivious distribution sort, which resembles

quicksort Quicksort is an efficient, general-purpose sorting algorithm. Quicksort was developed by British computer scientist Tony Hoare in 1959 and published in 1961, it is still a commonly used algorithm for sorting. Overall, it is slightly faster than ...

. Like their external memory counterparts, both achieve a

O\left(\tfrac \log_ \tfrac\right)

, which matches a

lower bound In mathematics, particularly in order theory, an upper bound or majorant of a subset of some preordered set is an element of that is greater than or equal to every element of . Dually, a lower bound or minorant of is defined to be an eleme ...

and is thus

asymptotically optimal In computer science, an algorithm is said to be asymptotically optimal if, roughly speaking, for large inputs it performs at worst a constant factor (independent of the input size) worse than the best possible algorithm. It is a term commonly en ...

Practicality

An empirical comparison of 2 RAM-based, 1 cache-aware, and 2 cache-oblivious algorithms implementing priority queues found that: * Cache-oblivious algorithms performed worse than RAM-based and cache-aware algorithms when data fits into main memory. * The cache-aware algorithm did not seem significantly more complex to implement than the cache-oblivious algorithms, and offered the best performance in all cases tested in the study. * Cache oblivious algorithms outperformed RAM-based algorithms when data size exceeded the size of main memory. Another study compared

hash table In computing, a hash table, also known as hash map, is a data structure that implements an associative array or dictionary. It is an abstract data type that maps keys to values. A hash table uses a hash function to compute an ''index'', ...

s (as RAM-based or cache-unaware),

B-tree In computer science, a B-tree is a self-balancing tree data structure that maintains sorted data and allows searches, sequential access, insertions, and deletions in logarithmic time. The B-tree generalizes the binary search tree, allowing for ...

s (as cache-aware), and a cache-oblivious data structure referred to as a "Bender set". For both execution time and memory usage, the hash table was best, followed by the B-tree, with the Bender set the worst in all cases. The memory usage for all tests did not exceed main memory. The hash tables were described as easy to implement, while the Bender set "required a greater amount of effort to implement correctly".

References

{{Reflist Models of computation Cache (computing) Analysis of algorithms

History

Idealized cache model

Examples

Practicality

See also

References