Cache prefetching is a technique used by computer processors to boost execution performance by fetching instructions or data from their original storage in slower memory to a faster local memory before it is actually needed (hence the term 'prefetch').
Most modern computer processors have fast and local
cache memory
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
in which prefetched data is held until it is required. The source for the prefetch operation is usually
main memory. Because of their design, accessing
cache memories is typically much faster than accessing
main memory, so prefetching data and then accessing it from caches is usually many orders of magnitude faster than accessing it directly from
main memory. Prefetching can be done with non-blocking
cache control instruction
In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler ...
s.
Data vs. instruction cache prefetching
Cache prefetching can either fetch data or instructions into cache.
* Data prefetching fetches data before it is needed. Because data access patterns show less regularity than instruction patterns, accurate data prefetching is generally more challenging than instruction prefetching.
* Instruction prefetching fetches instructions before they need to be executed. The first mainstream microprocessors to use some form of instruction prefetch were the
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
8086
The 8086 (also called iAPX 86) is a 16-bit microprocessor chip designed by Intel between early 1976 and June 8, 1978, when it was released. The Intel 8088, released July 1, 1979, is a slightly modified chip with an external 8-bit data bus (allowi ...
(six bytes) and the
Motorola
Motorola, Inc. () was an American multinational telecommunications company based in Schaumburg, Illinois, United States. After having lost $4.3 billion from 2007 to 2009, the company split into two independent public companies, Motorol ...
68000
The Motorola 68000 (sometimes shortened to Motorola 68k or m68k and usually pronounced "sixty-eight-thousand") is a 16/32-bit complex instruction set computer (CISC) microprocessor, introduced in 1979 by Motorola Semiconductor Products Secto ...
(four bytes). In recent years, all high-performance processors use prefetching techniques.
Hardware vs. software cache prefetching
Cache prefetching can be accomplished either by hardware or by software.
* Hardware based prefetching is typically accomplished by having a dedicated hardware mechanism in the processor that watches the stream of instructions or data being requested by the executing program, recognizes the next few elements that the program might need based on this stream and prefetches into the processor's cache.
* Software based prefetching is typically accomplished by having the compiler analyze the code and insert additional "prefetch" instructions in the program during compilation itself.
Methods of hardware prefetching
Stream buffers
* Stream buffers were developed based on the concept of "one block lookahead (OBL) scheme" proposed by
Alan Jay Smith.
* Stream
buffers are one of the most common hardware based prefetching techniques in use. This technique was originally proposed by
Norman Jouppi
Norman Paul Jouppi is an American electrical engineer and computer scientist.
Career
Jouppi was one of the computer architects at the MIPS Stanford University Project (under John L. Hennessy), an early RISC project. He received his master's d ...
in 1990
and many variations of this method have been developed since.
The basic idea is that the
cache miss
In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
address (and
subsequent addresses) are fetched into a separate buffer of depth
. This buffer is called a stream buffer and is separate from cache. The processor then consumes data/instructions from the stream buffer if the address associated with the prefetched blocks match the requested address generated by the program executing on the processor. The figure below illustrates this setup:
* Whenever the prefetch mechanism detects a miss on a memory block, say A, it allocates a stream to begin prefetching successive blocks from the missed block onward. If the stream buffer can hold 4 blocks, then we would prefetch A+1, A+2, A+3, A+4 and hold those in the allocated stream buffer. If the processor consumes A+1 next, then it shall be moved "up" from the stream buffer to the processor's cache. The first entry of the stream buffer would now be A+2 and so on. This pattern of prefetching successive blocks is called Sequential Prefetching. It is mainly used when contiguous locations are to be prefetched. For example, it is used when prefetching instructions.
* This mechanism can be scaled up by adding multiple such 'stream buffers' - each of which would maintain a separate prefetch stream. For each new miss, there would be a new stream buffer allocated and it would operate in a similar way as described above.
* The ideal depth of the stream buffer is something that is subject to experimentation against various benchmarks
and depends on the rest of the
microarchitecture involved.
Strided prefetching
This type of prefetching monitors the delta between the addresses of the memory accesses and looks for patterns within it.
Regular strides
In this pattern, consecutive memory accesses are made to blocks that are
addresses apart.
In this case, the prefetcher calculates the
and uses it to compute the memory address for prefetching. Eg: If the
is 4, the address to be prefetched would A+4.
Irregular strides
In this case, the delta between the addresses of consecutive memory accesses is variable but still follows a pattern. Some prefetchers designs
exploit this property to predict and prefetch for future accesses.
Temporal prefetching
This class of prefetchers look for memory access streams that repeat over time. Eg. In this stream of memory accesses: N, A, B, C, E, G, H, A, B, C, I, J, K, A, B, C, L, M, N, O, A, B, C, ....; the stream A,B,C is repeating over time. Other design variation have tried to provide more efficient, performant implementations.
Collaborative prefetching
Computer applications generate a variety of access patterns. The processor and memory subsystem architectures used to execute these applications further disambiguate the memory access patterns they generate. Hence, the effectiveness and efficiency of prefetching schemes often depend the application and the architectures used to execute them. Recent research has focussed on building collaborative mechanisms to synergistically use multiple prefetching schemes for better prefetching coverage and accuracy.
Methods of software prefetching
Compiler directed prefetching
Compiler directed prefetching is widely used within loops with a large number of iterations. In this technique, the compiler predicts future cache misses and inserts a prefetch instruction based on the
miss penalty and execution time of the instructions.
These prefetches are non-blocking memory operations, i.e. these memory accesses do not interfere with actual memory accesses. They do not change the state of the processor or cause page faults.
One main advantage of software prefetching is that it reduces the number of compulsory cache misses.
The following example shows how a prefetch instruction would be added into code to improve
cache performance.
Consider a for loop as shown below:
for (int i=0; i<1024; i++)
At each iteration, the i
th element of the array "array1" is accessed. Therefore, we can prefetch the elements that are going to be accessed in future iterations by inserting a "prefetch" instruction as shown below:
for (int i=0; i<1024; i++)
Here, the prefetch stride,
depends on two factors, the cache miss penalty and the time it takes to execute a single iteration of the ''for'' loop. For instance, if one iteration of the loop takes 7 cycles to execute, and the cache miss penalty is 49 cycles then we should have
- which means that we prefetch 7 elements ahead. With the first iteration, i will be 0, so we prefetch the 7th element. Now, with this arrangement, the first 7 accesses (i=0->6) will still be misses (under the simplifying assumption that each element of array1 is in a separate cache line of its own).
Comparison of hardware and software prefetching
* While software prefetching requires programmer or
compiler
In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs tha ...
intervention, hardware prefetching requires special hardware mechanisms.
* Software prefetching works well only with loops where there is regular array access as the programmer has to hand code the prefetch instructions, whereas hardware prefetchers work dynamically based on the program's behavior at
runtime.
* Hardware prefetching also has less CPU overhead when compared to software prefetching.
Metrics of cache prefetching
There are three main metrics to judge cache prefetching
Coverage
Coverage is the fraction of total misses that are eliminated because of prefetching, i.e.
,
where,
Accuracy
Accuracy is the fraction of total prefetches that were useful - i.e. the ratio of the number of memory addresses prefetched were actually referenced by the program to the total prefetches done.
While it appears that having perfect accuracy might imply that there are no misses, this is not the case. The prefetches themselves might result in new misses if the prefetched blocks are placed directly into the cache. Although these may be a small fraction of the total number of misses we might see without any prefetching, this is a non-zero number of misses.
Timeliness
The qualitative definition of timeliness is how early a block is prefetched versus when it is actually referenced. An example to further explain timeliness is as follows :
Consider a for loop where each iteration takes 3 cycles to execute and the 'prefetch' operation takes 12 cycles. This implies that for the prefetched data to be useful, we must start the prefetch
iterations prior to its usage to maintain timeliness.
See also
*
Prefetch input queue
Fetching the instruction opcodes from program memory well in advance is known as prefetching and it is served by using prefetch input queue (PIQ).The pre-fetched instructions are stored in data structure - namely a queue. The fetching of opcodes ...
*
Link prefetching Link prefetching allows web browsers to pre-load resources. This speeds up both the loading and rendering of web pages. Prefetching was first introduced in HTML5.
Prefetching is accomplished through hints in web pages. These hints are used by the ...
*
Prefetcher
The Prefetcher is a component of Microsoft Windows which was introduced in Windows XP. It is a component of the Memory Manager that can speed up the Windows boot process and shorten the amount of time it takes to start up programs. It accomplish ...
*
Cache control instruction
In computing, a cache control instruction is a hint embedded in the instruction stream of a processor intended to improve the performance of hardware caches, using foreknowledge of the memory access pattern supplied by the programmer or compiler ...
References
{{Reflist, 30em
Cache (computing)