Cache Control Instructions
   HOME

TheInfoList



OR:

In
computing Computing is any goal-oriented activity requiring, benefiting from, or creating computing machinery. It includes the study and experimentation of algorithmic processes, and development of both hardware and software. Computing has scientific, e ...
, a cache control instruction is a hint embedded in the instruction stream of a
processor Processor may refer to: Computing Hardware * Processor (computing) **Central processing unit (CPU), the hardware within a computer that executes a program *** Microprocessor, a central processing unit contained on a single integrated circuit (I ...
intended to improve the performance of
hardware cache In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
s, using foreknowledge of the
memory access pattern In computing, a memory access pattern or IO access pattern is the pattern with which a system or program reads and writes memory on secondary storage. These patterns differ in the level of locality of reference and drastically affect cache performa ...
supplied by the
programmer A computer programmer, sometimes referred to as a software developer, a software engineer, a programmer or a coder, is a person who creates computer programs — often for larger computer software. A programmer is someone who writes/creates ...
or
compiler In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
. They may reduce
cache pollution Cache pollution describes situations where an executing computer program loads data into CPU cache unnecessarily, thus causing other useful data to be evicted from the cache into lower levels of the memory hierarchy, degrading performance. For e ...
, reduce bandwidth requirement, bypass latencies, by providing better control over the
working set Working set is a concept in computer science which defines the amount of memory that a process requires in a given time interval. Definition Peter Denning (1968) defines "the working set of information W(t, \tau) of a process at time t to be the ...
. Most cache control instructions do not affect the semantics of a program, although some can.


Examples

Several such instructions, with variants, are supported by several processor
instruction set In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an ' ...
architectures, such as
ARM In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between the ...
, MIPS,
PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
, and
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
.


Prefetch

Also termed ''data cache block touch'', the effect is to request loading the cache line associated with a given address. This is performed by the PREFETCH instruction in the
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was introd ...
instruction set. Some variants bypass higher levels of the
cache hierarchy Cache hierarchy, or multi-level caches, refers to a memory architecture that uses a hierarchy of memory stores based on varying access speeds to cache data. Highly requested data is cached in high-speed access memory stores, allowing swifter access ...
, which is useful in a 'streaming' context for data that is traversed once, rather than held in the working set. The
prefetch Prefetching in computer science is a technique for speeding up fetch operations by beginning a fetch operation whose result is expected to be needed soon. Usually this is before it is ''known'' to be needed, so there is a risk of wasting time by p ...
should occur sufficiently far ahead in time to mitigate the latency of memory access, for example in a loop traversing memory linearly. The
GNU Compiler Collection The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free software ...
intrinsic function In computer software, in compiler theory, an intrinsic function (or built-in function) is a function (subroutine) available for use in a given programming language whose implementation is handled specially by the compiler. Typically, it may subst ...
__builtin_prefetch can be used to invoke this in the programming languages C or
C++ C++ (pronounced "C plus plus") is a high-level general-purpose programming language created by Danish computer scientist Bjarne Stroustrup as an extension of the C programming language, or "C with Classes". The language has expanded significan ...
.


Instruction prefetch

A variant of prefetch for the instruction cache.


Data cache block allocate zero

This hint is used to prepare cache lines before overwriting the contents completely. In this example, the CPU needn't load anything from
main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a computer ...
. The semantic effect is equivalent to an aligned memset of a cache-line sized block to zero, but the operation is effectively free.


Data cache block invalidate

This hint is used to discard cache lines, without committing their contents to main memory. Care is needed since incorrect results are possible. Unlike other cache hints, the semantics of the program are significantly modified. This is used in conjunction with allocate zero for managing temporary data. This saves unneeded main memory bandwidth and cache pollution.


Data cache block flush

This hint requests the immediate eviction of a cache line, making way for future allocations. It is used when it is known that data is no longer part of the
working set Working set is a concept in computer science which defines the amount of memory that a process requires in a given time interval. Definition Peter Denning (1968) defines "the working set of information W(t, \tau) of a process at time t to be the ...
.


Other hints

Some processors support a variant of load–store instructions that also imply cache hints. An example is load last in the
PowerPC PowerPC (with the backronym Performance Optimization With Enhanced RISC – Performance Computing, sometimes abbreviated as PPC) is a reduced instruction set computer (RISC) instruction set architecture (ISA) created by the 1991 Apple Inc., App ...
instruction set, which suggests that data will only be used once, i.e., the cache line in question may be pushed to the head of the eviction queue, whilst keeping it in use if still directly needed.


Alternatives


Automatic prefetch

In recent times, cache control instructions have become less popular as increasingly advanced application processor designs from
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
and
ARM In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between the ...
devote more transistors to accelerating code written in traditional languages, e.g., performing automatic prefetch, with hardware to detect linear access patterns on the fly. However the techniques may remain valid for throughput-oriented processors, which have a different throughput vs latency tradeoff, and may prefer to devote more area to execution units.


Scratchpad memory

Some processors support
scratchpad memory Scratchpad memory (SPM), also known as scratchpad, scratchpad RAM or local store in computer terminology, is a high-speed internal memory used for temporary storage of calculations, data, and other work in progress. In reference to a microprocesso ...
into which temporaries may be put, and
direct memory access Direct memory access (DMA) is a feature of computer systems and allows certain hardware subsystems to access main system memory independently of the central processing unit (CPU). Without DMA, when the CPU is using programmed input/output, it is t ...
(DMA) to transfer data to and from
main memory Computer data storage is a technology consisting of computer components and recording media that are used to retain digital data. It is a core function and fundamental component of computers. The central processing unit (CPU) of a computer ...
when needed. This approach is used by the
Cell processor Cell is a Multi-core processor, multi-core microprocessor microarchitecture that combines a general-purpose PowerPC Central processing unit, core of modest performance with streamlined coprocessor, coprocessing elements which greatly accelerate m ...
, and some
embedded system An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded'' as ...
s. These allow greater control over memory traffic and locality (as the working set is managed by explicit transfers), and eliminates the need for expensive
cache coherency In computer architecture, cache coherence is the uniformity of shared resource data that ends up stored in multiple local caches. When clients in a system maintain caches of a common memory resource, problems may arise with incoherent data, whi ...
in a
manycore Manycore processors are special kinds of multi-core processors designed for a high degree of parallel processing, containing numerous simpler, independent processor cores (from a few tens of cores to thousands or more). Manycore processors are use ...
machine. The disadvantage is it requires significantly different programming techniques to use. It is very hard to adapt programs written in traditional languages such as C and C++ which present the programmer with a uniform view of a large address space (which is an illusion simulated by caches). A traditional microprocessor can more easily run legacy code, which may then be accelerated by cache control instructions, whilst a scratchpad based machine requires dedicated coding from the ground up to even function. Cache control instructions are specific to a certain cache line size, which in practice may vary between generations of processors in the same architectural family. Caches may also help coalescing reads and writes from less predictable access patterns (e.g., during
texture mapping Texture mapping is a method for mapping a texture on a computer-generated graphic. Texture here can be high frequency detail, surface texture, or color. History The original technique was pioneered by Edwin Catmull in 1974. Texture mapping ...
), whilst scratchpad DMA requires reworking algorithms for more predictable 'linear' traversals. As such scratchpads are generally harder to use with traditional programming models, although
dataflow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Dataf ...
models (such as
TensorFlow TensorFlow is a free and open-source software library for machine learning and artificial intelligence. It can be used across a range of tasks but has a particular focus on training and inference of deep neural networks. "It is machine learning ...
) might be more suitable.


Vector fetch

Vector processor In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called ...
s (for example modern
graphics processing unit A graphics processing unit (GPU) is a specialized electronic circuit designed to manipulate and alter memory to accelerate the creation of images in a frame buffer intended for output to a display device. GPUs are used in embedded systems, mobi ...
(GPUs) and
Xeon Phi Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application programmi ...
) use massive parallelism to achieve high throughput whilst working around memory latency (reducing the need for prefetching). Many read operations are issued in parallel, for subsequent invocations of a
compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main progr ...
; calculations may be put on hold awaiting future data, whilst the execution units are devoted to working on data from past requests data that has already turned up. This is easier for programmers to leverage in conjunction with the appropriate programming models (
compute kernel In computing, a compute kernel is a routine compiled for high throughput accelerators (such as graphics processing units (GPUs), digital signal processors (DSPs) or field-programmable gate arrays (FPGAs)), separate from but used by a main progr ...
s), but harder to apply to general purpose programming. The disadvantage is that many copies of temporary states may be held in the
local memory This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices. A ...
of a
processing element This glossary of computer hardware terms is a list of definitions of terms and concepts related to computer hardware, i.e. the physical and structural components of computers, architectural issues, and peripheral devices. A ...
, awaiting data in flight.


References

{{Reflist Computer architecture