
In
computer architecture, a trace cache or execution trace cache is a specialized
instruction cache which stores the dynamic stream of
instructions known as trace. It helps in increasing the instruction fetch
bandwidth and decreasing power consumption (in the case of
Intel
Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California, and Delaware General Corporation Law, incorporated in Delaware. Intel designs, manufactures, and sells computer compo ...
Pentium 4
Pentium 4 is a series of single-core central processing unit, CPUs for Desktop computer, desktops, laptops and entry-level Server (computing), servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 20 ...
) by storing traces of instructions that have already been fetched and decoded.
A trace processor is an architecture designed around the trace cache and processes the instructions at trace level granularity. The formal mathematical theory of traces is described by
trace monoids.
Background
The earliest academic publication of trace cache was "Trace Cache: a Low Latency Approach to High Bandwidth Instruction Fetching".
This widely acknowledged paper was presented by Eric Rotenberg, Steve Bennett, and Jim Smith at 1996
International Symposium on Microarchitecture (MICRO) conference. An earlier publication is US patent 5381533, by Alex Peleg and Uri Weiser of Intel, "Dynamic flow instruction cache memory organized around trace segments independent of virtual address line", a continuation of an application filed in 1992, later abandoned.
Necessity
Wider
superscalar processor
A superscalar processor (or multiple-issue processor) is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single ins ...
s demand multiple instructions to be fetched in a single cycle for higher performance. Instructions to be fetched are not always in contiguous memory locations (
basic block
In compiler construction, a basic block is a straight-line code sequence with no branches in except to the entry and no branches out except at the exit. This restricted form makes a basic block highly amenable to analysis. Compilers usually decom ...
s) because of
branch
A branch, also called a ramus in botany, is a stem that grows off from another stem, or when structures like veins in leaves are divided into smaller veins.
History and etymology
In Old English, there are numerous words for branch, includ ...
and
jump instructions. So processors need additional logic and hardware support to fetch and align such instructions from non-contiguous basic blocks. If multiple branches are predicted as ''not-taken'', then processors can fetch instructions from multiple contiguous basic blocks in a single cycle. However, if any of the branches is predicted as ''taken'', then processor should fetch instructions from the taken path in that same cycle. This limits the fetch capability of a processor.

Consider these four basic blocks (
A
,
B
,
C
,
D
) as shown in the figure that correspond to a simple
if-else loop. These blocks will be stored
contiguously as
ABCD
in the memory. If the branch
D
is predicted ''not-taken,'' the fetch unit can fetch the basic blocks
A
,
B
,
C
which are placed contiguously. However, if
D
is predicted ''taken'', the fetch unit has to fetch
A
,
B
,
D
which are non-contiguously placed. Hence, fetching these blocks which are non contiguously placed, in a single cycle will be very difficult. So, in situations like these, the trace cache comes in aid to the processor.
Once fetched, the trace cache stores the instructions in their dynamic sequence. When these instructions are encountered again, the trace cache allows the instruction fetch unit of a processor to fetch several basic blocks from it without having to worry about branches in the execution flow. Instructions will be stored in the trace cache either after they have been decoded, or as they are retired. However, instruction sequence is speculative if they are stored just after decode stage.
Trace structure
A trace, also called a dynamic instruction sequence, is an entry in the trace cache. It can be characterized by ''maximum number of instructions'' and ''maximum basic blocks''. Traces can start at any dynamic instruction. Multiple traces can have same starting instruction i.e., same starting
program counter
The program counter (PC), commonly called the instruction pointer (IP) in Intel x86 and Itanium microprocessors, and sometimes called the instruction address register (IAR), the instruction counter, or just part of the instruction sequencer, ...
(PC) and instructions from different basic blocks as per the branch outcomes. For the figure above, ABC and ABD are valid traces. They both start at the same PC (address of A) and have different basic blocks as per D's prediction.
Traces usually terminate when one of the following occurs:
# Trace got filled with allowable ''maximum number of instructions''
# Trace has allowable maximum basic blocks
# Return instructions
# Indirect branches
# System calls
Trace control information
A single trace will have following information:
* Starting PC - PC of the first instruction in trace
* Branch flag - ( ''maximum basic blocks -1'') branch predictions
* Branch mask - number of branches in the trace and whether trace ends in a branch or not
* Trace fall through - Next PC if last instruction is ''not-taken'' branch or not a branch
* Trace target - address of last branch's taken target
Trace cache design
Following are the factors that need to be considered while designing a trace cache.
* Trace selection policies - ''maximum number of instructions'' and ''maximum basic blocks in a trace''
*
Associativity
In mathematics, the associative property is a property of some binary operations that rearranging the parentheses in an expression will not change the result. In propositional logic, associativity is a Validity (logic), valid rule of replaceme ...
- number of ways a cache can have
* Cache indexing method - concatenation or
XOR with PC bits
* Path associativity - traces with same starting PC but with different basic blocks can be mapped to different sets
* Trace cache fill choices -
*# After decode stage (speculative)
*# After retire stage
A trace cache is not on the critical path of instruction fetch
[Leon Gu; Dipti Motiani (October 2003)]
"Trace Cache"
(PDF). Retrieved2013-10-06.
Hit/miss logic
Trace lines are stored in the trace cache based on the PC of the first instruction in the trace and a set of branch predictions. This allows for storing different trace paths that start on the same address, each representing different branch outcomes. This method of tagging helps to provide path associativity to the trace cache. Other method can include having only starting PC as tag in trace cache. In the instruction fetch stage of a
pipeline
A pipeline is a system of Pipe (fluid conveyance), pipes for long-distance transportation of a liquid or gas, typically to a market area for consumption. The latest data from 2014 gives a total of slightly less than of pipeline in 120 countries ...
, the current PC along with a set of branch predictions is checked in the trace cache for a
hit. If there is a hit, a trace line is supplied to fetch unit which does not have to go to a regular cache or to memory for these instructions. The trace cache continues to feed the fetch unit until the trace line ends or until there is a
misprediction in the pipeline. If there is a miss, a new trace starts to be built.
The Pentium 4's execution trace cache stores
micro-operations
In computer central processing units, micro-operations (also known as micro-ops or μops, historically also as micro-actions) are detailed low-level instructions used in some designs to implement complex machine instructions (sometimes termed ma ...
resulting from decoding
x86 instructions
The x86 instruction set refers to the set of instructions that x86-compatible microprocessors support. The instructions are usually part of an executable program, often stored as a computer file and executed on the processor.
The x86 instruction ...
, providing also the functionality of a micro-operation cache. Having this, the next time an instruction is needed, it does not have to be decoded into micro-ops again.
Disadvantages
The disadvantages of trace cache are:
# Redundant instruction storage between trace cache and instruction cache and within trace cache itself.
# Power inefficiency and hardware complexity
Execution trace cache
Within the L1 cache of the
NetBurst
The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of central processing units (CPUs) made by Intel. The first CPU to use this architecture was the Willamette-core Pentium ...
CPUs, Intel incorporated its execution trace cache.
It stores decoded
micro-operation
In computer central processing units, micro-operations (also known as micro-ops or μops, historically also as micro-actions) are detailed low-level instructions used in some designs to implement complex machine instructions (sometimes termed ma ...
s, so that when executing a new instruction, instead of fetching and decoding the instruction again, the CPU directly accesses the decoded micro-ops from the trace cache, thereby saving considerable time. Moreover, the micro-ops are cached in their predicted path of execution, which means that when instructions are fetched by the CPU from the cache, they are already present in the correct order of execution. Intel later introduced a similar but simpler concept with
Sandy Bridge
Sandy Bridge is the List of Intel codenames, codename for Intel's 32 nm process, 32 nm microarchitecture used in the second generation of the Intel Core, Intel Core processors (Intel Core i7, Core i7, Intel Core i5, i5, Intel Core i3, i3). The Sa ...
called
micro-operation cache
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whic ...
(UOP cache).
See also
*
Branch trace Branch trace is a computer program debugging tool or analysis technique. It is an abbreviated instruction trace in which only the successful Branch (computer science), branch instructions are recorded. On IBM System/360 this was implemented as part ...
*
CPU cache
A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, whi ...
*
Instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch–execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instructions ...
*
Trace monoid
References
{{Reflist
Cache (computing)
Trace theory