In-order Processor
   HOME

TheInfoList



OR:

In computer engineering, out-of-order execution (or more formally dynamic execution) is a paradigm used in most high-performance
central processing unit A central processing unit (CPU), also called a central processor, main processor or just Processor (computing), processor, is the electronic circuitry that executes Instruction (computing), instructions comprising a computer program. The CPU per ...
s to make use of
instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instruction ...
s that would otherwise be wasted. In this paradigm, a processor executes
instructions Instruction or instructions may refer to: Computing * Instruction, one operation of a processor within a computer architecture instruction set * Computer program, a collection of instructions Music * Instruction (band), a 2002 rock band from Ne ...
in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.


History

Out-of-order execution is a restricted form of
data flow In computing, dataflow is a broad concept, which has various meanings depending on the application and context. In the context of software architecture, data flow relates to stream processing or reactive programming. Software architecture Dataf ...
computation, which was a major research area in
computer architecture In computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, the ...
in the 1970s and early 1980s. The first machine to use out-of-order execution was the
CDC 6600 The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM ...
(1964), designed by James E. Thornton, which uses a
scoreboard A scoreboard is a large board for publicly displaying the score in a game. Most levels of sport from high school and above use at least one scoreboard for keeping score, measuring time, and displaying statistics. Scoreboards in the past used ...
to avoid conflicts. It permits an instruction to execute if its source operand (read) addresses aren't to be written to by any unexecuted earlier instruction (true dependency) and the destination (write) address not be an address used by any unexecuted earlier instruction (false dependency). The 6600 lacks the means to avoid stalling an
execution unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
on false dependencies ( write after write (WAW) and write after read (WAR) conflicts, respectively termed "first order conflict" and "third order conflict" by Thornton, who termed true dependencies (
read after write Direct read after write is a procedure that compares data recorded onto a medium against the source Source may refer to: Research * Historical document * Historical source * Source (intelligence) or sub source, typically a confidential provider ...
(RAW)) as "second order conflict") because each address has only a single location referable by it. The WAW is worse than WAR for the 6600, because when an execution unit encounters a WAR, the other execution units still receive and execute instructions, but upon a WAW the assignment of instructions to execution units stops, and they can not receive any further instructions until the WAW-causing instruction's destination register has been written to by earlier instruction. About two years later, the
IBM System/360 Model 91 The IBM System/360 Model 91 was announced in 1964 as a competitor to the CDC 6600. Functionally, the Model 91 ran like any other large-scale System/360, but the internal organization was the most advanced of the System/360 line, and it was the ...
(1966) introduced
register renaming In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a partic ...
with Tomasulo's algorithm, which dissolves false dependencies (WAW and WAR), making full out-of-order execution possible. An instruction addressing a write into a register ''rn'' can be executed before an earlier instruction using the register ''rn'' is executed, by actually writing into an alternative (renamed) register ''alt-rn'', which is turned into a normal ("architectural") register ''rn'' only when all the earlier instructions addressing ''rn'' have been executed, but until then ''rn'' is given for earlier instructions and ''alt-rn'' for later ones addressing ''rn''. Another advantage the Model 91 has over the 6600 is the ability to execute out-of-order the instructions ''at the same
execution unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
'', not just between the units like the 6600. This is accomplished by
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservat ...
s, from which instructions go to the execution unit when ready, as opposed to the FIFO queue of each execution unit of the 6600. Only the floating-point registers of the Model 91 are renamed, making it subject to the same WAW and WAR limitations as the CDC 6600 when running fixed-point code. The 91 and 6600 both also suffer from imprecise exceptions, which needed to be solved before out-of-order execution would be used outside supercomputers. To have precise exceptions, the proper in-order state of the program's execution must be available upon an exception. By 1985 various approaches were developed as described by James E. Smith and Andrew R. Pleszkun.
(Expanded version published in May 1988 a
''Implementing Precise Interrupts in Pipelined Processors''
)
The CDC Cyber 205 was a precursor, as upon a virtual memory interrupt the entire state of the processor (including the information on the partially executed instructions) is saved into an ''invisible exchange package'', so that it can resume at the same state of execution. However to make all exceptions precise, there has to be a way to cancel the effects of instructions. The CDC Cyber 990 (1984) implements precise interrupts by using a history buffer, which holds the old (overwritten) values of registers that are restored when an exception necessitates the reverting of instructions. Smith simulated that adding a reorder buffer (or history buffer or equivalent) to Cray-1S would reduce the performance of executing the first 14 Livermore loops (unvectorized) by only 3%. Important academic research in this subject was led by Yale Patt with his HPSm simulator. With the
POWER1 The POWER1 is a multi-chip CPU developed and fabricated by IBM that implemented the POWER instruction set architecture (ISA). It was originally known as the RISC System/6000 CPU or, when in an abbreviated form, the RS/6000 CPU, before introdu ...
(1990) IBM returned to out-of-order execution. It was the first processor to combine register renaming (though again only floating-point registers) with precise exceptions. It uses a ''physical register file'' (i.e. a dynamically remapped file with both uncommitted and committed values) instead of a datafull reorder buffer, but the ability to cancel instructions is needed only in the branch unit, which implements a history buffer (named ''program counter stack'' by IBM) to undo changes to count, link, and condition registers. The reordering capability of even the floating-point instructions is still very limited; due to POWER1's inability to reorder floating-point arithmetic instructions (results became available in-order), their destination registers aren't renamed. POWER1 also doesn't have
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservat ...
s needed for out-of-order use of a same execution unit. The next year IBM's
ES/9000 The IBM System/390 is a discontinued mainframe product family implementing the ESA/390, the fifth generation of the System/360 instruction set architecture. The first computers to use the ESA/390 were the Enterprise System/9000 (ES/9000) ...
model 900 had register renaming also for the general-purpose registers. It also has
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservat ...
s with six entries for the dual integer unit (each cycle, from the six instructions up to two can be selected and then executed) and six entries for the FPU. Other units have simple FIFO queues. The re-ordering distance is up to 32 instructions. The first superscalar single-chip processors (
Intel i960 Intel's i960 (or 80960) was a RISC-based microprocessor design that became popular during the early 1990s as an embedded microcontroller. It became a best-selling CPU in that segment, along with the competing AMD 29000. In spite of its success, ...
CA in 1989) used a simple
scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data depende ...
scheduling like the CDC 6600 had quarter of a century earlier, but in 1992-1996 a rapid advancement of techniques, enabled by increasing transistor counts, saw proliferation down to personal computers.
Motorola 88110 The MC88110 was a microprocessor developed by Motorola that implemented the 88000 instruction set architecture (ISA). The MC88110 was a second-generation implementation of the 88000 ISA, succeeding the MC88100. It was designed for use in personal ...
(1992) used a history buffer to revert instructions. Loads could be executed ahead of preceding stores. While stores and branches were waiting to start execution, subsequent instructions of other types could keep flowing through all the pipeline stages, including writeback. The 12-entry capacity of the history buffer placed a limit on the reorder distance.
PowerPC 601 The PowerPC 600 family was the first family of PowerPC processors built. They were designed at the Somerset facility in Austin, Texas, jointly funded and staffed by engineers from IBM and Motorola as a part of the AIM alliance. Somerset was opened ...
(1993) was an evolution of the
RISC Single Chip The RISC Single Chip, or RSC, is a single-chip microprocessor developed and fabricated by International Business Machines (IBM). The RSC was a feature-reduced single-chip implementation of the POWER1, a multi-chip central processing unit (CPU) w ...
, itself a simplification of POWER1. The 601 permitted branch and floating-point instructions to overtake the integer instructions already in the fetched-instruction-queue, the lowest four entries of which were scanned for dispatchability. In the case of a cache miss, loads and stores could be reordered. Only the link and count register could be renamed. In the fall of 1994 NexGen and IBM with Motorola brought the renaming of general-purpose registers to single-chip CPUs. NexGen's Nx586 was the first
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was intr ...
processor capable of out-of-order execution, accomplished with
micro-OPs In computer central processing units, micro-operations (also known as micro-ops or μops, historically also as micro-actions) are detailed low-level instructions used in some designs to implement complex machine instructions (sometimes termed m ...
. The re-ordering distance is up to 14 micro-OPs.
PowerPC 603 The PowerPC 600 family was the first family of PowerPC processors built. They were designed at the Somerset facility in Austin, Texas, jointly funded and staffed by engineers from IBM and Motorola as a part of the AIM alliance. Somerset was opened ...
renamed both the general-purpose and FP registers. Each of the four non-branch execution units can have one instruction wait in front of it without blocking the instruction flow to the other units. A five-entry re-order buffer lets no more than four instructions to overtake an unexecuted instruction. Due to a store buffer, a load can access cache ahead of a preceding store.
PowerPC 604 The PowerPC 600 family was the first family of PowerPC processors built. They were designed at the Somerset facility in Austin, Texas, jointly funded and staffed by engineers from IBM and Motorola as a part of the AIM alliance. Somerset was opened ...
(1995) was the first single-chip processor with
execution unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
-level re-ordering, as three out of its six units each had a two-entry reservation station permitting the newer entry to execute before the older. The re-order buffer capacity is 16 instructions. A four-entry load queue and a six-entry store queue track the re-ordering of loads and stores upon cache misses.
HAL SPARC64 SPARC64 is a microprocessor developed by HAL Computer Systems and fabricated by Fujitsu. It implements the SPARC V9 instruction set architecture (ISA), the first microprocessor to do so. SPARC64 was HAL's first microprocessor and was the first ...
(1995) exceeded the re-ordering capacity of the
ES/9000 The IBM System/390 is a discontinued mainframe product family implementing the ESA/390, the fifth generation of the System/360 instruction set architecture. The first computers to use the ESA/390 were the Enterprise System/9000 (ES/9000) ...
model 900 by having three 8-entry reservation stations for integer, floating-point, and
address generation unit The address generation unit (AGU), sometimes also called address computation unit (ACU), is an execution unit inside central processing units (CPUs) that calculates addresses used by the CPU to access main memory. By having address calculations ...
, and a 12-entry reservation station for load/store, which permits greater reordering of cache/memory access than preceding processors. Up to 64 instructions can be in a re-ordered state at a time
Pentium Pro The Pentium Pro is a sixth-generation x86 microprocessor developed and manufactured by Intel and introduced on November 1, 1995. It introduced the P6 microarchitecture (sometimes termed i686) and was originally intended to replace the original ...
(1995) introduced a '' unified reservation station'', which at the 20 micro-OP capacity permitted very flexible re-ordering, backed by a 40-entry re-order buffer. Loads can be re-ordered ahead of both loads and stores. The practically attainable per-cycle rate of execution rose more as full out-of-order execution was further adopted by
SGI SGI may refer to: Companies *Saskatchewan Government Insurance *Scientific Games International, a gambling company *Silicon Graphics, Inc., a former manufacturer of high-performance computing products *Silicon Graphics International, formerly Rac ...
/ MIPS (
R10000 The R10000, code-named "T5", is a RISC microprocessor implementation of the MIPS IV instruction set architecture (ISA) developed by MIPS Technologies, Inc. (MTI), then a division of Silicon Graphics, Inc. (SGI). The chief designers are Chris Rowe ...
) and HP
PA-RISC PA-RISC is an instruction set architecture (ISA) developed by Hewlett-Packard. As the name implies, it is a reduced instruction set computer (RISC) architecture, where the PA stands for Precision Architecture. The design is also referred to as ...
(
PA-8000 The PA-8000 (PCX-U), code-named ''Onyx'', is a microprocessor developed and fabricated by Hewlett-Packard (HP) that implemented the PA-RISC 2.0 instruction set architecture (ISA). Hunt 1995 It was a completely new design with no circuitry deriv ...
) in 1996. The same year
Cyrix 6x86 The Cyrix 6x86 is a line of sixth-generation, 32-bit x86 microprocessors designed and released by Cyrix in 1995. Cyrix, being a fabless company, had the chips manufactured by IBM and SGS-Thomson. The 6x86 was made as a direct competitor to In ...
and
AMD K5 The K5 is AMD's first x86 processor to be developed entirely in-house. Introduced in March 1996, its primary competition was Intel's Pentium microprocessor. The K5 was an ambitious design, closer to a Pentium Pro than a Pentium regarding techn ...
brought advanced re-ordering techniques into mainstream
personal computer A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tec ...
s. Since
DEC Alpha Alpha (original name Alpha AXP) is a 64-bit reduced instruction set computer (RISC) instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). Alpha was designed to replace 32-bit VAX complex instruction set compute ...
gained out-of-order execution in 1998 (
Alpha 21264 The Alpha 21264 is a Digital Equipment Corporation RISC microprocessor launched on 19 October 1998. The 21264 implemented the Alpha instruction set architecture (ISA). Description The Alpha 21264 is a four-issue superscalar microprocessor with o ...
), the top-performing out-of-order processor cores have been unmatched by in-order cores other than HP/
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
Itanium Itanium ( ) is a discontinued family of 64-bit Intel microprocessors that implement the Intel Itanium architecture (formerly called IA-64). Launched in June 2001, Intel marketed the processors for enterprise servers and high-performance comput ...
2 and IBM
POWER6 The POWER6 is a microprocessor developed by IBM that implemented the Power ISA v.2.03. When it became available in systems in 2007, it succeeded the POWER5+ as IBM's flagship Power microprocessor. It is claimed to be part of the eCLipz projec ...
, though the latter had an out-of-order floating-point unit. The other high-end in-order processors fell far behind, namely
Sun The Sun is the star at the center of the Solar System. It is a nearly perfect ball of hot plasma, heated to incandescence by nuclear fusion reactions in its core. The Sun radiates this energy mainly as light, ultraviolet, and infrared radi ...
's UltraSPARC III/ IV, and IBM's
mainframes A mainframe computer, informally called a mainframe or big iron, is a computer used primarily by large organizations for critical applications like bulk data processing for tasks such as censuses, industry and consumer statistics, enterprise ...
which had lost the out-of-order execution capability for the second time, remaining in-order into the z10 generation. Later big in-order processors were focused on multithreaded performance, but eventually the
SPARC T series The SPARC T-series family of RISC processors and server computers, based on the SPARC V9 architecture, was originally developed by Sun Microsystems, and later by Oracle Corporation after its acquisition of Sun. Its distinguishing feature from e ...
and
Xeon Phi Xeon Phi was a series of x86 manycore processors designed and made by Intel. It was intended for use in supercomputers, servers, and high-end workstations. Its architecture allowed use of standard programming languages and application program ...
changed to out-of-order execution in 2011 and 2016 respectively. Almost all processors for phones and other lower-end applications remained in-order until c. 2010. First, Qualcomm's
Scorpion Scorpions are predatory arachnids of the order Scorpiones. They have eight legs, and are easily recognized by a pair of grasping pincers and a narrow, segmented tail, often carried in a characteristic forward curve over the back and always en ...
(re-ordering distance of 32) shipped in
Snapdragon ''Antirrhinum'' is a genus of plants commonly known as dragon flowers, snapdragons and dog flower because of the flowers' fancied resemblance to the face of a dragon that opens and closes its mouth when laterally squeezed. They are native to r ...
, and a bit later
Arm In human anatomy, the arm refers to the upper limb in common usage, although academically the term specifically means the upper arm between the glenohumeral joint (shoulder joint) and the elbow joint. The distal part of the upper limb between th ...
's A9 succeeded A8. For low-end
x86 x86 (also known as 80x86 or the 8086 family) is a family of complex instruction set computer (CISC) instruction set architectures initially developed by Intel based on the Intel 8086 microprocessor and its 8088 variant. The 8086 was intr ...
personal computer A personal computer (PC) is a multi-purpose microcomputer whose size, capabilities, and price make it feasible for individual use. Personal computers are intended to be operated directly by an end user, rather than by a computer expert or tec ...
s in-order early Intel Atoms were first challenged by
AMD Advanced Micro Devices, Inc. (AMD) is an American multinational semiconductor company based in Santa Clara, California, that develops computer processors and related technologies for business and consumer markets. While it initially manufactur ...
's Bobcat, and in 2013 were succeeded by an out-of-order
Silvermont Silvermont is a microarchitecture for low-power Atom, Celeron and Pentium branded processors used in systems on a chip (SoCs) made by Intel. Silvermont forms the basis for a total of four SoC families: * ''Merrifield'' and ''Moorefield'' cons ...
. Because the complexity of out-of-order execution precludes achieving the lowest minimum power consumption, cost and size, in-order execution is still prevalent in microcontrollers and
embedded system An embedded system is a computer system—a combination of a computer processor, computer memory, and input/output peripheral devices—that has a dedicated function within a larger mechanical or electronic system. It is ''embedded'' ...
s, as well as in phone-class cores such as Arm's A55 and A510 in
big.LITTLE ARM big.LITTLE is a heterogeneous computing architecture developed by ARM Holdings, coupling relatively battery-saving and slower processor cores (''LITTLE'') with relatively more powerful and power-hungry ones (''big''). Typically, only one "s ...
configurations.


Basic concept

To appreciate OoO Execution it is useful to first describe in-order, to be able to make a comparison of the two. Instructions cannot be completed instantaneously: they take time (multiple cycles). Therefore, results will lag behind where they are needed. In-order still has to keep track of the dependencies. Its approach is however quite unsophisticated: stall, every time. OoO uses much more sophisticated data tracking techniques, as seen below.


In-order processors

In earlier processors, the processing of instructions is performed in an
instruction cycle The instruction cycle (also known as the fetch–decode–execute cycle, or simply the fetch-execute cycle) is the cycle that the central processing unit (CPU) follows from boot-up until the computer has shut down in order to process instruction ...
normally consisting of the following steps: # Instruction
fetch Fetch may refer to: Books * ''Fetch'', a 2012 book by Alan MacDonald and David Roberts * ''The Fetch'', a 2006 book by Chris Humphreys * ''The Fetch'', a 2009 book by Laura Whitcomb * ''The Fetch'', a 1991 book by Robert Holdstock * ''Fazbear ...
. # If input
operand In mathematics, an operand is the object of a mathematical operation, i.e., it is the object or quantity that is operated on. Example The following arithmetic expression shows an example of operators and operands: :3 + 6 = 9 In the above exam ...
s are available (in processor registers, for instance), the instruction is dispatched to the appropriate
functional unit In computer engineering, an execution unit (E-unit or EU) is a part of the central processing unit (CPU) that performs the operations and calculations as instructed by the computer program. It may have its own internal control sequence unit (not ...
. If one or more operands are unavailable during the current clock cycle (generally because they are being fetched from
memory Memory is the faculty of the mind by which data or information is encoded, stored, and retrieved when needed. It is the retention of information over time for the purpose of influencing future action. If past events could not be remembered ...
), the processor stalls until they are available. # The instruction is executed by the appropriate functional unit. # The functional unit writes the results back to the register file. Often, an in-order processor will have a straightforward "
bit vector A bit array (also known as bitmask, bit map, bit set, bit string, or bit vector) is an array data structure that compactly stores bits. It can be used to implement a simple set data structure. A bit array is effective at exploiting bit-level p ...
" which records which registers a pipeline that it will (eventually) write to. If any input operands have the corresponding bit set in this vector, the instruction stalls. Essentially, the vector performs a greatly simplified role of protecting against register hazards. Thus out-of-order execution uses 2D matrices whereas in-order execution uses a 1D vector for hazard avoidance.


Out-of-order processors

This new paradigm breaks up the processing of instructions into these steps: # Instruction fetch. # Instruction dispatch to an instruction queue (also called instruction buffer or
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling. Reservat ...
s). # The instruction waits in the queue until its input operands are available. The instruction can leave the queue before older instructions. # The instruction is issued to the appropriate functional unit and executed by that unit. # The results are queued. # Only after all older instructions have their results written back to the register file, then this result is written back to the register file. This is called the graduation or retire stage. The key concept of OoOE processing is to allow the processor to avoid a class of stalls that occur when the data needed to perform an operation are unavailable. In the outline above, the OoOE processor avoids the stall that occurs in step (2) of the in-order processor when the instruction is not completely ready to be processed due to missing data. OoOE processors fill these "slots" in time with other instructions that ''are'' ready, then re-order the results at the end to make it appear that the instructions were processed as normal. The way the instructions are ordered in the original computer code is known as ''program order'', in the processor they are handled in ''data order'', the order in which the data, operands, become available in the processor's registers. Fairly complex circuitry is needed to convert from one ordering to the other and maintain a logical ordering of the output; the processor itself runs the instructions in seemingly random order. The benefit of OoOE processing grows as the
instruction pipeline In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing inco ...
deepens and the speed difference between main memory (or
cache memory In computing, a cache ( ) is a hardware or software component that stores data so that future requests for that data can be served faster; the data stored in a cache might be the result of an earlier computation or a copy of data stored elsewher ...
) and the processor widens. On modern machines, the processor runs many times faster than the memory, so during the time an in-order processor spends waiting for data to arrive, it could have processed a large number of instructions.


Dispatch and issue decoupling allows out-of-order issue

One of the differences created by the new paradigm is the creation of queues that allows the dispatch step to be decoupled from the issue step and the graduation stage to be decoupled from the execute stage. An early name for the paradigm was ''decoupled architecture''. In the earlier ''in-order'' processors, these stages operated in a fairly lock-step, pipelined fashion. The instructions of the program may not be run in the originally specified order, as long as the end result is correct. It separates the fetch and decode stages from the execute stage in a pipelined processor by using a
buffer Buffer may refer to: Science * Buffer gas, an inert or nonflammable gas * Buffer solution, a solution used to prevent changes in pH * Buffering agent, the weak acid or base in a buffer solution * Lysis buffer, in cell biology * Metal ion buffer * ...
. The buffer's purpose is to partition the memory access and execute functions in a computer program and achieve high-performance by exploiting the fine-grain parallelism between the two. In doing so, it effectively hides all
memory latency ''Memory latency'' is the time (the latency) between initiating a request for a byte or word in memory until it is retrieved by a processor. If the data are not in the processor's cache, it takes longer to obtain them, as the processor will hav ...
from the processor's perspective. A larger buffer can, in theory, increase throughput. However, if the processor has a
branch misprediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow i ...
then the entire buffer may need to be flushed, wasting a lot of
clock cycle In electronics and especially synchronous digital circuits, a clock signal (historically also known as ''logic beat'') oscillates between a high and a low state and is used like a metronome to coordinate actions of digital circuits. A clock sig ...
s and reducing the effectiveness. Furthermore, larger buffers create more heat and use more die space. For this reason processor designers today favour a
multi-threaded In computer science, a thread of execution is the smallest sequence of programmed instructions that can be managed independently by a scheduler, which is typically a part of the operating system. The implementation of threads and processes dif ...
design approach. Decoupled architectures are generally thought of as not useful for general purpose computing as they do not handle control intensive code well. Control intensive code include such things as nested branches that occur frequently in
operating system An operating system (OS) is system software that manages computer hardware, software resources, and provides common services for computer programs. Time-sharing operating systems schedule tasks for efficient use of the system and may also i ...
kernels Kernel may refer to: Computing * Kernel (operating system), the central component of most operating systems * Kernel (image processing), a matrix used for image convolution * Compute kernel, in GPGPU programming * Kernel method, in machine learnin ...
. Decoupled architectures play an important role in scheduling in
very long instruction word Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units (CPU, processor) mostly allow programs to specify instructions to exe ...
(VLIW) architectures. To avoid false operand dependencies, which would decrease the frequency when instructions could be issued out of order, a technique called
register renaming In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a partic ...
is used. In this scheme, there are more physical registers than defined by the architecture. The physical registers are tagged so that multiple versions of the same architectural register can exist at the same time.


Execute and writeback decoupling allows program restart

The queue for results is necessary to resolve issues such as branch mispredictions and exceptions/traps. The results queue allows programs to be restarted after an exception, which requires the instructions to be completed in program order. The queue allows results to be discarded due to mispredictions on older branch instructions and exceptions taken on older instructions. The ability to issue instructions past branches that are yet to resolve is known as
speculative execution Speculative execution is an optimization technique where a computer system performs some task that may not be needed. Work is done before it is known whether it is actually needed, so as to prevent a delay that would have to be incurred by doing ...
.


Micro-architectural choices

* Are the instructions dispatched to a centralized queue or to multiple distributed queues? : IBM PowerPC processors use queues that are distributed among the different functional units while other out-of-order processors use a centralized queue. IBM uses the term ''reservation stations'' for their distributed queues. * Is there an actual results queue or are the results written directly into a register file? For the latter, the queueing function is handled by register maps that hold the register renaming information for each instruction in flight. :Early Intel out-of-order processors use a results queue called a re-order buffer, while most later out-of-order processors use register maps. :More precisely: Intel P6 family microprocessors have both a re-order buffer (ROB) and a register alias table (RAT). The ROB was motivated mainly by branch misprediction recovery. :The Intel P6 family is among the earliest OoOE microprocessors but were supplanted by the
NetBurst The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of central processing units (CPUs) made by Intel. The first CPU to use this architecture was the Willamette-core Pentium ...
architecture. Years later, Netburst proved to be a dead end due to its long pipeline that assumed the possibility of much higher operating frequencies. Materials were not able to match the design's ambitious clock targets due to thermal issues and later designs based on NetBurst, namely Tejas and Jayhawk, were cancelled. Intel reverted to the P6 design as the basis of the
Core Core or cores may refer to: Science and technology * Core (anatomy), everything except the appendages * Core (manufacturing), used in casting and molding * Core (optical fiber), the signal-carrying portion of an optical fiber * Core, the centra ...
and Nehalem microarchitectures. The succeeding
Sandy Bridge Sandy Bridge is the codename for Intel's 32 nm microarchitecture used in the second generation of the Intel Core processors (Core i7, i5, i3). The Sandy Bridge microarchitecture is the successor to Nehalem and Westmere microarchitecture. ...
, Ivy Bridge, and Haswell microarchitectures are a departure from the reordering techniques used in P6 and employ re-ordering techniques from the EV6 and the P4 but with a somewhat shorter pipeline.


See also

*
Dataflow architecture Dataflow architecture is a dataflow-based computer architecture that directly contrasts the traditional von Neumann architecture or control flow architecture. Dataflow architectures have no program counter, in concept: the executability and executi ...
* Memory fence * Replay system *
Scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data depende ...
* Shelving buffer *
Tomasulo algorithm Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in ...


References

*


Further reading

* {{DEFAULTSORT:Out-Of-Order Execution Instruction processing