Deep Pipelining
   HOME

TheInfoList



OR:

In
computer engineering Computer engineering (CoE or CpE) is a branch of electrical engineering and computer science that integrates several fields of computer science and electronic engineering required to develop computer hardware and software. Computer engineers ...
, instruction pipelining or ILP is a technique for implementing
instruction-level parallelism Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution. Discu ...
within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming
instructions Instruction or instructions may refer to: Computing * Instruction, one operation of a processor within a computer architecture instruction set * Computer program, a collection of instructions Music * Instruction (band), a 2002 rock band from Ne ...
into a series of sequential steps (the eponymous "
pipeline Pipeline may refer to: Electronics, computers and computing * Pipeline (computing), a chain of data-processing stages or a CPU optimization found on ** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...
") performed by different processor units with different parts of instructions processed in parallel.


Concept and motivation

In a pipelined computer, instructions flow through the
central processing unit A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, an ...
(CPU) in stages. For example, it might have one stage for each step of the von Neumann cycle: Fetch the instruction, fetch the operands, do the instruction, write the results. A pipelined computer usually has "pipeline registers" after each stage. These store information from the instruction and calculations so that the
logic gate A logic gate is an idealized or physical device implementing a Boolean function, a logical operation performed on one or more binary inputs that produces a single binary output. Depending on the context, the term may refer to an ideal logic gate, ...
s of the next stage can do the next step. This arrangement lets the CPU complete an instruction on each clock cycle. It is common for even-numbered stages to operate on one edge of the square-wave clock, while odd-numbered stages operate on the other edge. This allows more CPU
throughput Network throughput (or just throughput, when in context) refers to the rate of message delivery over a communication channel, such as Ethernet or packet radio, in a communication network. The data that these messages contain may be delivered ov ...
than a multicycle computer at a given
clock rate In computing, the clock rate or clock speed typically refers to the frequency at which the clock generator of a processor can generate pulses, which are used to synchronize the operations of its components, and is used as an indicator of the pr ...
, but may increase latency due to the added overhead of the pipelining process itself. Also, even though the electronic logic has a fixed maximum speed, a pipelined computer can be made faster or slower by varying the number of stages in the pipeline. With more stages, each stage does less work, and so the stage has fewer delays from the
logic gate A logic gate is an idealized or physical device implementing a Boolean function, a logical operation performed on one or more binary inputs that produces a single binary output. Depending on the context, the term may refer to an ideal logic gate, ...
s and could run at a higher clock rate. A pipelined model of computer is often the most economical, when cost is measured as logic gates per instruction per second. At each instant, an instruction is in only one pipeline stage, and on average, a pipeline stage is less costly than a multicycle computer. Also, when made well, most of the pipelined computer's logic is in use most of the time. In contrast, out of order computers usually have large amounts of idle logic at any given instant. Similar calculations usually show that a pipelined computer uses less energy per instruction. However, a pipelined computer is usually more complex and more costly than a comparable multicycle computer. It typically has more logic gates, registers and a more complex control unit. In a like way, it might use more total energy, while using less energy per instruction. Out of order CPUs can usually do more instructions per second because they can do several instructions at once. In a pipelined computer, the control unit arranges for the flow to start, continue, and stop as a program commands. The instruction data is usually passed in pipeline registers from one stage to the next, with a somewhat separated piece of control logic for each stage. The control unit also assures that the instruction in each stage does not harm the operation of instructions in other stages. For example, if two stages must use the same piece of data, the control logic assures that the uses are done in the correct sequence. When operating efficiently, a pipelined computer will have an instruction in each stage. It is then working on all of those instructions at the same time. It can finish about one instruction for each cycle of its clock. But when a program switches to a different sequence of instructions, the pipeline sometimes must discard the data in process and restart. This is called a "stall." Much of the design of a pipelined computer prevents interference between the stages and reduces stalls.


Number of steps

The number of dependent steps varies with the machine architecture. For example: * The 1956–61
IBM Stretch The IBM 7030, also known as Stretch, was IBM's first transistorized supercomputer. It was the fastest computer in the world from 1961 until the first CDC 6600 became operational in 1964."Designed by Seymour Cray, the CDC 6600 was almost three ti ...
project proposed the terms Fetch, Decode, and Execute that have become common. * The
classic RISC pipeline In the history of computing hardware, history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: ...
comprises: *# Instruction fetch *# Instruction decode and register fetch *# Execute *# Memory access *# Register write back * The
Atmel AVR AVR is a family of microcontrollers developed since 1996 by Atmel, acquired by Microchip Technology in 2016. These are modified Harvard architecture 8-bit Reduced instruction set computer, RISC single-chip microcontrollers. AVR was one of the f ...
and the
PIC microcontroller PIC (usually pronounced as ''"pick"'') is a family of microcontrollers made by Microchip Technology, derived from the PIC1650"PICmicro Family Tree", PIC16F Seminar Presentation originally developed by General Instrument's Microelectronics ...
each have a two-stage pipeline. * Many designs include pipelines as long as 7, 10 and even 20 stages (as in the
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
Pentium 4 Pentium 4 is a series of single-core CPUs for desktops, laptops and entry-level servers manufactured by Intel. The processors were shipped from November 20, 2000 until August 8, 2008. The production of Netburst processors was active from 2000 ...
). * The later "Prescott" and "Cedar Mill"
Netburst The NetBurst microarchitecture, called P68 inside Intel, was the successor to the P6 microarchitecture in the x86 family of central processing units (CPUs) made by Intel. The first CPU to use this architecture was the Willamette-core Pentium ...
cores from Intel, used in the last Pentium 4 models and their
Pentium D Pentium D is a range of desktop 64-bit x86-64 processors based on the NetBurst microarchitecture, which is the dual-core variant of the Pentium 4 manufactured by Intel. Each CPU comprised two dies, each containing a single core, residing next to ...
and
Xeon Xeon ( ) is a brand of x86 microprocessors designed, manufactured, and marketed by Intel, targeted at the non-consumer workstation, server, and embedded system markets. It was introduced in June 1998. Xeon processors are based on the same arc ...
derivatives, have a long 31-stage pipeline. * The Xelerated X10q Network Processor has a pipeline more than a thousand stages long, although in this case 200 of these stages represent independent CPUs with individually programmed instructions. The remaining stages are used to coordinate accesses to memory and on-chip function units. As the pipeline is made "deeper" (with a greater number of dependent steps), a given step can be implemented with simpler circuitry, which may let the processor clock run faster. Such pipelines may be called ''superpipelines.'' A processor is said to be ''fully pipelined'' if it can fetch an instruction on every cycle. Thus, if some instructions or conditions require delays that inhibit fetching new instructions, the processor is not fully pipelined.


History

Seminal uses of pipelining were in the
ILLIAC II The ILLIAC II was a revolutionary super-computer built by the University of Illinois that became operational in 1962. Description The concept, proposed in 1958, pioneered Emitter-coupled logic (ECL) circuitry, pipelining, and transistor memory ...
project and the
IBM Stretch The IBM 7030, also known as Stretch, was IBM's first transistorized supercomputer. It was the fastest computer in the world from 1961 until the first CDC 6600 became operational in 1964."Designed by Seymour Cray, the CDC 6600 was almost three ti ...
project, though a simple version was used earlier in the Z1 in 1939 and the Z3 in 1941. (12 pages) Pipelining began in earnest in the late 1970s in
supercomputers A supercomputer is a computer with a high level of performance as compared to a general-purpose computer. The performance of a supercomputer is commonly measured in floating-point operations per second (FLOPS) instead of million instructions p ...
such as vector processors and array processors. One of the early supercomputers was the Cyber series built by Control Data Corporation. Its main architect,
Seymour Cray Seymour Roger Cray (September 28, 1925 – October 5, 1996
) was an American
Amdahl Corporation Amdahl Corporation was an information technology company which specialized in IBM mainframe-compatible computer products, some of which were regarded as supercomputers competing with those from Cray Research. Founded in 1970 by Gene Amdahl, a for ...
's 470 series general purpose mainframe had a 7-step pipeline, and a patented branch prediction circuit.


Hazards

The model of sequential execution assumes that each instruction completes before the next one begins; this assumption is not true on a pipelined processor. A situation where the expected result is problematic is known as a
hazard A hazard is a potential source of harm Harm is a moral and legal concept. Bernard Gert construes harm as any of the following: * pain * death * disability * mortality * loss of abil ity or freedom * loss of pleasure. Joel Feinberg giv ...
. Imagine the following two register instructions to a hypothetical processor: 1: add 1 to R5 2: copy R5 to R6 If the processor has the 5 steps listed in the initial illustration (the 'Basic five-stage pipeline' at the start of the article), instruction 1 would be fetched at time ''t''1 and its execution would be complete at ''t5''. Instruction 2 would be fetched at ''t2'' and would be complete at ''t6''. The first instruction might deposit the incremented number into R5 as its fifth step (register write back) at ''t5''. But the second instruction might get the number from R5 (to copy to R6) in its second step (instruction decode and register fetch) at time ''t3''. It seems that the first instruction would not have incremented the value by then. The above code invokes a hazard. Writing computer programs in a
compiled In computing, a compiler is a computer program that translates computer code written in one programming language (the ''source'' language) into another language (the ''target'' language). The name "compiler" is primarily used for programs that ...
language might not raise these concerns, as the compiler could be designed to generate machine code that avoids hazards.


Workarounds

In some early DSP and RISC processors, the documentation advises programmers to avoid such dependencies in adjacent and nearly adjacent instructions (called
delay slot In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction. The most common form is a single arbitrary instruction located immediately after a branch instruction on a RISC or DSP ar ...
s), or declares that the second instruction uses an old value rather than the desired value (in the example above, the processor might counter-intuitively copy the unincremented value), or declares that the value it uses is undefined. The programmer may have unrelated work that the processor can do in the meantime; or, to ensure correct results, the programmer may insert NOPs into the code, partly negating the advantages of pipelining.


Solutions

Pipelined processors commonly use three techniques to work as expected when the programmer assumes that each instruction completes before the next one begins: *The pipeline could stall, or cease scheduling new instructions until the required values are available. This results in empty slots in the pipeline, or ''bubbles'', in which no work is performed. *An additional data path can be added that routes a computed value to a future instruction elsewhere in the pipeline before the instruction that produced it has been fully retired, a process called
operand forwarding Operand forwarding (or data forwarding) is an optimization in pipelined CPUs to limit performance deficits which occur due to pipeline stalls. A data hazard can lead to a pipeline stall when the current operation has to wait for the results of ...
. *The processor can locate other instructions which are not dependent on the current ones and which can be immediately executed without hazards, an optimization known as
out-of-order execution In computer engineering, out-of-order execution (or more formally dynamic execution) is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a proce ...
.


Branches

A branch out of the normal instruction sequence often involves a hazard. Unless the processor can give effect to the branch in a single time cycle, the pipeline will continue fetching instructions sequentially. Such instructions cannot be allowed to take effect because the programmer has diverted control to another part of the program. A conditional branch is even more problematic. The processor may or may not branch, depending on a calculation that has not yet occurred. Various processors may stall, may attempt
branch prediction In computer architecture, a branch predictor is a digital circuit that tries to guess which way a branch (e.g., an if–then–else structure) will go before this is known definitively. The purpose of the branch predictor is to improve the flow ...
, and may be able to begin to execute two different program sequences ( eager execution), each assuming the branch is or is not taken, discarding all work that pertains to the incorrect guess. A processor with an implementation of branch prediction that usually makes correct predictions can minimize the performance penalty from branching. However, if branches are predicted poorly, it may create more work for the processor, such as flushing from the pipeline the incorrect code path that has begun execution before resuming execution at the correct location. Programs written for a pipelined processor deliberately avoid branching to minimize possible loss of speed. For example, the programmer can handle the usual case with sequential execution and branch only on detecting unusual cases. Using programs such as
gcov Gcov is a source code coverage analysis and statement-by-statement profiling tool. Gcov generates exact counts of the number of times each statement in a program is executed and annotates source code to add instrumentation. Gcov comes as a stand ...
to analyze
code coverage In computer science, test coverage is a percentage measure of the degree to which the source code of a program is executed when a particular test suite is run. A program with high test coverage has more of its source code executed during testing, ...
lets the programmer measure how often particular branches are actually executed and gain insight with which to optimize the code. In some cases, a programmer can handle both the usual case and unusual case with
branch-free code A branch is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order. ''Branch'' (or ''branching'', ''branc ...
.


Special situations

; Self-modifying programs : The technique of
self-modifying code In computer science, self-modifying code (SMC) is code that alters its own instructions while it is executing – usually to reduce the instruction path length and improve performance or simply to reduce otherwise repetitively similar code, ...
can be problematic on a pipelined processor. In this technique, one of the effects of a program is to modify its own upcoming instructions. If the processor has an
instruction cache A CPU cache is a hardware cache used by the central processing unit (CPU) of a computer to reduce the average cost (time or energy) to access data from the main memory. A cache is a smaller, faster memory, located closer to a processor core, which ...
, the original instruction may already have been copied into a
prefetch input queue Fetching the instruction opcodes from program memory well in advance is known as prefetching and it is served by using prefetch input queue (PIQ).The pre-fetched instructions are stored in data structure - namely a queue. The fetching of opcodes ...
and the modification will not take effect. Some processors such as the
Zilog Z280 The Zilog Z280 is a 16-bit microprocessor, an enhancement of the Zilog Z80 architecture, introduced in July 1987. It is basically the Z800, renamed, with slight improvements such as being fabricated in CMOS. It was a commercial failure. Zilog ...
can configure their on-chip cache memories for data-only fetches, or as part of their ordinary memory address space, and avoid such difficulties with self-modifying instructions. ; Uninterruptible instructions : An instruction may be uninterruptible to ensure its atomicity, such as when it swaps two items. A sequential processor permits
interrupt In digital computers, an interrupt (sometimes referred to as a trap) is a request for the processor to ''interrupt'' currently executing code (when permitted), so that the event can be processed in a timely manner. If the request is accepted, ...
s between instructions, but a pipelining processor overlaps instructions, so executing an uninterruptible instruction renders portions of ordinary instructions uninterruptible too. The
Cyrix coma bug The Cyrix coma bug is a design flaw in Cyrix 6x86 (introduced in 1996), 6x86L, and early 6x86MX processors that allows a non-privileged program to hang the computer. Discovery According to Andrew Balsa, around the time of the discovery of the ...
would
hang Hang or Hanging may refer to: People * Choe Hang (disambiguation), various people * Luciano Hang (born 1962/1963), Brazilian billionaire businessman * Ren Hang (disambiguation), various people Law * Hanging, a form of capital punishment Arts, e ...
a single-core system using an infinite loop in which an uninterruptible instruction was always in the pipeline.


Design considerations

; Speed : Pipelining keeps all portions of the processor occupied and increases the amount of useful work the processor can do in a given time. Pipelining typically reduces the processor's cycle time and increases the throughput of instructions. The speed advantage is diminished to the extent that execution encounters
hazards A hazard is a potential source of harm. Substances, events, or circumstances can constitute hazards when their nature would allow them, even just theoretically, to cause damage to health, life, property, or any other interest of value. The probabi ...
that require execution to slow below its ideal rate. A non-pipelined processor executes only a single instruction at a time. The start of the next instruction is delayed not based on hazards but unconditionally. : A pipelined processor's need to organize all its work into modular steps may require the duplication of registers, which increases the latency of some instructions. ; Economy : By making each dependent step simpler, pipelining can enable complex operations more economically than adding complex circuitry, such as for numerical calculations. However, a processor that declines to pursue increased speed with pipelining may be simpler and cheaper to manufacture. ; Predictability : Compared to environments where the programmer needs to avoid or work around hazards, use of a non-pipelined processor may make it easier to program and to train programmers. The non-pipelined processor also makes it easier to predict the exact timing of a given sequence of instructions.


Illustrated example

To the right is a generic pipeline with four stages: fetch, decode, execute and write-back. The top gray box is the list of instructions waiting to be executed, the bottom gray box is the list of instructions that have had their execution completed, and the middle white box is the pipeline. The execution is as follows:


Pipeline bubble

A pipelined processor may deal with hazards by stalling and creating a bubble in the pipeline, resulting in one or more cycles in which nothing useful happens. In the illustration at right, in cycle 3, the processor cannot decode the purple instruction, perhaps because the processor determines that decoding depends on results produced by the execution of the green instruction. The green instruction can proceed to the Execute stage and then to the Write-back stage as scheduled, but the purple instruction is stalled for one cycle at the Fetch stage. The blue instruction, which was due to be fetched during cycle 3, is stalled for one cycle, as is the red instruction after it. Because of the bubble (the blue ovals in the illustration), the processor's Decode circuitry is idle during cycle 3. Its Execute circuitry is idle during cycle 4 and its Write-back circuitry is idle during cycle 5. When the bubble moves out of the pipeline (at cycle 6), normal execution resumes. But everything now is one cycle late. It will take 8 cycles (cycle 1 through 8) rather than 7 to completely execute the four instructions shown in colors.


See also

*
Wait state A wait state is a delay experienced by a computer processor when accessing external memory or another device that is slow to respond. Computer microprocessors generally run much faster than the computer's other subsystems, which hold the data the ...
*
Classic RISC pipeline In the history of computing hardware, history of computer hardware, some early reduced instruction set computer central processing units (RISC CPUs) used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: ...


Notes


References


External links


Branch Prediction in the Pentium Family

ArsTechnica article on pipelining


{{DEFAULTSORT:Instruction Pipeline
Pipeline Pipeline may refer to: Electronics, computers and computing * Pipeline (computing), a chain of data-processing stages or a CPU optimization found on ** Instruction pipelining, a technique for implementing instruction-level parallelism within a s ...