Tomasulo Algorithm
   HOME

TheInfoList



OR:

Tomasulo's algorithm is a
computer architecture In computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, the ...
hardware
algorithm In mathematics and computer science, an algorithm () is a finite sequence of rigorous instructions, typically used to solve a class of specific problems or to perform a computation. Algorithms are used as specifications for performing ...
for dynamic scheduling of instructions that allows
out-of-order execution In computer engineering, out-of-order execution (or more formally dynamic execution) is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a proces ...
and enables more efficient use of multiple execution units. It was developed by
Robert Tomasulo Robert Marco Tomasulo (October 31, 1934 – April 3, 2008) was a computer scientist, and the inventor of the Tomasulo algorithm. Tomasulo was the recipient of the 1997 Eckert–Mauchly Award " r the ingenious Tomasulo algorithm, which enabled ...
at IBM in 1967 and was first implemented in the
IBM System/360 Model 91 The IBM System/360 Model 91 was announced in 1964 as a competitor to the CDC 6600. Functionally, the Model 91 ran like any other large-scale System/360, but the internal organization was the most advanced of the System/360 line, and it was the ...
’s
floating point unit Floating may refer to: * a type of dental work performed on horse teeth * use of an isolation tank * the guitar-playing technique where chords are sustained rather than scratched * ''Floating'' (play), by Hugh Hughes * Floating (psychological ...
. The major innovations of Tomasulo’s algorithm include
register renaming In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a partic ...
in hardware,
reservation station A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming In computer architecture, register renaming is a technique that abstracts logical r ...
s for all execution units, and a common data bus (CDB) on which computed values broadcast to all reservation stations that may need them. These developments allow for improved parallel execution of instructions that would otherwise stall under the use of
scoreboarding Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available. In a scoreboard, the data dependen ...
or other earlier algorithms.
Robert Tomasulo Robert Marco Tomasulo (October 31, 1934 – April 3, 2008) was a computer scientist, and the inventor of the Tomasulo algorithm. Tomasulo was the recipient of the 1997 Eckert–Mauchly Award " r the ingenious Tomasulo algorithm, which enabled ...
received the
Eckert–Mauchly Award The Eckert–Mauchly Award recognizes contributions to digital systems and computer architecture. It is known as the computer architecture community’s most prestigious award. First awarded in 1979, it was named for John Presper Eckert and Jo ...
in 1997 for his work on the algorithm.


Implementation concepts

The following are the concepts necessary to the implementation of Tomasulo's algorithm:


Common data bus

The Common Data Bus (CDB) connects reservation stations directly to functional units. According to Tomasulo it "preserves precedence while encouraging concurrency". This has two important effects: #Functional units can access the result of any operation without involving a floating-point-register, allowing multiple units waiting on a result to proceed without waiting to resolve contention for access to register file read ports. #Hazard Detection and control execution are distributed. The reservation stations control when an instruction can execute, rather than a single dedicated hazard unit.


Instruction order

Instructions are issued sequentially so that the effects of a sequence of instructions, such as exceptions raised by these instructions, occur in the same order as they would on an in-order processor, regardless of the fact that they are being executed out-of-order (i.e. non-sequentially).


Register renaming

Tomasulo's algorithm uses
register renaming In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a partic ...
to correctly perform out-of-order execution. All general-purpose and reservation station registers hold either a real value or a placeholder value. If a real value is unavailable to a destination register during the issue stage, a placeholder value is initially used. The placeholder value is a tag indicating which reservation station will produce the real value. When the unit finishes and broadcasts the result on the CDB, the placeholder will be replaced with the real value. Each functional unit has a single reservation station. Reservation stations hold information needed to execute a single instruction, including the operation and the operands. The functional unit begins processing when it is free and when all source operands needed for an instruction are real.


Exceptions

Practically speaking, there may be exceptions for which not enough status information about an exception is available, in which case the processor may raise a special exception, called an "imprecise" exception. Imprecise exceptions cannot occur in in-order implementations, as processor state is changed only in program order (see RISC Pipeline Exceptions). Programs that experience "precise" exceptions, where the specific instruction that took the exception can be determined, can restart or re-execute at the point of the exception. However, those that experience "imprecise" exceptions generally cannot restart or re-execute, as the system cannot determine the specific instruction that took the exception.


Instruction lifecycle

The three stages listed below are the stages through which each instruction passes from the time it is issued to the time its execution is complete.


Register legend

*Op - represents the operation being performed on operands *Qj, Qk - the reservation station that will produce the relevant source operand (0 indicates the value is in Vj, Vk) *Vj, Vk - the value of the source operands *A - used to hold the memory address information for a load or store *Busy - 1 if occupied, 0 if not occupied *Qi - (Only register unit) the reservation station whose result should be stored in this register (if blank or 0, no values are destined for this register)


Stage 1: issue

In the issue stage, instructions are issued for execution if all operands and reservation stations are ready or else they are stalled. Registers are renamed in this step, eliminating WAR and WAW hazards. *Retrieve the next instruction from the head of the instruction queue. If the instruction operands are currently in the registers, then **If a matching functional unit is available, issue the instruction. **Else, as there is no available functional unit, stall the instruction until a station or buffer is free. *Otherwise, we can assume the operands are not in the registers, and so use virtual values. The functional unit must calculate the real value to keep track of the functional units that produce the operand.


Stage 2: execute

In the execute stage, the instruction operations are carried out. Instructions are delayed in this step until all of their operands are available, eliminating RAW hazards. Program correctness is maintained through effective address calculation to prevent hazards through memory. *If one or more of the operands is not yet available then: wait for operand to become available on the CDB. *When all operands are available, then: if the instruction is a load or store **Compute the effective address when the base register is available, and place it in the load/store buffer ***If the instruction is a load then: execute as soon as the memory unit is available ***Else, if the instruction is a store then: wait for the value to be stored before sending it to the memory unit *Else, the instruction is an
arithmetic logic unit In computing, an arithmetic logic unit (ALU) is a combinational digital circuit that performs arithmetic and bitwise operations on integer binary numbers. This is in contrast to a floating-point unit (FPU), which operates on floating point num ...
(ALU) operation then: execute the instruction at the corresponding functional unit


Stage 3: write result

In the write Result stage, ALU operations results are written back to registers and store operations are written back to memory. * If the instruction was an ALU operation ** If the result is available, then: write it on the CDB and from there into the registers and any reservation stations waiting for this result * Else, if the instruction was a store then: write the data to memory during this step


Algorithm improvements

The concepts of reservation stations, register renaming, and the common data bus in Tomasulo's algorithm presents significant advancements in the design of high-performance computers. Reservation stations take on the responsibility of waiting for operands in the presence of data dependencies and other inconsistencies such as varying storage access time and circuit speeds, thus freeing up the functional units. This improvement overcomes long floating point delays and memory accesses. In particular the algorithm is more tolerant of cache misses. Additionally, programmers are freed from implementing optimized code. This is a result of the common data bus and reservation station working together to preserve dependencies as well as encouraging concurrency. By tracking operands for instructions in the reservation stations and register renaming in hardware the algorithm minimizes read-after-write (RAW) and eliminates write-after-write (WAW) and Write-after-Read (WAR)
computer architecture In computer engineering, computer architecture is a description of the structure of a computer system made from component parts. It can sometimes be a high-level description that ignores details of the implementation. At a more detailed level, the ...
hazard A hazard is a potential source of harm Harm is a moral and legal concept. Bernard Gert construes harm as any of the following: * pain * death * disability * mortality * loss of abil ity or freedom * loss of pleasure. Joel Feinberg giv ...
s. This improves performance by reducing wasted time that would otherwise be required for stalls. An equally important improvement in the algorithm is the design is not limited to a specific pipeline structure. This improvement allows the algorithm to be more widely adopted by multiple-issue processors. Additionally, the algorithm is easily extended to enable branch speculation.


Applications and legacy

Tomasulo's algorithm, outside of IBM, was unused for several years after its implementation in the System/360 Model 91 architecture. However, it saw a vast increase in usage during the 1990s for 3 reasons: # Once caches became commonplace, the Tomasulo algorithm's ability to maintain concurrency during unpredictable load times caused by cache misses became valuable in processors. # Dynamic scheduling and the branch speculation that the algorithm enables helped performance as processors issued more and more instructions. # Proliferation of mass-market software meant that programmers would not want to compile for a specific pipeline structure. The algorithm can function with any pipeline architecture and thus software requires few architecture-specific modifications. Many modern processors implement dynamic scheduling schemes that are derivative of Tomasulo's original algorithm, including popular
Intel Intel Corporation is an American multinational corporation and technology company headquartered in Santa Clara, California. It is the world's largest semiconductor chip manufacturer by revenue, and is one of the developers of the x86 seri ...
x86-64 x86-64 (also known as x64, x86_64, AMD64, and Intel 64) is a 64-bit version of the x86 instruction set, first released in 1999. It introduced two new modes of operation, 64-bit mode and compatibility mode, along with a new 4-level paging mod ...
chips.


See also

*
Re-order buffer A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order. The buffer is a circular buffer ...
(ROB) *
Instruction-level parallelism Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution. Disc ...
(ILP)


References


Further reading

*


External links

* {{webarchive, url=https://web.archive.org/web/20171225215322/http://www.cs.umd.edu/class/fall2001/cmsc411/projects/dynamic/tomasulo.html, title=Dynamic Scheduling - Tomasulo's Algorithm, date=December 25, 2017
HASE Java applet simulation of the Tomasulo's algorithm
Algorithms Instruction processing