Tomasulo's algorithm

Last updated

Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit. [1]

Contents

The major innovations of Tomasulo’s algorithm include register renaming in hardware, reservation stations for all execution units, and a common data bus (CDB) on which computed values broadcast to all reservation stations that may need them. These developments allow for improved parallel execution of instructions that would otherwise stall under the use of scoreboarding or other earlier algorithms.

Robert Tomasulo received the Eckert–Mauchly Award in 1997 for his work on the algorithm. [2]

Implementation concepts

Tomasulo's floating point unit Tomasulo Architecture.png
Tomasulo's floating point unit

The following are the concepts necessary to the implementation of Tomasulo's algorithm:

Common data bus

The Common Data Bus (CDB) connects reservation stations directly to functional units. According to Tomasulo it "preserves precedence while encouraging concurrency". [1] :33 This has two important effects:

  1. Functional units can access the result of any operation without involving a floating-point-register, allowing multiple units waiting on a result to proceed without waiting to resolve contention for access to register file read ports.
  2. Hazard Detection and control execution are distributed. The reservation stations control when an instruction can execute, rather than a single dedicated hazard unit.

Instruction order

Instructions are issued sequentially so that the effects of a sequence of instructions, such as exceptions raised by these instructions, occur in the same order as they would on an in-order processor, regardless of the fact that they are being executed out-of-order (i.e. non-sequentially).

Register renaming

Tomasulo's algorithm uses register renaming to correctly perform out-of-order execution. All general-purpose and reservation station registers hold either a real value or a placeholder value. If a real value is unavailable to a destination register during the issue stage, a placeholder value is initially used. The placeholder value is a tag indicating which reservation station will produce the real value. When the unit finishes and broadcasts the result on the CDB, the placeholder will be replaced with the real value.

Each functional unit has a single reservation station. Reservation stations hold information needed to execute a single instruction, including the operation and the operands. The functional unit begins processing when it is free and when all source operands needed for an instruction are real.

Exceptions

Practically speaking, there may be exceptions for which not enough status information about an exception is available, in which case the processor may raise a special exception, called an imprecise exception. Imprecise exceptions cannot occur in in-order implementations, as processor state is changed only in program order (see Classic RISC pipeline § Exceptions).

Programs that experience precise exceptions, where the specific instruction that took the exception can be determined, can restart or re-execute at the point of the exception. However, those that experience imprecise exceptions generally cannot restart or re-execute, as the system cannot determine the specific instruction that took the exception.

Instruction lifecycle

The three stages listed below are the stages through which each instruction passes from the time it is issued to the time its execution is complete.

Legend

Reservation Station Fields

  • Op - represents the operation being performed on operands
  • Qj, Qk - the reservation station that will produce the relevant source operand (0 indicates the value is in Vj, Vk)
  • Vj, Vk - the value of the source operands
  • A - used to hold the memory address information for a load or store
  • Busy - 1 if occupied, 0 if not occupied

Register Status Fields

  • Qi - the reservation station whose result should be stored in this register (if blank or 0, no values are destined for this register)

Stage 1: issue

In the issue stage, instructions are issued for execution if all operands and reservation stations are ready or else they are stalled. Registers are renamed in this step, eliminating WAR and WAW hazards.

Pseudocode [3] :180
Instruction stateWait untilAction or bookkeeping
IssueStation r empty
if(RegisterStat[rs].Qi¦0){RS[r].QjRegisterStat[rs].Qi}else{RS[r].VjRegs[rs];RS[r].Qj0;}if(RegisterStat[rt].Qi¦0){RS[r].QkRegisterStat[rt].Qi;}else{RS[r].VkRegs[rt];RS[r].Qk0;}RS[r].Busyyes;RegisterStat[rd].Qir;
Load or StoreBuffer r empty
if(RegisterStat[rs].Qi¦0){RS[r].QjRegisterStat[rs].Qi;}else{RS[r].VjRegs[rs];RS[r].Qj0;}RS[r].Aimm;RS[r].Busyyes;
Load only
RegisterStat[rt].Qir;
Store only
if(RegisterStat[rt].Qi¦0){RS[r].QkRegisterStat[rt].Qi;}else{RS[r].VkRegs[rt];RS[r].Qk0};
Example of Tomasulo's algorithm Example of Tomasulo's Algorithm.gif
Example of Tomasulo's algorithm

Stage 2: execute

In the execute stage, the instruction operations are carried out. Instructions are delayed in this step until all of their operands are available, eliminating RAW hazards. Program correctness is maintained through effective address calculation to prevent hazards through memory.

Pseudocode [3] :180
Instruction stateWait untilAction or bookkeeping
FP operation
(RS[r].Qj = 0) and (RS[r].Qk = 0) 

Compute result: operands are in Vj and Vk

Load/store step 1RS[r].Qj = 0 & r is head of load-store queue
RS[r].A ← RS[r].Vj + RS[r].A; 
Load step 2Load step 1 complete

Read from Mem[RS[r].A]

Stage 3: write result

In the write Result stage, ALU operations results are written back to registers and store operations are written back to memory.

Pseudocode [3] :180
Instruction stateWait untilAction or bookkeeping
FP operation or loadExecution complete at r & CDB available
x(if(RegisterStat[x].Qi=r){regs[x]result;RegisterStat[x].Qi=0});x(if(RS[x].Qj=r){RS[x].Vjresult;RS[x].Qj0;});x(if(RS[x].Qk=r){RS[x].Vkresult;RS[x].Qk0;});RS[r].Busyno;
StoreExecution complete at r & RS[r].Qk = 0
Mem[RS[r].A]RS[r].Vk;RS[r].Busyno;

Algorithm improvements

The concepts of reservation stations, register renaming, and the common data bus in Tomasulo's algorithm presents significant advancements in the design of high-performance computers.

Reservation stations take on the responsibility of waiting for operands in the presence of data dependencies and other inconsistencies such as varying storage access time and circuit speeds, thus freeing up the functional units. This improvement overcomes long floating point delays and memory accesses. In particular the algorithm is more tolerant of cache misses. Additionally, programmers are freed from implementing optimized code. This is a result of the common data bus and reservation station working together to preserve dependencies as well as encouraging concurrency. [1] :33

By tracking operands for instructions in the reservation stations and register renaming in hardware the algorithm minimizes read-after-write (RAW) and eliminates write-after-write (WAW) and Write-after-Read (WAR) computer architecture hazards. This improves performance by reducing wasted time that would otherwise be required for stalls. [1] :33

An equally important improvement in the algorithm is the design is not limited to a specific pipeline structure. This improvement allows the algorithm to be more widely adopted by multiple-issue processors. Additionally, the algorithm is easily extended to enable branch speculation. [3] :182

Applications and legacy

Tomasulo's algorithm, outside of IBM, was unused for several years after its implementation in the System/360 Model 91 architecture. However, it saw a vast increase in usage during the 1990s for 3 reasons:

  1. Once caches became commonplace, the Tomasulo algorithm's ability to maintain concurrency during unpredictable load times caused by cache misses became valuable in processors.
  2. Dynamic scheduling and the branch speculation that the algorithm enables helped performance as processors issued more and more instructions.
  3. Proliferation of mass-market software meant that programmers would not want to compile for a specific pipeline structure. The algorithm can function with any pipeline architecture and thus software requires few architecture-specific modifications. [3] :183

Many modern processors implement dynamic scheduling schemes that are derivative of Tomasulo's original algorithm, including popular Intel x86-64 chips. [5] [ failed verification ] [6]

See also

Related Research Articles

<span class="mw-page-title-main">Central processing unit</span> Central computer component which executes instructions

A central processing unit (CPU), also called a central processor, main processor, or just processor, is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

The control unit (CU) is a component of a computer's central processing unit (CPU) that directs the operation of the processor. A CU typically uses a binary decoder to convert coded instructions into timing and control signals that direct the operation of the other units.

<span class="mw-page-title-main">Intel 8088</span> Intel microprocessor model

The Intel 8088 microprocessor is a variant of the Intel 8086. Introduced on June 1, 1979, the 8088 has an eight-bit external data bus instead of the 16-bit bus of the 8086. The 16-bit registers and the one megabyte address range are unchanged, however. In fact, according to the Intel documentation, the 8086 and 8088 have the same execution unit (EU)—only the bus interface unit (BIU) is different. The 8088 was used in the original IBM PC and in IBM PC compatible clones.

In processor design, microcode serves as an intermediary layer situated between the central processing unit (CPU) hardware and the programmer-visible instruction set architecture of a computer, also known as its machine code. It consists of a set of hardware-level instructions that implement the higher-level machine code instructions or control internal finite-state machine sequencing in many digital processing components. While microcode is utilized in general-purpose CPUs in contemporary desktops, it also functions as a fallback path for scenarios that the faster hardwired control unit is unable to manage.

<span class="mw-page-title-main">Machine code</span> Set of instructions executed by a computer

In computer programming, machine code is computer code consisting of machine language instructions, which are used to control a computer's central processing unit (CPU). Although decimal computers were once common, the contemporary marketplace is dominated by binary computers; for those computers, machine code is "the binary representation of a computer program which is actually read and interpreted by the computer. A program in machine code consists of a sequence of machine instructions ."

In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation of that ISA.

In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order.

In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

<span class="mw-page-title-main">Intel 8087</span> Floating-point microprocessor made by Intel

The Intel 8087, announced in 1980, was the first floating-point coprocessor for the 8086 line of microprocessors. The purpose of the chip was to speed up floating-point arithmetic operations, such as addition, subtraction, multiplication, division, and square root. It also computes transcendental functions such as exponential, logarithmic or trigonometric calculations. The performance enhancements were from approximately 20% to over 500%, depending on the specific application. The 8087 could perform about 50,000 FLOPS using around 2.4 watts.

In computer engineering, out-of-order execution is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

<span class="mw-page-title-main">RISC Single Chip</span>

The RISC Single Chip, or RSC, is a single-chip microprocessor developed and fabricated by International Business Machines (IBM). The RSC was a feature-reduced single-chip implementation of the POWER1, a multi-chip central processing unit (CPU) which implemented the POWER instruction set architecture (ISA). It was used in entry-level workstation models of the IBM RS/6000 family, such as the Model 220 and 230.

In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.

Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available.

<span class="mw-page-title-main">Reservation station</span>

A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling.

Memory disambiguation is a set of techniques employed by high-performance out-of-order execution microprocessors that execute memory access instructions out of program order. The mechanisms for performing memory disambiguation, implemented using digital logic inside the microprocessor core, detect true dependencies between memory operations at execution time and allow the processor to recover when a dependence has been violated. They also eliminate spurious memory dependencies and allow for greater instruction-level parallelism by allowing safe out-of-order execution of loads and stores.

References

  1. 1 2 3 4 Tomasulo, Robert Marco (Jan 1967). "An Efficient Algorithm for Exploiting Multiple Arithmetic Units". IBM Journal of Research and Development. 11 (1). IBM: 25–33. doi:10.1147/rd.111.0025. ISSN   0018-8646. S2CID   8445049.
  2. "Robert Tomasulo – Award Winner". ACM Awards. ACM. Retrieved 8 December 2014.
  3. 1 2 3 4 5 Hennessy, John L.; Patterson, David A. (2012). Computer Architecture: A Quantitative Approach. Waltham, MA: Elsevier. ISBN   978-0123838728.
  4. "CSE P548 - Tomasulo" (PDF). washington.edu. Washington University. 2006. Retrieved 8 December 2014.
  5. Intel 64 and IA-32 Architectures Software Developer's Manual (Report). Intel. September 2014. Retrieved 8 December 2014.
  6. Yoga, Adarsh. "Differences between Tomasulo's algorithm and dynamic scheduling in Intel Core microarchitecture". The boozier. Retrieved 4 April 2016.

Further reading