Scoreboarding

Last updated

Scoreboarding is a centralized method, first used in the CDC 6600 computer, for dynamically scheduling instructions so that they can execute out of order when there are no conflicts and the hardware is available. [1]

Contents

In a scoreboard, the data dependencies of every instruction are logged, tracked and strictly observed at all times. Instructions are released only when the scoreboard determines that there are no conflicts with previously issued ("in flight") instructions. If an instruction is stalled because it is unsafe to issue (or there are insufficient resources), the scoreboard monitors the flow of executing instructions until all dependencies have been resolved before the stalled instruction is issued. In essence: reads proceed on the absence of write hazards, and writes proceed in the absence of read hazards.

Scoreboarding is essentially a hardware implementation of the same underlying algorithm seen in dataflow languages, creating a Directed Acyclic Graph, where the same logic is applied in the programming language runtime.

Stages

Instructions are decoded in order and go through the following four stages.

  1. Issue: The system checks which registers will be read and written by this instruction and where conflicts WAR and RAW and WAW are detected. RAW and WAR hazards are recorded using a Dependency Matrix (constructed from SR NOR latches in the original 6600 design) as it will be needed in the following stages. Simultaneously, an entry is recorded in a second Matrix, which records the instruction order as a Directed Acyclic Graph. In order to avoid output dependencies (WAW – Write after Write) the instruction is stalled until instructions intending to write to the same register are completed. The instruction is also stalled when required functional units are currently busy. No instruction is ever issued unless it is fully trackable from start to finish.
  2. Read operands: After an instruction has been issued and correctly allocated to the required hardware module (named a Computation Unit in Thornton's book), the Unit waits until all operands become available. The read only proceeds when write dependencies (RAW – Read after Write) have been dropped from all other Units. To avoid Register File Port contention, a Priority Picker selects one Computational Unit (in the case where several Units are clear of hazards).
  3. Execution: When all operands have been fetched, the Computation Unit starts its execution. After the result is ready, the scoreboard is notified.
  4. Write Result: In this stage the result is ready but has not yet been written to its destination register. The write may not proceed until the Unit is clear of all (WAR – Write after Read) hazards. The only additional delays here are based on availability of register file ports: in the 6600 a Priority Picker was used to select one result per write port. Once written the unit is marked as no longer busy, and all hazards and state is dropped. Note that only in advanced (augmented, precise) scoreboards with "Shadow" capability will the Write Result phase be prevented (delayed). The original 6600 did not have this capability.

It is critical to note above that Reads only proceed in the absence of write hazards, and that writes proceed in the absence of Read hazards. This is logical but contraindicative to expectations. In particular, note that Writes must wait to write after read in order to give other units the opportunity to read the current value in a register, before overwriting it with the new one. Hence why writes must wait until the absence of WAR hazards.

Data structure

To control the execution of the instructions, the scoreboard maintains three status tables:

The original 6600 algorithm

The detailed algorithm for the scoreboard control, outlined in the original patent, is described below:

function issue(op, dst, src1, src2)     wait until (!Busy[FU] AND !Result[dst]); // FU can be any functional unit that can execute operation op     Busy[FU] ← Yes;     Op[FU] ← op;     Fi[FU] ← dst;     Fj[FU] ← src1;     Fk[FU] ← src2;     Qj[FU] ← Result[src1];     Qk[FU] ← Result[src2];     Rj[FU] ← Qj[FU] == 0;     Rk[FU] ← Qk[FU] == 0;     Result[dst] ← FU;
function read_operands(FU)     wait until (Rj[FU] AND Rk[FU]);     Rj[FU] ← No;     Rk[FU] ← No;
function execute(FU)     // Execute whatever FU must do
function write_back(FU)     wait until (∀f {(Fj[f]≠Fi[FU] OR Rj[f]=No) AND (Fk[f]≠Fi[FU] OR Rk[f]=No)})     foreach f do         if Qj[f]=FU then Rj[f] ← Yes;         if Qk[f]=FU then Rk[f] ← Yes;     Result[Fi[FU]] ← 0; // 0 means no FU generates the register's result     RegFile[Fi[FU]] ← computed value;     Busy[FU] ← No;

Remarks

Thornton's book pre-dates modern computing terminology.

Stalling only occurred at the issue stage, when "First Order" conflicts were detected. Some other techniques like Tomasulo algorithm additionally resolve WAW dependencies with register renaming. The original CDC 6600 likely did not have WAW hazard tracking simply because its designers had to deliver product, and then moved on to the 7600: stalling instead was the most expedient option. There is no technical reason why Register renaming should not be added to Scoreboards.

An analysis of both algorithms was carried out by Luke Leighton and a transformation process outlined which shows equivalence between the Tomasulo algorithm and the 6600 Scoreboard algorithm. [5] WAW hazards resolution is indeed missing from the original algorithm: the 6600 would stall at the first occurrence of a Write Hazard. [6]

See also

Related Research Articles

The control unit (CU) is a component of a computer's central processing unit (CPU) that directs the operation of the processor. A CU typically uses a binary decoder to convert coded instructions into timing and control signals that direct the operation of the other units.

<span class="mw-page-title-main">CDC 6600</span>

The CDC 6600 was the flagship of the 6000 series of mainframe computer systems manufactured by Control Data Corporation. Generally considered to be the first successful supercomputer, it outperformed the industry's prior recordholder, the IBM 7030 Stretch, by a factor of three. With performance of up to three megaFLOPS, the CDC 6600 was the world's fastest computer from 1964 to 1969, when it relinquished that status to its successor, the CDC 7600.

In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

A re-order buffer (ROB) is a hardware unit used in an extension to the Tomasulo algorithm to support out-of-order and speculative instruction execution. The extension forces instructions to be committed in-order.

Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit.

In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

<span class="mw-page-title-main">CDC 7600</span>

The CDC 7600 was the Seymour Cray-designed successor to the CDC 6600, extending Control Data's dominance of the supercomputer field into the 1970s. The 7600 ran at 36.4 MHz and had a 65 Kword primary memory using magnetic core and variable-size secondary memory. It was generally about ten times as fast as the CDC 6600 and could deliver about 10 MFLOPS on hand-compiled code, with a peak of 36 MFLOPS. In addition, in benchmark tests in early 1970 it was shown to be slightly faster than its IBM rival, the IBM System/360, Model 195. When the system was released in 1967, it sold for around $5 million in base configurations, and considerably more as options and features were added.

<span class="mw-page-title-main">Intel 8087</span> Floating-point microprocessor made by Intel

The Intel 8087, announced in 1980, was the first floating-point coprocessor for the 8086 line of microprocessors. The purpose of the chip was to speed up floating-point arithmetic operations, such as addition, subtraction, multiplication, division, and square root. It also computes transcendental functions such as exponential, logarithmic or trigonometric calculations. The performance enhancements were from approximately 20% to over 500%, depending on the specific application. The 8087 could perform about 50,000 FLOPS using around 2.4 watts.

<span class="mw-page-title-main">CDC STAR-100</span>

The CDC STAR-100 is a vector supercomputer that was designed, manufactured, and marketed by Control Data Corporation (CDC). It was one of the first machines to use a vector processor to improve performance on appropriate scientific applications. It was also the first supercomputer to use integrated circuits and the first to be equipped with one million words of computer memory.

In computer engineering, out-of-order execution is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

In computer science, instruction scheduling is a compiler optimization used to improve instruction-level parallelism, which improves performance on machines with instruction pipelines. Put more simply, it tries to do the following without changing the meaning of the code:

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

In compiler theory, dependence analysis produces execution-order constraints between statements/instructions. Broadly speaking, a statement S2 depends on S1 if S1 must be executed before S2. Broadly, there are two classes of dependencies--control dependencies and data dependencies.

<span class="mw-page-title-main">CDC 6000 series</span>

The CDC 6000 series is a discontinued family of mainframe computers manufactured by Control Data Corporation in the 1960s. It consisted of the CDC 6200, CDC 6300, CDC 6400, CDC 6500, CDC 6600 and CDC 6700 computers, which were all extremely rapid and efficient for their time. Each is a large, solid-state, general-purpose, digital computer that performs scientific and business data processing as well as multiprogramming, multiprocessing, Remote Job Entry, time-sharing, and data management tasks under the control of the operating system called SCOPE. By 1970 there also was a time-sharing oriented operating system named KRONOS. They were part of the first generation of supercomputers. The 6600 was the flagship of Control Data's 6000 series.

<span class="mw-page-title-main">Reservation station</span>

A unified reservation station, also known as unified scheduler, is a decentralized feature of the microarchitecture of a CPU that allows for register renaming, and is used by the Tomasulo algorithm for dynamic instruction scheduling.

Memory disambiguation is a set of techniques employed by high-performance out-of-order execution microprocessors that execute memory access instructions out of program order. The mechanisms for performing memory disambiguation, implemented using digital logic inside the microprocessor core, detect true dependencies between memory operations at execution time and allow the processor to recover when a dependence has been violated. They also eliminate spurious memory dependencies and allow for greater instruction-level parallelism by allowing safe out-of-order execution of loads and stores.

Operand forwarding is an optimization in pipelined CPUs to limit performance deficits which occur due to pipeline stalls. A data hazard can lead to a pipeline stall when the current operation has to wait for the results of an earlier operation which has not yet finished.

Latency oriented processor architecture is the microarchitecture of a microprocessor designed to serve a serial computing thread with a low latency. This is typical of most central processing units (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases. Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total throughput of the system, rather than the service latencies for all individual threads that they work on.

References

  1. Thornton, James E. (1965). "Parallel operation in the control data 6600". Proceedings of the October 27–29, 1964, fall joint computer conference, part II: very high speed computer systems. AFIPS '64. San Francisco, California: ACM. pp. 33–40. doi: 10.1145/1464039.1464045 .
  2. Thornton (1970 , p. 125)
  3. Thornton (1970 , p. 126)
  4. Thornton 1970 , p. 127
  5. Transforming Tomasulo to Scoreboards
  6. Thornton, James (1970). Design of a Computer: The Control Data 6600 (PDF). p. 126. ISBN   9780673059536.