Delay slot

Last updated

In computer architecture, a delay slot is an instruction slot being executed without the effects of a preceding instruction. [1] The most common form is a single arbitrary instruction located immediately after a branch instruction on a RISC or DSP architecture; this instruction will execute even if the preceding branch is taken. This makes the instruction execute out-of-order compared to its location in the original assembler language code.

Contents

Modern processor designs generally do not use delay slots, and instead perform ever more complex forms of branch prediction. In these systems, the CPU immediately moves on to what it believes will be the correct side of the branch and thereby eliminates the need for the code to specify some unrelated instruction, which may not always be obvious at compile-time. If the assumption is wrong, and the other side of the branch has to be called, this can introduce a lengthy delay. This occurs rarely enough that the speed up of avoiding the delay slot is easily made up by the smaller number of wrong decisions.

Pipelining

A central processing unit generally performs instructions from the machine code using a four-step process; the instruction is first read from memory, then decoded to understand what needs to be performed, those actions are then executed, and finally, any results are written back to memory. In early designs, each of these stages was performed in series, so that instructions took some multiple of the machine's clock cycle to complete. For instance, in the Zilog Z80, the minimum number of clocks needed to complete an instruction was four, but could be as many as 23 clocks for some (rare) instructions. [2]

At any given stage of the instruction's processing, only one part of the chip is involved. For instance, during the execution stage, typically only the arithmetic logic unit (ALU) is active, while other units, like those that interact with main memory or decode the instruction, are idle. One way to improve the overall performance of a computer is through the use of an instruction pipeline. This adds some additional circuitry to hold the intermediate states of the instruction as it flows through the units. While this does not improve the cycle timing of any single instruction, the idea is to allow a second instruction to use the other CPU sub-units when the previous instruction has moved on. [3]

For instance, while one instruction is using the ALU, the next instruction from the program can be in the decoder, and a third can be fetched from memory. In this assembly line type arrangement, the total number of instructions processed at any time can be improved by up to the number of pipeline stages. In the Z80, for example, a four-stage pipeline could improve overall throughput by four times. However, due to the complexity of the instruction timing, this would not be easy to implement. The much simpler instruction set architecture (ISA) of the MOS 6502 allowed a two-stage pipeline to be included, which gave it performance that was about double that of the Z80 at any given clock speed. [4]

Branching problems

A major issue with the implementation of pipelines in early systems was that instructions had widely varying cycle counts. For instance, the instruction to add two values would often be offered in multiple versions, or opcodes , which varied on where they read in the data. One version of add might take the value found in one processor register and add it to the value in another, another version might add the value found in memory to a register, while another might add the value in one memory location to another memory location. Each of these instructions takes a different amount of bytes to represent it in memory, meaning they take different amounts of time to fetch, may require multiple trips through the memory interface to gather values, etc. This greatly complicates the pipeline logic. One of the goals of the RISC chip design concept was to remove these variants so that the pipeline logic was simplified, which leads to the classic RISC pipeline which completes one instruction every cycle.

However, there is one problem that comes up in pipeline systems that can slow performance. This occurs when the next instruction may change depending on the results of the last. In most systems, this happens when a branch occurs. For instance, consider the following pseudo-code:

 top:    read a number from memory and store it in a register    read another number and store it in a different register    add the two numbers into a third register    write the result to memory    read a number from memory and store it in another register    ...

In this case, the program is linear and can be easily pipelined. As soon as the first read instruction has been read and is being decoded, the second read instruction can be read from memory. When the first moves to execute, the add is being read from memory while the second read is decoding, and so forth. Although it still takes the same number of cycles to complete the first read, by the time it is complete the value from the second is ready and the CPU can immediately add them. In a non-pipelined processor the first four instructions will take 16 cycles to complete, in a pipelined one, it takes only five.

Now consider what occurs when a branch is added:

 top:    read a number from memory and store it in a register    read another number and store it in a different register    add the two numbers into a third register    if the result in the 3rd register is greater than 1000, then go back to top:    (if it is not) write the result to memory    read a number from memory and store it in another register    ...

In this example the outcome of the comparison on line four will cause the "next instruction" to change; sometimes it will be the following write to memory, and sometimes it will be the read from memory at the top. The processor's pipeline will normally have already read the next instruction, the write, by the time the ALU has calculated which path it will take. This is known as a branch hazard. If it has to return to the top, the write instruction has to be discarded and the read instruction read from memory instead. That takes one full instruction cycle, at a minimum, and results in the pipeline being empty for at least one instruction's time. This is known as a "pipeline stall" or "bubble", and, depending on the number of branches in the code, can have a noticeable impact on overall performance.

Branch delay slots

One strategy for dealing with this problem is to use a delay slot, which refers to the instruction slot after any instruction that needs more time to complete. In the examples above, the instruction that requires more time is the branch, which is by far the most common type of delay slot, and these are more commonly referred to as a branch delay slot.

In early implementations, the instruction following the branch would be filled with a no-operation, or NOP, simply to fill out the pipeline to ensure the timing was right such that by the time the NOP had been loaded from memory the branch was complete and the program counter could be updated with the correct value. This simple solution wastes the processing time available. More advanced solutions would instead try to identify another instruction, typically nearby in the code, to place in the delay slot so that useful work would be accomplished.

In the examples above, the read instruction at the end is completely independent, it does not rely on any other information and can be performed at any time. This makes it suitable for placement in the branch delay slot. Normally this would be handled automatically by the assembler program or compiler, which would re-order the instructions:

 read a number from memory and store it in a register  read another number and store it in a different register  add the two numbers into a third register  if the result in the 3rd register is greater than 1000, then go back to the top  read a number from memory and store it in another register  (if it is not) write the result to memory   ...

Now when the branch is executing, it goes ahead and performs the next instruction. By the time that instruction is read into the processor and starts to decode, the result of the comparison is ready and the processor can now decide which instruction to read next, the read at the top or the write at the bottom. This prevents any wasted time and keeps the pipeline full at all times.

Finding an instruction to fill the slot can be difficult. The compilers generally have a limited "window" to examine and may not find a suitable instruction in that range of code. Moreover, the instruction cannot rely on any of the data within the branch; if an add instruction takes a previous calculation as one of its inputs, that input cannot be part of the code in a branch that might be taken. Deciding if this is true can be very complex in the presence of register renaming, in which the processor may place data in registers other than what the code specifies without the compiler being aware of this.

Another side effect is that special handling is needed when managing breakpoints on instructions as well as stepping while debugging within the branch delay slot. An interrupt is unable to occur during a branch delay slot and is deferred until after the branch delay slot. [5] [6] Placing branch instruction in the branch delay slot is prohibited or deprecated. [7] [8] [9]

The ideal number of branch delay slots in a particular pipeline implementation is dictated by the number of pipeline stages, the presence of register forwarding, what stage of the pipeline the branch conditions are computed, whether or not a branch target buffer (BTB) is used and many other factors. Software compatibility requirements dictate that an architecture may not change the number of delay slots from one generation to the next. This inevitably requires that newer hardware implementations contain extra hardware to ensure that the architectural behaviour is followed despite no longer being relevant.

Implementations

Branch delay slots are found mainly in DSP architectures and older RISC architectures. MIPS, PA-RISC (delayed or non-delayed branch can be specified), [10] ETRAX CRIS, SuperH (unconditional branch instructions have one delay slot), [11] Am29000, [12] Intel i860 (unconditional branch instructions have one delay slot), [13] MC88000 (delayed or non-delayed branch can be specified), [14] and SPARC are RISC architectures that each have a single branch delay slot; PowerPC, ARM, Alpha, V850, and RISC-V do not have any. DSP architectures that each have a single branch delay slot include μPD77230 [15] and the VS DSP. The SHARC DSP and MIPS-X use a double branch delay slot; [16] such a processor will execute a pair of instructions following a branch instruction before the branch takes effect. Both TMS320C3x [17] and TMS320C4x [8] use a triple branch delay slot. The TMS320C4x has both non-delayed and delayed branches. [8]

The following example shows delayed branches in assembly language for the SHARC DSP including a pair after the RTS instruction. Registers R0 through R9 are cleared to zero in order by number (the register cleared after R6 is R7, not R9). No instruction executes more than once.

     R0 = 0;      CALL fn (DB);      /* call a function, below at label "fn" */      R1 = 0;            /* first delay slot */      R2 = 0;            /* second delay slot */      /***** discontinuity here (the CALL takes effect) *****/       R6 = 0;            /* the CALL/RTS comes back here, not at "R1 = 0" */      JUMP end (DB);      R7 = 0;            /* first delay slot */      R8 = 0;            /* second delay slot */      /***** discontinuity here (the JUMP takes effect) *****/       /* next 4 instructions are called from above, as function "fn" */ fn:  R3 = 0;      RTS (DB);          /* return to caller, past the caller's delay slots */      R4 = 0;            /* first delay slot */      R5 = 0;            /* second delay slot */      /***** discontinuity here (the RTS takes effect) *****/  end: R9 = 0; 


Load delay slot

A load delay slot is an instruction which executes immediately after a load (of a register from memory) but does not see, and need not wait for, the result of the load. Load delay slots are very uncommon because load delays are highly unpredictable on modern hardware. A load may be satisfied from RAM or from a cache, and may be slowed by resource contention. Load delays were seen on very early RISC processor designs. The MIPS I ISA (implemented in the R2000 and R3000 microprocessors) suffers from this problem.

The following example is MIPS I assembly code, showing both a load delay slot and a branch delay slot.

lwv0,4(v1)# load word from address v1+4 into v0nop# wasted load delay slotjrv0# jump to the address specified by v0nop# wasted branch delay slot

See also

Related Research Articles

MIPS is a family of reduced instruction set computer (RISC) instruction set architectures (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States.

<span class="mw-page-title-main">Reduced instruction set computer</span> Processor executing one instruction in minimal clock cycles

In electronics and computer science, a reduced instruction set computer (RISC) is a computer architecture designed to simplify the individual instructions given to the computer to accomplish tasks. Compared to the instructions given to a complex instruction set computer (CISC), a RISC computer might require more instructions in order to accomplish a task because the individual instructions are written in simpler code. The goal is to offset the need to process more instructions by increasing the speed of each instruction, in particular by implementing an instruction pipeline, which may be simpler to achieve given simpler instructions.

In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation of that ISA.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

SuperH is a 32-bit reduced instruction set computing (RISC) instruction set architecture (ISA) developed by Hitachi and currently produced by Renesas. It is implemented by microcontrollers and microprocessors for embedded systems.

In computer architecture, predication is a feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (predicated) non-branch instructions associated with a predicate, a Boolean value used by the instruction to control whether the instruction is allowed to modify the architectural state or not. If the predicate specified in the instruction is true, the instruction modifies the architectural state; otherwise, the architectural state is unchanged. For example, a predicated move instruction will only modify the destination if the predicate is true. Thus, instead of using a conditional branch to select an instruction or a sequence of instructions to execute based on the predicate that controls whether the branch occurs, the instructions to be executed are associated with that predicate, so that they will be executed, or not executed, based on whether that predicate is true or false.

In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer architecture, register renaming is a technique that abstracts logical registers from physical registers. Every logical register has a set of physical registers associated with it. When a machine language instruction refers to a particular logical register, the processor transposes this name to one specific physical register on the fly. The physical registers are opaque and cannot be referenced directly but only via the canonical names.

The DLX is a RISC processor architecture designed by John L. Hennessy and David A. Patterson, the principal designers of the Stanford MIPS and the Berkeley RISC designs (respectively), the two benchmark examples of RISC design.

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

In computer engineering, out-of-order execution is a paradigm used in high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

A branch is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order. Branch may also refer to the act of switching execution to a different instruction sequence as a result of executing a branch instruction. Branch instructions are used to implement control flow in program loops and conditionals.

<span class="mw-page-title-main">AMD Am29000</span> Family of RISC microprocessors and microcontrollers

The AMD Am29000, commonly shortened to 29k, is a family of 32-bit RISC microprocessors and microcontrollers developed and fabricated by Advanced Micro Devices (AMD). Based on the seminal Berkeley RISC, the 29k added a number of significant improvements. They were, for a time, the most popular RISC chips on the market, widely used in laser printers from a variety of manufacturers.

Berkeley RISC is one of two seminal research projects into reduced instruction set computer (RISC) based microprocessor design taking place under the Defense Advanced Research Projects Agency VLSI Project. RISC was led by David Patterson at the University of California, Berkeley between 1980 and 1984. The other project took place a short distance away at Stanford University under their MIPS effort starting in 1981 and running until 1984.

<span class="mw-page-title-main">Microarchitecture</span> Component of computer engineering

In electronics, computer science and computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as μarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

<span class="mw-page-title-main">R10000</span> MIPS microprocessor

The R10000, code-named "T5", is a RISC microprocessor implementation of the MIPS IV instruction set architecture (ISA) developed by MIPS Technologies, Inc. (MTI), then a division of Silicon Graphics, Inc. (SGI). The chief designers are Chris Rowen and Kenneth C. Yeager. The R10000 microarchitecture is known as ANDES, an abbreviation for Architecture with Non-sequential Dynamic Execution Scheduling. The R10000 largely replaces the R8000 in the high-end and the R4400 elsewhere. MTI was a fabless semiconductor company; the R10000 was fabricated by NEC and Toshiba. Previous fabricators of MIPS microprocessors such as Integrated Device Technology (IDT) and three others did not fabricate the R10000 as it was more expensive to do so than the R4000 and R4400.

In the design of pipelined computer processors, a pipeline stall is a delay in execution of an instruction in order to resolve a hazard.

RISC-V is an open standard instruction set architecture (ISA) based on established reduced instruction set computer (RISC) principles. Unlike most other ISA designs, RISC-V is provided under royalty-free open-source licenses. Many companies are offering or have announced RISC-V hardware; open source operating systems with RISC-V support are available, and the instruction set is supported in several popular software toolchains.

References

  1. A.Patterson, David; L.Hennessy, John (1990). Computer Archtecture A Quantitative Approach. Morgan Kaufmann Publishers. p. 275. ISBN   1-55860-069-8.
  2. "MSX Assembly Page".
  3. "CMSC 411 Lecture 19, Pipelining Data Forwarding". University of Maryland Baltimore County Computer Science and Electrical Engineering Department. Retrieved 2020-01-22.
  4. Cox, Russ (3 January 2011). "The MOS 6502 and the Best Layout Guy in the World".
  5. "μPD77230 Advanced Signal Processor" (PDF). pp. 38(3-39), 70(3-41). Retrieved 2023-11-17.
  6. "TMS320C4x User's Guide" (PDF). p. 75(3-15). Retrieved 2023-12-02.
  7. "μPD77230 Advanced Signal Processor" (PDF). p. 191(4-76). Retrieved 2023-10-28.
  8. 1 2 3 "TMS320C4x User's Guide" (PDF). p. 171(7-9). Retrieved 2023-10-29.
  9. "MC88100 RISC Microprocessor User's Manual" (PDF). p. 88(3-33). Retrieved 2023-12-30.
  10. DeRosa, John A.; Levy, Henry M. "An Evaluation of Branch Architectures". p. 1. Retrieved 2024-01-27.
  11. "SH7020 and SH7021 Hardware ManualSuperH™ RISC engine". p. 42,70. Retrieved 2023-12-17.
  12. "Evaluating and Programming the 29K RISC Family Third Edition – DRAFT" (PDF). p. 54. Retrieved 2023-12-20.
  13. "i860™ 64-bit Microprocessor Programmer's Reference Manual" (PDF). p. 70(5-11). Retrieved 2023-12-21.
  14. "MC88100 RISC Microprocessor User's Manual" (PDF). p. 81(3-26). Retrieved 2023-12-21.
  15. "μPD77230 Advanced Signal Processor" (PDF). p. 191(4-76). Retrieved 2023-11-05.
  16. "MIPS-X Instruction Set and Programmer's Manual" (PDF). p. 18. Retrieved 2023-12-03.
  17. "The TMS320C30 Floating-Point Digital Signal Processor" (PDF). ti.com. p. 14. Retrieved 2023-11-04.