Latency oriented processor architecture

Last updated January 30, 2023

Latency oriented processor architecture is the microarchitecture of a microprocessor designed to serve a serial computing thread with a low latency. This is typical of most central processing units (CPU) being developed since the 1970s. These architectures, in general, aim to execute as many instructions as possible belonging to a single serial thread, in a given window of time; however, the time to execute a single instruction completely from fetch to retire stages may vary from a few cycles to even a few hundred cycles in some cases.^[1]^{[ page needed ]} Latency oriented processor architectures are the opposite of throughput-oriented processors which concern themselves more with the total throughput of the system, rather than the service latencies for all individual threads that they work on.^[2]^{[ page needed ]}^[3]

Flynn's taxonomy

Typically, latency oriented processor architectures execute a single task operating on a single data stream, and so they are SISD under Flynn's taxonomy. Latency oriented processor architectures might also include SIMD instruction set extensions such as Intel MMX and SSE; even though these extensions operate on large data sets, their primary goal is to reduce overall latency.^[2]

Implementation techniques

There are many architectural techniques employed to reduce the overall latency for a single computing task. These typically involve adding additional hardware in the pipeline to serve instructions as soon as they are fetched from memory or instruction cache. A notable characteristic of these architectures is that a significant area of the chip is used up in parts other than the Execution Units themselves. This is because the intent is to bring down the time required to complete a 'typical' task in a computing environment. A typical computing task is a serial set of instructions, where there is a high dependency on results produced by the previous instructions of the same task. Hence, it makes sense that the microprocessor will be spending its time doing many other tasks other than the calculations required by the individual instructions themselves. If the hazards encountered during computation are not resolved quickly, then latency for the thread increases. This is because hazards stall execution of subsequent instructions and, depending upon the pipeline implementation, may either stall progress completely until the dependency is resolved or lead to an avalanche of more hazards in future instructions; further exacerbating execution time for the thread.^[4]^[5]

The design space of micro-architectural techniques is very large. Below are some of the most commonly employed techniques to reduce the overall latency for a thread.

Instruction set architecture (ISA)

Most architectures today use shorter and simpler instructions, like the load/store architecture, which help in optimizing the instruction pipeline for faster execution. Instructions are usually all of the same size which also helps in optimizing the instruction fetch logic. Such an ISA is called a RISC architecture.^[6]

Instruction pipelining

Pipelining overlaps execution of multiple instructions from the same executing thread in order to increase clock frequency or to increase the number of instructions that complete per unit time; thereby reducing the overall execution time for a thread. Instead of waiting for a single instruction to complete all its execution stages, multiple instructions are processed simultaneously, at their respective stages inside the pipeline. ^{[lower-alpha 1]}

Register-renaming

This technique is used to effectively increase the total register file size than that specified in the ISA to programmers, and to eliminate false dependencies. Suppose we have two consecutive instructions which reference the same register. The first reads the register while the second writes to it. To maintain correctness of the program, it is essential to make sure that the second instruction does not write to the register before the first can read its original value. This is an example of a Write-After-Read (WAR) dependency. To eliminate this dependency, the pipeline would 'rename' the instruction internally by assigning it to an internal register. The instruction is therefore allowed to execute and results produced by it will now be immediately available to all subsequent instructions, even though the actual destination register intended by the program will be written to later. Similarly if both the instructions simply meant to write to the same register Write-After-Write (WAW), the pipeline would rename them and ensure that their results are available to future instructions without the need to serialize their execution. ^{[lower-alpha 2]}

Memory organization

The different levels of memory, which includes caches, main memory and non-volatile storage like hard disks (where the program instructions and data reside), are designed to exploit spatial locality and temporal locality to reduce the total memory access time. The less time the processor spends waiting for data to be fetched from memory, the lower number of instructions consume pipeline resources while just sitting idle and doing no useful work. The instruction pipeline will be completely stalled if all its internal buffers (for example reservation stations) are filled to their respective capacities. Hence, if instructions consume fewer idle cycles while inside the pipeline, there is a greater chance of exploiting Instruction level parallelism (ILP) as the fetch logic can pull in greater number of instructions from the cache/memory per unit time. ^{[lower-alpha 3]}

Speculative execution

A major cause for pipeline stalls are control flow dependencies, i.e. when the outcome of a branch instruction is not known in advance (which is usually the case). Many architectures today use branch predictor components to guess the outcome of a branch. Execution continues along the predicted path for the program but instructions are tagged as speculative. If the guess turns out to be correct, then the instructions are allowed to complete successfully and to update their results back to register file/memory. If the guess was incorrect, then all speculative instructions are flushed from the pipeline and execution (re)starts along the actual correct path for the program. By maintaining a high prediction accuracy, the pipeline is able to significantly increase throughput for the executing thread. ^{[lower-alpha 4]}

Out-of-order execution

Not all instructions in a thread take the same amount of time to execute. Superscalar pipelines usually have multiple possible paths for instructions depending upon current state and the instruction type itself. Hence, to increase instructions per cycle (IPC) the pipeline allows execution of instructions out-of-order so that instructions later in the program are not stalled due to an instruction which will take longer to complete. All instructions are registered in a re-order buffer when they are fetched by the pipeline and allowed to retire (i.e. write back their results) in the order of the original program so as to maintain correctness. ^{[lower-alpha 5]}

Superscalar execution

A super-scalar instruction pipeline pulls in multiple instructions in every clock cycle, as opposed to a simple scalar pipeline. This increases Instruction level parallelism (ILP) as many times as the number of instructions fetched in each cycle, except when the pipeline is stalled due to data or control flow dependencies. Even though the retire rate of superscalar pipelines is usually less than their fetch rate, the overall number of instructions executed per unit time (> 1) is generally greater than a scalar pipeline. ^{[lower-alpha 6]}

Contrast with throughput oriented processor architectures

In contrast, a throughput oriented processor architecture is designed to maximize the amount of 'useful work' done in a significant window of time. Useful work refers to large calculations on a significant amount of data. They do this by parallelizing the work load so that many calculations can be performed simultaneously. The calculations may belong to a single task or a limited number of multiple tasks. The total time required to complete 1 execution is significantly larger than that of a latency oriented processor architecture, however, the total time to complete a large set of calculations is significantly reduced. Latency is often sacrificed in order to achieve a higher throughput per cycle.^[3] As a result, a latency oriented processor may complete a single calculation significantly faster than a throughput-oriented processor; however, the throughput-oriented processor could be partway through hundreds of such computations by the time the latency oriented processor completes 1 calculation.^[2]

Latency oriented processors expend a substantial chip area on sophisticated control structures like branch prediction, data forwarding, re-order buffer, large register files and caches in each processor. These structures help reduce operational latency and memory-access time per instruction, and make results available as soon as possible. Throughput oriented architectures on the other hand, usually have a multitude of processors with much smaller caches and simpler control logic. This helps to efficiently utilize the memory bandwidth and increase total the number of total number of execution units on the same chip area.^[3]

GPUs are a typical example of throughput oriented processor architectures.

Notes

↑ Computer Organization and Design: The Hardware/software Interface, Chapter 4^[5]
↑ Computer Architecture: A Quantitative Approach, Section 3.1^[4]
↑ Computer Organization and Design: The Hardware/software Interface, Chapter 5^[5]
↑ Computer Architecture: A Quantitative Approach, Section 3.3^[4]
↑ Computer Architecture: A Quantitative Approach, Sections 3.4, 3.5^[4]
↑ Computer Architecture: A Quantitative Approach, Sections 3.6-3.8^[4]

Related Research Articles

A central processing unit (CPU), also called a central processor, main processor or just processor, is the electronic circuitry that executes instructions comprising a computer program. The CPU performs basic arithmetic, logic, controlling, and input/output (I/O) operations specified by the instructions in the program. This contrasts with external components such as main memory and I/O circuitry, and specialized processors such as graphics processing units (GPUs).

The control unit (CU) is a component of a computer's central processing unit (CPU) that directs the operation of the processor. A CU typically uses a binary decoder to convert coded instructions into timing and control signals that direct the operation of the other units.

<span class="mw-page-title-main">Superscalar processor</span> CPU that implements instruction-level parallelism within a single processor

A superscalar processor is a CPU that implements a form of parallelism called instruction-level parallelism within a single processor. In contrast to a scalar processor, which can execute at most one single instruction per clock cycle, a superscalar processor can execute more than one instruction during a clock cycle by simultaneously dispatching multiple instructions to different execution units on the processor. It therefore allows more throughput than would otherwise be possible at a given clock rate. Each execution unit is not a separate processor, but an execution resource within a single CPU such as an arithmetic logic unit.

The program counter (PC), commonly called the instruction pointer (IP) in Intel x86 and Itanium microprocessors, and sometimes called the instruction address register (IAR), the instruction counter, or just part of the instruction sequencer, is a processor register that indicates where a computer is in its program sequence.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer engineering, instruction pipelining or ILP is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

Instruction-level parallelism (ILP) is the parallel or simultaneous execution of a sequence of instructions in a computer program. More specifically ILP refers to the average number of instructions run per step of this parallel execution.

Simultaneous multithreading (SMT) is a technique for improving the overall efficiency of superscalar CPUs with hardware multithreading. SMT permits multiple independent threads of execution to better use the resources provided by modern processor architectures.

Tomasulo's algorithm is a computer architecture hardware algorithm for dynamic scheduling of instructions that allows out-of-order execution and enables more efficient use of multiple execution units. It was developed by Robert Tomasulo at IBM in 1967 and was first implemented in the IBM System/360 Model 91’s floating point unit.

In the domain of central processing unit (CPU) design, hazards are problems with the instruction pipeline in CPU microarchitectures when the next instruction cannot execute in the following clock cycle, and can potentially lead to incorrect computation results. Three common types of hazards are data hazards, structural hazards, and control hazards.

In the history of computer hardware, some early reduced instruction set computer central processing units used a very similar architectural solution, now called a classic RISC pipeline. Those CPUs were: MIPS, SPARC, Motorola 88000, and later the notional CPU DLX invented for education.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

In computer engineering, out-of-order execution is a paradigm used in most high-performance central processing units to make use of instruction cycles that would otherwise be wasted. In this paradigm, a processor executes instructions in an order governed by the availability of input data and execution units, rather than by their original order in a program. In doing so, the processor can avoid being idle while waiting for the preceding instruction to complete and can, in the meantime, process the next instructions that are able to run immediately and independently.

In computing, a pipeline, also known as a data pipeline, is a set of data processing elements connected in series, where the output of one element is the input of the next one. The elements of a pipeline are often executed in parallel or in time-sliced fashion. Some amount of buffer storage is often inserted between elements.

A barrel processor is a CPU that switches between threads of execution on every cycle. This CPU design technique is also known as "interleaved" or "fine-grained" temporal multithreading. Unlike simultaneous multithreading in modern superscalar architectures, it generally does not allow execution of multiple instructions in one cycle.

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

In computer architecture, speedup is a number that measures the relative performance of two systems processing the same problem. More technically, it is the improvement in speed of execution of a task executed on two similar architectures with different resources. The notion of speedup was established by Amdahl's law, which was particularly focused on parallel processing. However, speedup can be used more generally to show the effect on performance after any resource enhancement.

<span class="mw-page-title-main">Hardware acceleration</span> Specialized computer hardware

Hardware acceleration is the use of computer hardware designed to perform specific functions more efficiently when compared to software running on a general-purpose central processing unit (CPU). Any transformation of data that can be calculated in software running on a generic CPU can also be calculated in custom-made hardware, or in some mix of both.

In computer architecture, multithreading is the ability of a central processing unit (CPU) to provide multiple threads of execution concurrently, supported by the operating system. This approach differs from multiprocessing. In a multithreaded application, the threads share the resources of a single or multiple cores, which include the computing units, the CPU caches, and the translation lookaside buffer (TLB).

Hardware scout is a technique that uses otherwise idle processor execution resources to perform prefetching during cache misses. When a thread is stalled by a cache miss, the processor pipeline checkpoints the register file, switches to runahead mode, and continues to issue instructions from the thread that is waiting for memory. The thread of execution in run-ahead mode is known as a scout thread. When the data returns from memory, the processor restores the register file contents from the checkpoint, and switches back to normal execution mode.

References

↑ John Paul Shen; Mikko H. Lipasti (2013). Modern Processor Design. McGraw-Hill Professional. ISBN 978-1478607830.
1 2 3 Yan Solihin (2016). Fundamentals of Parallel Multicore Architecture. Chapman & Hall/CRC Computational Science. ISBN 978-1482211184.
1 2 3 Michael Garland; David B. Kirk (2010). "Understanding Throughput-Oriented Architectures". Communications of the ACM. 53 (11): 58–66. doi: 10.1145/1839676.1839694 .
1 2 3 4 5 John L. Hennessy; David A. Patterson (2013). Computer Architecture: A Quantitative Approach (Fifth ed.). Morgan Kaufmann Publishers. ISBN 978-0123838728.
1 2 3 David A. Patterson; John L. Hennessy (2013). Computer Organization and Design: The Hardware/software Interface (Fifth ed.). Morgan Kaufmann Publishers. ISBN 9780124078864.
↑ Bhandarkar, Dileep; Clark, Douglas W. (1 January 1991). Performance from Architecture: Comparing a RISC and a CISC with Similar Hardware Organization. Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM. pp. 310–319. doi: 10.1145/106972.107003 . ISBN 0897913809.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[7] Computer Organization and Design: The Hardware/software Interface, Chapter 4^[5]

[8] Computer Architecture: A Quantitative Approach, Section 3.1^[4]

[9] Computer Organization and Design: The Hardware/software Interface, Chapter 5^[5]

[10] Computer Architecture: A Quantitative Approach, Section 3.3^[4]

[11] Computer Architecture: A Quantitative Approach, Sections 3.4, 3.5^[4]

[12] Computer Architecture: A Quantitative Approach, Sections 3.6-3.8^[4]

[1] John Paul Shen; Mikko H. Lipasti (2013). Modern Processor Design. McGraw-Hill Professional. ISBN 978-1478607830.

[YanSohilin2016-2] 1 2 3 Yan Solihin (2016). Fundamentals of Parallel Multicore Architecture. Chapman & Hall/CRC Computational Science. ISBN 978-1482211184.

[GarlandKirk-3] 1 2 3 Michael Garland; David B. Kirk (2010). "Understanding Throughput-Oriented Architectures". Communications of the ACM. 53 (11): 58–66. doi: 10.1145/1839676.1839694 .

[quant-4] 1 2 3 4 5 John L. Hennessy; David A. Patterson (2013). Computer Architecture: A Quantitative Approach (Fifth ed.). Morgan Kaufmann Publishers. ISBN 978-0123838728.

[interface-5] 1 2 3 David A. Patterson; John L. Hennessy (2013). Computer Organization and Design: The Hardware/software Interface (Fifth ed.). Morgan Kaufmann Publishers. ISBN 9780124078864.

[6] Bhandarkar, Dileep; Clark, Douglas W. (1 January 1991). Performance from Architecture: Comparing a RISC and a CISC with Similar Hardware Organization. Proceedings of the Fourth International Conference on Architectural Support for Programming Languages and Operating Systems. ACM. pp. 310–319. doi: 10.1145/106972.107003 . ISBN 0897913809.

[1]

[2]

[3]

[4]

[5]

[6]

[lower-alpha 1]

[lower-alpha 2]

[lower-alpha 3]

[lower-alpha 4]

[lower-alpha 5]

[lower-alpha 6]