Predication (computer architecture)

Last updated March 20, 2023

In computer architecture, predication is a feature that provides an alternative to conditional transfer of control, as implemented by conditional branch machine instructions. Predication works by having conditional (predicated) non-branch instructions associated with a predicate, a Boolean value used by the instruction to control whether the instruction is allowed to modify the architectural state or not. If the predicate specified in the instruction is true, the instruction modifies the architectural state; otherwise, the architectural state is unchanged. For example, a predicated move instruction (a conditional move) will only modify the destination if the predicate is true. Thus, instead of using a conditional branch to select an instruction or a sequence of instructions to execute based on the predicate that controls whether the branch occurs, the instructions to be executed are associated with that predicate, so that they will be executed, or not executed, based on whether that predicate is true or false.^[1]

Vector processors, some SIMD ISAs (such as AVX2 and AVX-512) and GPUs in general make heavy use of predication, applying one bit of a conditional mask vector to the corresponding elements in the vector registers being processed, whereas scalar predication in scalar instruction sets only need the one predicate bit. Where predicate masks become particularly powerful in vector processing is if an array of condition codes, one per vector element, may feed back into predicate masks that are then applied to subsequent vector instructions.

Overview

Most computer programs contain conditional code, which will be executed only under specific conditions depending on factors that cannot be determined beforehand, for example depending on user input. As the majority of processors simply execute the next instruction in a sequence, the traditional solution is to insert branch instructions that allow a program to conditionally branch to a different section of code, thus changing the next step in the sequence. This was sufficient until designers began improving performance by implementing instruction pipelining, a method which is slowed down by branches. For a more thorough description of the problems which arose, and a popular solution, see branch predictor.

Luckily, one of the more common patterns of code that normally relies on branching has a more elegant solution. Consider the following pseudocode:^[1]

ifcondition{dosomething}else{dosomethingelse}

On a system that uses conditional branching, this might translate to machine instructions looking similar to:^[1]

branch-if-conditiontolabel1dosomethingelsebranch-tolabel2label1:dosomethinglabel2:...

With predication, all possible branch paths are coded inline, but some instructions execute while others do not. The basic idea is that each instruction is associated with a predicate (the word here used similarly to its usage in predicate logic) and that the instruction will only be executed if the predicate is true. The machine code for the above example using predication might look something like this:^[1]

(condition)dosomething(notcondition)dosomethingelse

Besides eliminating branches, less code is needed in total, provided the architecture provides predicated instructions. While this does not guarantee faster execution in general, it will if the dosomething and dosomethingelse blocks of code are short enough.

Predication's simplest form is partial predication, where the architecture has conditional move or conditional select instructions. Conditional move instructions write the contents of one register over another only if the predicate's value is true, whereas conditional select instructions choose which of two registers has its contents written to a third based on the predicate's value. A more generalized and capable form is full predication. Full predication has a set of predicate registers for storing predicates (which allows multiple nested or sequential branches to be simultaneously eliminated) and most instructions in the architecture have a register specifier field to specify which predicate register supplies the predicate.^[2]

Advantages

The main purpose of predication is to avoid jumps over very small sections of program code, increasing the effectiveness of pipelined execution and avoiding problems with the cache. It also has a number of more subtle benefits:

Functions that are traditionally computed using simple arithmetic and bitwise operations may be quicker to compute using predicated instructions.
Predicated instructions with different predicates can be mixed with each other and with unconditional code, allowing better instruction scheduling and so even better performance.
Elimination of unnecessary branch instructions can make the execution of necessary branches, such as those that make up loops, faster by lessening the load on branch prediction mechanisms.
Elimination of the cost of a branch misprediction which can be high on deeply pipelined architectures.
Instruction sets that have comprehensive Condition Codes generated by instructions may reduce code size further by directly using the Condition Registers in or as predication.

Disadvantages

Predication's primary drawback is in increased encoding space. In typical implementations, every instruction reserves a bitfield for the predicate specifying under what conditions that instruction should have an effect. When available memory is limited, as on embedded devices, this space cost can be prohibitive. However, some architectures such as Thumb-2 are able to avoid this issue (see below). Other detriments are the following:^[3]

Predication complicates the hardware by adding levels of logic to critical paths and potentially degrades clock speed.
A predicated block includes cycles for all operations, so shorter paths may take longer and be penalized.
Predication is not usually speculated and causes a longer dependency chain. For ordered data this translates to a performance loss compared to a predictable branch.^[4]

Predication is most effective when paths are balanced or when the longest path is the most frequently executed,^[3] but determining such a path is very difficult at compile time, even in the presence of profiling information.

History

Predicated instructions were popular in European computer designs of the 1950s, including the Mailüfterl (1955), the Zuse Z22 (1955), the ZEBRA (1958), and the Electrologica X1 (1958). The IBM ACS-1 design of 1967 allocated a "skip" bit in its instruction formats, and the CDC Flexible Processor in 1976 allocated three conditional execution bits in its microinstruction formats.

Hewlett-Packard's PA-RISC architecture (1986) had a feature called nullification, which allowed most instructions to be predicated by the previous instruction. IBM's POWER architecture (1990) featured conditional move instructions. POWER's successor, PowerPC (1993), dropped these instructions. Digital Equipment Corporation's Alpha architecture (1992) also featured conditional move instructions. MIPS gained conditional move instructions in 1994 with the MIPS IV version; and SPARC was extended in Version 9 (1994) with conditional move instructions for both integer and floating-point registers.

In the Hewlett-Packard/Intel IA-64 architecture, most instructions are predicated. The predicates are stored in 64 special-purpose predicate registers; and one of the predicate registers is always true so that unpredicated instructions are simply instructions predicated with the value true. The use of predication is essential in IA-64's implementation of software pipelining because it avoids the need for writing separated code for prologs and epilogs.^{[ clarification needed ]}

In the x86 architecture, a family of conditional move instructions (CMOV and FCMOV) were added to the architecture by the Intel Pentium Pro (1995) processor. The CMOV instructions copied the contents of the source register to the destination register depending on a predicate supplied by the value of the flag register.

In the ARM architecture, the original 32-bit instruction set provides a feature called conditional execution that allows most instructions to be predicated by one of 13 predicates that are based on some combination of the four condition codes set by the previous instruction. ARM's Thumb instruction set (1994) dropped conditional execution to reduce the size of instructions so they could fit in 16 bits, but its successor, Thumb-2 (2003) overcame this problem by using a special instruction which has no effect other than to supply predicates for the following four instructions. The 64-bit instruction set introduced in ARMv8-A (2011) replaced conditional execution with conditional selection instructions.

SIMD, SIMT and vector predication

Some SIMD instruction sets, like AVX2, have the ability to use a logical mask to conditionally load/store values to memory, a parallel form of the conditional move, and may also apply individual mask bits to individual arithmetic units executing a parallel operation. The technique is known in Flynn's taxonomy as "associative processing".

This form of predication is also used in vector processors and single instruction, multiple threads GPU computing. All the techniques, advantages and disadvantages of single scalar predication apply just as well to the parallel processing case.

Related Research Articles

A central processing unit (CPU), also called a central processor or main processor, is the most important processor in a given computer. Its electronic circuitry executes instructions of a computer program, such as arithmetic, logic, controlling, and input/output (I/O) operations. This role contrasts with that of external components, such as main memory and I/O circuitry, and specialized coprocessors such as graphics processing units (GPUs).

MIPS is a family of reduced instruction set computer (RISC) instruction set architectures (ISA) developed by MIPS Computer Systems, now MIPS Technologies, based in the United States.

In computer science, an instruction set architecture (ISA), also called computer architecture, is an abstract model of a computer. A device that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation.

Very long instruction word (VLIW) refers to instruction set architectures designed to exploit instruction level parallelism (ILP). Whereas conventional central processing units mostly allow programs to specify instructions to execute in sequence only, a VLIW processor allows programs to explicitly specify instructions to execute in parallel. This design is intended to allow higher performance without the complexity inherent in some other designs.

<span class="mw-page-title-main">Single instruction, multiple data</span> Type of parallel processing

Single instruction, multiple data (SIMD) is a type of parallel processing in Flynn's taxonomy. SIMD can be internal and it can be directly accessible through an instruction set architecture (ISA), but it should not be confused with an ISA. SIMD describes computers with multiple processing elements that perform the same operation on multiple data points simultaneously.

In computing, Streaming SIMD Extensions (SSE) is a single instruction, multiple data (SIMD) instruction set extension to the x86 architecture, designed by Intel and introduced in 1999 in their Pentium III series of Central processing units (CPUs) shortly after the appearance of Advanced Micro Devices (AMD's) 3DNow!. SSE contains 70 new instructions, most of which work on single precision floating-point data. SIMD instructions can greatly increase performance when exactly the same operations are to be performed on multiple data objects. Typical applications are digital signal processing and graphics processing.

In computing, a vector processor or array processor is a central processing unit (CPU) that implements an instruction set where its instructions are designed to operate efficiently and effectively on large one-dimensional arrays of data called vectors. This is in contrast to scalar processors, whose instructions operate on single data items only, and in contrast to some of those same scalar processors having additional single instruction, multiple data (SIMD) or SWAR Arithmetic Units. Vector processors can greatly improve performance on certain workloads, notably numerical simulation and similar tasks. Vector processing techniques also operate in video-game console hardware and in graphics accelerators.

Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously. Large problems can often be divided into smaller ones, which can then be solved at the same time. There are several different forms of parallel computing: bit-level, instruction-level, data, and task parallelism. Parallelism has long been employed in high-performance computing, but has gained broader interest due to the physical constraints preventing frequency scaling. As power consumption by computers has become a concern in recent years, parallel computing has become the dominant paradigm in computer architecture, mainly in the form of multi-core processors.

In computer engineering, instruction pipelining is a technique for implementing instruction-level parallelism within a single processor. Pipelining attempts to keep every part of the processor busy with some instruction by dividing incoming instructions into a series of sequential steps performed by different processor units with different parts of instructions processed in parallel.

The CDC Cyber range of mainframe-class supercomputers were the primary products of Control Data Corporation (CDC) during the 1970s and 1980s. In their day, they were the computer architecture of choice for scientific and mathematically intensive computing. They were used for modeling fluid flow, material science stress analysis, electrochemical machining analysis, probabilistic analysis, energy and academic computing, radiation shielding modeling, and other applications. The lineup also included the Cyber 18 and Cyber 1000 minicomputers. Like their predecessor, the CDC 6600, they were unusual in using the ones' complement binary representation.

Addressing modes are an aspect of the instruction set architecture in most central processing unit (CPU) designs. The various addressing modes that are defined in a given instruction set architecture define how the machine language instructions in that architecture identify the operand(s) of each instruction. An addressing mode specifies how to calculate the effective memory address of an operand by using information held in registers and/or constants contained within a machine instruction or elsewhere.

A branch is an instruction in a computer program that can cause a computer to begin executing a different instruction sequence and thus deviate from its default behavior of executing instructions in order. Branch may also refer to the act of switching execution to a different instruction sequence as a result of executing a branch instruction. Branch instructions are used to implement control flow in program loops and conditionals.

In computer engineering, microarchitecture, also called computer organization and sometimes abbreviated as µarch or uarch, is the way a given instruction set architecture (ISA) is implemented in a particular processor. A given ISA may be implemented with different microarchitectures; implementations may vary due to different goals of a given design or due to shifts in technology.

In computer architecture, a transport triggered architecture (TTA) is a kind of processor design in which programs directly control the internal transport buses of a processor. Computation happens as a side effect of data transports: writing data into a triggering port of a functional unit triggers the functional unit to start a computation. This is similar to what happens in a systolic array. Due to its modular structure, TTA is an ideal processor template for application-specific instruction set processors (ASIP) with customized datapath but without the inflexibility and design cost of fixed function hardware accelerators.

Automatic vectorization, in parallel computing, is a special case of automatic parallelization, where a computer program is converted from a scalar implementation, which processes a single pair of operands at a time, to a vector implementation, which processes one operation on multiple pairs of operands at once. For example, modern conventional computers, including specialized supercomputers, typically have vector operations that simultaneously perform operations such as the following four additions :

Advanced Vector Extensions (AVX) are extensions to the x86 instruction set architecture for microprocessors from Intel and Advanced Micro Devices (AMD). They were proposed by Intel in March 2008 and first supported by Intel with the Sandy Bridge processor shipping in Q1 2011 and later by AMD with the Bulldozer processor shipping in Q3 2011. AVX provides new features, new instructions and a new coding scheme.

The XOP instruction set, announced by AMD on May 1, 2009, is an extension to the 128-bit SSE core instructions in the x86 and AMD64 instruction set for the Bulldozer processor core, which was released on October 12, 2011. However AMD removed support for XOP from Zen (microarchitecture) onward.

AVX-512 are 512-bit extensions to the 256-bit Advanced Vector Extensions SIMD instructions for x86 instruction set architecture (ISA) proposed by Intel in July 2013, and implemented in Intel's Xeon Phi x200 and Skylake-X CPUs; this includes the Core-X series, as well as the new Xeon Scalable Processor Family and Xeon D-2100 Embedded Series. AVX-512 consists of multiple extensions that may be implemented independently. This policy is a departure from the historical requirement of implementing the entire instruction block. Only the core extension AVX-512F is required by all AVX-512 implementations.

<span class="mw-page-title-main">AArch64</span> 64-bit extension of the ARM architecture

AArch64 or ARM64 is the 64-bit extension of the ARM architecture family.

RISC-V is an open standard instruction set architecture (ISA) based on established RISC principles. Unlike most other ISA designs, RISC-V is provided under open source licenses that do not require fees to use. A number of companies are offering or have announced RISC-V hardware, open source operating systems with RISC-V support are available, and the instruction set is supported in several popular software toolchains.

References

1 2 3 4 Rick Vinyard (2000-04-26). "Predication". cs.nmsu.edu. Retrieved 2014-04-22.
↑ Mahlke, Scott A.; Hank, Richard E.; McCormick, James E.; August, David I.; Hwn, Wen-mei W. (1995). A Comparison of Full and Partial Predicated Execution Support for ILP Processors. The 22nd International Symposium on Computer Architecture, 22–24 June 1995. CiteSeerX 10.1.1.19.3187 . doi:10.1145/223982.225965. ISBN 0-89791-698-0.
1 2 Fisher, Joseph A.; Faraboschi, Paolo; Young, Cliff (2004). "4.5.2 Predication § Predication in the Embedded Domain". Embedded Computing — A VLIW Approach to Architecture, Compilers, and Tools. Elsevier. p. 172. ISBN 9780080477541.
↑ Cordes, Peter. "assembly - How does Out of Order execution work with conditional instructions, Ex: CMOVcc in Intel or ADDNE (Add not equal) in ARM". Stack Overflow. Unlike with control dependencies (branches), they don't predict or speculate what the flags will be, so a cmovcc instead of a jcc can create a loop-carried dependency chain and end up being worse than a predictable branch. gcc optimization flag -O3 makes code slower than -O2 is an example of that.{{cite web}}: External link in |quote= (help)