Peephole optimization

Last updated May 21, 2024

Peephole optimization is an optimization technique performed on a small set of compiler-generated instructions, known as a peephole or window,^[1] that involves replacing the instructions with a logically equivalent set that has better performance.

Replacements

Peephole optimization replacements include but are not limited to:^[3]

Null sequences – Delete useless operations
Combine operations – Replace several operations with one equivalent
Algebraic laws – Use algebraic laws to simplify or reorder instructions
Special case instructions – Use instructions designed for special operand cases
Address mode operations – Use address modes to simplify code

Implementation

Modern compilers often implement peephole optimizations with a pattern matching algorithm.^[4]

Examples

Replacing slow instructions with faster ones

The following Java bytecode:

aload 1 aload 1 mul

can be replaced with the following which executes faster:

aload 1 dup mul

As for most peephole optimizations, this is based on the relative efficiency of different instructions. In this case, dup (which duplicates and pushes the top of the stack) is known/assumed to be more efficient than aload (which loads a local variable and pushes it onto the stack).

Removing redundant code

The following source code:

 a = b + c;  d = a + e;

is straightforwardly compiled to:

MOVb,R0; Copy b to the registerADDc,R0; Add  c to the register, the register is now b+cMOVR0,a; Copy the register to aMOVa,R0; Copy a to the registerADDe,R0; Add  e to the register, the register is now a+e [(b+c)+e]MOVR0,d; Copy the register to d

but can be optimized to:

MOVb,R0; Copy b to the registerADDc,R0; Add c to the register, which is now b+c (a)MOVR0,a; Copy the register to aADDe,R0; Add e to the register, which is now b+c+e [(a)+e]MOVR0,d; Copy the register to d

Removing redundant stack instructions

If the compiler saves registers on the stack before calling a subroutine and restores them when returning, consecutive calls to subroutines may have redundant stack instructions.

Suppose the compiler generates the following Z80 instructions for each procedure call:

PUSHAFPUSHBCPUSHDEPUSHHLCALL_ADDRPOPHLPOPDEPOPBCPOPAF

If there were two consecutive subroutine calls, they would look like this:

PUSHAFPUSHBCPUSHDEPUSHHLCALL_ADDR1POPHLPOPDEPOPBCPOPAFPUSHAFPUSHBCPUSHDEPUSHHLCALL_ADDR2POPHLPOPDEPOPBCPOPAF

The sequence POP regs followed by PUSH for the same registers is generally redundant. In cases where it is redundant, a peephole optimization would remove these instructions. In the example, this would cause another redundant POP/PUSH pair to appear in the peephole, and these would be removed in turn. Assuming that subroutine _ADDR2 does not depend on previous register values, removing all of the redundant code in the example above would eventually leave the following code:

PUSHAFPUSHBCPUSHDEPUSHHLCALL_ADDR1CALL_ADDR2POPHLPOPDEPOPBCPOPAF

Related Research Articles

In computer programming, assembly language, often referred to simply as assembly and commonly abbreviated as ASM or asm, is any low-level programming language with a very strong correspondence between the instructions in the language and the architecture's machine code instructions. Assembly language usually has one statement per machine instruction (1:1), but constants, comments, assembler directives, symbolic labels of, e.g., memory locations, registers, and macros are generally also supported.

The Intel 8080 ("eighty-eighty") is the second 8-bit microprocessor designed and manufactured by Intel. It first appeared in April 1974 and is an extended and enhanced variant of the earlier 8008 design, although without binary compatibility. The initial specified clock rate or frequency limit was 2 MHz, with common instructions using 4, 5, 7, 10, or 11 cycles. As a result, the processor is able to execute several hundred thousand instructions per second. Two faster variants, the 8080A-1 and 8080A-2, became available later with clock frequency limits of 3.125 MHz and 2.63 MHz respectively. The 8080 needs two support chips to function in most applications: the i8224 clock generator/driver and the i8228 bus controller. It is implemented in N-type metal–oxide–semiconductor logic (NMOS) using non-saturated enhancement mode transistors as loads thus demanding a +12 V and a −5 V voltage in addition to the main transistor–transistor logic (TTL) compatible +5 V.

An optimizing compiler is a compiler designed to generate code that is optimized in aspects such as minimizing program execution time, memory use, storage size, and power consumption.

In computer science, threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It is often used in compilers, which may generate code in that form or be implemented in that form themselves. The code may be processed by an interpreter or it may simply be a sequence of machine code call instructions.

In computer science, an instruction set architecture (ISA) is an abstract model that generally defines how software controls the CPU in a computer or a family of computers. A device or program that executes instructions described by that ISA, such as a central processing unit (CPU), is called an implementation of that ISA.

The Intel MCS-51 is a single chip microcontroller (MCU) series developed by Intel in 1980 for use in embedded systems. The architect of the Intel MCS-51 instruction set was John H. Wharton. Intel's original versions were popular in the 1980s and early 1990s, and enhanced binary compatible derivatives remain popular today. It is a complex instruction set computer, but also has some of the features of RISC architectures, such as a large register set and register windows, and has separate memory spaces for program instructions and data.

The Intel 8085 ("eighty-eighty-five") is an 8-bit microprocessor produced by Intel and introduced in March 1976. It is the last 8-bit microprocessor developed by Intel.

In computing, code generation is part of the process chain of a compiler and converts intermediate representation of source code into a form that can be readily executed by the target system.

x86 assembly language is the name for the family of assembly languages which provide some level of backward compatibility with CPUs back to the Intel 8008 microprocessor, which was launched in April 1972. It is used to produce object code for the x86 class of processors.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automata with additional load/store operations or multiple stacks and hence are Turing-complete.

Loop unrolling, also known as loop unwinding, is a loop transformation technique that attempts to optimize a program's execution speed at the expense of its binary size, which is an approach known as space–time tradeoff. The transformation can be undertaken manually by the programmer or by an optimizing compiler. On modern processors, loop unrolling is often counterproductive, as the increased code size can cause more cache misses; cf. Duff's device.

In computer science, a tail call is a subroutine call performed as the final action of a procedure. If the target of a tail is the same subroutine, the subroutine is said to be tail recursive, which is a special case of direct recursion. Tail recursion is particularly useful, and is often easy to optimize in implementations.

In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program. This type of stack is also known as an execution stack, program stack, control stack, run-time stack, or machine stack, and is often shortened to simply "the stack". Although maintenance of the call stack is important for the proper functioning of most software, the details are normally hidden and automatic in high-level programming languages. Many computer instruction sets provide special instructions for manipulating stacks.

In computer science, a calling convention is an implementation-level (low-level) scheme for how subroutines or functions receive parameters from their caller and how they return a result. When some code calls a function, design choices have been taken for where and how parameters are passed to that function, and where and how results are returned from that function, with these transfers typically done via certain registers or within a stack frame on the call stack. There are design choices for how the tasks of preparing for a function call and restoring the environment after the function has completed are divided between the caller and the callee. Some calling convention specifies the way every function should get called. The correct calling convention should be used for every function call, to allow the correct and reliable execution of the whole program using these functions.

This article describes the calling conventions used when programming x86 architecture microprocessors.

<span class="mw-page-title-main">TI-990</span> Series of 16-bit computers by Texas Instruments.

The TI-990 was a series of 16-bit minicomputers sold by Texas Instruments (TI) in the 1970s and 1980s. The TI-990 was a replacement for TI's earlier minicomputer systems, the TI-960 and the TI-980. It had several unique features, and was easier to program than its predecessors.

A stack register is a computer central processor register whose purpose is to keep track of a call stack. On an accumulator-based architecture machine, this may be a dedicated register. On a machine with multiple general-purpose registers, it may be a register that is reserved by convention, such as on the IBM System/360 through z/Architecture architecture and RISC architectures, or it may be a register that procedure call and return instructions are hardwired to use, such as on the PDP-11, VAX, and Intel x86 architectures. Some designs such as the Data General Eclipse had no dedicated register, but used a reserved hardware memory address for this function.

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

The PDP-11 architecture is a 16-bit CISC instruction set architecture (ISA) developed by Digital Equipment Corporation (DEC). It is implemented by central processing units (CPUs) and microprocessors used in PDP-11 minicomputers. It was in wide use during the 1970s, but was eventually overshadowed by the more powerful VAX architecture in the 1980s.

In computer programming, a function, subprogram, procedure, method, routine or subroutine is a callable unit of software logic that has a well-defined interface and behavior and can be invoked multiple times.

References

↑ Muchnick, Steven Stanley (1997-08-15). Advanced Compiler Design and Implementation. Academic Press / Morgan Kaufmann. ISBN 978-1-55860-320-2.
↑ McKeeman, William Marshall (July 1965). "Peephole optimization". Communications of the ACM . 8 (7): 443–444. doi: 10.1145/364995.365000 . S2CID 9529633.
↑ Fischer, Charles N.; Cytron, Ron K.; LeBlanc, Jr., Richard J. (2010). Crafting a Compiler (PDF). Addison-Wesley. ISBN 978-0-13-606705-4. Archived from the original (PDF) on 2018-07-03. Retrieved 2018-07-02.
↑ Aho, Alfred Vaino; Lam, Monica Sin-Ling; Sethi, Ravi; Ullman, Jeffrey David (2007). "Chapter 8.9.2 Code Generation by Tiling an Input Tree". Compilers – Principles, Techniques, & Tools (PDF) (2 ed.). Pearson Education. p. 540. Archived (PDF) from the original on 2018-06-10. Retrieved 2018-07-02.

External links

The dictionary definition of peephole optimization at Wiktionary

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Muchnick_1997-1] Muchnick, Steven Stanley (1997-08-15). Advanced Compiler Design and Implementation. Academic Press / Morgan Kaufmann. ISBN 978-1-55860-320-2.

[McKeeman_1965-2] McKeeman, William Marshall (July 1965). "Peephole optimization". Communications of the ACM . 8 (7): 443–444. doi: 10.1145/364995.365000 . S2CID 9529633.

[Fischer_2010-3] Fischer, Charles N.; Cytron, Ron K.; LeBlanc, Jr., Richard J. (2010). Crafting a Compiler (PDF). Addison-Wesley. ISBN 978-0-13-606705-4. Archived from the original (PDF) on 2018-07-03. Retrieved 2018-07-02.

[Aho_2007-4] Aho, Alfred Vaino; Lam, Monica Sin-Ling; Sethi, Ravi; Ullman, Jeffrey David (2007). "Chapter 8.9.2 Code Generation by Tiling an Input Tree". Compilers – Principles, Techniques, & Tools (PDF) (2 ed.). Pearson Education. p. 540. Archived (PDF) from the original on 2018-06-10. Retrieved 2018-07-02.

[1]

[2]

[3]

[4]

v t e Compiler optimizations
Basic block	Peephole optimization Local value numbering
Loop	Automatic parallelization Automatic vectorization Induction variable Loop fusion Loop-invariant code motion Loop inversion Loop interchange Loop nest optimization Loop splitting Loop unrolling Loop unswitching Software pipelining Strength reduction
Data-flow analysis	Available expression Common subexpression elimination Constant folding Dead store elimination Induction variable recognition and elimination Live-variable analysis Use-define chain
SSA-based	Global value numbering Sparse conditional constant propagation
Code generation	Instruction scheduling Instruction selection Register allocation Rematerialization
Functional	Deforestation Tail-call elimination
Global	Interprocedural optimization
Other	Bounds-checking elimination Compile-time function execution Dead-code elimination Expression templates Inline expansion Jump threading Partial evaluation Profile-guided optimization
Static analysis	Alias analysis Array-access analysis Control-flow analysis Data-flow analysis Dependence analysis Escape analysis Pointer analysis Shape analysis Value range analysis