Multi-pass compiler

Last updated

A multi-pass compiler is a type of compiler that processes the source code or abstract syntax tree of a program several times. This is in contrast to a one-pass compiler, which traverses the program only once. Each pass takes the result of the previous pass as the input, and creates an intermediate output. In this way, the (intermediate) code is improved pass by pass, until the final pass produces the final code.

Contents

Multi-pass compilers are sometimes called wide compilers, [1] referring to the greater scope of the passes: they can "see" the entire program being compiled, instead of just a small portion of it. The wider scope thus available to these compilers allows better code generation (e.g. smaller code size, faster code) compared to the output of one-pass compilers, at the cost of higher compiler time and memory consumption. In addition, some languages cannot be compiled in a single pass, as a result of their design.

Typical multi-pass compiler

Multi-passcompiler.png

Lexical analysis

This stage of a multi-pass compiler is to remove irrelevant information from the source program that syntax analysis will not be able to use or interpret. Irrelevant information could include things like comments and white space. In addition to removing the irrelevant information, the lexical analysis determines the lexical tokens of the language. This step means that forward declaration is generally not necessary if a multi-pass compiler is used. This phase is focused on breaking a sequence of characters into tokens with attributes such as kind, type, value, and potentially others, as well.

Syntax analysis

Syntax analysis is responsible for looking at syntax rules of the language (often as a context-free grammar), and building some intermediate representation of the language. An example of this intermediate representation could be something like an abstract syntax tree or a directed acyclic graph.

Semantic analysis

Semantic analysis takes the representation made from syntax analysis and applies semantic rules to the representation to make sure that the program meets the semantic rules requirements of the language. For example, in the example below at the stage of semantic analysis if the language required that conditions on if statements were boolean expressions the cond would be type-checked to make sure it would be a valid boolean expression.

if(cond){...}else{...}

In addition to performing semantic analysis at this stage of compilation, often symbol tables are created in order to assist in code generation.

Code generation

This final stage of a typical compiler converts the intermediate representation of program into an executable set of instructions (often assembly). This last stage is the only stage in compilation that is machine dependent. There can also be optimization done at this stage of compilation that make the program more efficient.

Other passes of compiler include intermediate code generation phase which takes place before code generation and code optimization phase which can take place when the source program is written, or after intermediate code generation phase, or after code generation phase.

Advantages of multi-pass compilers

Machine Independent: Since the multiple passes include a modular structure, and the code generation decoupled from the other steps of the compiler, the passes can be reused for different hardware/machines.

More Expressive Languages: Multiple passes obviate the need for forward declarations, allowing mutual recursion to be implemented elegantly. The prime examples of languages requiring forward declarations due to the requirement of being compilable in a single pass include C and Pascal, whereas Java does not have forward declarations.

Related Research Articles

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.

<span class="mw-page-title-main">Interpreter (computing)</span> Program that executes source code without a separate compilation step

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

  1. Parse the source code and perform its behavior directly;
  2. Translate source code into some efficient intermediate representation or object code and immediately execute that;
  3. Explicitly execute stored precompiled bytecode made by a compiler and matched with the interpreter Virtual Machine.

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

<span class="mw-page-title-main">Abstract syntax tree</span> Tree representation of the abstract syntactic structure of source code

An abstract syntax tree (AST) is a data structure used in computer science to represent the structure of a program or code snippet. It is a tree representation of the abstract syntactic structure of text written in a formal language. Each node of the tree denotes a construct occurring in the text. It is sometimes called just a syntax tree.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for large language model's data preprocessing, that encode text into numerical tokens, using byte pair encoding.

In computer science, a preprocessor is a program that processes its input data to produce output that is used as input in another program. The output is said to be a preprocessed form of the input data, which is often used by some subsequent programs like compilers. The amount and kind of processing done depends on the nature of the preprocessor; some preprocessors are only capable of performing relatively simple textual substitutions and macro expansions, while others have the power of full-fledged programming languages.

In computing, code generation is part of the process chain of a compiler and converts intermediate representation of source code into a form that can be readily executed by the target system.

In computer science, a high-level programming language is a programming language with strong abstraction from the details of the computer. In contrast to low-level programming languages, it may use natural language elements, be easier to use, or may automate significant areas of computing systems, making the process of developing a program simpler and more understandable than when using a lower-level language. The amount of abstraction provided defines how "high-level" a programming language is.

In computer science, compile time describes the time window during which a language's statements are converted into binary instructions for the processor to execute. The term is used as an adjective to describe concepts related to the context of program compilation, as opposed to concepts related to the context of program execution (runtime). For example, compile-time requirements are programming language requirements that must be met by source code before compilation and compile-time properties are properties of the program that can be reasoned about during compilation. The actual length of time it takes to compile a program is usually referred to as compilation time.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

In computer programming, operators are constructs defined within programming languages which behave generally like functions, but which differ syntactically or semantically.

In computer science, an abstract semantic graph (ASG) or term graph is a form of abstract syntax in which an expression of a formal or programming language is represented by a graph whose vertices are the expression's subterms. An ASG is at a higher level of abstraction than an abstract syntax tree, which is used to express the syntactic structure of an expression or program.

An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation. A "good" IR must be accurate – capable of representing the source code without loss of information – and independent of any particular source or target language. An IR may take one of several forms: an in-memory data structure, or a special tuple- or stack-based code readable by the program. In the latter case it is also called an intermediate language.

The computer programming languages C and Pascal have similar times of origin, influences, and purposes. Both were used to design their own compilers early in their lifetimes. The original Pascal definition appeared in 1969 and a first compiler in 1970. The first version of C appeared in 1972.

In computer programming, a one-pass compiler is a compiler that passes through the parts of each compilation unit only once, immediately translating each part into its final machine code. This is in contrast to a multi-pass compiler which converts the program into one or more intermediate representations in steps between source code and machine code, and which reprocesses the entire compilation unit in each sequential pass.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer programming, the lexer hack is a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer.

In computer programming, a precompiled header (PCH) is a header file that is compiled into an intermediate form that is faster to process for the compiler. Usage of precompiled headers may significantly reduce compilation time, especially when applied to large header files, header files that include many other header files, or header files that are included in many translation units.

<span class="mw-page-title-main">Transfer-based machine translation</span>

Transfer-based machine translation is a type of machine translation (MT). It is currently one of the most widely used methods of machine translation. In contrast to the simpler direct model of MT, transfer MT breaks translation into three steps: analysis of the source language text to determine its grammatical structure, transfer of the resulting structure to a structure suitable for generating text in the target language, and finally generation of this text. Transfer-based MT systems are thus capable of using knowledge of the source and target languages.

<span class="mw-page-title-main">History of compiler construction</span>

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

References

  1. Grune, Dick; van Reeuwijk, Kees; Bal, Henri; Jacobs, Ceriel; Langendoen, Koen (2012). Modern Compiler Design (Second ed.). Amsterdam, the Netherlands: Springer. p. 27. ISBN   978-1-4939-4472-9.