Definite assignment analysis

Last updated

In computer science, definite assignment analysis is a data-flow analysis used by compilers to conservatively ensure that a variable or location is always assigned before it is used.

Contents

Motivation

In C and C++ programs, a source of particularly difficult-to-diagnose errors is the nondeterministic behavior that results from reading uninitialized variables; this behavior can vary between platforms, builds, and even from run to run.

There are two common ways to solve this problem. One is to ensure that all locations are written before they are read. Rice's theorem establishes that this problem cannot be solved in general for all programs; however, it is possible to create a conservative (imprecise) analysis that will accept only programs that satisfy this constraint, while rejecting some correct programs, and definite assignment analysis is such an analysis. The Java [1] and C# [2] programming language specifications require that the compiler report a compile-time error if the analysis fails. Both languages require a specific form of the analysis that is spelled out in meticulous detail. In Java, this analysis was formalized by Stärk et al., [3] and some correct programs are rejected and must be altered to introduce explicit unnecessary assignments. In C#, this analysis was formalized by Fruja, and is precise as well as sound, in the sense that all variables assigned along all control flow paths will be considered definitely assigned. [4] The Cyclone language also requires programs to pass a definite assignment analysis, but only on variables with pointer types, to ease porting of C programs. [5]

The second way to solve the problem is to automatically initialize all locations to some fixed, predictable value at the point at which they are defined, but this introduces new assignments that may impede performance. In this case, definite assignment analysis enables a compiler optimization where redundant assignments assignments followed only by other assignments with no possible intervening reads can be eliminated. In this case, no programs are rejected, but programs for which the analysis fails to recognize definite assignment may contain redundant initialization. The Common Language Infrastructure relies on this approach. [6]

Terminology

A variable or location can be said to be in one of three states at any given point in the program:

The analysis

The following is based on Fruja's formalization of the C# intraprocedural (single method) definite assignment analysis, which is responsible for ensuring that all local variables are assigned before they are used. [4] It simultaneously does definite assignment analysis and constant propagation of boolean values. We define five static functions:

NameDomainDescription
beforeAll statements and expressionsVariables definitely assigned before the evaluation of the given statement or expression.
afterAll statements and expressionsVariables definitely assigned after the evaluation of the given statement or expression, assuming it completes normally.
varsAll statements and expressionsAll variables available in the scope of the given statement or expression.
trueAll boolean expressionsVariables definitely assigned after the evaluation of the given expression, assuming the expression evaluates to true.
falseAll boolean expressionsVariables definitely assigned after the evaluation of the given expression, assuming the expression evaluates to false.

We supply data-flow equations that define the values of these functions on various expressions and statements, in terms of the values of the functions on their syntactic subexpressions. Assume for the moment that there are no goto, break, continue, return, or exception handling statements. Following are a few examples of these equations:

At the beginning of the method, no local variables are definitely assigned. The verifier repeatedly iterates over the abstract syntax tree and uses the data-flow equations to migrate information between the sets until a fixed point can be reached. Then, the verifier examines the before set of every expression that uses a local variable to ensure that it contains that variable.

The algorithm is complicated by the introduction of control-flow jumps like goto, break, continue, return, and exception handling. Any statement that can be the target of one of these jumps must intersect its before set with the set of definitely assigned variables at the jump source. When these are introduced, the resulting data flow may have multiple fixed points, as in this example:

1 inti=1;2 L:3 gotoL;

Since the label L can be reached from two locations, the control-flow equation for goto dictates that before(2) = after(1) intersect before(3). But before(3) = before(2), so before(2) = after(1) intersect before(2). This has two fixed-points for before(2), {i} and the empty set. However, it can be shown that because of the monotonic form of the data-flow equations, there is a unique maximal fixed point (fixed point of largest size) that provides the most possible information about the definitely assigned variables. Such a maximal (or maximum) fixed point may be computed by standard techniques; see data-flow analysis.

An additional issue is that a control-flow jump may render certain control flows infeasible; for example, in this code fragment the variable i is definitely assigned before it is used:

1 inti;2 if(j<0)return;elsei=j;3 print(i);

The data-flow equation for if says that after(2) = after(return) intersect after(i = j). To make this work out correctly, we define after(e) = vars(e) for all control-flow jumps; this is vacuously valid in the same sense that the equation false(true) = vars(e) is valid, because it is not possible for control to reach a point immediately after a control-flow jump.

Related Research Articles

In computer science, control flow is the order in which individual statements, instructions or function calls of an imperative program are executed or evaluated. The emphasis on explicit control flow distinguishes an imperative programming language from a declarative programming language.

ABAP is a high-level programming language created by the German software company SAP SE. It is extracted from the base computing languages Java, C, C++ and Python. It is currently positioned, alongside Java, as the language for programming the SAP NetWeaver Application Server, which is part of the SAP NetWeaver platform for building business applications.

In computer programming, an assignment statement sets and/or re-sets the value stored in the storage location(s) denoted by a variable name; in other words, it copies a value into the variable. In most imperative programming languages, the assignment statement is a fundamental construct.

In compiler design, static single assignment form is a property of an intermediate representation (IR), which requires that each variable is assigned exactly once, and every variable is defined before it is used. Existing variables in the original IR are split into versions, new variables typically indicated by the original name with a subscript in textbooks, so that every definition gets its own version. In SSA form, use-def chains are explicit and each contains a single element.

In computer programming, a block or code block is a lexical structure of source code which is grouped together. Blocks consist of one or more declarations and statements. A programming language that permits the creation of blocks, including blocks nested within other blocks, is called a block-structured programming language. Blocks are fundamental to structured programming, where control structures are formed from blocks.

Conditional (computer programming) programming language construct that performs actions according to boolean conditions

In computer science, conditional statements, conditional expressions and conditional constructs are features of a programming language, which perform different computations or actions depending on whether a programmer-specified boolean condition evaluates to true or false. Apart from the case of branch predication, this is always achieved by selectively altering the control flow based on some condition.

ActionScript object-oriented programming language

ActionScript is an object-oriented programming language originally developed by Macromedia Inc.. It is influenced by HyperTalk, the scripting language for HyperCard. It is now an implementation of ECMAScript, though it originally arose as a sibling, both being influenced by HyperTalk.

In computer programming, ?: is a ternary operator that is part of the syntax for basic conditional expressions in several programming languages. It is commonly referred to as the conditional operator, inline if (iif), or ternary if. An expression a? b : c evaluates to b if the value of a is true, and otherwise to c. One can read it aloud as "if a then b otherwise c".

Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program. A program's control flow graph (CFG) is used to determine those parts of a program to which a particular value assigned to a variable might propagate. The information gathered is often used by compilers when optimizing a program. A canonical example of a data-flow analysis is reaching definitions.

In most computer programming languages, a do while loop is a control flow statement that executes a block of code at least once, and then repeatedly executes the block, or not, depending on a given boolean condition at the end of the block.

In computer programming, a statement is a syntactic unit of an imperative programming language that expresses some action to be carried out. A program written in such a language is formed by a sequence of one or more statements. A statement may have internal components.

In computer science, a relational operator is a programming language construct or operator that tests or defines some kind of relation between two entities. These include numerical equality and inequalities.

In computer programming, unreachable code is part of the source code of a program which can never be executed because there exists no control flow path to the code from the rest of the program.

In computer programming languages, a switch statement is a type of selection control mechanism used to allow the value of a variable or expression to change the control flow of program execution via search and map.

HP Time-Shared BASIC is a BASIC programming language interpreter for Hewlett-Packard's HP 2000 line of minicomputer-based time-sharing computer systems. TSB is historically notable as the platform that released the first public versions of the game Star Trek.

C Sharp (programming language) Multi-paradigm (object-oriented) programming language

C# is a general-purpose, multi-paradigm programming language encompassing strong typing, lexically scoped, imperative, declarative, functional, generic, object-oriented (class-based), and component-oriented programming disciplines. It was developed around 2000 by Microsoft as part of its .NET initiative, and later approved as an international standard by Ecma (ECMA-334) and ISO. Mono is the name of the free and open-source project to develop a compiler and runtime for the language. C# is one of the programming languages designed for the Common Language Infrastructure (CLI).

In the Java programming language, the final keyword is used in several contexts to define an entity that can only be assigned once.

The syntax of JavaScript is the set of rules that define a correctly structured JavaScript program.

In compilers, live variable analysis is a classic data-flow analysis to calculate the variables that are live at each point in the program. A variable is live at some point if it holds a value that may be needed in the future, or equivalently if its value may be read before the next time the variable is written to.

GoTo is a statement found in many computer programming languages. It performs a one-way transfer of control to another line of code; in contrast a function call normally returns control. The jumped-to locations are usually identified using labels, though some languages use line numbers. At the machine code level, a goto is a form of branch or jump statement, in some cases combined with a stack adjustment. Many languages support the goto statement, and many do not.

References

  1. J. Gosling; B. Joy; G. Steele; G. Bracha. "The Java Language Specification, 3rd Edition". pp. Chapter 16 (pp.527–552). Retrieved December 2, 2008.
  2. "Standard ECMA-334, C# Language Specification". ECMA International. pp. Section 12.3 (pp.122–133). Retrieved December 2, 2008.
  3. Stärk, Robert F.; E. Borger; Joachim Schmid (2001). Java and the Java Virtual Machine: Definition, Verification, Validation. Secaucus, NJ, USA: Springer-Verlag New York, Inc. pp. Section 8.3. ISBN   3-540-42088-6.
  4. 1 2 Fruja, Nicu G. (October 2004). "The Correctness of the Definite Assignment Analysis in C#". Journal of Object Technology. 3 (9): 29–52. CiteSeerX   10.1.1.165.6696 . doi:10.5381/jot.2004.3.9.a2 . Retrieved 2008-12-02. We actually prove more than correctness: we show that the solution of the analysis is a perfect solution (and not only a safe approximation).
  5. "Cyclone: Definite Assignment". Cyclone User's Manual. Retrieved December 16, 2008.
  6. "Standard ECMA-335, Common Language Infrastructure (CLI)". ECMA International. pp. Section 1.8.1.1 (Partition III, pg. 19). Retrieved December 2, 2008.