In computer science, abstract interpretation is a theory of sound approximation of the semantics of computer programs, based on monotonic functions over ordered sets, especially lattices. It can be viewed as a partial execution of a computer program which gains information about its semantics (e.g., control-flow, data-flow) without performing all the calculations.
Its main concrete application is formal static analysis, the automatic extraction of information about the possible executions of computer programs; such analyses have two main usages:
Abstract interpretation was formalized by the French computer scientist working couple Patrick Cousot and Radhia Cousot in the late 1970s. [1] [2]
This section illustrates abstract interpretation by means of real-world, non-computing examples.
Consider the people in a conference room. Assume a unique identifier for each person in the room, like a social security number in the United States. To prove that someone is not present, all one needs to do is see if their social security number is not on the list. Since two different people cannot have the same number, it is possible to prove or disprove the presence of a participant simply by looking up his or her number.
However it is possible that only the names of attendees were registered. If the name of a person is not found in the list, we may safely conclude that that person was not present; but if it is, we cannot conclude definitely without further inquiries, due to the possibility of homonyms (for example, two people named John Smith). Note that this imprecise information will still be adequate for most purposes, because homonyms are rare in practice. However, in all rigor, we cannot say for sure that somebody was present in the room; all we can say is that he or she was possibly here. If the person we are looking up is a criminal, we will issue an alarm; but there is of course the possibility of issuing a false alarm. Similar phenomena will occur in the analysis of programs.
If we are only interested in some specific information, say, "was there a person of age n in the room?", keeping a list of all names and dates of births is unnecessary. We may safely and without loss of precision restrict ourselves to keeping a list of the participants' ages. If this is already too much to handle, we might keep only the age of the youngest, m and oldest person, M. If the question is about an age strictly lower than m or strictly higher than M, then we may safely respond that no such participant was present. Otherwise, we may only be able to say that we do not know.
In the case of computing, concrete, precise information is in general not computable within finite time and memory (see Rice's theorem and the halting problem). Abstraction is used to allow for generalized answers to questions (for example, answering "maybe" to a yes/no question, meaning "yes or no", when we (an algorithm of abstract interpretation) cannot compute the precise answer with certainty); this simplifies the problems, making them amenable to automatic solutions. One crucial requirement is to add enough vagueness so as to make problems manageable while still retaining enough precision for answering the important questions (such as "might the program crash?").
Given a programming or specification language, abstract interpretation consists of giving several semantics linked by relations of abstraction. A semantics is a mathematical characterization of a possible behavior of the program. The most precise semantics, describing very closely the actual execution of the program, are called the concrete semantics. For instance, the concrete semantics of an imperative programming language may associate to each program the set of execution traces it may produce – an execution trace being a sequence of possible consecutive states of the execution of the program; a state typically consists of the value of the program counter and the memory locations (globals, stack and heap). More abstract semantics are then derived; for instance, one may consider only the set of reachable states in the executions (which amounts to considering the last states in finite traces).
The goal of static analysis is to derive a computable semantic interpretation at some point. For instance, one may choose to represent the state of a program manipulating integer variables by forgetting the actual values of the variables and only keeping their signs (+, − or 0). For some elementary operations, such as multiplication, such an abstraction does not lose any precision: to get the sign of a product, it is sufficient to know the sign of the operands. For some other operations, the abstraction may lose precision: for instance, it is impossible to know the sign of a sum whose operands are respectively positive and negative.
Sometimes a loss of precision is necessary to make the semantics decidable (see Rice's theorem and the halting problem). In general, there is a compromise to be made between the precision of the analysis and its decidability (computability), or tractability (computational cost).
In practice the abstractions that are defined are tailored to both the program properties one desires to analyze, and to the set of target programs. The first large scale automated analysis of computer programs with abstract interpretation was motivated by the accident that resulted in the destruction of the first flight of the Ariane 5 rocket in 1996. [3]
Let L be an ordered set, called a concrete set and let L′ be another ordered set, called an abstract set. These two sets are related to each other by defining total functions that map elements from one to the other.
A function α is called an abstraction function if it maps an element x in the concrete set L to an element α(x) in the abstract set L′. That is, element α(x) in L′ is the abstraction of x in L.
A function γ is called a concretization function if it maps an element x′ in the abstract set L′ to an element γ(x′) in the concrete set L. That is, element γ(x′) in L is a concretization of x′ in L′.
Let L1, L2, L′1 and L′2 be ordered sets. The concrete semantics f is a monotonic function from L1 to L2. A function f′ from L′1 to L′2 is said to be a valid abstraction of f if for all x′ in L′1, (f ∘ γ)(x′) ≤ (γ ∘ f′)(x′).
Program semantics are generally described using fixed points in the presence of loops or recursive procedures. Let us suppose that L is a complete lattice and let f be a monotonic function from L into L. Then, any x′ such that f(x′) ≤ x′ is an abstraction of the least fixed-point of f, which exists, according to the Knaster–Tarski theorem.
The difficulty is now to obtain such an x′. If L′ is of finite height, or at least verifies the ascending chain condition (all ascending sequences are ultimately stationary), then such an x′ may be obtained as the stationary limit of the ascending sequence x′n defined by induction as follows: x′0=⊥ (the least element of L′) and x′n+1=f′(x′n).
In other cases, it is still possible to obtain such an x′ through a widening operator [4] ∇: for all x and y, x ∇ y should be greater or equal than both x and y, and for any sequence y′n, the sequence defined by x′0=⊥ and x′n+1=x′n ∇ y′n is ultimately stationary. We can then take y′n=f′(x′n).
In some cases, it is possible to define abstractions using Galois connections (α, γ) where α is from L to L′ and γ is from L′ to L. This supposes the existence of best abstractions, which is not necessarily the case. For instance, if we abstract sets of couples (x, y) of real numbers by enclosing convex polyhedra, there is no optimal abstraction to the disc defined by x2+y2 ≤ 1.
One can assign to each variable x available at a given program point an interval [Lx, Hx]. A state assigning the value v(x) to variable x will be a concretization of these intervals if for all x, v(x) is in [Lx, Hx]. From the intervals [Lx, Hx] and [Ly, Hy] for variables x and y, one can easily obtain intervals for x+y ([Lx+Ly, Hx+Hy]) and for x−y ([Lx−Hy, Hx−Ly]); note that these are exact abstractions, since the set of possible outcomes for, say, x+y, is precisely the interval ([Lx+Ly, Hx+Hy]). More complex formulas can be derived for multiplication, division, etc., yielding so-called interval arithmetics. [5]
Let us now consider the following very simple program:
y = x; z = x - y;
With reasonable arithmetic types, the result for z should be zero. But if we do interval arithmetic starting from x in [0, 1], one gets z in [−1, +1]. While each of the operations taken individually was exactly abstracted, their composition isn't.
The problem is evident: we did not keep track of the equality relationship between x and y; actually, this domain of intervals does not take into account any relationships between variables, and is thus a non-relational domain. Non-relational domains tend to be fast and simple to implement, but imprecise.
Some examples of relational numerical abstract domains are:
and combinations thereof (such as the reduced product, [2] cf. right picture).
When one chooses an abstract domain, one typically has to strike a balance between keeping fine-grained relationships, and high computational costs.
While high-level languages such as Python or Haskell use unbounded integers by default, lower-level programming languages such as C or assembly language typically operate on finitely-sized machine words, which are more suitably modeled using the integers modulo (where n is the bit width of a machine word). There are several abstract domains suitable for various analyses of such variables.
The bitfield domain treats each bit in a machine word separately, i.e., a word of width n is treated as an array of n abstract values. The abstract values are taken from the set , and the abstraction and concretization functions are given by: [14] [15] , , , , , , . Bitwise operations on these abstract values are identical with the corresponding logical operations in some three-valued logics: [16]
|
|
|
Further domains include the signed interval domain and the unsigned interval domain. All three of these domains support forwards and backwards abstract operators for common operations such as addition, shifts, xor, and multiplication. These domains can be combined using the reduced product. [17]
Lambda calculus is a formal system in mathematical logic for expressing computation based on function abstraction and application using variable binding and substitution. It is a universal model of computation that can be used to simulate any Turing machine. It was introduced by the mathematician Alonzo Church in the 1930s as part of his research into the foundations of mathematics.
In computer science, static program analysis is the analysis of computer programs performed without executing them, in contrast with dynamic program analysis, which is performed on programs during their execution.
In computer science, denotational semantics is an approach of formalizing the meanings of programming languages by constructing mathematical objects that describe the meanings of expressions from the languages. Other approaches providing formal semantics of programming languages include axiomatic semantics and operational semantics.
In software engineering and computer science, abstraction is the process of generalizing concrete details, such as attributes, away from the study of objects and systems to focus attention on details of greater importance. Abstraction is a fundamental concept in computer science and software engineering, especially within the object-oriented programming paradigm. Examples of this include:
In mathematics, especially in order theory, a Galois connection is a particular correspondence (typically) between two partially ordered sets (posets). Galois connections find applications in various mathematical theories. They generalize the fundamental theorem of Galois theory about the correspondence between subgroups and subfields, discovered by the French mathematician Évariste Galois.
In the mathematical areas of order and lattice theory, the Knaster–Tarski theorem, named after Bronisław Knaster and Alfred Tarski, states the following:
In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type to every term. Usually the terms are various language constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term. For variables, the type system determines the allowed values of that term. Type systems formalize and enforce the otherwise implicit categories the programmer uses for algebraic data types, data structures, or other components.
In programming language theory and proof theory, the Curry–Howard correspondence is the direct relationship between computer programs and mathematical proofs.
In mathematics, a fixed point, also known as an invariant point, is a value that does not change under a given transformation. Specifically, for functions, a fixed point is an element that is mapped to itself by the function.
In computer programming, program slicing is the computation of the set of program statements, the program slice, that may affect the values at some point of interest, referred to as a slicing criterion. Program slicing can be used in debugging to locate source of errors more easily. Other applications of slicing include software maintenance, optimization, program analysis, and information flow control.
Data-flow analysis is a technique for gathering information about the possible set of values calculated at various points in a computer program. A program's control-flow graph (CFG) is used to determine those parts of a program to which a particular value assigned to a variable might propagate. The information gathered is often used by compilers when optimizing a program. A canonical example of a data-flow analysis is reaching definitions.
Samson Abramsky is Professor of Computer Science at University College London. He was previously the Christopher Strachey Professor of Computing at Wolfson College, Oxford, from 2000 to 2021.
In computer science, Programming Computable Functions (PCF) is a typed functional language introduced by Gordon Plotkin in 1977, based on previous unpublished material by Dana Scott. It can be considered to be an extended version of the typed lambda calculus or a simplified version of modern typed functional languages such as ML or Haskell.
Patrick Cousot is a French computer scientist, currently Silver Professor of Computer Science at the Courant Institute of Mathematical Sciences, New York University, USA. Before he was Professor at the École Normale Supérieure (ENS), Paris, France, the École Polytechnique, Palaiseau, France and the University of Metz, France and a Research Scientist at the French National Center for Scientific Research (CNRS) at the Joseph Fourier University, Grenoble, France.
In computer science and in mathematics, abstraction model checking is for systems where an actual representation is too complex in developing the model alone. So, the design undergoes a kind of translation to scaled down "abstract" version.
Astrée is a static analyzer based on abstract interpretation. It analyzes programs written in the C and C++ programming languages and outputs an exhaustive list of possible runtime errors and assertion violations. The defect classes covered include divisions by zero, buffer overflows, dereferences of null or dangling pointers, data races, deadlocks, etc. Astrée includes a static taint checker and helps finding cybersecurity vulnerabilities, such as Spectre.
Computable topology is a discipline in mathematics that studies the topological and algebraic structure of computation. Computable topology is not to be confused with algorithmic or computational topology, which studies the application of computation to topology.
AbsInt is a software-development tools vendor based in Saarbrücken, Germany. The company was founded in 1998 as a technology spin-off from the Department of Programming Languages and Compiler Construction of Prof. Reinhard Wilhelm at Saarland University. AbsInt specializes in software-verification tools based on abstract interpretation. Its tools are used worldwide by Fortune 500 companies, educational institutions, government agencies and startups.
Radhia Cousot was a French computer scientist known for inventing abstract interpretation.
In computer science, especially model checking and abstract interpretation, widening refers to at least two different techniques in the analysis of abstract transition systems where infinite progressions of abstract states are replaced by a least fixed point. The use of the term in model checking is closely related to acceleration techniques, some authors reserving acceleration for exact computations.