Abstract interpretation

Last updated

In computer science, abstract interpretation is a theory of sound approximation of the semantics of computer programs, based on monotonic functions over ordered sets, especially lattices. It can be viewed as a partial execution of a computer program which gains information about its semantics (e.g., control-flow, data-flow) without performing all the calculations.

Contents

Its main concrete application is formal static analysis, the automatic extraction of information about the possible executions of computer programs; such analyses have two main usages:

Abstract interpretation was formalized by the French computer scientist working couple Patrick Cousot and Radhia Cousot in the late 1970s. [1] [2]

Intuition

This section illustrates abstract interpretation by means of real-world, non-computing examples.

Consider the people in a conference room. Assume a unique identifier for each person in the room, like a social security number in the United States. To prove that someone is not present, all one needs to do is see if their social security number is not on the list. Since two different people cannot have the same number, it is possible to prove or disprove the presence of a participant simply by looking up their number.

However it is possible that only the names of attendees were registered. If the name of a person is not found in the list, we may safely conclude that that person was not present; but if it is, we cannot conclude definitely without further inquiries, due to the possibility of homonyms (for example, two people named John Smith). Note that this imprecise information will still be adequate for most purposes, because homonyms are rare in practice. However, in all rigor, we cannot say for sure that somebody was present in the room; all we can say is that they were possibly here. If the person we are looking up is a criminal, we will issue an alarm; but there is of course the possibility of issuing a false alarm. Similar phenomena will occur in the analysis of programs.

If we are only interested in some specific information, say, "was there a person of age in the room?", keeping a list of all names and dates of births is unnecessary. We may safely and without loss of precision restrict ourselves to keeping a list of the participants' ages. If this is already too much to handle, we might keep only the age of the youngest, and oldest person, . If the question is about an age strictly lower than or strictly higher than , then we may safely respond that no such participant was present. Otherwise, we may only be able to say that we do not know.

In the case of computing, concrete, precise information is in general not computable within finite time and memory (see Rice's theorem and the halting problem). Abstraction is used to allow for generalized answers to questions (for example, answering "maybe" to a yes/no question, meaning "yes or no", when we (an algorithm of abstract interpretation) cannot compute the precise answer with certainty); this simplifies the problems, making them amenable to automatic solutions. One crucial requirement is to add enough vagueness so as to make problems manageable while still retaining enough precision for answering the important questions (such as "might the program crash?").

Abstract interpretation of computer programs

Given a programming or specification language, abstract interpretation consists of giving several semantics linked by relations of abstraction. A semantics is a mathematical characterization of a possible behavior of the program. The most precise semantics, describing very closely the actual execution of the program, are called the concrete semantics. For instance, the concrete semantics of an imperative programming language may associate to each program the set of execution traces it may produce an execution trace being a sequence of possible consecutive states of the execution of the program; a state typically consists of the value of the program counter and the memory locations (globals, stack and heap). More abstract semantics are then derived; for instance, one may consider only the set of reachable states in the executions (which amounts to considering the last states in finite traces).

The goal of static analysis is to derive a computable semantic interpretation at some point. For instance, one may choose to represent the state of a program manipulating integer variables by forgetting the actual values of the variables and only keeping their signs (+, − or 0). For some elementary operations, such as multiplication, such an abstraction does not lose any precision: to get the sign of a product, it is sufficient to know the sign of the operands. For some other operations, the abstraction may lose precision: for instance, it is impossible to know the sign of a sum whose operands are respectively positive and negative.

Sometimes a loss of precision is necessary to make the semantics decidable (see Rice's theorem and the halting problem). In general, there is a compromise to be made between the precision of the analysis and its decidability (computability), or tractability (computational cost).

In practice the abstractions that are defined are tailored to both the program properties one desires to analyze, and to the set of target programs. The first large scale automated analysis of computer programs with abstract interpretation was motivated by the accident that resulted in the destruction of the first flight of the Ariane 5 rocket in 1996. [3]

Formalization

Example: abstraction of integer sets (red) to sign sets (green) Abstract interpretation of integers by signs svg.svg
Example: abstraction of integer sets (red) to sign sets (green)

Let be an ordered set, called concrete set, and let be another ordered set, called abstract set. These two sets are related to each other by defining total functions that map elements from one to the other.

A function is called an abstraction function if it maps an element in the concrete set to an element in the abstract set . That is, element in is the abstraction of in .

A function is called a concretization function if it maps an element in the abstract set to an element in the concrete set . That is, element in is a concretization of in .

Let , , , and be ordered sets. The concrete semantics is a monotonic function from to . A function from to is said to be a valid abstraction of if, for all in , we have .

Program semantics are generally described using fixed points in the presence of loops or recursive procedures. Suppose that is a complete lattice and let be a monotonic function from into . Then, any such that is an abstraction of the least fixed-point of , which exists, according to the Knaster–Tarski theorem.

The difficulty is now to obtain such an . If is of finite height, or at least verifies the ascending chain condition (all ascending sequences are ultimately stationary), then such an may be obtained as the stationary limit of the ascending sequence defined by induction as follows: (the least element of ) and .

In other cases, it is still possible to obtain such an through a (pair-)widening operator, [4] defined as a binary operator which satisfies the following conditions:

  1. For all and , we have and , and
  2. For any ascending sequence , the sequence defined by and is ultimately stationary. We can then take .

In some cases, it is possible to define abstractions using Galois connections where is from to and is from to . This supposes the existence of best abstractions, which is not necessarily the case. For instance, if we abstract sets of couples of real numbers by enclosing convex polyhedra, there is no optimal abstraction to the disc defined by .

Examples of abstract domains

Numerical abstract domains

One can assign to each variable available at a given program point an interval . A state assigning the value to variable will be a concretization of these intervals if, for all , we have . From the intervals and for variables and , respectively, one can easily obtain intervals for (namely, ) and for (namely, ); note that these are exact abstractions, since the set of possible outcomes for, say, , is precisely the interval . More complex formulas can be derived for multiplication, division, etc., yielding so-called interval arithmetics. [5]

Let us now consider the following very simple program:

y = x; z = x - y; 
Combination of interval arithmetic (
html.skin-theme-clientpref-night .mw-parser-output div:not(.notheme)>.tmp-color,html.skin-theme-clientpref-night .mw-parser-output p>.tmp-color,html.skin-theme-clientpref-night .mw-parser-output table:not(.notheme) .tmp-color{color:inherit!important}@media(prefers-color-scheme:dark){html.skin-theme-clientpref-os .mw-parser-output div:not(.notheme)>.tmp-color,html.skin-theme-clientpref-os .mw-parser-output p>.tmp-color,html.skin-theme-clientpref-os .mw-parser-output table:not(.notheme) .tmp-color{color:inherit!important}}
green) and congruence mod 2 on integers (
cyan) as abstract domains to analyze a simple piece of C code (
red: concrete sets of possible values at runtime). Using the congruence information (
0=even,
1=odd), a zero division can be excluded. (Since only one variable is involved, relational vs. non-relational domains is not an issue here.) Combination of abstract domains.svg
Combination of interval arithmetic (green) and congruence mod 2 on integers (cyan) as abstract domains to analyze a simple piece of C code (red: concrete sets of possible values at runtime). Using the congruence information (0=even, 1=odd), a zero division can be excluded. (Since only one variable is involved, relational vs. non-relational domains is not an issue here.)
A 3-dimensional convex example polyhedron describing the possible values of 3 variables at some program point. Each of the variables may be zero, but all three can't be zero simultaneously. The latter property cannot be described in the interval arithmetics domain. 3dpoly.svg
A 3-dimensional convex example polyhedron describing the possible values of 3 variables at some program point. Each of the variables may be zero, but all three can't be zero simultaneously. The latter property cannot be described in the interval arithmetics domain.

With reasonable arithmetic types, the result for z should be zero. But if we do interval arithmetic starting from x in [0, 1], one gets z in [−1, +1]. While each of the operations taken individually was exactly abstracted, their composition isn't.

The problem is evident: we did not keep track of the equality relationship between x and y; actually, this domain of intervals does not take into account any relationships between variables, and is thus a non-relational domain. Non-relational domains tend to be fast and simple to implement, but imprecise.

Some examples of relational numerical abstract domains are:

and combinations thereof (such as the reduced product, [2] cf. right picture).

When one chooses an abstract domain, one typically has to strike a balance between keeping fine-grained relationships, and high computational costs.

Machine word abstract domains

While high-level languages such as Python or Haskell use unbounded integers by default, lower-level programming languages such as C or assembly language typically operate on finitely-sized machine words, which are more suitably modeled using the integers modulo (where n is the bit width of a machine word). There are several abstract domains suitable for various analyses of such variables.

The bitfield domain treats each bit in a machine word separately, i.e., a word of width n is treated as an array of n abstract values. The abstract values are taken from the set , and the abstraction and concretization functions are given by: [14] [15] , , , , , , . Bitwise operations on these abstract values are identical with the corresponding logical operations in some three-valued logics: [16]

NOT(A)
A¬A
01
10
AND(A, B)
A ∧ BB
01
A0000
0
101
OR(A, B)
A ∨ BB
01
A001
1
1111

Further domains include the signed interval domain and the unsigned interval domain. All three of these domains support forwards and backwards abstract operators for common operations such as addition, shifts, xor, and multiplication. These domains can be combined using the reduced product. [17]

See also

Related Research Articles

<span class="mw-page-title-main">Exponential distribution</span> Probability distribution

In probability theory and statistics, the exponential distribution or negative exponential distribution is the probability distribution of the distance between events in a Poisson point process, i.e., a process in which events occur continuously and independently at a constant average rate; the distance parameter could be any meaningful mono-dimensional measure of the process, such as time between production errors, or length along a roll of fabric in the weaving manufacturing process. It is a particular case of the gamma distribution. It is the continuous analogue of the geometric distribution, and it has the key property of being memoryless. In addition to being used for the analysis of Poisson point processes it is found in various other contexts.

<span class="mw-page-title-main">Taylor's theorem</span> Approximation of a function by a truncated power series

In calculus, Taylor's theorem gives an approximation of a -times differentiable function around a given point by a polynomial of degree , called the -th-order Taylor polynomial. For a smooth function, the Taylor polynomial is the truncation at the order of the Taylor series of the function. The first-order Taylor polynomial is the linear approximation of the function, and the second-order Taylor polynomial is often referred to as the quadratic approximation. There are several versions of Taylor's theorem, some giving explicit estimates of the approximation error of the function by its Taylor polynomial.

<span class="mw-page-title-main">Pareto distribution</span> Probability distribution

The Pareto distribution, named after the Italian civil engineer, economist, and sociologist Vilfredo Pareto, is a power-law probability distribution that is used in description of social, quality control, scientific, geophysical, actuarial, and many other types of observable phenomena; the principle originally applied to describing the distribution of wealth in a society, fitting the trend that a large portion of wealth is held by a small fraction of the population. The Pareto principle or "80-20 rule" stating that 80% of outcomes are due to 20% of causes was named in honour of Pareto, but the concepts are distinct, and only Pareto distributions with shape value of log45 ≈ 1.16 precisely reflect it. Empirical observation has shown that this 80-20 distribution fits a wide range of cases, including natural phenomena and human activities.

In computer science, denotational semantics is an approach of formalizing the meanings of programming languages by constructing mathematical objects that describe the meanings of expressions from the languages. Other approaches providing formal semantics of programming languages include axiomatic semantics and operational semantics.

<span class="mw-page-title-main">Semi-continuity</span> Property of functions which is weaker than continuity

In mathematical analysis, semicontinuity is a property of extended real-valued functions that is weaker than continuity. An extended real-valued function is uppersemicontinuous at a point if, roughly speaking, the function values for arguments near are not much higher than

<span class="mw-page-title-main">Chi-squared distribution</span> Probability distribution and special case of gamma distribution

In probability theory and statistics, the chi-squared distribution with degrees of freedom is the distribution of a sum of the squares of independent standard normal random variables. The chi-squared distribution is a special case of the gamma distribution and is one of the most widely used probability distributions in inferential statistics, notably in hypothesis testing and in construction of confidence intervals. This distribution is sometimes called the central chi-squared distribution, a special case of the more general noncentral chi-squared distribution.

In programming language theory and proof theory, the Curry–Howard correspondence is the direct relationship between computer programs and mathematical proofs.

<span class="mw-page-title-main">Morera's theorem</span> Integral criterion for holomorphy

In complex analysis, a branch of mathematics, Morera's theorem, named after Giacinto Morera, gives an important criterion for proving that a function is holomorphic.

<span class="mw-page-title-main">Lévy distribution</span> Probability distribution

In probability theory and statistics, the Lévy distribution, named after Paul Lévy, is a continuous probability distribution for a non-negative random variable. In spectroscopy, this distribution, with frequency as the dependent variable, is known as a van der Waals profile. It is a special case of the inverse-gamma distribution. It is a stable distribution.

The simply typed lambda calculus, a form of type theory, is a typed interpretation of the lambda calculus with only one type constructor that builds function types. It is the canonical and simplest example of a typed lambda calculus. The simply typed lambda calculus was originally introduced by Alonzo Church in 1940 as an attempt to avoid paradoxical use of the untyped lambda calculus.

<span class="mw-page-title-main">Kumaraswamy distribution</span>

In probability and statistics, the Kumaraswamy's double bounded distribution is a family of continuous probability distributions defined on the interval (0,1). It is similar to the beta distribution, but much simpler to use especially in simulation studies since its probability density function, cumulative distribution function and quantile functions can be expressed in closed form. This distribution was originally proposed by Poondi Kumaraswamy for variables that are lower and upper bounded with a zero-inflation. This was extended to inflations at both extremes [0,1] in later work with S. G. Fletcher.

In computer science, Programming Computable Functions (PCF) is a typed functional language introduced by Gordon Plotkin in 1977, based on previous unpublished material by Dana Scott. It can be considered to be an extended version of the typed lambda calculus or a simplified version of modern typed functional languages such as ML or Haskell.

String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector spaces and linear maps with the tensor product, string diagrams are called tensor networks or Penrose graphical notation. This has led to the development of categorical quantum mechanics where the axioms of quantum theory are expressed in the language of monoidal categories.

In mathematics – specifically, in operator theory – a densely defined operator or partially defined operator is a type of partially defined function. In a topological sense, it is a linear operator that is defined "almost everywhere". Densely defined operators often arise in functional analysis as operations that one would like to apply to a larger class of objects than those for which they a priori "make sense".

The uncertainty theory invented by Baoding Liu is a branch of mathematics based on normality, monotonicity, self-duality, countable subadditivity, and product measure axioms.

In mathematics, a limit is the value that a function approaches as the input approaches some value. Limits are essential to calculus and mathematical analysis, and are used to define continuity, derivatives, and integrals.

A product distribution is a probability distribution constructed as the distribution of the product of random variables having two other known distributions. Given two statistically independent random variables X and Y, the distribution of the random variable Z that is formed as the product is a product distribution.

Computable topology is a discipline in mathematics that studies the topological and algebraic structure of computation. Computable topology is not to be confused with algorithmic or computational topology, which studies the application of computation to topology.

<span class="mw-page-title-main">Radhia Cousot</span> Inventor of abstract interpretation

Radhia Cousot was a French computer scientist known for inventing abstract interpretation.

In computer science, especially model checking and abstract interpretation, widening refers to at least two different techniques in the analysis of abstract transition systems where infinite progressions of abstract states are replaced by a least fixed point. The use of the term in model checking is closely related to acceleration techniques, some authors reserving acceleration for exact computations.

References

  1. Cousot, Patrick; Cousot, Radhia (1977). "Abstract Interpretation: A Unified Lattice Model for Static Analysis of Programs by Construction or Approximation of Fixpoints" (PDF). Conference Record of the Fourth ACM Symposium on Principles of Programming Languages, Los Angeles, California, USA, January 1977. ACM Press. pp. 238–252. doi:10.1145/512950.512973. S2CID   207614632.
  2. 1 2 Cousot, Patrick; Cousot, Radhia (1979). "Systematic Design of Program Analysis Frameworks" (PDF). Conference Record of the Sixth Annual ACM Symposium on Principles of Programming Languages, San Antonio, Texas, USA, January 1979. ACM Press. pp. 269–282. doi:10.1145/567752.567778. S2CID   1547466.
  3. Faure, Christèle. "PolySpace Technologies History" . Retrieved 3 October 2010.
  4. Cousot, P.; Cousot, R. (August 1992). "Comparing the Galois Connection and Widening / Narrowing Approaches to Abstract Interpretation" (PDF). In Bruynooghe, Maurice; Wirsing, Martin (eds.). Proc. 4th Int. Symp. on Programming Language Implementation and Logic Programming (PLILP). Lecture Notes in Computer Science. Vol. 631. Springer. pp. 269–296. ISBN   978-0-387-55844-8.
  5. Cousot, Patrick; Cousot, Radhia (1976). "Static determination of dynamic properties of programs" (PDF). Proceedings of the Second International Symposium on Programming. Dunod, Paris, France. pp. 106–130.
  6. Granger, Philippe (1989). "Static Analysis of Arithmetical Congruences". International Journal of Computer Mathematics. 30 (3–4): 165–190. doi:10.1080/00207168908803778.
  7. Philippe Granger (1991). "Static Analysis of Linear Congruence Equalities Among Variables of a Program". In Abramsky, S.; Maibaum, T.S.E. (eds.). Proc. Int. J. Conf. on Theory and Practice of Software Development (TAPSOFT). Lecture Notes in Computer Science. Vol. 493. Springer. pp. 169–192.
  8. Cousot, Patrick; Halbwachs, Nicolas (January 1978). "Automatic Discovery of Linear Restraints Among Variables of a Program" (PDF). Conf. Rec. 5th ACM Symp. on Principles of Programming Languages (POPL). pp. 84–97.
  9. Miné, Antoine (2001). "A New Numerical Abstract Domain Based on Difference-Bound Matrices". In Danvy, Olivier; Filinski, Andrzej (eds.). Programs as Data Objects, Second Symposium, (PADO). Lecture Notes in Computer Science. Vol. 2053. Springer. pp. 155–172. arXiv: cs/0703073 .
  10. Miné, Antoine (Dec 2004). Weakly Relational Numerical Abstract Domains (PDF) (Ph.D. thesis). Laboratoire d'Informatique de l'École Normale Supérieure.
  11. Antoine Miné (2006). "The Octagon Abstract Domain". Higher Order Symbol. Comput. 19 (1): 31–100. arXiv: cs/0703084 . doi:10.1007/s10990-006-8609-1.
  12. Clarisó, Robert; Cortadella, Jordi (2007). "The Octahedron Abstract Domain". Science of Computer Programming. 64: 115–139. doi:10.1016/j.scico.2006.03.009. hdl: 10609/109823 .
  13. Michael Karr (1976). "Affine Relationships Among Variables of a Program". Acta Informatica. 6 (2): 133–151. doi:10.1007/BF00268497. S2CID   376574.
  14. Miné, Antoine (Jun 2012). "Abstract domains for bit-level machine integer and floating-point operations". WING'12 - 4th International Workshop on Invariant Generation. Manchester, United Kingdom: 16.
  15. Regehr, John; Duongsaa, Usit (Jun 2006). "Deriving abstract transfer functions for analyzing embedded software". Proceedings of the 2006 ACM SIGPLAN/SIGBED conference on Language, compilers, and tool support for embedded systems. LCTES '06. New York, NY, USA: Association for Computing Machinery. pp. 34–43. doi:10.1145/1134650.1134657. ISBN   978-1-59593-362-1. S2CID   13221224.
  16. Reps, T.; Loginov, A.; Sagiv, M. (Jul 2002). "Semantic minimization of 3-valued propositional formulae". Proceedings 17th Annual IEEE Symposium on Logic in Computer Science. pp. 40–51. doi:10.1109/LICS.2002.1029816. ISBN   0-7695-1483-9. S2CID   8451238.
  17. Yoon, Yongho; Lee, Woosuk; Yi, Kwangkeun (2023-06-06). "Inductive Program Synthesis via Iterative Forward-Backward Abstract Interpretation". Proceedings of the ACM on Programming Languages. 7 (PLDI): 174:1657–174:1681. arXiv: 2304.10768 . doi: 10.1145/3591288 .
Lecture notes