Typestate analysis

Last updated

Typestate analysis, sometimes called protocol analysis, is a form of program analysis employed in programming languages. It is most commonly applied to object-oriented languages. Typestates define valid sequences of operations that can be performed upon an instance of a given type. Typestates, as the name suggests, associate state information with variables of that type. This state information is used to determine at compile-time which operations are valid to be invoked upon an instance of the type. Operations performed on an object that would usually only be executed at run-time are performed upon the type state information which is modified to be compatible with the new state of the object.

Contents

Typestates are capable of representing behavioral type refinements such as "method A must be invoked before method B is invoked, and method C may not be invoked in between". Typestates are well-suited to representing resources that use open/close semantics by enforcing semantically valid sequences such as "open then close" as opposed to invalid sequences such as leaving a file in an open state. Such resources include filesystem elements, transactions, connections and protocols. For instance, developers may want to specify that files or sockets must be opened before they are read or written, and that they can no longer be read or written if they have been closed. The name "typestate" stems from the fact that this kind of analysis often models each type of object as a finite state machine. In this state machine, each state has a well-defined set of permitted methods/messages, and method invocations may cause state transitions. Petri nets have also been proposed as a possible behavioral model for use with refinement types. [1]

Typestate analysis was introduced by Rob Strom in 1983 [2] in the Network Implementation Language (NIL) developed at IBM's Watson Lab. It was formalized by Strom and Yemini in a 1986 article [3] that described how to use typestate to track the degree of initialisation of variables, guaranteeing that operations would never be applied on improperly initialised data, and further generalized in the Hermes programming language.

In recent years, various studies have developed ways of applying the typestate concept to object-oriented languages. [4] [5]

Approach

Nonlinear typestate lattice arising from initialization of a variable of type struct{int x;int y;int z;}.
The least element [?] coincides with the state [?] of no struct components initialized. Hasse diagram of powerset of 3.svg
Nonlinear typestate lattice arising from initialization of a variable of type struct{int x;int y;int z;}.
The least element ⊥ coincides with the state ∅ of no struct components initialized.

Strom and Yemini (1986) required the set of typestates for a given type to be partially ordered such that a lower typestate can be obtained from a higher one by discarding some information. For example, an int variable in C typically has the typestates "uninitialized" < "initialized", and a FILE* pointer may have the typestates "unallocated" < "allocated, but uninitialized" < "initialized, but file not opened" < "file opened". Moreover, Strom and Yemini require that each two typestates have a greatest lower bound, i.e. that the partial order is even a meet-semilattice; and that each order has a least element, always called "⊥".

Their analysis is based on the simplification that each variable v is assigned only one typestate for each point in the program text; if a point p is reached by two different execution paths and v inherits different typestates via each path, then the typestate of v at p is taken to be the greatest lower bound of the inherited typestates. For example, in the following C snippet, the variable n inherits the typestate "initialized" and "uninitialized" from the then and the (empty) else part, respectively, hence it has typestate "uninitialized" after the whole conditional statement.

intn;// here, n has typestate "uninitialized"if(...){n=5;// here, n has typestate   "initialized"}else{/*do nothing*/// here, n has typestate "uninitialized"}// here, n has typestate "uninitialized" = greatest_lower_bound("initialized","uninitialized")

Every basic operation [note 1] has to be equipped with a typestate transition, i.e. for each parameter the required and ensured typestate before and after the operation, respectively. For example, an operation fwrite(...,fd) requires fd to have typestate "file opened". More precisely, an operation may have several outcomes, each of which needs its own typestate transition. For example, the C code FILE *fd=fopen("foo","r") sets fd's typestate to "file opened" and "unallocated" if opening succeeds and fails, respectively.

For each two typestates t1 t2, a unique typestate coercion operation needs to be provided which, when applied to an object of typestate t2, reduces its typestate to t1, possibly by releasing some resources. For example, fclose(fd) coerces fd's typestate from "file opened" to "initialized, but file not opened".

A program execution is called typestate-correct if

A program text is called typestate-consistent if it can be transformed, by adding appropriate typestate coercions, to a program whose points can be statically labelled with typestates such that any path allowed by the control flow is typestate-correct. Strom and Yemini give a linear-time algorithm that checks a given program text for typestate-consistency, and computes where to insert which coercion operation, if any.

Challenges

In order to achieve a precise and effective typestate analysis, it is necessary to address the problem of aliasing. Aliasing occurs when an object has more than one reference or pointer that points to it. For the analysis to be correct, state changes to a given object must be reflected in all references that point to that object, but in general it is a difficult problem to track all such references. This becomes especially hard if the analysis needs to be modular, that is, applicable to each part of a large program separately without taking the rest of the program into account.

As another issue, for some programs, the method of taking the greatest lower bound at converging execution paths and adding corresponding down-coercion operations appears to be inadequate. For example, before the return 1 in the following program, [note 3] all components x, y, and z of coord are initialized, but Strom's and Yemini's approach fails to recognize this, since each initialization of a struct component in the loop body has to be down-coerced at loop re-entry to meet the typestate of the very first loop entry, viz. ⊥. A related problem is that this example would require overloading of typestate transitions; for example, parse_int_attr("x",&coord->x) changes a typestate "no component initialized" to "x component initialized", but also "y component initialized" to "x and y component initialized".

intparse_coord(struct{intx;inty;intz;}*coord){intseen=0;/* remember which attributes have been parsed */while(1)if(parse_int_attr("x",&coord->x))seen|=1;elseif(parse_int_attr("y",&coord->y))seen|=2;elseif(parse_int_attr("z",&coord->z))seen|=4;elsebreak;if(seen!=7)/* some attribute missing, fail */return0;.../* all attributes present, do some computations and succeed */return1;}

Typestate inference

There are several approaches seeking to infer typestates out of programs (or even other artifacts such as contracts). Many of them can infer typestates at compile time [6] [7] [8] [9] and others mine the models dynamically. [10] [11] [12] [13] [14] [15]

Languages supporting typestate

Typestate is an experimental concept that has not yet crossed over into mainstream programming languages. However, many academic projects actively investigate how to make it more useful as an everyday programming technique. Two examples are the Plaid and Obsidian languages, which are being developed by Jonathan Aldrich's group at Carnegie Mellon University. [16] [17] Other examples include the Clara [18] language research framework, earlier versions of the Rust language, and the >> keyword in ATS. [19]

See also

Notes

  1. these include language constructs, e.g. += in C, and standard library routines, e.g.fopen()
  2. This aims at ensuring that e.g. all files have been closed, and all malloced memory has been freed. In most programming languages, a variable's lifetime may end before program termination; the notion of typestate-correctness has then to be sharpened accordingly.
  3. assuming that int parse_int_attr(const char *name,int *val) initializes *val whenever it succeeds

Related Research Articles

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.

Forth is a stack-oriented programming language and interactive integrated development environment designed by Charles H. "Chuck" Moore and first used by other programmers in 1970. Although not an acronym, the language's name in its early years was often spelled in all capital letters as FORTH. The FORTH-79 and FORTH-83 implementations, which were not written by Moore, became de facto standards, and an official technical standard of the language was published in 1994 as ANS Forth. A wide range of Forth derivatives existed before and after ANS Forth. The free and open-source software Gforth implementation is actively maintained, as are several commercially supported systems.

In computing, a segmentation fault or access violation is a fault, or failure condition, raised by hardware with memory protection, notifying an operating system (OS) the software has attempted to access a restricted area of memory. On standard x86 computers, this is a form of general protection fault. The operating system kernel will, in response, usually perform some corrective action, generally passing the fault on to the offending process by sending the process a signal. Processes can in some cases install a custom signal handler, allowing them to recover on their own, but otherwise the OS default signal handler is used, generally causing abnormal termination of the process, and sometimes a core dump.

In software engineering and computer science, abstraction is the process of generalizing concrete details, such as attributes, away from the study of objects and systems to focus attention on details of greater importance. Abstraction is a fundamental concept in computer science and software engineering, especially within the object-oriented programming paradigm. Examples of this include:

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

<span class="mw-page-title-main">Jackson structured programming</span>

Jackson structured programming (JSP) is a method for structured programming developed by British software consultant Michael A. Jackson and described in his 1975 book Principles of Program Design. The technique of JSP is to analyze the data structures of the files that a program must read as input and produce as output, and then produce a program design based on those data structures, so that the program control structure handles those data structures in a natural and intuitive way.

<span class="mw-page-title-main">Abstract syntax tree</span> Tree representation of the abstract syntactic structure of source code

An abstract syntax tree (AST) is a data structure used in computer science to represent the structure of a program or code snippet. It is a tree representation of the abstract syntactic structure of text written in a formal language. Each node of the tree denotes a construct occurring in the text. It is sometimes called just a syntax tree.

In computer programming, a type system is a logical system comprising a set of rules that assigns a property called a type to every term. Usually the terms are various language constructs of a computer program, such as variables, expressions, functions, or modules. A type system dictates the operations that can be performed on a term. For variables, the type system determines the allowed values of that term.

<span class="mw-page-title-main">Syntax highlighting</span> Tool of editors for programming, scripting, and markup

Syntax highlighting is a feature of text editors that is used for programming, scripting, or markup languages, such as HTML. The feature displays text, especially source code, in different colours and fonts according to the category of terms. This feature facilitates writing in a structured language such as a programming language or a markup language as both structures and syntax errors are visually distinct. This feature is also employed in many programming related contexts, either in the form of colorful books or online websites to make understanding code snippets easier for readers. Highlighting does not affect the meaning of the text itself; it is intended only for human readers.

In compiler theory, dead-code elimination is a compiler optimization to remove dead code. Removing such code has several benefits: it shrinks program size, an important consideration in some contexts, it reduces resource usage such as the number of bytes to be transferred and it allows the running program to avoid executing irrelevant operations, which reduces its running time. It can also enable further optimizations by simplifying program structure. Dead code includes code that can never be executed, and code that only affects dead variables, that is, irrelevant to the program.

In computer science, symbolic execution (also symbolic evaluation or symbex) is a means of analyzing a program to determine what inputs cause each part of a program to execute. An interpreter follows the program, assuming symbolic values for inputs rather than obtaining actual inputs as normal execution of the program would. It thus arrives at expressions in terms of those symbols for expressions and variables in the program, and constraints in terms of those symbols for the possible outcomes of each conditional branch. Finally, the possible inputs that trigger a branch can be determined by solving the constraints.

In computing, an uninitialized variable is a variable that is declared but is not set to a definite known value before it is used. It will have some value, but not a predictable one. As such, it is a programming error and a common source of bugs in software.

In computing, a data segment is a portion of an object file or the corresponding address space of a program that contains initialized static variables, that is, global variables and static local variables. The size of this segment is determined by the size of the values in the program's source code, and does not change at run time.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In program analysis, shape analysis is a static code analysis technique that discovers and verifies properties of linked, dynamically allocated data structures in computer programs. It is typically used at compile time to find software bugs or to verify high-level correctness properties of programs. In Java programs, it can be used to ensure that a sort method correctly sorts a list. For C programs, it might look for places where a block of memory is not properly freed.

In computer science, pointer analysis, or points-to analysis, is a static code analysis technique that establishes which pointers, or heap references, can point to which variables, or storage locations. It is often a component of more complex analyses such as escape analysis. A closely related technique is shape analysis.

A decompiler is a computer program that translates an executable file to high-level source code. It does therefore the opposite of a typical compiler, which translates a high-level language to a low-level language. While disassemblers translate an executable into assembly language, decompilers go a step further and translate the code into a higher level language such as C or Java, requiring more sophisticated techniques. Decompilers are usually unable to perfectly reconstruct the original source code, thus will frequently produce obfuscated code. Nonetheless, they remain an important tool in the reverse engineering of computer software.

Hermes is a language for distributed programming that was developed at IBM's Thomas J. Watson Research Center from 1986 through 1992, with an open-source compiler and run-time system. Hermes' primary features included:

In engineering, debugging is the process of finding the root cause of and workarounds and possible fixes for bugs.

This glossary of computer science is a list of definitions of terms and concepts used in computer science, its sub-disciplines, and related fields, including terms relevant to software, data science, and computer programming.

References

  1. Jorge Luis Guevara D´ıaz (2010). "Typestate oriented design - A coloured petri net approach" (PDF).
  2. Strom, Robert E. (1983). "Mechanisms for compile-time enforcement of security". Proceedings of the 10th ACM SIGACT-SIGPLAN symposium on Principles of programming languages - POPL '83. pp. 276–284. doi:10.1145/567067.567093. ISBN   0897910907. S2CID   6630704.
  3. Strom, Robert E.; Yemini, Shaula (1986). "Typestate: A programming language concept for enhancing software reliability" (PDF). IEEE Transactions on Software Engineering. 12. IEEE: 157–171. doi:10.1109/tse.1986.6312929. S2CID   15575346.
  4. DeLine, Robert; Fähndrich, Manuel (2004). "Typestates for Objects". ECOOP 2004 – Object-Oriented Programming. Lecture Notes in Computer Science. Vol. 3086. Springer. pp. 465–490. doi:10.1007/978-3-540-24851-4_21. ISBN   978-3-540-22159-3.
  5. Bierhoff, Kevin; Aldrich, Jonathan (2007). "Modular typestate checking of aliased objects". Proceedings of the 22nd annual ACM SIGPLAN conference on Object-oriented programming systems, languages and applications. Vol. 42. pp. 301–320. doi:10.1145/1297027.1297050. ISBN   9781595937865. S2CID   9793770.
  6. Guido de Caso, Victor Braberman, Diego Garbervetsky, and Sebastian Uchitel. 2013. Enabledness-based program abstractions for behavior validation. ACM Trans. Softw. Eng. Methodol. 22, 3, Article 25 (July 2013), 46 pages.
  7. R. Alur, P. Cerny, P. Madhusudan, and W. Nam. Synthesis of interface specifications for Java classes, 32nd ACM Symposium on Principles of Programming Languages, 2005
  8. Giannakopoulou, D., and Pasareanu, C. S., "Interface Generation and Compositional Verification in JavaPathfinder", FASE 2009.
  9. Thomas A. Henzinger, Ranjit Jhala, and Rupak Majumdar. Permissive interfaces. Proceedings of the 13th Annual Symposium on Foundations of Software Engineering (FSE), ACM Press, 2005, pp. 31-40.
  10. Valentin Dallmeier, Christian Lindig, Andrzej Wasylkowski, and Andreas Zeller. 2006. Mining object behavior with ADABU. In Proceedings of the 2006 international workshop on Dynamic systems analysis (WODA '06). ACM, New York, NY, USA, 17-24
  11. Carlo Ghezzi, Andrea Mocci, and Mattia Monga. 2009. Synthesizing intensional behavior models by graph transformation. In Proceedings of the 31st International Conference on Software Engineering (ICSE '09). IEEE Computer Society, Washington, DC, USA, 430-440
  12. Mark Gabel and Zhendong Su. 2008. Symbolic mining of temporal specifications. In Proceedings of the 30th international conference on Software engineering (ICSE '08). ACM, New York, NY, USA, 51-60.
  13. Davide Lorenzoli, Leonardo Mariani, and Mauro Pezzè. 2008. Automatic generation of software behavioral models. In Proceedings of the 30th international conference on Software engineering (ICSE '08). ACM, New York, NY, USA, 501-510
  14. Ivan Beschastnikh, Yuriy Brun, Sigurd Schneider, Michael Sloan, and Michael D. Ernst. 2011. Leveraging existing instrumentation to automatically infer invariant-constrained models. In Proceedings of the 19th ACM SIGSOFT symposium and the 13th European conference on Foundations of software engineering (ESEC/FSE '11). ACM, New York, NY, USA, 267-277
  15. Pradel, M.; Gross, T.R., "Automatic Generation of Object Usage Specifications from Large Method Traces", 2009. ASE '09. 24th IEEE/ACM International Conference on Automated Software Engineering, vol., no., pp.371,382, 16-20 Nov. 2009
  16. Aldrich, Jonathan. "The Plaid Programming Language" . Retrieved 22 July 2012.
  17. Coblenz, Michael. "The Obsidian Programming Language" . Retrieved 16 February 2018.
  18. Bodden, Eric. "Clara" . Retrieved 23 July 2012.
  19. Xi, Hongwei. "Introduction to Programming in ATS" . Retrieved 20 April 2018.