Soot (software)

Last updated

In static program analysis, Soot is a bytecode manipulation and optimization framework consisting of intermediate languages for Java. It has been developed by the Sable Research Group at McGill University. Soot is currently maintained by the Secure Software Engineering Group at Paderborn University. [1] Soot provides four intermediate representations for use through its API for other analysis programs to access and build upon: [2]

Contents

The current Soot software release also contains detailed program analyses that can be used out-of-the-box, such as context-sensitive flow-insensitive points-to analysis, [3] call graph analysis and domination analysis (answering the question "must event a follow event b?"). It also has a decompiler called dava.

Soot is free software available under the GNU Lesser General Public License (LGPL). In 2010, two research papers on Soot (Vallée-Rai et al. 1999 and Pominville et al. 2000) were selected as IBM CASCON First Decade High Impact Papers among 12 other papers from the 425 entries. [4]

Jimple

Jimple is an intermediate representation of a Java program designed to be easier to optimize than Java bytecode. It is typed, has a concrete syntax and is based on three-address code.

Jimple includes only 15 different operations, thus simplifying flow analysis. By contrast, java bytecode includes over 200 different operations. [5] [6]

Unlike java bytecode, in Jimple local and stack variables are typed and Jimple is inherently type safe.

Converting to Jimple, or "Jimplifying" (after "simplifying"), is conversion of bytecode to three-address code. The idea behind the conversion, first investigated by Clark Verbrugge, is to associate a variable to each position in the stack. Hence stack operations become assignments involving the stack variables.

Example

Consider the following bytecode, which is from the [7]

iload 1  // load variable x1, and push it on the stack iload 2  // load variable x2, and push it on the stack iadd     // pop two values, and push their sum on the stack istore 1 // pop a value from the stack, and store it in variable x1 

The above translates to the following three-address code:

stack1 = x1 // iload 1 stack2 = x2 // iload 2 stack1 = stack1 + stack2 // iadd x1 = stack1 // istore 1 

In general the resulting code does not have static single assignment form.

SootUp

Soot is now succeeded by the SootUp framework developed by the Secure Software Engineering Group at Paderborn University [8] . SootUp is a complete reimplementation of Soot with a novel design, that focuses more on static program analysis, rather than bytecode optimization.

Related Research Articles

<span class="mw-page-title-main">GNU Compiler Collection</span> Free and open-source compiler for various programming languages

The GNU Compiler Collection (GCC) is an optimizing compiler produced by the GNU Project supporting various programming languages, hardware architectures and operating systems. The Free Software Foundation (FSF) distributes GCC as free software under the GNU General Public License. GCC is a key component of the GNU toolchain and the standard compiler for most projects related to GNU and the Linux kernel. With roughly 15 million lines of code in 2019, GCC is one of the biggest free programs in existence. It has played an important role in the growth of free software, as both a tool and an example.

<span class="mw-page-title-main">Java virtual machine</span> Virtual machine that runs Java programs

A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally describes what is required in a JVM implementation. Having a specification ensures interoperability of Java programs across different implementations so that program authors using the Java Development Kit (JDK) need not worry about idiosyncrasies of the underlying hardware platform.

OCaml is a general-purpose, high-level, multi-paradigm programming language which extends the Caml dialect of ML with object-oriented features. OCaml was created in 1996 by Xavier Leroy, Jérôme Vouillon, Damien Doligez, Didier Rémy, Ascánder Suárez, and others.

<span class="mw-page-title-main">Interpreter (computing)</span> Program that executes source code without a separate compilation step

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

  1. Parse the source code and perform its behavior directly;
  2. Translate source code into some efficient intermediate representation or object code and immediately execute that;
  3. Explicitly execute stored precompiled bytecode made by a compiler and matched with the interpreter Virtual Machine.

The GNU Compiler for Java (GCJ) is a discontinued free compiler for the Java programming language. It was part of the GNU Compiler Collection.

In computing, just-in-time (JIT) compilation is compilation during execution of a program rather than before execution. This may consist of source code translation but is more commonly bytecode translation to machine code, which is then executed directly. A system implementing a JIT compiler typically continuously analyses the code being executed and identifies parts of the code where the speedup gained from compilation or recompilation would outweigh the overhead of compiling that code.

AspectJ is an aspect-oriented programming (AOP) extension created at PARC for the Java programming language. It is available in Eclipse Foundation open-source projects, both stand-alone and integrated into Eclipse. AspectJ has become a widely used de facto standard for AOP by emphasizing simplicity and usability for end users. It uses Java-like syntax, and included IDE integrations for displaying crosscutting structure since its initial public release in 2001.

In compiler design, static single assignment form is a property of an intermediate representation (IR) that requires each variable to be assigned exactly once and defined before it is used. Existing variables in the original IR are split into versions, new variables typically indicated by the original name with a subscript in textbooks, so that every definition gets its own version. In SSA form, use-def chains are explicit and each contains a single element.

SableVM was a clean room implementation of Java bytecode interpreter implementing the Java virtual machine (VM) specification, second edition. SableVM was designed to be a robust, extremely portable, efficient, and fully specifications-compliant Java Virtual Machine that would be easy to maintain and to extend. It is now no longer being maintained.

An intermediate representation (IR) is the data structure or code used internally by a compiler or virtual machine to represent source code. An IR is designed to be conducive to further processing, such as optimization and translation. A "good" IR must be accurate – capable of representing the source code without loss of information – and independent of any particular source or target language. An IR may take one of several forms: an in-memory data structure, or a special tuple- or stack-based code readable by the program. In the latter case it is also called an intermediate language.

In compiler optimization, escape analysis is a method for determining the dynamic scope of pointers – where in the program a pointer can be accessed. It is related to pointer analysis and shape analysis.

Haxe is a high-level cross-platform programming language and compiler that can produce applications and source code for many different computing platforms from one code-base. It is free and open-source software, released under the MIT License. The compiler, written in OCaml, is released under the GNU General Public License (GPL) version 2.

In computer science, ahead-of-time compilation is the act of compiling an (often) higher-level programming language into an (often) lower-level language before execution of a program, usually at build-time, to reduce the amount of work needed to be performed at run time.

Dalvik is a discontinued process virtual machine (VM) in the Android operating system that executes applications written for Android. Dalvik was an integral part of the Android software stack in the Android versions 4.4 "KitKat" and earlier, which were commonly used on mobile devices such as mobile phones and tablet computers, and more in some devices such as smart TVs and wearables. Dalvik is open-source software, originally written by Dan Bornstein, who named it after the fishing village of Dalvík in Eyjafjörður, Iceland.

A decompiler is a computer program that translates an executable file to high-level source code. It does therefore the opposite of a typical compiler, which translates a high-level language to a low-level language. While disassemblers translate an executable into assembly language, decompilers go a step further and translate the code into a higher level language such as C or Java, requiring more sophisticated techniques. Decompilers are usually unable to perfectly reconstruct the original source code, thus will frequently produce obfuscated code. Nonetheless, they remain an important tool in the reverse engineering of computer software.

The Maxine virtual machine is an open source virtual machine that is developed at the University of Manchester. It was formerly developed by Sun Microsystems Laboratories, since renamed Oracle Labs. The emphasis in Maxine's software architecture is on modular design and code reuse for flexibility, configurability, and productivity for industrial and academic virtual machine researchers. It is one of a growing number of Java virtual machines written entirely in Java in a meta-circular style. Examples include Squawk and Jikes RVM.

This article compares the application programming interfaces (APIs) and virtual machines (VMs) of the programming language Java and operating system Android.

Java bytecode is the instruction set of the Java virtual machine (JVM), crucial for executing programs written in the Java language and other JVM-compatible languages. Each bytecode operation in the JVM is represented by a single byte, hence the name "bytecode", making it a compact form of instruction. This intermediate form enables Java programs to be platform-independent, as they are compiled not to native machine code but to a universally executable format across different JVM implementations.

<span class="mw-page-title-main">Laurie Hendren</span> Canadian computer scientist (1958–2019)

Laurie Hendren was a Canadian computer scientist noted for her research in programming languages and compilers.

<span class="mw-page-title-main">ProGuard</span>

ProGuard is an open source command-line tool which shrinks, optimizes and obfuscates Java code. It is able to optimize bytecode as well as detect and remove unused instructions. ProGuard is free software and is distributed under the GNU General Public License, version 2.

References

  1. "Soot - A Java optimization framework". github.com. Retrieved 16 January 2024.
  2. "A framework for analyzing and transforming Java and Android Applications". Sable.mcgill.ca. Archived from the original on 2008-12-28. Retrieved 2016-08-10.
  3. "Tutorials · Sable/soot Wiki · GitHub". Sable.mcgill.ca. 2016-01-12. Retrieved 2016-08-10.
  4. "CASCON First Decade High Impact Papers". Dl.acm.org. Retrieved 2016-08-10.
  5. Vallee-Rai, Raja (1998). "The Jimple Framework". Sable.mcgill.ca.
  6. Vallee-Rai, Raja; Hendren, Laurie J. (1998). "Jimple: Simplifying Java Bytecode for Analyses and Transformations". Sable.mcgill.ca.
  7. Vallee-Rai 1998.
  8. "A new version of Soot with a completely overhauled architecture". github.com. Retrieved 16 January 2024.

Further reading