Decompiler

Last updated

A decompiler is a computer program that takes an executable file as input, and attempts to create a high level source file which can be recompiled successfully. It is therefore the opposite of a compiler, which takes a source file and makes an executable. Decompilers are usually unable to perfectly reconstruct the original source code, and as such, will frequently produce obfuscated code. Nonetheless, decompilers remain an important tool in the reverse engineering of computer software.

Computer program Instructions to be executed by a computer

A computer program is a collection of instructions that performs a specific task when executed by a computer. Most computer devices require programs to function properly.

A compiler is a computer program that translates computer code written in one programming language into another programming language. The name compiler is primarily used for programs that translate source code from a high-level programming language to a lower level language to create an executable program.

Reverse engineering, also called back engineering, is the process by which a man-made object is deconstructed to reveal its designs, architecture, or to extract knowledge from the object; similar to scientific research, the only difference being that scientific research is about a natural phenomenon.

Contents

Introduction

The term decompiler is most commonly applied to a program which translates executable programs (the output from a compiler) into source code in a (relatively) high level language which, when compiled, will produce an executable whose behavior is the same as the original executable program. By comparison, a disassembler translates an executable program into assembly language (and an assembler could be used to assemble it back into an executable program).

Executable file that can be run by a computer

In computing, executable code or an executable file or executable program, sometimes simply referred to as an executable, causes a computer "to perform indicated tasks according to encoded instructions", as opposed to a data file that must be parsed by a program to be meaningful.

In computing, source code is any collection of code, possibly with comments, written using a human-readable programming language, usually as plain text. The source code of a program is specially designed to facilitate the work of computer programmers, who specify the actions to be performed by a computer mostly by writing source code. The source code is often transformed by an assembler or compiler into binary machine code understood by the computer. The machine code might then be stored for execution at a later time. Alternatively, source code may be interpreted and thus immediately executed.

A disassembler is a computer program that translates machine language into assembly language—the inverse operation to that of an assembler. A disassembler differs from a decompiler, which targets a high-level language rather than an assembly language. Disassembly, the output of a disassembler, is often formatted for human creativity in the code writing process.

Decompilation is the act of using a decompiler, although the term can also refer to the output of a decompiler. It can be used for the recovery of lost source code, and is also useful in some cases for computer security, interoperability and error correction. [1] The success of decompilation depends on the amount of information present in the code being decompiled and the sophistication of the analysis performed on it. The bytecode formats used by many virtual machines (such as the Java Virtual Machine or the .NET Framework Common Language Runtime) often include extensive metadata and high-level features that make decompilation quite feasible. The presence of debug data can make it possible to reproduce the original variable and structure names and even the line numbers. Machine language without such metadata or debug data is much harder to decompile. [2]

Computer security, cybersecurity or information technology security is the protection of computer systems from the theft of or damage to their hardware, software, or electronic data, as well as from the disruption or misdirection of the services they provide.

Interoperability is a characteristic of a product or system, whose interfaces are completely understood, to work with other products or systems, at present or in the future, in either implementation or access, without any restrictions.

.NET Framework software platform developed by Microsoft

.NET Framework is a software framework developed by Microsoft that runs primarily on Microsoft Windows. It includes a large class library named as Framework Class Library (FCL) and provides language interoperability across several programming languages. Programs written for .NET Framework execute in a software environment named the Common Language Runtime (CLR). The CLR is an application virtual machine that provides services such as security, memory management, and exception handling. As such, computer code written using .NET Framework is called "managed code". FCL and CLR together constitute the .NET Framework.

Some compilers and post-compilation tools produce obfuscated code (that is, they attempt to produce output that is very difficult to decompile). This is done to make it more difficult to reverse engineer the executable.

Design

Decompilers can be thought of as composed of a series of phases each of which contributes specific aspects of the overall decompilation process.

Loader

The first decompilation phase loads and parses the input machine code or intermediate language program's binary file format. It should be able to discover basic facts about the input program, such as the architecture (Pentium, PowerPC, etc.) and the entry point. In many cases, it should be able to find the equivalent of the main function of a C program, which is the start of the user written code. This excludes the runtime initialization code, which should not be decompiled if possible. If available the symbol tables and debug data are also loaded. The front end may be able to identify the libraries used even if they are linked with the code, this will provide library interfaces. If it can determine the compiler or compilers used it may provide useful information in identifying code idioms. [3]

C (programming language) general-purpose programming language

C is a general-purpose, procedural computer programming language supporting structured programming, lexical variable scope, and recursion, while a static type system prevents unintended operations. By design, C provides constructs that map efficiently to typical machine instructions, and has found lasting use in applications previously coded in assembly language. Such applications include operating systems, as well as various application software for computers ranging from supercomputers to embedded systems.

Disassembly

The next logical phase is the disassembly of machine code instructions into a machine independent intermediate representation (IR). For example, the Pentium machine instruction

moveax,[ebx+0x04]

might be translated to the IR

eax  := m[ebx+4];

Idioms

Idiomatic machine code sequences are sequences of code whose combined semantics is not immediately apparent from the instructions' individual semantics. Either as part of the disassembly phase, or as part of later analyses, these idiomatic sequences need to be translated into known equivalent IR. For example, the x86 assembly code:

x86 assembly language is a family of backward-compatible assembly languages, which provide some level of compatibility all the way back to the Intel 8008 introduced in April 1972. x86 assembly languages are used to produce object code for the x86 class of processors. Like all assembly languages, it uses short mnemonics to represent the fundamental instructions that the CPU in a computer can understand and follow. Compilers sometimes produce assembly code as an intermediate step when translating a high level program into machine code. Regarded as a programming language, assembly coding is machine-specific and low level. Assembly languages are more typically used for detailed and time critical applications such as small real-time embedded systems or operating system kernels and device drivers.

cdqeax; edx is set to the sign-extension≠edi,edi +(tex)pushxoreax,edxsubeax,edx

could be translated to

eax  := abs(eax);

Some idiomatic sequences are machine independent; some involve only one instruction. For example, xoreax,eax clears the eax register (sets it to zero). This can be implemented with a machine independent simplification rule, such as a = 0.

In general, it is best to delay detection of idiomatic sequences if possible, to later stages that are less affected by instruction ordering. For example, the instruction scheduling phase of a compiler may insert other instructions into an idiomatic sequence, or change the ordering of instructions in the sequence. A pattern matching process in the disassembly phase would probably not recognize the altered pattern. Later phases group instruction expressions into more complex expressions, and modify them into a canonical (standardized) form, making it more likely that even the altered idiom will match a higher level pattern later in the decompilation.

It is particularly important to recognize the compiler idioms for subroutine calls, exception handling, and switch statements. Some languages also have extensive support for strings or long integers.

Program analysis

Various program analyses can be applied to the IR. In particular, expression propagation combines the semantics of several instructions into more complex expressions. For example,

moveax,[ebx+0x04]addeax,[ebx+0x08]sub[ebx+0x0C],eax

could result in the following IR after expression propagation:

m[ebx+12]  := m[ebx+12] - (m[ebx+4] + m[ebx+8]);

The resulting expression is more like high level language, and has also eliminated the use of the machine register eax . Later analyses may eliminate the ebx register.

Data flow analysis

The places where register contents are defined and used must be traced using data flow analysis. The same analysis can be applied to locations that are used for temporaries and local data. A different name can then be formed for each such connected set of value definitions and uses. It is possible that the same local variable location was used for more than one variable in different parts of the original program. Even worse it is possible for the data flow analysis to identify a path whereby a value may flow between two such uses even though it would never actually happen or matter in reality. This may in bad cases lead to needing to define a location as a union of types. The decompiler may allow the user to explicitly break such unnatural dependencies which will lead to clearer code. This of course means a variable is potentially used without being initialized and so indicates a problem in the original program.

Type analysis

A good machine code decompiler will perform type analysis. Here, the way registers or memory locations are used result in constraints on the possible type of the location. For example, an and instruction implies that the operand is an integer; programs do not use such an operation on floating point values (except in special library code) or on pointers. An add instruction results in three constraints, since the operands may be both integer, or one integer and one pointer (with integer and pointer results respectively; the third constraint comes from the ordering of the two operands when the types are different). [4]

Various high level expressions can be recognized which trigger recognition of structures or arrays. However, it is difficult to distinguish many of the possibilities, because of the freedom that machine code or even some high level languages such as C allow with casts and pointer arithmetic.

The example from the previous section could result in the following high level code:

structT1*ebx;structT1{intv0004;intv0008;intv000C;};ebx->v000C-=ebx->v0004+ebx->v0008;

Structuring

The penultimate decompilation phase involves structuring of the IR into higher level constructs such as while loops and if/then/else conditional statements. For example, the machine code

xoreax,eaxl0002:orebx,ebxjgel0003addeax,[ebx]movebx,[ebx+0x4]jmpl0002l0003:mov[0x10040000],eax

could be translated into:

eax=0;while(ebx<0){eax+=ebx->v0000;ebx=ebx->v0004;}v10040000=eax;

Unstructured code is more difficult to translate into structured code than already structured code. Solutions include replicating some code, or adding boolean variables. [5]

Code generation

The final phase is the generation of the high level code in the back end of the decompiler. Just as a compiler may have several back ends for generating machine code for different architectures, a decompiler may have several back ends for generating high level code in different high level languages.

Just before code generation, it may be desirable to allow an interactive editing of the IR, perhaps using some form of graphical user interface. This would allow the user to enter comments, and non-generic variable and function names. However, these are almost as easily entered in a post decompilation edit. The user may want to change structural aspects, such as converting a while loop to a for loop. These are less readily modified with a simple text editor, although source code refactoring tools may assist with this process. The user may need to enter information that failed to be identified during the type analysis phase, e.g. modifying a memory expression to an array or structure expression. Finally, incorrect IR may need to be corrected, or changes made to cause the output code to be more readable.

Legality

The majority of computer programs are covered by copyright laws. Although the precise scope of what is covered by copyright differs from region to region, copyright law generally provides the author (the programmer(s) or employer) with a collection of exclusive rights to the program. [6] These rights include the right to make copies, including copies made into the computer’s RAM (unless creating such a copy is essential for using the program). [7] Since the decompilation process involves making multiple such copies, it is generally prohibited without the authorization of the copyright holder. However, because decompilation is often a necessary step in achieving software interoperability, copyright laws in both the United States and Europe permit decompilation to a limited extent.

In the United States, the copyright fair use defence has been successfully invoked in decompilation cases. For example, in Sega v. Accolade , the court held that Accolade could lawfully engage in decompilation in order to circumvent the software locking mechanism used by Sega's game consoles. [8] Additionally, the Digital Millennium Copyright Act (PUBLIC LAW 105–304 [9] ) has proper exemptions for both Security Testing and Evaluation in §1205(i), and Reverse Engineering in §1205(f).

In Europe, the 1991 Software Directive explicitly provides for a right to decompile in order to achieve interoperability. The result of a heated debate between, on the one side, software protectionists, and, on the other, academics as well as independent software developers, Article 6 permits decompilation only if a number of conditions are met:

In addition, Article 6 prescribes that the information obtained through decompilation may not be used for other purposes and that it may not be given to others.

Overall, the decompilation right provided by Article 6 codifies what is claimed to be common practice in the software industry. Few European lawsuits are known to have emerged from the decompilation right. This could be interpreted as meaning one of three things: 1) the decompilation right is not used frequently and the decompilation right may therefore have been unnecessary, 2) the decompilation right functions well and provides sufficient legal certainty not to give rise to legal disputes or 3) illegal decompilation goes largely undetected. In a recent report regarding implementation of the Software Directive by the European member states, the European Commission seems to support the second interpretation. [11]

See also

Related Research Articles

Assembly language Low level programming language

An assembly language, often abbreviated asm, is any low-level programming language in which there is a very strong correspondence between the instructions in the language and the architecture's machine code instructions. Assembly language may also be called symbolic machine code.

Machine code set of instructions executed directly by a computers central processing unit (CPU)

Machine code is a computer program written in machine language instructions that can be executed directly by a computer's central processing unit (CPU). Each instruction causes the CPU to perform a very specific task, such as a load, a store, a jump, or an ALU operation on one or more units of data in CPU registers or memory.

In computing, an optimizing compiler is a compiler that tries to minimize or maximize some attributes of an executable computer program. Common requirements are to minimize a program's execution time, memory requirement, and power consumption.

SoftICE is a kernel mode debugger for Microsoft Windows up to Windows XP. Crucially, it is designed to run underneath Windows such that the operating system is unaware of its presence. Unlike an application debugger, SoftICE is capable of suspending all operations in Windows when instructed. For driver debugging this is critical due to how hardware is accessed and the kernel of the operating system functions. Because of its low-level capabilities, SoftICE is also popular as a software cracking tool.

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

  1. Parse the source code and perform its behavior directly;
  2. Translate source code into some efficient intermediate representation and immediately execute this;
  3. Explicitly execute stored precompiled code made by a compiler which is part of the interpreter system.
Netwide Assembler Assembler for the Intel x86 architecture

The Netwide Assembler (NASM) is an assembler and disassembler for the Intel x86 architecture. It can be used to write 16-bit, 32-bit (IA-32) and 64-bit (x86-64) programs. NASM is considered to be one of the most popular assemblers for Linux.

A low-level programming language is a programming language that provides little or no abstraction from a computer's instruction set architecture—commands or functions in the language map closely to processor instructions. Generally, this refers to either machine code or assembly language. The word "low" refers to the small or nonexistent amount of abstraction between the language and machine language; because of this, low-level languages are sometimes described as being "close to the hardware". Programs written in low-level languages tend to be relatively non-portable.

In computer programming, undefined behavior (UB) is the result of executing computer code whose behavior is not prescribed by the language specification to which the code adheres, for the current state of the program. This happens when the translator of the source code makes certain assumptions, but these assumptions are not satisfied during execution.

In computer programming, an inline assembler is a feature of some compilers that allows low-level code written in assembly language to be embedded within a program, among code that otherwise has been compiled from a higher-level language such as C or Ada.

In computer science, instruction selection is the stage of a compiler backend that transforms its middle-level intermediate representation (IR) into a low-level IR where each operation directly corresponds to an instruction available on the target machine. In a typical compiler, instruction selection precedes both instruction scheduling and register allocation; hence its output IR has an infinite set of pseudo-registers and may still be – and typically is – subject to peephole optimization. Otherwise, it closely resembles the target machine code, bytecode, or assembly language.

The CPUID instruction is a processor supplementary instruction for the x86 architecture allowing software to discover details of the processor. It was introduced by Intel in 1993 when it introduced the Pentium and SL-enhanced 486 processors.

On many computer operating systems, a computer process terminates its execution by making an exit system call. More generally, an exit in a multithreading environment means that a thread of execution has stopped running. For resource management, the operating system reclaims resources that were used by the process. The process is said to be a dead process after it terminates.

This article describes the calling conventions used when programming x86 architecture microprocessors.

Cosmos (operating system) open source operating system building kit

C# Open Source Managed Operating System (Cosmos) is a toolkit for building operating systems, written mostly in the programming language C# and small amounts of a high level assembly language named X#. Cosmos is a backronym, in that the acronym was chosen before the meaning. It is open-source software released under a BSD license.

Java Decompiler Decompiler for the Java programming language

JD is a decompiler for the Java programming language. JD is provided as a GUI tool as well as in the form of plug-ins for the Eclipse (JD-Eclipse) and IntelliJ IDEA (JD-IntelliJ) integrated development environments.

References

  1. Mike Van Emmerik (2005-04-29). "Why Decompilation". Program-transformation.org. Retrieved 2010-09-15.
  2. Miecznikowski, Jerome; Hendren, Laurie (2002). "Decompiling Java Bytecode: Problems, Traps and Pitfalls". In R Nigel Horspool (ed.). Compiler Construction: 11th International Conference, proceedings / CC 2002. Springer-Verlag. pp. 111–127. ISBN   3-540-43369-4.
  3. Cifuentes, Cristina; Gough, K. John (July 1995). "Decompilation of Binary Programs". Software Practice and Experience. 25 (7): 811–829. CiteSeerX   10.1.1.14.8073 . doi:10.1002/spe.4380250706.
  4. Mycroft, Alan (1999). "Type-Based Decompilation". In S. Doaitse Swierstra (ed.). Programming languages and systems: 8th European Symposium on Programming Languages and Systems. Springer-Verlag. pp. 208–223. ISBN   3-540-65699-5.
  5. C. Cifuentes. Reverse Compilation Techniques. PhD thesis, Queensland University of Technology, 1994. (available as PDF Chapter 6 Archived 2016-11-22 at the Wayback Machine )
  6. Rowland, Diane (2005). Information technology law (3rd ed.). Cavendish. ISBN   1-85941-756-6.
  7. "U.S. Copyright Office - Copyright Law: Chapter 1".
  8. "The Legality of Decompilation". Program-transformation.org. 2004-12-03. Retrieved 2010-09-15.
  9. "DIGITAL MILLENNIUM COPYRIGHT ACT" (PDF). US Congress. 1998-10-28. Retrieved 2013-11-15.
  10. B. Czarnota and R.J. Hart, Legal protection of computer programs in Europe: a guide to the EC directive. 1991, London: Butterworths.
  11. "EUR-Lex - 52000DC0199 - EN".