Common Intermediate Language

Last updated

Common Intermediate Language (CIL), formerly called Microsoft Intermediate Language (MSIL) or Intermediate Language (IL), [1] is the intermediate language binary instruction set defined within the Common Language Infrastructure (CLI) specification. [2] CIL instructions are executed by a CIL-compatible runtime environment such as the Common Language Runtime. Languages which target the CLI compile to CIL. CIL is object-oriented, stack-based bytecode. Runtimes typically just-in-time compile CIL instructions into native code.

Contents

CIL was originally known as Microsoft Intermediate Language (MSIL) during the beta releases of the .NET languages. Due to standardization of C# and the CLI, the bytecode is now officially known as CIL. [3] Windows Defender virus definitions continue to refer to binaries compiled with it as MSIL. [4]

General information

During compilation of CLI programming languages, the source code is translated into CIL code rather than into platform- or processor-specific object code. CIL is a CPU- and platform-independent instruction set that can be executed in any environment supporting the Common Language Infrastructure, such as the .NET runtime on Windows, or the cross-platform Mono runtime. In theory, this eliminates the need to distribute different executable files for different platforms and CPU types. CIL code is verified for safety during runtime, providing better security and reliability than natively compiled executable files. [5] [6]

The execution process looks like this:

  1. Source code is converted to CIL bytecode and a CLI assembly is created.
  2. Upon execution of a CIL assembly, its code is passed through the runtime's JIT compiler to generate native code. Ahead-of-time compilation may also be used, which eliminates this step, but at the cost of executable-file portability.
  3. The computer's processor executes the native code.

Instructions

CIL bytecode has instructions for the following groups of tasks:

Computational model

The Common Intermediate Language is object-oriented and stack-based, which means that instruction parameters and results are kept on a single stack instead of in several registers or other memory locations, as in most programming languages.

Code that adds two numbers in x86 assembly language, where eax and edx specify two different general-purpose registers:

addeax,edx

Code in an intermediate language (IL), where 0 is eax and 1 is edx:

ldloc.0// push local variable 0 onto stackldloc.1// push local variable 1 onto stackadd// pop and add the top two stack items then push the result onto the stackstloc.0// pop and store the top stack item to local variable 0

In the latter example, the values of the two registers, eax and edx, are first pushed on the stack. When the add-instruction is called the operands are "popped", or retrieved, and the result is "pushed", or stored, on the stack. The resulting value is then popped from the stack and stored in eax.

Object-oriented concepts

CIL is designed to be object-oriented. You may create objects, call methods, and use other types of members, such as fields.

Every method needs (with some exceptions) to reside in a class. So does this static method:

.classpublicFoo{.methodpublicstaticint32Add(int32,int32)cilmanaged{.maxstack2ldarg.0// load the first argument;ldarg.1// load the second argument;add// add them;ret// return the result;}}

The method Add does not require any instance of Foo to be declared because it is declared as static, and it may then be used like this in C#:

intr=Foo.Add(2,3);// 5

In CIL it would look like this:

ldc.i4.2ldc.i4.3callint32Foo::Add(int32,int32)stloc.0

Instance classes

An instance class contains at least one constructor and some instance members. The following class has a set of methods representing actions of a Car-object.

.classpublicCar{.methodpublicspecialnamertspecialnameinstancevoid.ctor(int32,int32)cilmanaged{/* Constructor */}.methodpublicvoidMove(int32)cilmanaged{/* Omitting implementation */}.methodpublicvoidTurnRight()cilmanaged{/* Omitting implementation */}.methodpublicvoidTurnLeft()cilmanaged{/* Omitting implementation */}.methodpublicvoidBrake()cilmanaged{/* Omitting implementation */}}

Creating objects

In C# class instances are created like this:

CarmyCar=newCar(1,4);CaryourCar=newCar(1,3);

And those statements are roughly the same as these instructions in CIL:

ldc.i4.1ldc.i4.4newobjinstancevoidCar::.ctor(int,int)stloc.0// myCar = new Car(1, 4);ldc.i4.1ldc.i4.3newobjinstancevoidCar::.ctor(int,int)stloc.1// yourCar = new Car(1, 3);

Invoking instance methods

Instance methods are invoked in C# as the one that follows:

myCar.Move(3);

As invoked in CIL:

ldloc.0// Load the object "myCar" on the stackldc.i4.3callinstancevoidCar::Move(int32)

Metadata

The Common Language Infrastructure (CLI) records information about compiled classes as metadata. Like the type library in the Component Object Model, this enables applications to support and discover the interfaces, classes, types, methods, and fields in the assembly. The process of reading such metadata is called "reflection".

Metadata can be data in the form of "attributes". Attributes can be customized by extending the Attribute class. This is a powerful feature. It allows the creator of the class the ability to adorn it with extra information that consumers of the class can use in various meaningful ways, depending on the application domain.

Example

Below is a basic "Hello, World!" program written in CIL assembler. It will display the string "Hello, world!".

.assemblyHello{}.assemblyexternmscorlib{}.methodstaticvoidMain(){.entrypoint.maxstack1ldstr"Hello, world!"callvoid[mscorlib]System.Console::WriteLine(string)ret}

The following code is more complex in number of opcodes.

This code can also be compared with the corresponding code in the article about Java bytecode.

staticvoidMain(string[]args){for(inti=2;i<1000;i++){for(intj=2;j<i;j++){if(i%j==0)gotoouter;}Console.WriteLine(i);outer:;}}

In CIL assembler syntax it looks like this:

.methodprivatehidebysigstaticvoidMain(string[]args)cilmanaged{.entrypoint.maxstack2.localsinit(int32V_0,int32V_1)ldc.i4.2stloc.0br.sIL_001fIL_0004:ldc.i4.2stloc.1br.sIL_0011IL_0008:ldloc.0ldloc.1rembrfalse.sIL_001bldloc.1ldc.i4.1addstloc.1IL_0011:ldloc.1ldloc.0blt.sIL_0008ldloc.0callvoid[mscorlib]System.Console::WriteLine(int32)IL_001b:ldloc.0ldc.i4.1addstloc.0IL_001f:ldloc.0ldc.i40x3e8blt.sIL_0004ret}

This is just a representation of how CIL looks near the virtual machine (VM) level. When compiled the methods are stored in tables and the instructions are stored as bytes inside the assembly, which is a Portable Executable (PE).

Generation

A CIL assembly and instructions are generated by either a compiler or a utility called the IL Assembler (ILAsm) that is shipped with the execution environment.

Assembled CIL can also be disassembled into code again using the IL Disassembler (ILDASM). There are other tools such as .NET Reflector that can decompile CIL into a high-level language (e. g. C# or Visual Basic). This makes CIL a very easy target for reverse engineering. This trait is shared with Java bytecode. However, there are tools that can obfuscate the code, and do it so that the code cannot be easily readable but still be runnable.

Execution

Just-in-time compilation

Just-in-time compilation (JIT) involves turning the byte-code into code immediately executable by the CPU. The conversion is performed gradually during the program's execution. JIT compilation provides environment-specific optimization, runtime type safety, and assembly verification. To accomplish this, the JIT compiler examines the assembly metadata for any illegal accesses and handles violations appropriately.

Ahead-of-time compilation

CLI-compatible execution environments also come with the option to do an Ahead-of-time compilation (AOT) of an assembly to make it execute faster by removing the JIT process at runtime.

In the .NET Framework there is a special tool called the Native Image Generator (NGEN) that performs the AOT. A different approach for AOT is CoreRT that allows the compilation of .Net Core code to a single executable with no dependency on a runtime. In Mono there is also an option to do an AOT.

Pointer instructions - C++/CLI

A notable difference from Java's bytecode is that CIL comes with ldind, stind, ldloca, and many call instructions which are enough for data/function pointers manipulation needed to compile C/C++ code into CIL.

classA{public:virtualvoid__stdcallmeth(){}};voidtest_pointer_operations(intparam){intk=0;int*ptr=&k;*ptr=1;ptr=&param;*ptr=2;Aa;A*ptra=&a;ptra->meth();}

The corresponding code in CIL can be rendered as this:

.methodassemblystaticvoidmodopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl)test_pointer_operations(int32param)cilmanaged{.vtentry1:1// Code size       44 (0x2c).maxstack2.locals([0]int32*ptr,[1]valuetypeA*V_1,[2]valuetypeA*a,[3]int32k)// k = 0;IL_0000:ldc.i4.0IL_0001:stloc.3// ptr = &k;IL_0002:ldloca.sk// load local's address instructionIL_0004:stloc.0// *ptr = 1;IL_0005:ldloc.0IL_0006:ldc.i4.1IL_0007:stind.i4// indirection instruction// ptr = &paramIL_0008:ldarga.sparam// load parameter's address instructionIL_000a:stloc.0// *ptr = 2IL_000b:ldloc.0IL_000c:ldc.i4.2IL_000d:stind.i4// a = new A;IL_000e:ldloca.saIL_0010:callvaluetypeA*modopt([mscorlib]System.Runtime.CompilerServices.CallConvThiscall)'A.{ctor}'(valuetypeA*modopt([mscorlib]System.Runtime.CompilerServices.IsConst)modopt([mscorlib]System.Runtime.CompilerServices.IsConst))IL_0015:pop// ptra = &a;IL_0016:ldloca.saIL_0018:stloc.1// ptra->meth();IL_0019:ldloc.1IL_001a:dupIL_001b:ldind.i4// reading the VMT for virtual callIL_001c:ldind.i4IL_001d:calliunmanagedstdcallvoidmodopt([mscorlib]System.Runtime.CompilerServices.CallConvStdcall)(nativeint)IL_0022:ret}// end of method 'Global Functions'::test_pointer_operations

See also

Related Research Articles

<span class="mw-page-title-main">Java virtual machine</span> Virtual machine that runs Java programs

A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally describes what is required in a JVM implementation. Having a specification ensures interoperability of Java programs across different implementations so that program authors using the Java Development Kit (JDK) need not worry about idiosyncrasies of the underlying hardware platform.

In computer programming, a p-code machine is a virtual machine designed to execute p-code. This term is applied both generically to all such machines, and to specific implementations, the most famous being the p-Machine of the Pascal-P system, particularly the UCSD Pascal implementation, among whose developers, the p in p-code was construed to mean pseudo more often than portable, thus pseudo-code meaning instructions for a pseudo-machine.

<span class="mw-page-title-main">Interpreter (computing)</span> Program that executes source code without a separate compilation step

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

  1. Parse the source code and perform its behavior directly;
  2. Translate source code into some efficient intermediate representation or object code and immediately execute that;
  3. Explicitly execute stored precompiled bytecode made by a compiler and matched with the interpreter's Virtual Machine.

Bytecode is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

A low-level programming language is a programming language that provides little or no abstraction from a computer's instruction set architecture—commands or functions in the language map that are structurally similar to processor's instructions. Generally, this refers to either machine code or assembly language. Because of the low abstraction between the language and machine language, low-level languages are sometimes described as being "close to the hardware". Programs written in low-level languages tend to be relatively non-portable, due to being optimized for a certain type of system architecture.

In computing, just-in-time (JIT) compilation is compilation during execution of a program rather than before execution. This may consist of source code translation but is more commonly bytecode translation to machine code, which is then executed directly. A system implementing a JIT compiler typically continuously analyses the code being executed and identifies parts of the code where the speedup gained from compilation or recompilation would outweigh the overhead of compiling that code.

In computer programming, specifically when using the imperative programming paradigm, an assertion is a predicate connected to a point in the program, that always should evaluate to true at that point in code execution. Assertions can help a programmer read the code, help a compiler compile it, or help the program detect its own defects.

<span class="mw-page-title-main">D (programming language)</span> Multi-paradigm system programming language

D, also known as dlang, is a multi-paradigm system programming language created by Walter Bright at Digital Mars and released in 2001. Andrei Alexandrescu joined the design and development effort in 2007. Though it originated as a re-engineering of C++, D is now a very different language drawing inspiration from other high-level programming languages, notably Java, Python, Ruby, C#, and Eiffel.

Managed Extensions for C++ or Managed C++ is a deprecated set of language extensions for C++, including grammatical and syntactic extensions, keywords and attributes, to bring the C++ syntax and language to the .NET Framework. These extensions were created by Microsoft to allow C++ code to be targeted to the Common Language Runtime (CLR) in the form of managed code, as well as continue to interoperate with native code.

Metadata, in the Common Language Infrastructure (CLI), refers to certain data structures embedded within the Common Intermediate Language (CIL) code that describes the high-level structure of the code. Metadata describes all classes and class members that are defined in the assembly, and the classes and class members that the current assembly will call from another assembly. The metadata for a method contains the complete description of the method, including the class, the return type and all of the method parameters.

Defined by Microsoft for use in recent versions of Windows, an assembly in the Common Language Infrastructure (CLI) is a compiled code library used for deployment, versioning, and security. There are two types: process assemblies (EXE) and library assemblies (DLL). A process assembly represents a process that will use classes defined in library assemblies. CLI assemblies contain code in CIL, which is usually generated from a CLI language, and then compiled into machine language at run time by the just-in-time compiler. In the .NET Framework implementation, this compiler is part of the Common Language Runtime (CLR).

ILAsm generates a portable executable (PE) file from a text representation of Common Intermediate Language (CIL) code. It is not to be confused with NGEN, which compiles Common Intermediate Language code into native code as a .NET assembly is deployed.

In computer science, ahead-of-time compilation is the act of compiling an (often) higher-level programming language into an (often) lower-level language before execution of a program, usually at build-time, to reduce the amount of work needed to be performed at run time.

<span class="mw-page-title-main">Cosmos (operating system)</span> Toolkit for building GUI and command-line based operating systems

C# Open Source Managed Operating System (Cosmos) is a toolkit for building GUI and command-line based operating systems, written mostly in the programming language C# and small amounts of a high-level assembly language named X#. Cosmos is a backronym, in that the acronym was chosen before the meaning. It is open-source software released under a BSD license.

A decompiler is a computer program that translates an executable file to high-level source code. It does therefore the opposite of a typical compiler, which translates a high-level language to a low-level language. While disassemblers translate an executable into assembly language, decompilers go a step further and translate the code into a higher level language such as C or Java, requiring more sophisticated techniques. Decompilers are usually unable to perfectly reconstruct the original source code, thus will frequently produce obfuscated code. Nonetheless, they remain an important tool in the reverse engineering of computer software.

<span class="mw-page-title-main">.NET Framework</span> Software platform developed by Microsoft

The .NET Framework is a proprietary software framework developed by Microsoft that runs primarily on Microsoft Windows. It was the predominant implementation of the Common Language Infrastructure (CLI) until being superseded by the cross-platform .NET project. It includes a large class library called Framework Class Library (FCL) and provides language interoperability across several programming languages. Programs written for .NET Framework execute in a software environment named the Common Language Runtime (CLR). The CLR is an application virtual machine that provides services such as security, memory management, and exception handling. As such, computer code written using .NET Framework is called "managed code". FCL and CLR together constitute the .NET Framework.

The Native Image Generator, or simply NGen, is the ahead-of-time compilation (AOT) service of the .NET Framework. It allows a CLI assembly to be pre-compiled instead of letting the Common Language Runtime (CLR) do a just-in-time compilation (JIT) at runtime. In some cases the execution will be significantly faster than with JIT.

Tracing just-in-time compilation is a technique used by virtual machines to optimize the execution of a program at runtime. This is done by recording a linear sequence of frequently executed operations, compiling them to native machine code and executing them. This is opposed to traditional just-in-time (JIT) compilers that work on a per-method basis.

Java bytecode is the instruction set of the Java virtual machine (JVM), crucial for executing programs written in the Java language and other JVM-compatible languages. Each bytecode operation in the JVM is represented by a single byte, hence the name "bytecode", making it a compact form of instruction. This intermediate form enables Java programs to be platform-independent, as they are compiled not to native machine code but to a universally executable format across different JVM implementations.

References

  1. "Intermediate Language & execution".
  2. "ECMA-335 Common Language Infrastructure (CLI)".
  3. "What is Intermediate Language(IL)/MSIL/CIL in .NET" . Retrieved 2011-02-17. CIL: ... When we compile [a]. NET project, it [is] not directly converted to binary code but to the intermediate language. When a project is run, every language of .NET programming is converted into binary code into CIL. Only some part of CIL that is required at run time is converted into binary code. DLL and EXE of .NET are also in CIL form.
  4. "HackTool:MSIL/SkypeCracker". Microsoft. Retrieved 26 November 2019.
  5. Troelsen, Andrew (2009-05-02). Benefits of CIL. ISBN   9781590598849 . Retrieved 2011-02-17.
  6. "Unmanaged, Managed Extensions for C++, Managed and .Net Framework". www.visualcplusdotnet.com. Retrieved 2020-07-07.

Further reading