This article needs additional citations for verification .(November 2017) |
Common Intermediate Language (CIL), formerly called Microsoft Intermediate Language (MSIL) or Intermediate Language (IL), [1] is the intermediate language binary instruction set defined within the Common Language Infrastructure (CLI) specification. [2] CIL instructions are executed by a CIL-compatible runtime environment such as the Common Language Runtime. Languages which target the CLI compile to CIL. CIL is object-oriented, stack-based bytecode. Runtimes typically just-in-time compile CIL instructions into native code.
CIL was originally known as Microsoft Intermediate Language (MSIL) during the beta releases of the .NET languages. Due to standardization of C# and the CLI, the bytecode is now officially known as CIL. [3] Windows Defender virus definitions continue to refer to binaries compiled with it as MSIL. [4]
During compilation of CLI programming languages, the source code is translated into CIL code rather than into platform- or processor-specific object code. CIL is a CPU- and platform-independent instruction set that can be executed in any environment supporting the Common Language Infrastructure, such as the .NET runtime on Windows, or the cross-platform Mono runtime. In theory, this eliminates the need to distribute different executable files for different platforms and CPU types. CIL code is verified for safety during runtime, providing better security and reliability than natively compiled executable files. [5] [6]
The execution process looks like this:
CIL bytecode has instructions for the following groups of tasks:
The Common Intermediate Language is object-oriented and stack-based, which means that instruction parameters and results are kept on a single stack instead of in several registers or other memory locations, as in most programming languages.
Code that adds two numbers in x86 assembly language, where eax and edx specify two different general-purpose registers:
addeax,edx
Code in an intermediate language (IL), where 0 is eax and 1 is edx:
ldloc.0// push local variable 0 onto stackldloc.1// push local variable 1 onto stackadd// pop and add the top two stack items then push the result onto the stackstloc.0// pop and store the top stack item to local variable 0
In the latter example, the values of the two registers, eax and edx, are first pushed on the stack. When the add-instruction is called the operands are "popped", or retrieved, and the result is "pushed", or stored, on the stack. The resulting value is then popped from the stack and stored in eax.
CIL is designed to be object-oriented. One may create objects, call methods, and use other types of members, such as fields.
Every method needs (with some exceptions) to reside in a class. So does this static method:
.classpublicFoo{.methodpublicstaticint32Add(int32,int32)cilmanaged{.maxstack2ldarg.0// load the first argument;ldarg.1// load the second argument;add// add them;ret// return the result;}}
The method Add does not require any instance of Foo to be declared because it is declared as static, and it may then be used like this in C#:
intr=Foo.Add(2,3);// 5
In CIL it would look like this:
ldc.i4.2ldc.i4.3callint32Foo::Add(int32,int32)stloc.0
An instance class contains at least one constructor and some instance members. The following class has a set of methods representing actions of a Car-object.
.classpublicCar{.methodpublicspecialnamertspecialnameinstancevoid.ctor(int32,int32)cilmanaged{/* Constructor */}.methodpublicvoidMove(int32)cilmanaged{/* Omitting implementation */}.methodpublicvoidTurnRight()cilmanaged{/* Omitting implementation */}.methodpublicvoidTurnLeft()cilmanaged{/* Omitting implementation */}.methodpublicvoidBrake()cilmanaged{/* Omitting implementation */}}
In C# class instances are created like this:
CarmyCar=newCar(1,4);CaryourCar=newCar(1,3);
And those statements are roughly the same as these instructions in CIL:
ldc.i4.1ldc.i4.4newobjinstancevoidCar::.ctor(int,int)stloc.0// myCar = new Car(1, 4);ldc.i4.1ldc.i4.3newobjinstancevoidCar::.ctor(int,int)stloc.1// yourCar = new Car(1, 3);
Instance methods are invoked in C# as the one that follows:
myCar.Move(3);
As invoked in CIL:
ldloc.0// Load the object "myCar" on the stackldc.i4.3callinstancevoidCar::Move(int32)
The Common Language Infrastructure (CLI) records information about compiled classes as metadata. Like the type library in the Component Object Model, this enables applications to support and discover the interfaces, classes, types, methods, and fields in the assembly. The process of reading such metadata is called "reflection".
Metadata can be data in the form of "attributes". Attributes can be customized by extending the Attribute
class. This is a powerful feature. It allows the creator of the class the ability to adorn it with extra information that consumers of the class can use in various meaningful ways, depending on the application domain.
Below is a basic "Hello, World!" program written in CIL assembler. It will display the string "Hello, world!".
.assemblyHello{}.assemblyexternmscorlib{}.methodstaticvoidMain(){.entrypoint.maxstack1ldstr"Hello, world!"callvoid[mscorlib]System.Console::WriteLine(string)ret}
The following code is more complex in number of opcodes.
This code can also be compared with the corresponding code in the article about Java bytecode.
staticvoidMain(string[]args){for(inti=2;i<1000;i++){for(intj=2;j<i;j++){if(i%j==0)gotoouter;}Console.WriteLine(i);outer:;}}
In CIL assembler syntax it looks like this:
.methodprivatehidebysigstaticvoidMain(string[]args)cilmanaged{.entrypoint.maxstack2.localsinit(int32V_0,int32V_1)ldc.i4.2stloc.0br.sIL_001fIL_0004:ldc.i4.2stloc.1br.sIL_0011IL_0008:ldloc.0ldloc.1rembrfalse.sIL_001bldloc.1ldc.i4.1addstloc.1IL_0011:ldloc.1ldloc.0blt.sIL_0008ldloc.0callvoid[mscorlib]System.Console::WriteLine(int32)IL_001b:ldloc.0ldc.i4.1addstloc.0IL_001f:ldloc.0ldc.i40x3e8blt.sIL_0004ret}
This is just a representation of how CIL looks near the virtual machine (VM) level. When compiled the methods are stored in tables and the instructions are stored as bytes inside the assembly, which is a Portable Executable (PE).
A CIL assembly and instructions are generated by either a compiler or a utility called the IL Assembler (ILAsm) that is shipped with the execution environment.
Assembled CIL can also be disassembled into code again using the IL Disassembler (ILDASM). There are other tools such as .NET Reflector that can decompile CIL into a high-level language (e. g. C# or Visual Basic). This makes CIL a very easy target for reverse engineering. This trait is shared with Java bytecode. However, there are tools that can obfuscate the code, and do it so that the code cannot be easily readable but still be runnable.
Just-in-time compilation (JIT) involves turning the byte-code into code immediately executable by the CPU. The conversion is performed gradually during the program's execution. JIT compilation provides environment-specific optimization, runtime type safety, and assembly verification. To accomplish this, the JIT compiler examines the assembly metadata for any illegal accesses and handles violations appropriately.
CLI-compatible execution environments also come with the option to do an Ahead-of-time compilation (AOT) of an assembly to make it execute faster by removing the JIT process at runtime.
In the .NET Framework there is a special tool called the Native Image Generator (NGEN) that performs the AOT. A different approach for AOT is CoreRT that allows the compilation of .Net Core code to a single executable with no dependency on a runtime. In Mono there is also an option to do an AOT.
A notable difference from Java's bytecode is that CIL comes with ldind
, stind
, ldloca
, and many call instructions which are enough for data/function pointers manipulation needed to compile C/C++ code into CIL.
classA{public:virtualvoid__stdcallmeth(){}};voidtest_pointer_operations(intparam){intk=0;int*ptr=&k;*ptr=1;ptr=¶m;*ptr=2;Aa;A*ptra=&a;ptra->meth();}
The corresponding code in CIL can be rendered as this:
.methodassemblystaticvoidmodopt([mscorlib]System.Runtime.CompilerServices.CallConvCdecl)test_pointer_operations(int32param)cilmanaged{.vtentry1:1// Code size 44 (0x2c).maxstack2.locals([0]int32*ptr,[1]valuetypeA*V_1,[2]valuetypeA*a,[3]int32k)// k = 0;IL_0000:ldc.i4.0IL_0001:stloc.3// ptr = &k;IL_0002:ldloca.sk// load local's address instructionIL_0004:stloc.0// *ptr = 1;IL_0005:ldloc.0IL_0006:ldc.i4.1IL_0007:stind.i4// indirection instruction// ptr = ¶mIL_0008:ldarga.sparam// load parameter's address instructionIL_000a:stloc.0// *ptr = 2IL_000b:ldloc.0IL_000c:ldc.i4.2IL_000d:stind.i4// a = new A;IL_000e:ldloca.saIL_0010:callvaluetypeA*modopt([mscorlib]System.Runtime.CompilerServices.CallConvThiscall)'A.{ctor}'(valuetypeA*modopt([mscorlib]System.Runtime.CompilerServices.IsConst)modopt([mscorlib]System.Runtime.CompilerServices.IsConst))IL_0015:pop// ptra = &a;IL_0016:ldloca.saIL_0018:stloc.1// ptra->meth();IL_0019:ldloc.1IL_001a:dupIL_001b:ldind.i4// reading the VMT for virtual callIL_001c:ldind.i4IL_001d:calliunmanagedstdcallvoidmodopt([mscorlib]System.Runtime.CompilerServices.CallConvStdcall)(nativeint)IL_0022:ret}// end of method 'Global Functions'::test_pointer_operations
A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally describes what is required in a JVM implementation. Having a specification ensures interoperability of Java programs across different implementations so that program authors using the Java Development Kit (JDK) need not worry about idiosyncrasies of the underlying hardware platform.
In computer programming, a P-code machine is a virtual machine designed to execute P-code, the assembly language or machine code of a hypothetical central processing unit (CPU). The term "P-code machine" is applied generically to all such machines, as well as specific implementations using those machines. One of the most notable uses of P-Code machines is the P-Machine of the Pascal-P system. The developers of the UCSD Pascal implementation within this system construed the P in P-code to mean pseudo more often than portable; they adopted a unique label for pseudo-code meaning instructions for a pseudo-machine.
In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:
Bytecode is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.
In computer programming, specifically when using the imperative programming paradigm, an assertion is a predicate connected to a point in the program, that always should evaluate to true at that point in code execution. Assertions can help a programmer read the code, help a compiler compile it, or help the program detect its own defects.
D, also known as dlang, is a multi-paradigm system programming language created by Walter Bright at Digital Mars and released in 2001. Andrei Alexandrescu joined the design and development effort in 2007. Though it originated as a re-engineering of C++, D is now a very different language. As it has developed, it has drawn inspiration from other high-level programming languages. Notably, it has been influenced by Java, Python, Ruby, C#, and Eiffel.
LLVM is a set of compiler and toolchain technologies that can be used to develop a frontend for any programming language and a backend for any instruction set architecture. LLVM is designed around a language-independent intermediate representation (IR) that serves as a portable, high-level assembly language that can be optimized with a variety of transformations over multiple passes. The name LLVM originally stood for Low Level Virtual Machine, though the project has expanded and the name is no longer officially an initialism.
Managed Extensions for C++ or Managed C++ is a deprecated set of language extensions for C++, including grammatical and syntactic extensions, keywords and attributes, to bring the C++ syntax and language to the .NET Framework. These extensions were created by Microsoft to allow C++ code to be targeted to the Common Language Runtime (CLR) in the form of managed code, as well as continue to interoperate with native code.
Metadata, in the Common Language Infrastructure (CLI), refers to certain data structures embedded within the Common Intermediate Language (CIL) code that describes the high-level structure of the code. Metadata describes all classes and class members that are defined in the assembly, and the classes and class members that the current assembly will call from another assembly. The metadata for a method contains the complete description of the method, including the class, the return type and all of the method parameters.
Defined by Microsoft for use in recent versions of Windows, an assembly in the Common Language Infrastructure (CLI) is a compiled code library used for deployment, versioning, and security. There are two types: process assemblies (EXE) and library assemblies (DLL). A process assembly represents a process that will use classes defined in library assemblies. CLI assemblies contain code in CIL, which is usually generated from a CLI language, and then compiled into machine language at run time by the just-in-time compiler. In the .NET Framework implementation, this compiler is part of the Common Language Runtime (CLR).
In computer programming, the term hooking covers a range of techniques used to alter or augment the behaviour of an operating system, of applications, or of other software components by intercepting function calls or messages or events passed between software components. Code that handles such intercepted function calls, events or messages is called a hook.
In computer science, a type punning is any programming technique that subverts or circumvents the type system of a programming language in order to achieve an effect that would be difficult or impossible to achieve within the bounds of the formal language.
ILAsm generates a portable executable (PE) file from a text representation of Common Intermediate Language (CIL) code. It is not to be confused with NGEN, which compiles Common Intermediate Language code into native code as a .NET assembly is deployed.
In computer science, ahead-of-time compilation is the act of compiling an (often) higher-level programming language into an (often) lower-level language before execution of a program, usually at build-time, to reduce the amount of work needed to be performed at run time.
C# Open Source Managed Operating System (Cosmos) is a toolkit for building GUI and command-line based operating systems, written mostly in the programming language C# and small amounts of a high-level assembly language named X#. Cosmos is a backronym, in that the acronym was chosen before the meaning. It is open-source software released under a BSD license.
A decompiler is a computer program that translates an executable file to high-level source code. It does therefore the opposite of a typical compiler, which translates a high-level language to a low-level language. While disassemblers translate an executable into assembly language, decompilers go a step further and translate the code into a higher level language such as C or Java, requiring more sophisticated techniques. Decompilers are usually unable to perfectly reconstruct the original source code, thus will frequently produce obfuscated code. Nonetheless, they remain an important tool in the reverse engineering of computer software.
The .NET Framework is a proprietary software framework developed by Microsoft that runs primarily on Microsoft Windows. It was the predominant implementation of the Common Language Infrastructure (CLI) until being superseded by the cross-platform .NET project. It includes a large class library called Framework Class Library (FCL) and provides language interoperability across several programming languages. Programs written for .NET Framework execute in a software environment named the Common Language Runtime (CLR). The CLR is an application virtual machine that provides services such as security, memory management, and exception handling. As such, computer code written using .NET Framework is called "managed code". FCL and CLR together constitute the .NET Framework.
Tracing just-in-time compilation is a technique used by virtual machines to optimize the execution of a program at runtime. This is done by recording a linear sequence of frequently executed operations, compiling them to native machine code and executing them. This is opposed to traditional just-in-time (JIT) compilers that work on a per-method basis.
Java bytecode is the instruction set of the Java virtual machine (JVM), the language to which Java and other JVM-compatible source code is compiled. Each instruction is represented by a single byte, hence the name bytecode, making it a compact form of data.
CIL: ... When we compile [a]. NET project, it [is] not directly converted to binary code but to the intermediate language. When a project is run, every language of .NET programming is converted into binary code into CIL. Only some part of CIL that is required at run time is converted into binary code. DLL and EXE of .NET are also in CIL form.