Bytecode

Last updated

Bytecode (also called portable code or p-code) is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable [1] source code, bytecodes are compact numeric codes, constants, and references (normally numeric addresses) that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

Contents

The name bytecode stems from instruction sets that have one-byte opcodes followed by optional parameters. Intermediate representations such as bytecode may be output by programming language implementations to ease interpretation, or it may be used to reduce hardware and operating system dependence by allowing the same code to run cross-platform, on different devices. Bytecode may often be either directly executed on a virtual machine (a p-code machine, i.e., interpreter), or it may be further compiled into machine code for better performance.

Since bytecode instructions are processed by software, they may be arbitrarily complex, but are nonetheless often akin to traditional hardware instructions: virtual stack machines are the most common, but virtual register machines have been built also. [2] [3] Different parts may often be stored in separate files, similar to object modules, but dynamically loaded during execution.

Execution

A bytecode program may be executed by parsing and directly executing the instructions, one at a time. This kind of bytecode interpreter is very portable. Some systems, called dynamic translators, or just-in-time (JIT) compilers, translate bytecode into machine code as necessary at runtime. This makes the virtual machine hardware-specific but does not lose the portability of the bytecode. For example, Java and Smalltalk code is typically stored in bytecode format, which is typically then JIT compiled to translate the bytecode to machine code before execution. This introduces a delay before a program is run, when the bytecode is compiled to native machine code, but improves execution speed considerably compared to interpreting source code directly, normally by around an order of magnitude (10x). [4]

Because of its performance advantage, today many language implementations execute a program in two phases, first compiling the source code into bytecode, and then passing the bytecode to the virtual machine. There are bytecode based virtual machines of this sort for Java, Raku, Python, PHP, [a] Tcl, mawk and Forth (however, Forth is seldom compiled via bytecodes in this way, and its virtual machine is more generic instead). The implementation of Perl and Ruby 1.8 instead work by walking an abstract syntax tree representation derived from the source code.

More recently, the authors of V8 [1] and Dart [7] have challenged the notion that intermediate bytecode is needed for fast and efficient VM implementation. Both of these language implementations currently do direct JIT compiling from source code to machine code with no bytecode intermediary. [8]

Examples

(disassemble'(lambda(x)(printx))); disassembly for (LAMBDA (X)); 2436F6DF:       850500000F22     TEST EAX, [#x220F0000]     ; no-arg-parsing entry point;       E5:       8BD6             MOV EDX, ESI;       E7:       8B05A8F63624     MOV EAX, [#x2436F6A8]      ; #<FDEFINITION object for PRINT>;       ED:       B904000000       MOV ECX, 4;       F2:       FF7504           PUSH DWORD PTR [EBP+4];       F5:       FF6005           JMP DWORD PTR [EAX+5];       F8:       CC0A             BREAK 10                   ; error trap;       FA:       02               BYTE #X02;       FB:       18               BYTE #X18                  ; INVALID-ARG-COUNT-ERROR;       FC:       4F               BYTE #X4F                  ; ECX
Compiled code can be analysed and investigated using a built-in tool for debugging the low-level bytecode. The tool can be initialized from the shell, for example:
>>> importdis# "dis" - Disassembler of Python byte code into mnemonics.>>> dis.dis('print("Hello, World!")')  1           0 LOAD_NAME                0 (print)              2 LOAD_CONST               0 ('Hello, World!')              4 CALL_FUNCTION            1              6 RETURN_VALUE

See also

Notes

  1. PHP has just-in-time compilation in PHP 8, [5] [6] and before while not on in the default version, had options like HHVM. For older versions of PHP: Although PHP opcodes are generated each time the program is launched, and are always interpreted and not just-in-time compiled.

Related Research Articles

<span class="mw-page-title-main">Java virtual machine</span> Virtual machine that runs Java programs

A Java virtual machine (JVM) is a virtual machine that enables a computer to run Java programs as well as programs written in other languages that are also compiled to Java bytecode. The JVM is detailed by a specification that formally describes what is required in a JVM implementation. Having a specification ensures interoperability of Java programs across different implementations so that program authors using the Java Development Kit (JDK) need not worry about idiosyncrasies of the underlying hardware platform.

In computer programming, a P-code machine is a virtual machine designed to execute P-code, the assembly language or machine code of a hypothetical central processing unit (CPU). The term "P-code machine" is applied generically to all such machines, as well as specific implementations using those machines. One of the most notable uses of P-Code machines is the P-Machine of the Pascal-P system. The developers of the UCSD Pascal implementation within this system construed the P in P-code to mean pseudo more often than portable; they adopted a unique label for pseudo-code meaning instructions for a pseudo-machine.

Common Intermediate Language (CIL), formerly called Microsoft Intermediate Language (MSIL) or Intermediate Language (IL), is the intermediate language binary instruction set defined within the Common Language Infrastructure (CLI) specification. CIL instructions are executed by a CIL-compatible runtime environment such as the Common Language Runtime. Languages which target the CLI compile to CIL. CIL is object-oriented, stack-based bytecode. Runtimes typically just-in-time compile CIL instructions into native code.

<span class="mw-page-title-main">Interpreter (computing)</span> Program that executes source code without a separate compilation step

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

  1. Parse the source code and perform its behavior directly;
  2. Translate source code into some efficient intermediate representation or object code and immediately execute that;
  3. Explicitly execute stored precompiled bytecode made by a compiler and matched with the interpreter's virtual machine.

In computer science, dynamic recompilation is a feature of some emulators and virtual machines, where the system may recompile some part of a program during execution. By compiling during execution, the system can tailor the generated code to reflect the program's run-time environment, and potentially produce more efficient code by exploiting information that is not available to a traditional static compiler.

In computing, just-in-time (JIT) compilation is compilation during execution of a program rather than before execution. This may consist of source code translation but is more commonly bytecode translation to machine code, which is then executed directly. A system implementing a JIT compiler typically continuously analyses the code being executed and identifies parts of the code where the speedup gained from compilation or recompilation would outweigh the overhead of compiling that code.

In compiler theory, dead-code elimination is a compiler optimization to remove dead code. Removing such code has several benefits: it shrinks program size, an important consideration in some contexts, it reduces resource usage such as the number of bytes to be transferred and it allows the running program to avoid executing irrelevant operations, which reduces its running time. It can also enable further optimizations by simplifying program structure. Dead code includes code that can never be executed, and code that only affects dead variables, that is, irrelevant to the program.

A foreign function interface (FFI) is a mechanism by which a program written in one programming language can call routines or make use of services written or compiled in another one. An FFI is often used in contexts where calls are made into a binary dynamic-link library.

Application virtualization software refers to both application virtual machines and software responsible for implementing them. Application virtual machines are typically used to allow application bytecode to run portably on many different computer architectures and operating systems. The application is usually run on the computer using an interpreter or just-in-time compilation (JIT). There are often several implementations of a given virtual machine, each covering a different set of functions.

In computer programming, a programming language implementation is a system for executing computer programs. There are two general approaches to programming language implementation:

In computer science, ahead-of-time compilation is the act of compiling an (often) higher-level programming language into an (often) lower-level language before execution of a program, usually at build-time, to reduce the amount of work needed to be performed at run time.

Dalvik is a discontinued process virtual machine (VM) in the Android operating system that executes applications written for Android. Dalvik was an integral part of the Android software stack in the Android versions 4.4 "KitKat" and earlier, which were commonly used on mobile devices such as mobile phones and tablet computers, and more in some devices such as smart TVs and wearables. Dalvik is open-source software, originally written by Dan Bornstein, who named it after the fishing village of Dalvík in Eyjafjörður, Iceland.

Tracing just-in-time compilation is a technique used by virtual machines to optimize the execution of a program at runtime. This is done by recording a linear sequence of frequently executed operations, compiling them to native machine code and executing them. This is opposed to traditional just-in-time (JIT) compilers that work on a per-method basis.

Java bytecode is the instruction set of the Java virtual machine (JVM), the language to which Java and other JVM-compatible source code is compiled. Each instruction is represented by a single byte, hence the name bytecode, making it a compact form of data.

<span class="mw-page-title-main">GraalVM</span> Virtual machine software

GraalVM is a Java Development Kit (JDK) written in Java. The open-source distribution of GraalVM is based on OpenJDK, and the enterprise distribution is based on Oracle JDK. As well as just-in-time (JIT) compilation, GraalVM can compile a Java application ahead of time. This allows for faster initialization, greater runtime performance, and decreased resource consumption, but the resulting executable can only run on the platform it was compiled for.

HipHop Virtual Machine (HHVM) is an open-source virtual machine based on just-in-time (JIT) compilation that serves as an execution engine for the Hack programming language. By using the principle of JIT compilation, Hack code is first transformed into intermediate HipHop bytecode (HHBC), which is then dynamically translated into x86-64 machine code, optimized, and natively executed. This contrasts with PHP's usual interpreted execution, in which the Zend Engine transforms PHP source code into opcodes that serve as a form of bytecode, and executes the opcodes directly on the Zend Engine's virtual CPU.

A high-level language computer architecture (HLLCA) is a computer architecture designed to be targeted by a specific high-level programming language (HLL), rather than the architecture being dictated by hardware considerations. It is accordingly also termed language-directed computer design, coined in McKeeman (1967) and primarily used in the 1960s and 1970s. HLLCAs were popular in the 1960s and 1970s, but largely disappeared in the 1980s. This followed the dramatic failure of the Intel 432 (1981) and the emergence of optimizing compilers and reduced instruction set computer (RISC) architectures and RISC-like complex instruction set computer (CISC) architectures, and the later development of just-in-time compilation (JIT) for HLLs. A detailed survey and critique can be found in Ditzel & Patterson (1980).

References

  1. 1 2 "Dynamic Machine Code Generation". Google Inc. Archived from the original on 2017-03-05. Retrieved 2024-12-01.
  2. "The Implementation of Lua 5.0". (NB. This involves a register-based virtual machine.)
  3. "Dalvik VM". Archived from the original on 2013-05-18. Retrieved 2012-10-29. (NB. This VM is register based.)
  4. "Byte Code Vs Machine Code". www.allaboutcomputing.net. Retrieved 2017-10-23.
  5. O’Phinney, Matthew Weier. "Exploring the New PHP JIT Compiler". Zend by Perforce. Retrieved 2021-02-19.
  6. "PHP 8: The JIT - stitcher.io". stitcher.io. Retrieved 2021-02-19.
  7. Loitsch, Florian. "Why Not a Bytecode VM?". Google. Archived from the original on 2013-05-12.
  8. "JavaScript myth: JavaScript needs a standard bytecode". 2ality.com.
  9. G., Adam Y. (2022-07-11). "Berkeley Pascal". GitHub . Retrieved 2022-01-08.
  10. "CLHS: Function DISASSEMBLE". www.lispworks.com.
  11. Collective (2023-12-13). "The Common Lisp Cookbook – Performance Tuning and Tips". lispcookbook.github.io.
  12. "The Implementation of the Icon Programming Language" (PDF). Archived from the original (PDF) on 2016-03-05. Retrieved 2011-09-09.
  13. "The Implementation of Icon and Unicon a Compendium" (PDF). Archived (PDF) from the original on 2022-10-09.
  14. Paul, Matthias R. (2001-12-30). "KEYBOARD.SYS internal structure". Newsgroup:  comp.os.msdos.programmer. Archived from the original on 2017-09-09. Retrieved 2016-09-17. […] In fact, the format is basically the same in MS-DOS 3.3 - 8.0, PC DOS 3.3 - 2000, including Russian, Lithuanian, Chinese and Japanese issues, as well as in Windows NT, 2000, and XP […]. There are minor differences and incompatibilities, but the general format has not changed over the years. […] Some of the data entries contain normal tables […] However, most entries contain executable code interpreted by some kind of p-code interpreter at *runtime*, including conditional branches and the like. This is why the KEYB driver has such a huge memory footprint compared to table-driven keyboard drivers which can be done in 3 - 4 Kb getting the same level of function except for the interpreter. […]
  15. Mendelson, Edward (2001-07-20). "How to Display the Euro in MS-DOS and Windows DOS". Display the euro symbol in full-screen MS-DOS (including Windows 95 or Windows 98 full-screen DOS). Archived from the original on 2016-09-17. Retrieved 2016-09-17. […] Matthias [R.] Paul […] warns that the IBM PC DOS version of the keyboard driver uses some internal procedures that are not recognized by the Microsoft driver, so, if possible, you should use the IBM versions of both KEYB.COM and KEYBOARD.SYS instead of mixing Microsoft and IBM versions […] (NB. What is meant by "procedures" here are some additional bytecodes in the IBM KEYBOARD.SYS file not supported by the Microsoft version of the KEYB driver.)
  16. "United States Patent 6,973,644". Archived from the original on 2017-03-05. Retrieved 2009-05-21.
  17. Microsoft C Pcode Specifications. p. 13. Multiplan wasn't compiled to machine code, but to a kind of byte-code which was run by an interpreter, in order to make Multiplan portable across the widely varying hardware of the time. This byte-code distinguished between the machine-specific floating point format to calculate on, and an external (standard) format, which was binary coded decimal (BCD). The PACK and UNPACK instructions converted between the two.
  18. "R Installation and Administration". cran.r-project.org.
  19. "The SQLite Bytecode Engine". Archived from the original on 2017-04-14. Retrieved 2016-08-29.