Perl virtual machine

Last updated

The Perl virtual machine is a stack-based process virtual machine implemented as an opcodes interpreter which runs previously compiled programs written in the Perl language. The opcodes interpreter is a part of the Perl interpreter, which also contains a compiler (lexer, parser and optimizer) in one executable file, commonly /usr/bin/perl on various Unix-like systems or perl.exe on Microsoft Windows systems.

Contents

Implementation

Opcodes

The Perl compiler outputs a compiled program into memory as an internal structure which can be represented as a tree graph in which each node represents an opcode. Opcodes are represented internally by typedefs. Each opcode has next / other and first / sibling pointers, so the opcode tree can be drawn as a basic OP tree starting from root node or as flat OP list in the order they would normally execute from start node. Opcodes tree can be mapped to the source code, so it is possible to decompile to high-level source code. [1]

Perl's opcodes interpreter is implemented as a tree walker which travels the opcode tree in execution order from the start node, following the next or other pointers. Each opcode has a function pointer to a pp_opname function, i.e. the say opcode calls the pp_say function of internal Perl API.

The phase of compiling a Perl program is hidden from the end user, but it can be exposed with the B Perl module [2] or other specialized modules, as the B::Concise Perl module. [3]

An example of a simple compiled Hello world program dumped in execute order (with the B::Concise Perl module):

$ perl -MO=Concise,-exec -E 'say "Hello, world!"'1  <0> enter2  <;> nextstate(main 46 -e:1) v:%,{3  <0> pushmark s4  <$> const[PV "Hello, world!"] s5  <@> say vK6  <@> leave[1 ref] vKP/REFC

Some opcodes (entereval, dofile, require) call Perl compiler functions which in turn generate other opcodes in the same Perl virtual machine.

Variables

Perl variables can be global, dynamic (local keyword), or lexical (my and our keywords).

Global variables are accessible via the stash and the corresponding typeglob.

Local variables are the same as global variables but a special opcode is generated to save its value on the savestack and restore it later.

Lexical variables are stored in the padlist.

Data structures

Perl VM data structures are represented internally by typedefs.

The internal data structures can be examined with the B Perl module [2] or other specialized tools like the Devel::Peek Perl module. [4]

data types

Perl has three typedefs that handle Perl's three main data types: Scalar Value (SV), Array Value (AV), Hash Value (HV). Perl uses a special typedef for the simple signed integer type (IV), unsigned integers (UV), floating point numbers (NV) and strings (PV).

Perl uses a reference count-driven garbage collection mechanism. SVs, AVs, or HVs start their life with a reference count of 1. If the reference count of a data value drops to 0, then it will be destroyed and its memory is made available for reuse.

Other typedefs are Glob Value (GV) which contain named references to various objects, Code Value (CV) which contain a reference to a Perl subroutine, I/O Handler (IO), a reference to regular expression (REGEXP; RV in Perl before 5.11), reference to compiled format for output record (FM) and simple reference which is a special type of scalar that point to other data types (RV).

stash

Special Hash Value is stash, a hash that contains all variables that are defined within a package. Each value in this hash table is a Glob Value (GV).

padlist

Special Array Value is padlist which is an array of array. Its 0th element to an AV containing all lexical variable names (with prefix symbols) used within that subroutine. The padlist's first element points to a scratchpad AV, whose elements contain the values corresponding to the lexical variables named in the 0th row. Another elements of padlist are created when the subroutine recurses or new thread is created.

Stacks

Perl has a number of stacks to store things it is working on.

Argument stack

Arguments are passed to opcode and returned from opcode using the argument stack. The typical way to handle arguments is to pop them off the stack, and then push the result back onto the stack.

Mark stack

This stack saves bookmarks to locations in the argument stack usable by each function so the functions doesn't necessarily get the whole argument stack to itself.

Save stack

This stack is used for saving and restoring values of dynamically scoped local variables.

Scope stack

This stack stores information about the actual scope and it is used only for debugging purposes.

Other implementations

There is no standardization for the Perl language and Perl virtual machine. The internal API is considered non-stable and changes from version to version. The Perl virtual machine is tied closely to the compiler.

The most known and most stable implementation is the B::C Perl module [5] which translates opcodes tree to a representation in the C programming language and adds its own tree walker.

Another implementation is an Acme::Perl::VM Perl module [6] which is an implementation coded in Perl language only but it is still tied with original Perl virtual machine via B:: modules.

See also

Related Research Articles

Common Lisp (CL) is a dialect of the Lisp programming language, published in ANSI standard document ANSI INCITS 226-1994 (S20018). The Common Lisp HyperSpec, a hyperlinked HTML version, has been derived from the ANSI Common Lisp standard.

Perl Interpreted programming language first released in 1987

Perl is a family of two high-level, general-purpose, interpreted, dynamic programming languages. "Perl" refers to Perl 5, but from 2000 to 2019 it also referred to its redesigned "sister language", Perl 6, before the latter's name was officially changed to Raku in October 2019.

Ruby is an interpreted, high-level, general-purpose programming language which supports multiple programming paradigms. It was designed with an emphasis on programming productivity and simplicity. In Ruby, everything is an object, including primitive data types. It was developed in the mid-1990s by Yukihiro "Matz" Matsumoto in Japan.

Lua (programming language) Lightweight programming language

Lua is a lightweight, high-level, multi-paradigm programming language designed primarily for embedded use in applications. Lua is cross-platform, since the interpreter of compiled bytecode is written in ANSI C, and Lua has a relatively simple C API to embed it into applications.

In computer science, threaded code is a programming technique where the code has a form that essentially consists entirely of calls to subroutines. It is often used in compilers, which may generate code in that form or be implemented in that form themselves. The code may be processed by an interpreter or it may simply be a sequence of machine code call instructions.

Interpreter (computing) Program that executes source code without a separate compilation step

In computer science, an interpreter is a computer program that directly executes instructions written in a programming or scripting language, without requiring them previously to have been compiled into a machine language program. An interpreter generally uses one of the following strategies for program execution:

  1. Parse the source code and perform its behavior directly;
  2. Translate source code into some efficient intermediate representation or object code and immediately execute that;
  3. Explicitly execute stored precompiled bytecode made by a compiler and matched with the interpreter Virtual Machine.

In computer programming, the scope of a name binding—an association of a name to an entity, such as a variable—is the part of a program where the name binding is valid, that is where the name can be used to refer to the entity. In other parts of the program the name may refer to a different entity, or to nothing at all. The scope of a name binding is also known as the visibility of an entity, particularly in older or more technical literature—this is from the perspective of the referenced entity, not the referencing name.

In programming languages, a closure, also lexical closure or function closure, is a technique for implementing lexically scoped name binding in a language with first-class functions. Operationally, a closure is a record storing a function together with an environment. The environment is a mapping associating each free variable of the function with the value or reference to which the name was bound when the closure was created. Unlike a plain function, a closure allows the function to access those captured variables through the closure's copies of their values or references, even when the function is invoked outside their scope.

Bytecode, also termed p-code, is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

The Burroughs Large Systems Group produced a family of large 48-bit mainframes using stack machine instruction sets with dense syllables. The first machine in the family was the B5000 in 1961. It was optimized for compiling ALGOL 60 programs extremely well, using single-pass compilers. It evolved into the B5500. Subsequent major redesigns include the B6500/B6700 line and its successors, as well as the separate B8500 line.

In computer science, computer engineering and programming language implementations, a stack machine is a computer processor or a virtual machine in which the primary interaction is moving short-lived temporary values to and from a push down stack. In the case of a hardware processor, a hardware stack is used. The use of a stack significantly reduces the required number of processor registers. Stack machines extend push-down automaton with additional load/store operations or multiple stacks and hence are Turing-complete.

In computer science, a tail call is a subroutine call performed as the final action of a procedure. If the target of a tail is the same subroutine, the subroutine is said to be tail recursive, which is a special case of direct recursion. Tail recursion is particularly useful, and often easy to handle optimization in implementations.

Raku (programming language) Programming language derived from Perl

Raku is a member of the Perl family of programming languages. Formerly known as Perl 6, it was renamed in October 2019. Raku introduces elements of many modern and historical languages. Compatibility with Perl was not a goal, though a compatibility mode is part of the specification. The design process for Raku began in 2000.

In computer science, a call stack is a stack data structure that stores information about the active subroutines of a computer program. This kind of stack is also known as an execution stack, program stack, control stack, run-time stack, or machine stack, and is often shortened to just "the stack". Although maintenance of the call stack is important for the proper functioning of most software, the details are normally hidden and automatic in high-level programming languages. Many computer instruction sets provide special instructions for manipulating stacks.

In computer programming, an automatic variable is a local variable which is allocated and deallocated automatically when program flow enters and leaves the variable's scope. The scope is the lexical context, particularly the function or block in which a variable is defined. Local data is typically invisible outside the function or lexical context where it is defined. Local data is also invisible and inaccessible to a called function, but is not deallocated, coming back in scope as the execution thread returns to the caller.

Haxe is an open source high-level cross-platform programming language and compiler that can produce applications and source code, for many different computing platforms from one code-base. It is free and open-source software, released under the MIT License. The compiler, written in OCaml, is released under the GNU General Public License (GPL) version 2.

This article compares a large number of programming languages by tabulating their data types, their expression, statement, and declaration syntax, and some common operating-system interfaces.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

In computer programming, a subroutine is a sequence of program instructions that performs a specific task, packaged as a unit. This unit can then be used in programs wherever that particular task should be performed.

Toi is an imperative, type-sensitive language that provides the basic functionality of a programming language. The language was designed and developed from the ground-up by Paul Longtine. Written in C, Toi was created with the intent to be an educational experience and serves as a learning tool for those looking to familiarize themselves with the inner-workings of a programming language.

References

  1. "B::Deparse - Perl compiler backend to produce perl code".
  2. 1 2 "B - The Perl Compiler Backend".
  3. "B::Concise - Walk Perl syntax tree, printing concise info about ops".
  4. "Devel::Peek - A data debugging tool for the XS programmer".
  5. "B::C - Perl compiler's C backend".
  6. "Acme::Perl::VM - A Perl5 Virtual Machine in Pure Perl (APVM)".