Parser Grammar Engine

Last updated

The Parser Grammar Engine (PGE, originally the Parrot Grammar Engine) is a compiler and runtime for Raku rules for the Parrot virtual machine. [1] PGE uses these rules to convert a parsing expression grammar into Parrot bytecode. It is therefore compiling rules into a program, unlike most virtual machines and runtimes, which store regular expressions in a secondary internal format that is then interpreted at runtime by a regular expression engine. The rules format used by PGE can express any regular expression and most formal grammars, and as such it forms the first link in the compiler chain for all of Parrot's front-end languages.

Contents

When executed, the bytecode generated by PGE will parse text as described in the input rules, generating a parse tree. The parse tree can be manipulated directly, or fed into the next stage of the Parrot compiler toolchain in order to generate an AST from which code generation can occur (if the grammar describes a programming language).

History

Originally named P6GE and written in C, PGE was translated to native Parrot and renamed not long after its initial release in November 2004. Its author is Patrick R. Michaud. [2] PGE was written in order to reduce the amount of work required to implement a compiler on top of Parrot. It was also written to allow Perl 6 to easily self-host, though current Pugs development no longer uses PGE as its primary rules back-end in favor of a native engine called PCR. [3]

Internals

PGE combines three styles of parsing:

The primary form is Raku rules, so a PGE rule might look like this for an addition-only grammar:

rule term   { <number> | \( <expr> \) }  rule number { \d+ }  rule expr   { <term> ( '+' <term> )* } 

The operator precedence parser allows an operator table to be built and used directly in a Perl 6 rule style parser like so:

rule exprisoptable { ... }  rule term   { <number> | \( <expr> \) }  rule number { \d+ }  prototerm:isprecedence('=')              isparsed(&term) {...}  protoinfix:+ islooser('term:') {...} 

This accomplishes the same goal of defining a simple, addition-only grammar, but does so using a combination of a Raku style regex/rules for term and number and a shift-reduce optable for everything else.

Code generation

Though PGE outputs code which will parse the grammar described by a rule, and can be used at run time to handle simple grammars and regular expressions found in code, its primary purpose is for the parsing of high level languages.

The Parrot compiler toolchain is broken into several parts, of which PGE is the first. PGE converts source code to parse trees. The Tree Grammar Engine (TGE) then converts these into Parrot Abstract Syntax Trees (PAST). A second TGE pass then converts a PAST into Parrot Opcode Syntax Trees (POST) which can be directly transformed into executable bytecode.

Pge-overview.svg

Related Research Articles

<span class="mw-page-title-main">Perl</span> Interpreted programming language first released in 1987

Perl is a high-level, general-purpose, interpreted, dynamic programming language. Though Perl is not officially an acronym, there are various backronyms in use, including "Practical Extraction and Reporting Language".

<span class="mw-page-title-main">Regular expression</span> Sequence of characters that forms a search pattern

A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

Rebol is a cross-platform data exchange language and a multi-paradigm dynamic programming language designed by Carl Sassenrath for network communications and distributed computing. It introduces the concept of dialecting: small, optimized, domain-specific languages for code and data, which is also the most notable property of the language according to its designer Carl Sassenrath:

Although it can be used for programming, writing functions, and performing processes, its greatest strength is the ability to easily create domain-specific languages or dialects

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

Parrot is a discontinued register-based process virtual machine designed to run dynamic languages efficiently. It is possible to compile Parrot assembly language and Parrot intermediate representation to Parrot bytecode and execute it. Parrot is free and open-source software.

In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for a large language model's data preprocessing, that encodes text into numerical tokens, using byte pair encoding.

Bytecode is a form of instruction set designed for efficient execution by a software interpreter. Unlike human-readable source code, bytecodes are compact numeric codes, constants, and references that encode the result of compiler parsing and performing semantic analysis of things like type, scope, and nesting depths of program objects.

<span class="mw-page-title-main">Oberon-2</span> Programming language

Oberon-2 is an extension of the original Oberon programming language that adds limited reflection and object-oriented programming facilities, open arrays as pointer base types, read-only field export, and reintroduces the FOR loop from Modula-2.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

<span class="mw-page-title-main">Raku (programming language)</span> Programming language derived from Perl

Raku is a member of the Perl family of programming languages. Formerly known as Perl 6, it was renamed in October 2019. Raku introduces elements of many modern and historical languages. Compatibility with Perl was not a goal, though a compatibility mode is part of the specification. The design process for Raku began in 2000.

In computer science, an operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

Pugs is a compiler and interpreter for the Raku programming language, started on February 1, 2005, by Audrey Tang.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

This is a list of notable lexer generators and parser generators for various language classes.

META II is a domain-specific programming language for writing compilers. It was created in 1963–1964 by Dewey Val Schorre at UCLA. META II uses what Schorre called syntax equations. Its operation is simply explained as:

Each syntax equation is translated into a recursive subroutine which tests the input string for a particular phrase structure, and deletes it if found.

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

Mirah has been a programming language based on Ruby language syntax, local type inference, hybrid static–dynamic type system, and a pluggable compiler toolchain. Mirah was created by Charles Oliver Nutter to be "a 'Ruby-like' language, probably a subset of Ruby syntax, that [could] compile to solid, fast, idiomatic JVM bytecode." The word mirah refers to the gemstone ruby in the Javanese language, a play on the concept of Ruby in Java.

The structure of the Perl programming language encompasses both the syntactical rules of the language and the general ways in which programs are organized. Perl's design philosophy is expressed in the commonly cited motto "there's more than one way to do it". As a multi-paradigm, dynamically typed language, Perl allows a great degree of flexibility in program design. Perl also encourages modularization; this has been attributed to the component-based design structure of its Unix roots, and is responsible for the size of the CPAN archive, a community-maintained repository of more than 100,000 modules.

References

  1. Michaud, Patrick R. (2004-11-22). "Parrot Grammar Engine (PGE)". Archived from the original on 2005-12-20.
  2. Michaud, Patrick R. (2004-11-08). "First public release of grammar engine".
  3. Agent Zhang (2006-09-17). "PCR replaces PGE in Pugs".