GOLD (parser)

Last updated
GOLD Parsing System
GOLD logo.gif
GOLD Builder v3.4.gif
Developer(s) Devin Cook and 54 other contributors [1]
Final release
5.2.0 / August 18, 2012;9 years ago (2012-08-18)
Operating system Windows
Type LALR parser
License zlib License (free software)
Website goldparser.org   OOjs UI icon edit-ltr-progressive.svg

GOLD is a free parsing system that is designed to support multiple programming languages.

Contents

Design

The system uses a DFA for lexical analysis and the LALR algorithm for parsing. Both of these algorithms are state machines that use tables to determine actions. GOLD is designed around the principle of logically separating the process of generating the LALR and DFA parse tables from the actual implementation of the parsing algorithms themselves. This allows parsers to be implemented in different programming languages while maintaining the same grammars and development process.

The GOLD system consists of three logical components, the "Builder", the "Engine", and a "Compiled Grammar Table" file definition which functions as an intermediary between the Builder and the Engine.

Builder

GOLD Components GOLD flow.png
GOLD Components

The Builder is the primary component and main application of the system. The Builder is used to analyze the syntax of a language (specified as a grammar) and construct LALR and DFA tables. During this process, any ambiguities in the grammar will be reported. This is essentially the same task that is performed by compiler-compilers such as YACC and ANTLR.

Once the LALR and DFA parse tables are successfully constructed, the Builder can save this data into a Compiled Grammar Table file. This allows the information to be reopened later by the Builder or used in one of the Engines. Currently, the Builder component is only available for Windows 32-bit operating systems.

Some of the features of the Builder are:

Compiled Grammar Table file

The Compiled Grammar Table file is used to store table information generated by the Builder.

Engines

Unlike the Builder, which only runs on a single platform, the Engine component is written for a specific programming language and/or development platform. The Engine implements the LALR and DFA algorithms. Since different programming languages use different approaches to designing programs, each implementation of the Engine will vary. As a result, an implementation of the Engine written for Visual Basic 6 will differ greatly from one written for ANSI C.

Currently, Engines for GOLD have been implemented for the following programming languages / platforms. New Engines can be implemented using the source code for the existing Engines as the starting point.

Grammars

GOLD grammars are based directly on Backus–Naur form, regular expressions, and set notation.

The following grammar defines the syntax for a minimal general-purpose programming language called "Simple".

"Name"    = 'Simple' "Author"  = 'Devin Cook' "Version" = '2.1'  "About"   = 'This is a very simple grammar designed for use in examples'  "Case Sensitive" = False  "Start Symbol"   = <Statements>  {String Ch 1} = {Printable} - [''] {String Ch 2} = {Printable} - ["]  Identifier    = {Letter}{AlphaNumeric}*      ! String allows either single or double quotes  StringLiteral = ''  {String Ch 1}* ''               | '"' {String Ch 2}* '"'  NumberLiteral = {Number}+('.'{Number}+)?  Comment Start = '/*' Comment End   = '*/' Comment Line  = '//'  
<Statements>::=<Statements><Statement>                |  <Statement><Statement>::= display <Expression>                |  display <Expression> read ID                |  assign ID '=' <Expression>                |  while <Expression> do <Statements> end                |  if <Expression> then <Statements> end                |  if <Expression> then <Statements> else <Statements> end                 <Expression>::=<Expression> '>'  <Add Exp>                |  <Expression> '<'  <Add Exp>                |  <Expression> '<=' <Add Exp>                |  <Expression> '>=' <Add Exp>                |  <Expression> '==' <Add Exp>                |  <Expression> '<>' <Add Exp>                |  <Add Exp><Add Exp>::=<Add Exp> '+' <Mult Exp>                |  <Add Exp> '-' <Mult Exp>                |  <Add Exp> '&' <Mult Exp>                |  <Mult Exp><Mult Exp>::=<Mult Exp> '*' <Negate Exp>                |  <Mult Exp> '/' <Negate Exp>                |  <Negate Exp><Negate Exp>::= '-' <Value>                |  <Value><Value>::= Identifier                |  StringLiteral                |  NumberLiteral                |  '(' <Expression> ')' 

Development overview

GOLD Builder Application GOLD screenshot.gif
GOLD Builder Application

The first step consists of writing and testing a grammar for the language being parsed. The grammar can be written using any text editor - such as Notepad or the editor that is built into the Builder. At this stage, no coding is required.

Once the grammar is complete, it is analyzed by the Builder, the LALR and DFA parse tables are constructed, and any ambiguities or problems with the grammar are reported. Afterwards, the tables are saved to a Compiled Grammar Table file to be used later by a parsing engine. At this point, the GOLD Parser Builder is no longer needed.

In the final stage, the tables are read by an Engine. At this point, the development process is dependent on the selected implementation language.

Related Research Articles

In computer science, an LALR parser or Look-Ahead LR parser is a simplified version of a canonical LR parser, to parse a text according to a set of production rules specified by a formal grammar for a computer language.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, Canonical LR(1) parsers, Minimal LR(1) parsers, and GLR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

Regular expression Sequence of characters that forms a search pattern

A regular expression is a sequence of characters that specifies a search pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.

Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a Look Ahead Left-to-Right Rightmost Derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur Form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in the BNF notation, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

In computer science, Backus–Naur form or Backus normal form (BNF) is a metasyntax notation for context-free grammars, often used to describe the syntax of languages used in computing, such as computer programming languages, document formats, instruction sets and communication protocols. They are applied wherever exact descriptions of languages are needed: for instance, in official language specifications, in manuals, and in textbooks on programming language theory.

In computer science, a recursive descent parser is a kind of top-down parser built from a set of mutually recursive procedures where each such procedure implements one of the nonterminals of the grammar. Thus the structure of the resulting program closely mirrors that of the grammar it recognizes.

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

In computer science, a Simple LR or SLR parser is a type of LR parser with small parse tables and a relatively simple parser generator algorithm. As with other types of LR(1) parser, an SLR parser is quite efficient at finding the single correct bottom-up parse in a single left-to-right scan over the input stream, without guesswork or backtracking. The parser is mechanically generated from a formal grammar for the language.

In computer science, lexical analysis, lexing or tokenization is the process of converting a sequence of characters into a sequence of lexical tokens. A program that performs lexical analysis may be termed a lexer, tokenizer, or scanner, although scanner is also a term for the first stage of a lexer. A lexer is generally combined with a parser, which together analyze the syntax of programming languages, web pages, and so forth.

Alfred Vaino Aho is a Canadian computer scientist best known for his work on programming languages, compilers, and related algorithms, and his textbooks on the art and science of computer programming.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

XPL is a programming language based on PL/I, a portable one-pass compiler written in its own language, and a parser generator tool for easily implementing similar compilers for other languages. XPL was designed in 1967 as a way to teach compiler design principles and as starting point for students to build compilers for their own languages.

Berkeley Yacc (byacc) is a Unix parser generator designed to be compatible with Yacc. It was originally written by Robert Corbett and released in 1989. Due to its liberal license and because it was faster than the AT&T Yacc, it quickly became the most popular version of Yacc. It has the advantages of being written in ANSI C89 and being public domain software.

Syntax (programming languages) Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the set of rules that defines the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

This is a list of notable lexer generators and parser generators for various language classes.

In formal grammar theory, the deterministic context-free grammars (DCFGs) are a proper subset of the context-free grammars. They are the subset of context-free grammars that can be derived from deterministic pushdown automata, and they generate the deterministic context-free languages. DCFGs are always unambiguous, and are an important subclass of unambiguous CFGs; there are non-deterministic unambiguous CFGs, however.

In computer science, SYNTAX is a system used to generate lexical and syntactic analyzers (parsers) for all kinds of context-free grammars (CFGs) as well as some classes of contextual grammars. It has been developed at INRIA (France) for several decades, mostly by Pierre Boullier, but has become free software since 2007 only. SYNTAX is distributed under the CeCILL license.

History of compiler construction Wikimedia history article

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

A lookahead LR parser (LALR) generator is a software tool that reads a BNF grammar and creates an LALR parser which is capable of parsing files written in the computer language defined by the BNF grammar. LALR parsers are desirable because they are very fast and small in comparison to other types of parsers.

References

  1. "Contributors". goldparser.org. Retrieved 28 August 2017.