Scannerless parsing

Last updated November 14, 2024

In computer science, scannerless parsing (also called lexerless parsing) performs tokenization (breaking a stream of characters into words) and parsing (arranging the words into phrases) in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical (word level) and phrase level structure of the language.

Dividing processing into a lexer followed by a parser is more modular; scannerless parsing is primarily used when a clear lexer–parser distinction is unneeded or unwanted. Examples of when this is appropriate include TeX, most wiki grammars, makefiles, simple application-specific scripting languages, and Raku.

Advantages

Only one metalanguage is needed
Non-regular lexical structure is handled easily
"Token classification" is unneeded which removes the need for design accommodations such as "the lexer hack" and language reserved words (such as "while" in C)
Grammars can be compositional (can be merged without human intervention) ^[a]

Disadvantages

Since the lexical scanning and syntactic parsing are combined, the resulting parser tends to be more complicated and thus harder to understand and debug. The same will hold for the associated grammar, if a grammar is used to generate the parser.
The resulting parser tends to be significantly less efficient than a lexer-parser pipeline with regard to both time and memory.^[1]

Implementations

SGLR is a parser for the modular Syntax Definition Formalism (SDF), and is part of the ASF+SDF Meta-Environment and the Stratego/XT program transformation system.
JSGLR, a pure Java implementation of SGLR, also based on SDF.
TXL supports character-level parsing.
dparser generates ANSI C code for scannerless GLR parsers.
Spirit allows for both scannerless and scanner-based parsing.
SBP is a scannerless parser for Boolean grammars (a superset of context-free grammars), written in Java.
Laja is a two-phase scannerless parser generator with support for mapping the grammar rules into objects, written in Java.
The Raku grammars feature of the general purpose programming language Raku.
PyParsing is a scannerless parser written in pure Python.
META II Has built in token parsers functions.
TREE-META Like META II also is scannerless having builtin lexer functions.
CWIC compiler for writing and implementing compilers. Has token rules as part of its language. Rules in CWIC were compiled into Boolean functions returning success or failure.

Notes

^a This is because parsing at the character level makes the language recognized by the parser a single context-free language defined on characters, as opposed to a context-free language of sequences of strings in regular languages. Some lexerless parsers handle the entire class of context-free languages, which is closed under composition.

Related Research Articles

Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.

In a computer language, a reserved word is a word that cannot be used as an identifier, such as the name of a variable, function, or label – it is "reserved from use". This is a syntactic definition, and a reserved word may have no user-defined meaning.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in large language models (LLMs) but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Lex is a computer program that generates lexical analyzers. It is commonly used with the yacc parser generator and is the standard lexical analyzer generator on many Unix and Unix-like systems. An equivalent tool is specified as part of the POSIX standard.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

The dangling else is a problem in programming of parser generators in which an optional else clause in an if–then(–else) statement results in nested conditionals being ambiguous. Formally, the reference context-free grammar of the language is ambiguous, meaning there is more than one correct parse tree.

In computer-based language recognition, ANTLR, or ANother Tool for Language Recognition, is a parser generator that uses a LL(*) algorithm for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is Professor Terence Parr of the University of San Francisco.

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer programming, the lexer hack is a solution to parsing context-sensitive grammars such as C, where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information backwards from the parser to the lexer.

In computer programming and computer science, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed.

Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.

The Syntax Definition Formalism (SDF) is a metasyntax used to define context-free grammars: that is, a formal way to describe formal languages. It can express the entire range of context-free grammars. Its current version is SDF3. A parser and parser generator for SDF specifications are provided as part of the free ASF+SDF Meta Environment. These operate using the SGLR. An SDF parser outputs parse trees or, in the case of ambiguities, parse forests.

In computer science, SYNTAX is a system used to generate lexical and syntactic analyzers (parsers) for all kinds of context-free grammars (CFGs) as well as some classes of contextual grammars. It has been developed at INRIA in France for several decades, mostly by Pierre Boullier, but has become free software since 2007 only. SYNTAX is distributed under the CeCILL license.

The Scannerless Boolean Parser is an open-source scannerless GLR parser generator for boolean grammars. It was implemented in the Java programming language and generates Java source code. SBP also integrates with Haskell via LambdaVM.

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

In computer science, an integer literal is a kind of literal for an integer whose value is directly represented in source code. For example, in the assignment statement x = 1, the string 1 is an integer literal indicating the value 1, while in the statement x = 0x10 the string 0x10 is an integer literal indicating the value 16, which is represented by 10 in hexadecimal.

OMeta is a specialized object-oriented programming language for pattern matching, developed by Alessandro Warth and Ian Piumarta in 2007 at the Viewpoints Research Institute. The language is based on parsing expression grammars (PEGs), rather than context-free grammars, with the intent to provide "a natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree-transformers".

In computer programming languages, an identifier is a lexical token that names the language's entities. Some of the kinds of entities an identifier might denote include variables, data types, labels, subroutines, and modules.

References

↑ Economopoulos, Giorgios; Klint, Paul; Vinju, Jurgen (2009). "Faster Scannerless GLR Parsing" (PDF). Compiler Construction. Lecture Notes in Computer Science. Vol. 5501. pp. 126–141. doi: 10.1007/978-3-642-00722-4_10 . ISBN 978-3-642-00721-7.

v t e Parsing algorithms
Top-down	Earley LL Recursive descent Tail recursive
Bottom-up	Precedence Simple Operator Shunting-yard LR Simple Look-ahead Canonical Generalized CYK Recursive ascent Shift-reduce
Mixed, other	Combinator Chart Left corner Statistical
Related topics	PEG Definite clause grammar Deterministic parsing Dynamic programming Memoization Parser generator LALR Parse tree AST Scannerless parsing History of compiler construction Comparison of parser generators Operator-precedence grammar