Ragel

Last updated
Ragel
Developer(s) Adrian Thurston [1]
Stable release
7.0.4 [2]   OOjs UI icon edit-ltr-progressive.svg / 15 February 2021;3 years ago (15 February 2021)
Preview release
7.0.4 / February 16, 2021;3 years ago (2021-02-16)
Repository
Written in C++
Operating system Unix-like, Windows
Type State machine compiler
License "Ragel 6 remains under GPL v2 [generated code] covered by the MIT (or GPL v2)". [3]
Ragel 7: MIT License
Website www.colm.net/open-source/ragel/

Ragel is a finite-state machine compiler and a parser generator. Initially Ragel supported output for C, C++ and Assembly source code, [4] and was expanded to support several other languages including Objective-C, D, Go, Ruby, and Java. [5] Additional language support is also in development. [6] It supports the generation of table or control flow driven state machines from regular expressions [7] and/or state charts and can also build lexical analysers via the longest-match method. Ragel specifically targets text parsing and input validation. [8]

Contents

Overview

Ragel supports the generation of table or control flow driven state machines from regular expressions and/or state charts and can also build lexical analysers via the longest-match method. A unique feature of Ragel is that user actions can be associated with arbitrary state machine transitions using operators that are integrated into the regular expressions. Ragel also supports visualization of the generated machine via graphviz.

Visualisierung Ragel.png

The above graph represents a state-machine that takes user input as a series of bytes representing ASCII characters and control codes. 48..57 is equivalent to the regular expression [0-9] (i.e. any digit), so only sequences beginning with a digit can be recognised. If 10 (line feed) is encountered, the program is done. 46 is the decimal point ('.'), 43 and 45 are positive and negative signs ('+', '-') and 69/101 is uppercase/lowercase 'e' (to indicate a number in scientific format). As such, it will recognize the following properly:

2 45 055 78.1 2e5 78.3e12 69.0e-3 3e+3 

but not:

.3 46. -5 3.e2 2e5.1 

Syntax

Ragel's input is a regular expression only in the sense that it describes a regular language; it is usually not written in a concise regular expression, but written out into multiple parts like in Extended Backus–Naur form. For example, instead of supporting POSIX character classes in regex syntax, Ragel implements them as built-in production rules. As with usual parser generators, Ragel allows for handling code for productions to be written with the syntax. [9] The code yielding the above example from the official website is:

actiondgt{ printf("DGT: %c\n", fc); }actiondec{ printf("DEC: .\n"); }actionexp{ printf("EXP: %c\n", fc); }actionexp_sign{ printf("SGN: %c\n", fc); }actionnumber{ /*NUMBER*/ }# A floating-point number literal.number=([0-9]+$dgt('.'@dec[0-9]+$dgt)?([eE]([+\-]$exp_sign)?[0-9]+$exp)?)%number;main:=(number'\n')*;

See also

Related Research Articles

In computing, a compiler is a computer program that translates computer code written in one programming language into another language. The name "compiler" is primarily used for programs that translate source code from a high-level programming language to a low-level programming language to create an executable program.

In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.

Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in Large language models (LLMs), but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Lex is a computer program that generates lexical analyzers. It is commonly used with the yacc parser generator and is the standard lexical analyzer generator on many Unix and Unix-like systems. An equivalent tool is specified as part of the POSIX standard.

In computing, code generation is part of the process chain of a compiler and converts intermediate representation of source code into a form that can be readily executed by the target system.

A string literal or anonymous string is a literal for a string value in the source code of a computer program. Modern programming languages commonly use a quoted sequence of characters, formally "bracketed delimiters", as in x = "foo", where "foo" is a string literal with value foo. Methods such as escape sequences can be used to avoid the problem of delimiter collision and allow the delimiters to be embedded in a string. There are many alternate notations for specifying string literals especially in complicated cases. The exact notation depends on the programming language in question. Nevertheless, there are general guidelines that most modern programming languages follow.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.

In computer-based language recognition, ANTLR, or ANother Tool for Language Recognition, is a parser generator that uses a LL(*) algorithm for parsing. ANTLR is the successor to the Purdue Compiler Construction Tool Set (PCCTS), first developed in 1989, and is under active development. Its maintainer is Professor Terence Parr of the University of San Francisco.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

This is a list of notable lexer generators and parser generators for various language classes.

Lemon is a parser generator, maintained as part of the SQLite project, that generates a look-ahead LR parser in the programming language C from an input context-free grammar. The generator is quite simple, implemented in one C source file with another file used as a template for output. Lexical analysis is performed externally.

<span class="mw-page-title-main">GOLD (parser)</span>

GOLD is a free parsing system that is designed to support multiple programming languages.

<span class="mw-page-title-main">History of compiler construction</span>

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

OMeta is a specialized object-oriented programming language for pattern matching, developed by Alessandro Warth and Ian Piumarta in 2007 at the Viewpoints Research Institute. The language is based on parsing expression grammars (PEGs), rather than context-free grammars, with the intent to provide "a natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree-transformers".

re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

References

  1. Dr. Adrian D. Thurston at complang.org Last changed: Jul 14, 2013
  2. "Release 7.0.4". 15 February 2021. Retrieved 3 March 2021.
  3. "Ragel State Machine Compiler". www.colm.net. Retrieved 2019-11-19.
  4. Adrian D. Thurston. "Parsing Computer Languages with an Automaton Compiled from a Single Regular Expression. Archived 2012-09-07 at the Wayback Machine " In: 11th International Conference on Implementation and Application of Automata (CIAA 2006), Lecture Notes in Computer Science, volume 4094, p. 285-286, Taipei, Taiwan, August 2006.
  5. "Ragel User Guide" (PDF). March 2017.
  6. "Additional Target Languages Return to Ragel 7". 18 May 2018.
  7. Liqun Chen, Chris J. Mitchell, Andrew Martin (2009) Trusted Computing: Second International Conference, Trust 2009 Oxford, UK, April 6–8, 2009, Proceedings. p. 111
  8. Omar Badreddin (2010) "Umple: a model-oriented programming language." Software Engineering, 2010 ACM/IEEE 32nd International Conference on. Vol. 2. IEEE, 2010.
  9. "Ragel User Guide" (PDF). March 2017.