Re2c

re2c
Original author(s)	Peter Bumbulis
Initial release	around 1994;30 years ago
Stable release	3.1 / 19 July 2023;14 months ago
Repository	github.com/skvadrik/re2c
Type	Lexical analyzer generator
License	Public domain
Website	re2c.org

Last updated September 23, 2024

re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper,^[1] re2c was put in public domain and has been since maintained by volunteers.^[3] It is the lexer generator adopted by projects such as PHP,^[4] SpamAssassin,^[5] Ninja build system ^[6] and others. Together with the Lemon parser generator, re2c is used in BRL-CAD.^[7] This combination is also used with STEPcode, an implementation of ISO 10303 standard.^[8]

Philosophy

The main goal of re2c is generating fast lexers:^[1] at least as fast as reasonably optimized C lexers coded by hand. Instead of using traditional table-driven approach, re2c encodes the generated finite state machine directly in the form of conditional jumps and comparisons. The resulting program is faster than its table-driven counterpart^[1] and much easier to debug and understand. Moreover, this approach often results in smaller lexers,^[1] as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton.^[9] Another distinctive feature of re2c is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment. The main idea is that re2c should be a zero-cost abstraction for the programmer: using it should never result in a slower program than the corresponding hand-coded implementation.

Features

Submatch extraction:^[10] re2c supports both POSIX-compliant capturing groups and standalone tags^[11] (with leftmost greedy disambiguation and optional handling of repeated submatch). The implementation is based on the lookahead-TDFA algorithm.^[12]^[13]^[14]

Encoding support:^[15] re2c supports ASCII, UTF-8, UTF-16, UTF-32, UCS-2 and EBCDIC.
Flexible user interface:^[16] the generated code uses a few primitive operations in order to interface with the environment (read input characters, advance to the next input position, etc.); users can redefine these primitives to whatever they need.
Storable state:^[17] re2c supports both pull-model lexers (when lexer runs without interrupts and pulls more input as necessary) and push-model lexers (when lexer is periodically stopped and resumed to parse new chunks of input).
Start conditions:^[18] re2c can generate multiple interrelated lexers, where each lexer is triggered by a certain condition in program.
Self-validation:^[19] re2c has a special mode in which it ignores all used-defined interface code and generates a self-contained skeleton program. Additionally, re2c generates two files: one with the input strings derived from the regular grammar, and one with compressed match results that are used to verify lexer behavior on all inputs. Input strings are generated so that they extensively cover DFA transitions and paths. Data generation happens right after DFA construction and prior to any optimizations, but the lexer itself is fully optimized, so skeleton programs are capable of revealing any errors in optimizations and code generation.
Warnings:^[20] re2c performs static analysis of the program and warns its users about possible deficiencies or bugs, such as undefined control flow, unreachable code, ill-formed escape symbols and potential misuse of the interface primitives.
Debugging. Besides generating human-readable lexers, re2c has a number of options that dump various intermediate representations of the generated lexer, such as NFA, multiple stages of DFA and the resulting program graph in DOT format.^[21]

Syntax

re2c program can contain any number of /*!re2c ... */ blocks. Each block consists of a sequence of rules, definitions and configurations (they can be intermixed, but it is generally better to put configurations first, then definitions and then rules). Rules have the form REGEXP { CODE } or REGEXP := CODE; where REGEXP is a regular expression and CODE is a block of C code. When REGEXP matches the input string, control flow is transferred to the associated CODE. There is one special rule: the default rule with * instead of REGEXP; it is triggered if no other rules matches. re2c has greedy matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority. Definitions have the form NAME = REGEXP; (and also NAME { REGEXP } in Flex compatibility mode). Configurations have the form re2c:CONFIG = VALUE; where CONFIG is the name of the particular configuration and VALUE is a number or a string. For more advanced usage see the official re2c manual.^[22]

Regular expressions

re2c uses the following syntax for regular expressions:

"foo" case-sensitive string literal
'foo' case-insensitive string literal
[a-xyz], [^a-xyz] character class (possibly negated)
. any character except newline
R \ S difference of character classes
R* zero or more occurrences of R
R+ one or more occurrences of R
R? zero or one occurrence of R
R{n} repetition of R exactly n times
R{n,} repetition of R at least n times
R{n,m} repetition of R from n to m times
(R) just R; parentheses are used to override precedence or for POSIX-style submatch
R S concatenation: R followed by S
R | S alternative: R or S
R / S lookahead: R followed by S, but S is not consumed
name the regular expression defined as name (except in Flex compatibility mode)
@stag an s-tag: saves the last input position at which @stag matches in a variable named stag
#mtag an m-tag: saves all input positions at which #mtag matches in a variable named mtag

Character classes and string literals may contain the following escape sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexadecimal escapes \xhh, \uhhhh and \Uhhhhhhhh.

Example

Here is a very simple program in re2c (example.re). It checks that all input arguments are hexadecimal numbers. The code for re2c is enclosed in comments /*!re2c ... */, all the rest is plain C code. See the official re2c website for more complex examples.^[23]

#include<stdio.h>staticintlex(constchar*YYCURSOR){constchar*YYMARKER;/*!re2c        re2c:define:YYCTYPE = char;        re2c:yyfill:enable = 0;        end = "\x00";        hex = "0x" [0-9a-fA-F]+;        *       { printf("err\n"); return 1; }        hex end { printf("hex\n"); return 0; }    */}intmain(intargc,char**argv){for(inti=1;i<argc;++i){lex(argv[i]);}return0;}

Given that, re2c -is -o example.c example.re generates the code below (example.c). The contents of the comment /*!re2c ... */ are substituted with a deterministic finite automaton encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file. There are several code generation options; normally re2c uses switch statements, but it can use nested if statements (as in this example with -s option), or generate bitmaps and jump tables. Which option is better depends on the C compiler; re2c users are encouraged to experiment.

/* Generated by re2c 1.2.1 on Fri Aug 23 21:59:00 2019 */#include<stdio.h>staticintlex(constchar*YYCURSOR){constchar*YYMARKER;{charyych;yych=*YYCURSOR;if(yych=='0')gotoyy4;++YYCURSOR;yy3:{printf("err\n");return1;}yy4:yych=*(YYMARKER=++YYCURSOR);if(yych!='x')gotoyy3;yych=*++YYCURSOR;if(yych>=0x01)gotoyy8;yy6:YYCURSOR=YYMARKER;gotoyy3;yy7:yych=*++YYCURSOR;yy8:if(yych<='@'){if(yych<=0x00)gotoyy9;if(yych<='/')gotoyy6;if(yych<='9')gotoyy7;gotoyy6;}else{if(yych<='F')gotoyy7;if(yych<='`')gotoyy6;if(yych<='f')gotoyy7;gotoyy6;}yy9:++YYCURSOR;{printf("hex\n");return0;}}}intmain(intargc,char**argv){for(inti=1;i<argc;++i){lex(argv[i]);}return0;}

Related Research Articles

The GNU Debugger (GDB) is a portable debugger that runs on many Unix-like systems and works for many programming languages, including Ada, Assembly, C, C++, D, Fortran, Haskell, Go, Objective-C, OpenCL C, Modula-2, Pascal, Rust, and partially others.

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in large language models (LLMs) but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Lex is a computer program that generates lexical analyzers. It is commonly used with the yacc parser generator and is the standard lexical analyzer generator on many Unix and Unix-like systems. An equivalent tool is specified as part of the POSIX standard.

Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

In computer science, a generator is a routine that can be used to control the iteration behaviour of a loop. All generators are also iterators. A generator is very similar to a function that returns an array, in that a generator has parameters, can be called, and generates a sequence of values. However, instead of building an array containing all the values and returning them all at once, a generator yields the values one at a time, which requires less memory and allows the caller to get started processing the first few values immediately. In short, a generator looks like a function but behaves like an iterator.

In computer science, an operator precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

In computer programming, an entry point is the place in a program where the execution of a program begins, and where the program has access to command line arguments.

sizeof is a unary operator in the programming languages C and C++. It generates the storage size of an expression or a data type, measured in the number of char-sized units. Consequently, the construct sizeof (char) is guaranteed to be 1. The actual number of bits of type char is specified by the preprocessor macro CHAR_BIT, defined in the standard include file limits.h. On most modern computing platforms this is eight bits. The result of sizeof has an unsigned integer type that is usually denoted by size_t.

In computing, exec is a functionality of an operating system that runs an executable file in the context of an already existing process, replacing the previous executable. This act is also referred to as an overlay. It is especially important in Unix-like systems, although it also exists elsewhere. As no new process is created, the process identifier (PID) does not change, but the machine code, data, heap, and stack of the process are replaced by those of the new program.

In computer programming, a branch table or jump table is a method of transferring program control (branching) to another part of a program using a table of branch or jump instructions. It is a form of multiway branch. The branch table construction is commonly used when programming in assembly language but may also be generated by compilers, especially when implementing optimized switch statements whose values are densely packed together.

In software engineering, a fluent interface is an object-oriented API whose design relies extensively on method chaining. Its goal is to increase code legibility by creating a domain-specific language (DSL). The term was coined in 2005 by Eric Evans and Martin Fowler.

select is a system call and application programming interface (API) in Unix-like and POSIX-compliant operating systems for examining the status of file descriptors of open input/output channels. The select system call is similar to the poll facility introduced in UNIX System V and later operating systems. However, with the c10k problem, both select and poll have been superseded by the likes of kqueue, epoll, /dev/poll and I/O completion ports.

Wt is an open-source widget-centric web framework for the C++ programming language. It has an API resembling that of Qt framework, also using a widget-tree and an event-driven signal/slot system.

Getopt is a C library function used to parse command-line options of the Unix/POSIX style. It is a part of the POSIX specification, and is universal to Unix-like systems. It is also the name of a Unix program for parsing command line arguments in shell scripts.

A code sanitizer is a programming tool that detects bugs in the form of undefined or suspicious behavior by a compiler inserting instrumentation code at runtime. The class of tools was first introduced by Google's AddressSanitizer of 2012, which uses directly mapped shadow memory to detect memory corruption such as buffer overflows or accesses to a dangling pointer (use-after-free).

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

In the automata theory, a tagged deterministic finite automaton (TDFA) is an extension of deterministic finite automaton (DFA). In addition to solving the recognition problem for regular languages, TDFA is also capable of submatch extraction and parsing. While canonical DFA can find out if a string belongs to the language defined by a regular expression, TDFA can also extract substrings that match specific subexpressions. More generally, TDFA can identify positions in the input string that match tagged positions in a regular expression.

References

1 2 3 4 5 Bumbulis, Peter; Donald D., Cowan (March–December 1993). "RE2C: a more versatile scanner generator". ACM Letters on Programming Languages and Systems. 2 (1–4): 70–84. doi: 10.1145/176454.176487 . S2CID 14814637.
↑ "Release 3.1". 19 July 2023. Retrieved 3 August 2023.
↑ "Authors, re2c documentation".
↑ "Building PHP". PHP Internals Book. Retrieved 2020-07-20.
↑ "SpamAssassin (sa-compile)".
↑ "Ninja: build.ninja". Ninja. Retrieved 2020-07-20.
↑ "BRL-CAD (tools: re2c)".
↑ "Build Process".
↑ Joseph, Grosch (1989). "Efficient Generation of Table-Driven Scanners". Software: Practice and Experience 19: 1089–1103.
↑ "Submatch extraction, re2c documentation".
↑ Ville, Laurikari (2000). "NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions" (PDF). Seventh International Symposium on String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings.
↑ Ulya, Trofimovich (2017). "Tagged Deterministic Finite Automata with Lookahead". arXiv: 1907.08837 [cs.FL].
↑ Ulya, Trofimovich (2020). "RE2C -- a lexer generator based on lookahead TDFA". Software Impacts. 6: 100027. doi: 10.1016/j.simpa.2020.100027 .
↑ Ulya, Trofimovich (2021). "Lookahead TDFA in pictures (slides)" (PDF).
↑ "Encodings, re2c documentation".
↑ "Program interface, re2c documentation".
↑ "Storable state, re2c documentation".
↑ "Start conditions, re2c documentation".
↑ "Skeleton, re2c documentation".
↑ "Warnings, re2c documentation".
↑ "Visualization, re2c documentation".
↑ "User manual (C), re2c documentation".
↑ "Official website".

External links

Official website

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[re2cpaper-1] 1 2 3 4 5 Bumbulis, Peter; Donald D., Cowan (March–December 1993). "RE2C: a more versatile scanner generator". ACM Letters on Programming Languages and Systems. 2 (1–4): 70–84. doi: 10.1145/176454.176487 . S2CID 14814637.

[wikidata-1bed033dd225c0e4d9f7aa5ffd9189b681eded95-v18-2] "Release 3.1". 19 July 2023. Retrieved 3 August 2023.

[3] "Authors, re2c documentation".

[4] "Building PHP". PHP Internals Book. Retrieved 2020-07-20.

[5] "SpamAssassin (sa-compile)".

[6] "Ninja: build.ninja". Ninja. Retrieved 2020-07-20.

[7] "BRL-CAD (tools: re2c)".

[8] "Build Process".

[gro89paper-9] Joseph, Grosch (1989). "Efficient Generation of Table-Driven Scanners". Software: Practice and Experience 19: 1089–1103.

[10] "Submatch extraction, re2c documentation".

[lau00paper-11] Ville, Laurikari (2000). "NFAs with tagged transitions, their conversion to deterministic automata and application to regular expressions" (PDF). Seventh International Symposium on String Processing and Information Retrieval, 2000. SPIRE 2000. Proceedings.

[tro17paper-12] Ulya, Trofimovich (2017). "Tagged Deterministic Finite Automata with Lookahead". arXiv: 1907.08837 [cs.FL].

[tro20paper-13] Ulya, Trofimovich (2020). "RE2C -- a lexer generator based on lookahead TDFA". Software Impacts. 6: 100027. doi: 10.1016/j.simpa.2020.100027 .

[tro21slides-14] Ulya, Trofimovich (2021). "Lookahead TDFA in pictures (slides)" (PDF).

[15] "Encodings, re2c documentation".

[16] "Program interface, re2c documentation".

[17] "Storable state, re2c documentation".

[18] "Start conditions, re2c documentation".

[19] "Skeleton, re2c documentation".

[20] "Warnings, re2c documentation".

[21] "Visualization, re2c documentation".

[22] "User manual (C), re2c documentation".

[23] "Official website".

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]