Original author(s) | Peter Bumbulis |
---|---|
Initial release | around 1994[1] |
Stable release | 4.0 [2] / 19 November 2024 |
Repository | github |
Type | Lexical analyzer generator |
License | Public domain |
Website | re2c |
re2c is a free and open-source lexer generator for C, C++, D, Go, Haskell, Java, JavaScript, OCaml, Python, Rust, V and Zig. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, [1] re2c was put in public domain and has been since maintained by volunteers. [3] It is the lexer generator adopted by projects such as PHP, [4] SpamAssassin, [5] Ninja build system [6] and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. [7] This combination is also used with STEPcode, an implementation of ISO 10303 standard. [8]
The main goal of re2c is generating fast lexers: [1] at least as fast as reasonably optimized C lexers coded by hand. Instead of using traditional table-driven approach, re2c encodes the generated finite-state machine directly in the form of conditional jumps and comparisons. The resulting program is faster than its table-driven counterpart [1] and much easier to debug and understand. Moreover, this approach often results in smaller lexers, [1] as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton. [9] Another distinctive feature of re2c is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment. The main idea is that re2c should be a zero-cost abstraction for the programmer: using it should never result in a slower program than the corresponding hand-coded implementation.
re2c program can contain any number of /*!re2c ... */
blocks. Each block consists of a sequence of rules, definitions and configurations (they can be intermixed, but it is generally better to put configurations first, then definitions and then rules). Rules have the form REGEXP { CODE }
or REGEXP := CODE;
where REGEXP
is a regular expression and CODE
is a block of C code. When REGEXP
matches the input string, control flow is transferred to the associated CODE
. There is one special rule: the default rule with *
instead of REGEXP
; it is triggered if no other rules matches. re2c has greedy matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority. Definitions have the form NAME = REGEXP;
(and also NAME { REGEXP }
in Flex compatibility mode). Configurations have the form re2c:CONFIG = VALUE;
where CONFIG
is the name of the particular configuration and VALUE
is a number or a string. For more advanced usage see the official re2c manual. [22]
re2c uses the following syntax for regular expressions:
"foo"
case-sensitive string literal'foo'
case-insensitive string literal[a-xyz]
, [^a-xyz]
character class (possibly negated).
any character except newlineR \ S
difference of character classesR*
zero or more occurrences of R
R+
one or more occurrences of R
R?
zero or one occurrence of R
R{n}
repetition of R
exactly n
timesR{n,}
repetition of R
at least n
timesR{n,m}
repetition of R
from n
to m
times(R)
just R
; parentheses are used to override precedence or for POSIX-style submatchR S
concatenation: R
followed by S
R | S
alternative: R
or S
R / S
lookahead: R
followed by S
, but S
is not consumedname
the regular expression defined as name
(except in Flex compatibility mode)@stag
an s-tag: saves the last input position at which @stag
matches in a variable named stag
#mtag
an m-tag: saves all input positions at which #mtag
matches in a variable named mtag
Character classes and string literals may contain the following escape sequences: \a
, \b
, \f
, \n
, \r
, \t
, \v
, \\
, octal escapes \ooo
and hexadecimal escapes \xhh
, \uhhhh
and \Uhhhhhhhh
.
Here is a very simple program in re2c (example.re). It checks that all input arguments are hexadecimal numbers. The code for re2c is enclosed in comments /*!re2c ... */
, all the rest is plain C code. See the official re2c website for more complex examples. [23]
#include<stdio.h>staticintlex(constchar*YYCURSOR){constchar*YYMARKER;/*!re2c re2c:define:YYCTYPE = char; re2c:yyfill:enable = 0; end = "\x00"; hex = "0x" [0-9a-fA-F]+; * { printf("err\n"); return 1; } hex end { printf("hex\n"); return 0; } */}intmain(intargc,char**argv){for(inti=1;i<argc;++i){lex(argv[i]);}return0;}
Given that, re2c -is -o example.c example.re
generates the code below (example.c). The contents of the comment /*!re2c ... */
are substituted with a deterministic finite automaton encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file. There are several code generation options; normally re2c uses switch
statements, but it can use nested if
statements (as in this example with -s
option), or generate bitmaps and jump tables. Which option is better depends on the C compiler; re2c users are encouraged to experiment.
/* Generated by re2c 1.2.1 on Fri Aug 23 21:59:00 2019 */#include<stdio.h>staticintlex(constchar*YYCURSOR){constchar*YYMARKER;{charyych;yych=*YYCURSOR;if(yych=='0')gotoyy4;++YYCURSOR;yy3:{printf("err\n");return1;}yy4:yych=*(YYMARKER=++YYCURSOR);if(yych!='x')gotoyy3;yych=*++YYCURSOR;if(yych>=0x01)gotoyy8;yy6:YYCURSOR=YYMARKER;gotoyy3;yy7:yych=*++YYCURSOR;yy8:if(yych<='@'){if(yych<=0x00)gotoyy9;if(yych<='/')gotoyy6;if(yych<='9')gotoyy7;gotoyy6;}else{if(yych<='F')gotoyy7;if(yych<='`')gotoyy6;if(yych<='f')gotoyy7;gotoyy6;}yy9:++YYCURSOR;{printf("hex\n");return0;}}}intmain(intargc,char**argv){for(inti=1;i<argc;++i){lex(argv[i]);}return0;}