| re2c | |
|---|---|
| Original author(s) | Peter Bumbulis | 
| Initial release | around 1994 [1] | 
| Stable release | 4.3 [2]     / 30 June 2025  | 
| Repository |  github | 
| Type | Lexical analyzer generator | 
| License | Public domain | 
| Website |  re2c | 
re2c is a free and open-source lexer generator for C, C++, D, Go, Haskell, Java, JavaScript, OCaml, Python, Rust, V and Zig. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, [1] re2c was put in public domain and has been since maintained by volunteers. [3] It is the lexer generator adopted by projects such as PHP, [4] SpamAssassin, [5] Ninja build system [6] and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. [7] This combination is also used with STEPcode, an implementation of ISO 10303 standard. [8]
The main goal of re2c is generating fast lexers: [1] at least as fast as reasonably optimized C lexers coded by hand. Instead of using traditional table-driven approach, re2c encodes the generated finite-state machine directly in the form of conditional jumps and comparisons. The resulting program is faster than its table-driven counterpart [1] and much easier to debug and understand. Moreover, this approach often results in smaller lexers, [1] as re2c applies a number of optimizations such as DFA minimization and the construction of tunnel automaton. [9] Another distinctive feature of re2c is its flexible interface: instead of assuming a fixed program template, re2c lets the programmer write most of the interface code and adapt the generated lexer to any particular environment. The main idea is that re2c should be a zero-cost abstraction for the programmer: using it should never result in a slower program than the corresponding hand-coded implementation.
re2c program can contain any number of /*!re2c ... */ blocks. Each block consists of a sequence of rules, definitions and configurations (they can be intermixed, but it is generally better to put configurations first, then definitions and then rules). Rules have the form REGEXP { CODE } or REGEXP := CODE; where REGEXP is a regular expression and CODE is a block of C code. When REGEXP matches the input string, control flow is transferred to the associated CODE. There is one special rule: the default rule with * instead of REGEXP; it is triggered if no other rules matches. re2c has greedy matching semantics: if multiple rules match, the rule that matches longer prefix is preferred; if the conflicting rules match the same prefix, the earlier rule has priority. Definitions have the form NAME = REGEXP; (and also NAME { REGEXP } in Flex compatibility mode). Configurations have the form re2c:CONFIG = VALUE; where CONFIG is the name of the particular configuration and VALUE is a number or a string. For more advanced usage see the official re2c manual. [22] 
re2c uses the following syntax for regular expressions:
"foo" case-sensitive string literal'foo' case-insensitive string literal[a-xyz], [^a-xyz] character class (possibly negated). any character except newlineR \ S difference of character classesR* zero or more occurrences of RR+ one or more occurrences of RR? zero or one occurrence of RR{n} repetition of R exactly n timesR{n,} repetition of R at least n timesR{n,m} repetition of R from n to m times(R) just R; parentheses are used to override precedence or for POSIX-style submatchR S concatenation: R followed by SR | S alternative: R or SR / S lookahead: R followed by S, but S is not consumedname the regular expression defined as name (except in Flex compatibility mode)@stag an s-tag: saves the last input position at which @stag matches in a variable named stag#mtag an m-tag: saves all input positions at which #mtag matches in a variable named mtagCharacter classes and string literals may contain the following escape sequences: \a, \b, \f, \n, \r, \t, \v, \\, octal escapes \ooo and hexadecimal escapes \xhh, \uhhhh and \Uhhhhhhhh.
Here is a very simple program in re2c (example.re). It checks that all input arguments are hexadecimal numbers. The code for re2c is enclosed in comments /*!re2c ... */, all the rest is plain C code. See the official re2c website for more complex examples. [23] 
#include<stdio.h>staticintlex(constchar*YYCURSOR){constchar*YYMARKER;/*!re2c        re2c:define:YYCTYPE = char;        re2c:yyfill:enable = 0;        end = "\x00";        hex = "0x" [0-9a-fA-F]+;        *       { printf("err\n"); return 1; }        hex end { printf("hex\n"); return 0; }    */}intmain(intargc,char**argv){for(inti=1;i<argc;++i){lex(argv[i]);}return0;}Given that, re2c -is -o example.c example.re generates the code below (example.c). The contents of the comment /*!re2c ... */ are substituted with a deterministic finite automaton encoded in the form of conditional jumps and comparisons; the rest of the program is copied verbatim into the output file. There are several code generation options; normally re2c uses switch statements, but it can use nested if statements (as in this example with -s option), or generate bitmaps and jump tables. Which option is better depends on the C compiler; re2c users are encouraged to experiment.
/* Generated by re2c 1.2.1 on Fri Aug 23 21:59:00 2019 */#include<stdio.h>staticintlex(constchar*YYCURSOR){constchar*YYMARKER;{charyych;yych=*YYCURSOR;if(yych=='0')gotoyy4;++YYCURSOR;yy3:{printf("err\n");return1;}yy4:yych=*(YYMARKER=++YYCURSOR);if(yych!='x')gotoyy3;yych=*++YYCURSOR;if(yych>=0x01)gotoyy8;yy6:YYCURSOR=YYMARKER;gotoyy3;yy7:yych=*++YYCURSOR;yy8:if(yych<='@'){if(yych<=0x00)gotoyy9;if(yych<='/')gotoyy6;if(yych<='9')gotoyy7;gotoyy6;}else{if(yych<='F')gotoyy7;if(yych<='`')gotoyy6;if(yych<='f')gotoyy7;gotoyy6;}yy9:++YYCURSOR;{printf("hex\n");return0;}}}intmain(intargc,char**argv){for(inti=1;i<argc;++i){lex(argv[i]);}return0;}