A major contributor to this article appears to have a close connection with its subject.(February 2024) |
Developer(s) | Dr. Robert van Engelen |
---|---|
Initial release | December 5, 2016 |
Stable release | 4.0 / February 18, 2024 |
Repository | |
Written in | C++ |
Operating system | Cross-platform |
Type | Lexical analysis |
License | BSD license |
Website | https://github.com/Genivia/RE-flex |
RE/flex (regex-centric, fast lexical analyzer) [1] [2] is a free and open source computer program written in C++ that generates fast lexical analyzers (also known as "scanners" or "lexers") [3] [4] in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers (non-greedy, lazy repeats), and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.
The RE/flex project was designed and implemented by professor Robert van Engelen in 2016 and released as free open source. The software evolved with several contributions made by others. The RE/flex tool generates lexical analyzers based on regular expression ("regex") libraries, instead of fixed DFA tables generated by traditional lexical analyzer generators. [5]
The RE/flex lexical analyzer generator accepts an extended syntax of Flex lexer specifications as input. The RE/flex specification syntax is more expressive than the traditional Flex lexer specification syntax and may include indentation anchors, word boundaries, lazy quantifiers (non-greedy, lazy repeats), and new actions such as wstr()
to retrieve Unicode wide-string matches.
A lexer specification is of the form:
Definitions %% Rules %% User Code
The Definitions section includes declarations and customization options, followed by name-pattern pairs to define names for regular expression patterns. Named patterns may be referenced in other patterns by embracing them in {
and }
. The following example defines two names for two patterns, where the second pattern number
uses the previously named pattern digit
:
%top{#include<inttypes.h> // strtol()}%class{public:intvalue;// yyFlexLexer class public member}%init{value=0;// yyFlexLexer initializations}%optionflexdigit[0-9]number{digit}+
The Rules section defines pattern-action pairs. The following example defines a rule to translate a number to the lexer class integer value
member:
{number}value=strtol(yytext,NULL,10);\s// skip white space.throw*yytext;
The User Code section typically defines C/C++ functions, for example a main
program:
intmain(){yyFlexLexerlexer;try{while(lexer.yylex()!=0)std::cout<<"number="<<lexer.value<<std::endl;}catch(intch){std::cerr<<"Error: unknown character code "<<ch<<std::endl;}}
The yyFlexLexer
class is generated by RE/flex as instructed by the %option flex
directive in the lexer specification. The generated lex.yy.cpp
source code contains the algorithm for lexical analysis, which is linked with the libreflex
library.
The generated algorithm for lexical analysis is based on the concept that any regular expression engine can in principle be used to tokenize input into tokens: given a set of n regular expression patterns for , a regular expression of the form "()|()|...|()"
with n alternations may be specified to match and tokenize the input. In this way, the group capture index i of a matching pattern that is returned by the regular expression matcher identifies the pattern that matched the input text partially and continuously after the previous match.
This approach makes it possible for any regex library that supports group captures to be utilized as a matcher. However, note that all groupings of the form (X)
in patterns must be converted to non-capturing groups of the form (?:X)
to avoid any unwanted group capturing within sub-expressions.
The following RE/flex-generated yyFlexLexer
class yylex
method repeatedly invokes the matcher's scan
(continuous partial matching) operation to tokenize input:
intyyFlexLexer::yylex(){if(!has_matcher())matcher("(p1)|(p2)|...|(pn)");// new matcher engine for regex pattern (p1)|(p2)|...|(pn)while(true){switch(matcher().scan())// scan and match next token, get capture index{case0:// no matchif(...EOFreached...)return0;output(matcher().input());// echo the current input characterbreak;case1:// pattern p1 matched...// Action for pattern p1break;case2:// pattern p2 matched...// Action for pattern p2break;...// and so on for patterns up to pn}}}
If none of the n patterns match and the end-of-file (EOF) is not reached, the so-called "default rule" is invoked. The default rule echo's the current input character and advances the scanner to the next character in the input.
The regular expression pattern "()|()|...|()"
is produced by RE/flex from a lexer specification with n rules of pattern-action pairs:
%% p1 Action for pattern p1 p2 Action for pattern p2 ... pn Action for pattern pn %%
From this specification, RE/flex generates the aforementioned yyFlexLexer
class with the yylex()
method that executes actions corresponding to the patterns matched in the input. The generated yyFlexLexer
class is used in a C++ application, such as a parser, to tokenize the input into the integer-valued tokens returned by the actions in the lexer specification. For example:
std::ifstreamifs(filename,std::ios::in);yyFlexLexerlexer(ifs);inttoken;while((token=lexer.yylex())!=0)std::cout<<"token = "<<token<<std::endl;ifs.close();
Note that yylex()
returns an integer value when an action executes return token_value;
. Otherwise, yylex()
does not return a value and continues scanning the input, which is often used by rules that ignore input such as comments.
This example tokenizes a file. A lexical analyzer often serves as a tokenizer for a parser generated by a parser generator such as Bison.
RE/flex is compatible with Flex specifications when %option flex
is used. This generates a yyFlexLexer
class with yylex()
method. RE/flex is also compatible with Bison using a range of RE/flex options for complete coverage of Bison options and features.
By contrast to Flex, RE/flex scanners are thread-safe by default on work with reentrant Bison parsers.
RE/flex supports Unicode regular expression patterns in lexer specifications and automatically tokenizes UTF-8, UTF-16, and UTF-32 input files. Code pages may be specified to tokenize input files encoded in ISO/IEC 8859 1 to 16, Windows-1250 to Windows-1258, CP-437, CP-850, CP-858, MacRoman, KOI-8, EBCDIC, and so on. Normalization to UTF-8 is automatically performed by internal incremental buffering for (partial) pattern matching with Unicode regular expression patterns.
RE/flex integrates indent and dedent matching directly in the regular expression syntax with new \i
and \j
anchors. These indentation anchors detect changes of line indentation in the input. This allows many practical scenarios to be covered to tokenize programming languages with indented block structures. For example, the following lexer specification detects and reports indentation changes:
%%^[\t]+std::cout<<"| ";// nodent: text is aligned to current indent margin^[\t]*\istd::cout<<"> ";// indent: matched with \i^[\t]*\jstd::cout<<"< ";// dedent: matched with \j\jstd::cout<<"< ";// dedent: for each extra level dedented%%
Lazy quantifiers may be associated with repeats in RE/flex regular expression patterns to simplify the expressions using non-greedy repeats, when applicable. Normally matching is "greedy", meaning that the longest pattern is matched. For example, the pattern a.*b
with the greedy *
repeat matches aab
, but also matches abab
because .*
matches any characters except newline and abab
is longer than ab
. Using a lazy quantifier ?
for the lazy repeat *?
, pattern a.*?b
matches ab
but not abab
.
As a practical application of lazy quantifiers, consider matching C/C++ multiline comments of the form /*...*/
. The lexer specification pattern "/*"(.|\n)*?"*/"
with lazy repeat *?
matches multiline comments. Without lazy repeats the pattern "/*"([^*]|(\*+[^*/]))*\*+"/"
should be used (note that quotation of the form "..."
is allowed in lexer specifications only, this construct is comparable to the \Q...\E
quotations supported by most regex libraries.)
Besides the built-in RE/flex POSIX regex pattern matcher, RE/flex also supports PCRE2, Boost.Regex and std::regex pattern matching libraries. PCRE2 and Boost.Regex offer a richer regular expression pattern syntax with Perl pattern matching semantics, but are slower due to their intrinsic NFA-based matching algorithm.
Lex, Flex and RE/flex translate regular expressions to DFA, which are implemented in tables for run-time scanning. RE/flex differs from Lex and Flex in that the generated tables contain a list of opcode words executed by a virtual machine to perform pattern matching. In addition, a DFA implemented in code instead of opcode tables is generated with the --fast
option.
For example, the following direct-coded DFA for pattern \w+
is generated with option --fast
:
voidreflex_code_INITIAL(reflex::Matcher&m){intc0=0,c1=0;m.FSM_INIT(c1);S0:m.FSM_FIND();c1=m.FSM_CHAR();if('a'<=c1&&c1<='z')gotoS5;if(c1=='_')gotoS5;if('A'<=c1&&c1<='Z')gotoS5;if('0'<=c1&&c1<='9')gotoS5;returnm.FSM_HALT(c1);S5:m.FSM_TAKE(1);c1=m.FSM_CHAR();if('a'<=c1&&c1<='z')gotoS5;if(c1=='_')gotoS5;if('A'<=c1&&c1<='Z')gotoS5;if('0'<=c1&&c1<='9')gotoS5;returnm.FSM_HALT(c1);}
A list of virtual machine opcode words for pattern \w+
is generated with option --full
:
REFLEX_CODE_DECLreflex_code_INITIAL[11]={0x617A0005,// 0: GOTO 5 ON 'a'-'z'0x5F5F0005,// 1: GOTO 5 ON '_'0x415A0005,// 2: GOTO 5 ON 'A'-'Z'0x30390005,// 3: GOTO 5 ON '0'-'9'0x00FFFFFF,// 4: HALT0xFE000001,// 5: TAKE 10x617A0005,// 6: GOTO 5 ON 'a'-'z'0x5F5F0005,// 7: GOTO 5 ON '_'0x415A0005,// 8: GOTO 5 ON 'A'-'Z'0x30390005,// 9: GOTO 5 ON '0'-'9'0x00FFFFFF,// 10: HALT};
The RE/flex built-in profiler can be used to measure the performance of the generated scanner automatically. The profiler instruments the scanner source code to collect run-time metrics. At run-time when the instrumented scanner terminates, the profiler reports the number of times a rule is matched and the cumulative time consumed by the matching rule. Profiling includes the time spent in the parser when the rule returns control to the parser. This allows for fine-tuning the performance of the generated scanners and parsers. Lexer rules that are hot spots, i.e. computationally expensive, are detected and can be optimized by the user in the lexer source code.
Also debugging of the generated scanner is supported with Flex-compatible options. Debugging outputs annotated lexer rules during scanning.
In computer science, LR parsers are a type of bottom-up parser that analyse deterministic context-free languages in linear time. There are several variants of LR parsers: SLR parsers, LALR parsers, canonical LR(1) parsers, minimal LR(1) parsers, and generalized LR parsers. LR parsers can be generated by a parser generator from a formal grammar defining the syntax of the language to be parsed. They are widely used for the processing of computer languages.
A regular expression, sometimes referred to as rational expression, is a sequence of characters that specifies a match pattern in text. Usually such patterns are used by string-searching algorithms for "find" or "find and replace" operations on strings, or for input validation. Regular expression techniques are developed in theoretical computer science and formal language theory.
sed is a Unix utility that parses and transforms text, using a simple, compact programming language. It was developed from 1973 to 1974 by Lee E. McMahon of Bell Labs, and is available today for most operating systems. sed was based on the scripting features of the interactive editor ed and the earlier qed. It was one of the earliest tools to support regular expressions, and remains in use for text processing, most notably with the substitution command. Popular alternative tools for plaintext string manipulation and "stream editing" include AWK and Perl.
Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.
GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.
In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.
Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is not the same process as the probabilistic tokenization, used for a large language model's data preprocessing, that encodes text into numerical tokens, using byte pair encoding.
Lex is a computer program that generates lexical analyzers. It is commonly used with the yacc parser generator and is the standard lexical analyzer generator on many Unix and Unix-like systems. An equivalent tool is specified as part of the POSIX standard.
Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.
Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.
JavaCC is an open-source parser generator and lexical analyzer generator written in the Java programming language.
The Spirit Parser Framework is an object oriented recursive descent parser generator framework implemented using template metaprogramming techniques. Expression templates allow users to approximate the syntax of extended Backus–Naur form (EBNF) completely in C++. Parser objects are composed through operator overloading and the result is a backtracking LL(∞) parser that is capable of parsing rather ambiguous grammars.
In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.
In computer programming, the lexer hack is a common solution to the problems in parsing ANSI C, due to the reference grammar being context-sensitive. In C, classifying a sequence of characters as a variable name or a type name requires contextual information of the phrase structure, which prevents one from having a context-free lexer.
In computer programming and computer science, "maximal munch" or "longest match" is the principle that when creating some construct, as much of the available input as possible should be consumed.
Raku rules are the regular expression, string matching and general-purpose parsing facility of the Raku programming language, and are a core part of the language. Since Perl's pattern-matching constructs have exceeded the capabilities of formal regular expressions for some time, Raku documentation refers to them exclusively as regexes, distancing the term from the formal definition.
JetPAG is an open-source LL(k) parser and lexical analyzer generator, licensed under the GNU General Public License. It is a personal work of Tareq H. Sharafy, and is currently at final beta stages of development.
This is a comparison of regular expression engines.
A regular expression denial of service (ReDoS) is an algorithmic complexity attack that produces a denial-of-service by providing a regular expression and/or an input that takes a long time to evaluate. The attack exploits the fact that many regular expression implementations have super-linear worst-case complexity; on certain regex-input pairs, the time taken can grow polynomially or exponentially in relation to the input size. An attacker can thus cause a program to spend substantial time by providing a specially crafted regular expression and/or input. The program will then slow down or become unresponsive.
re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.