Developer(s) | Vern Paxson |
---|---|
Initial release | around 1987[1] |
Stable release | 2.6.4 / May 6, 2017 |
Repository | |
Operating system | Unix-like |
Type | Lexical analyzer generator |
License | BSD license |
Website | github |
Flex (fast lexical analyzer generator) is a free and open-source software alternative to lex. [2] It is a computer program that generates lexical analyzers (also known as "scanners" or "lexers"). [3] [4] It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems (as both lex
and yacc
are part of POSIX), [5] [6] [7] or together with GNU bison (a version of yacc) in *BSD ports [8] and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, [9] although a manual for Flex was produced and published by the Free Software Foundation. [10]
Flex was written in C around 1987 [1] by Vern Paxson, with the help of many ideas and much inspiration from Van Jacobson. Original version by Jef Poskanzer. The fast table representation is a partial implementation of a design done by Van Jacobson. The implementation was done by Kevin Gong and Vern Paxson. [11]
This is an example of a Flex scanner for the instructional programming language PL/0.
The tokens recognized are: '+
', '-
', '*
', '/
', '=
', '(
', ')
', ',
', ';
', '.
', ':=
', '<
', '<=
', '<>
', '>
', '>=
'; numbers: 0-9 {0-9}
; identifiers: a-zA-Z {a-zA-Z0-9}
and keywords: begin
, call
, const
, do
, end
, if
, odd
, procedure
, then
, var
, while
.
%{#include"y.tab.h"%}digit[0-9]letter[a-zA-Z]%%"+"{returnPLUS;}"-"{returnMINUS;}"*"{returnTIMES;}"/"{returnSLASH;}"("{returnLPAREN;}")"{returnRPAREN;}";"{returnSEMICOLON;}","{returnCOMMA;}"."{returnPERIOD;}":="{returnBECOMES;}"="{returnEQL;}"<>"{returnNEQ;}"<"{returnLSS;}">"{returnGTR;}"<="{returnLEQ;}">="{returnGEQ;}"begin"{returnBEGINSYM;}"call"{returnCALLSYM;}"const"{returnCONSTSYM;}"do"{returnDOSYM;}"end"{returnENDSYM;}"if"{returnIFSYM;}"odd"{returnODDSYM;}"procedure"{returnPROCSYM;}"then"{returnTHENSYM;}"var"{returnVARSYM;}"while"{returnWHILESYM;}{letter}({letter}|{digit})*{yylval.id=strdup(yytext);returnIDENT;}{digit}+{yylval.num=atoi(yytext);returnNUMBER;}[\t\n\r]/* skip whitespace */.{printf("Unknown character [%c]\n",yytext[0]);returnUNKNOWN;}%%intyywrap(void){return1;}
These programs perform character parsing and tokenizing via the use of a deterministic finite automaton (DFA). A DFA is a theoretical machine accepting regular languages. These machines are a subset of the collection of Turing machines. DFAs are equivalent to read-only right moving Turing machines. The syntax is based on the use of regular expressions. See also nondeterministic finite automaton.
A Flex lexical analyzer usually has time complexity in the length of the input. That is, it performs a constant number of operations for each input symbol. This constant is quite low: GCC generates 12 instructions for the DFA match loop.[ citation needed ] Note that the constant is independent of the length of the token, the length of the regular expression and the size of the DFA.
However, using the REJECT macro in a scanner with the potential to match extremely long tokens can cause Flex to generate a scanner with non-linear performance. This feature is optional. In this case, the programmer has explicitly told Flex to "go back and try again" after it has already matched some input. This will cause the DFA to backtrack to find other accept states. The REJECT feature is not enabled by default, and because of its performance implications its use is discouraged in the Flex manual. [12]
By default the scanner generated by Flex is not reentrant. This can cause serious problems for programs that use the generated scanner from different threads. To overcome this issue there are options that Flex provides in order to achieve reentrancy. A detailed description of these options can be found in the Flex manual. [13]
Normally the generated scanner contains references to the unistd.h header file, which is Unix specific. To avoid generating code that includes unistd.h , %option nounistd should be used. Another issue is the call to isatty (a Unix library function), which can be found in the generated code. The %option never-interactive forces flex to generate code that does not use isatty. [14]
Flex can only generate code for C and C++. To use the scanner code generated by flex from other languages a language binding tool such as SWIG can be used.
Flex is limited to matching 1-byte (8-bit) binary values and therefore does not support Unicode. [15] RE/flex and other alternatives do support Unicode matching.
flex++ is a similar lexical scanner for C++ which is included as part of the flex package. The generated code does not depend on any runtime or external library except for a memory allocator (malloc or a user-supplied alternative) unless the input also depends on it. This can be useful in embedded and similar situations where traditional operating system or C runtime facilities may not be available.
The flex++ generated C++ scanner includes the header file FlexLexer.h
, which defines the interfaces of the two C++ generated classes.
In computer science, an LALR parser is part of the compiling process where human readable text is converted into a structured representation to be read by computers. An LALR parser is a software tool to process (parse) text into a very specific internal representation that other programs, such as compilers, can work with. This process happens according to a set of production rules specified by a formal grammar for a computer language.
Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.
GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax, warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.
In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.
Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in large language models (LLMs) but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.
Lex is a computer program that generates lexical analyzers. It is commonly used with the yacc parser generator and is the standard lexical analyzer generator on many Unix and Unix-like systems. An equivalent tool is specified as part of the POSIX standard.
A compiled language is a programming language for which source code is typically compiled; not interpreted.
JavaCC is an open-source parser generator and lexical analyzer generator written in the Java programming language.
Doxygen is a documentation generator and static analysis tool for software source trees. When used as a documentation generator, Doxygen extracts information from specially-formatted comments within the code. When used for analysis, Doxygen uses its parse tree to generate diagrams and charts of the code structure. Doxygen can cross reference documentation and code, so that the reader of a document can easily refer to the actual code.
Berkeley Yacc (byacc) is a Unix parser generator designed to be compatible with Yacc. It was originally written by Robert Corbett and released in 1989. Due to its liberal license and because it was faster than the AT&T Yacc, it quickly became the most popular version of Yacc. It has the advantages of being written in ANSI C89 and being public domain software.
In computer science, scannerless parsing performs tokenization and parsing in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical and phrase level structure of the language.
In computer programming, the lexer hack is a solution to parsing context-sensitive grammars such as C, where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information backwards from the parser to the lexer.
This is a list of notable lexer generators and parser generators for various language classes.
Lemon is a parser generator, maintained as part of the SQLite project, that generates a look-ahead LR parser in the programming language C from an input context-free grammar. The generator is quite simple, implemented in one C source file with another file used as a template for output. Lexical analysis is performed externally.
JetPAG is an open-source LL(k) parser and lexical analyzer generator, licensed under the GNU General Public License. It is a personal work of Tareq H. Sharafy, and is currently at final beta stages of development.
In computer science, SYNTAX is a system used to generate lexical and syntactic analyzers (parsers) for all kinds of context-free grammars (CFGs) as well as some classes of contextual grammars. It has been developed at INRIA in France for several decades, mostly by Pierre Boullier, but has become free software since 2007 only. SYNTAX is distributed under the CeCILL license.
In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.
A lookahead LR parser (LALR) generator is a software tool that reads a context-free grammar (CFG) and creates an LALR parser which is capable of parsing files written in the context-free language defined by the CFG. LALR parsers are desirable because they are very fast and small in comparison to other types of parsers.
re2c is a free and open-source lexer generator for C, C++, Go, and Rust. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.
RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.
In about 1987, Vern Paxson of the Lawrence Berkeley Lab took a version of lex written in ratfor (an extended Fortran popular at the time) and translated it into C, calling it flex, for 'Fast Lexical Analyzer Generator.'
A freely available version of lex is flex.
This is flex, the fast lexical analyzer generator.
flex(1)
". *BSD man pages .yacc(1)
". *BSD man pages .