GNU Bison

GNU Bison
Original author(s)	Robert Corbett
Developer(s)	The GNU Project
Initial release	June 1985;39 years ago
Stable release	3.8.2 / 25 September 2021
Repository	git.savannah.gnu.org/cgit/bison.git ;
Written in	C and m4
Operating system	Unix-like
Type	Parser generator
License	GPL
Website	www.gnu.org/software/bison/

Last updated December 14, 2024

GNU Bison, commonly known as Bison, is a parser generator that is part of the GNU Project. Bison reads a specification in Bison syntax (described as "machine-readable BNF"^[3]), warns about any parsing ambiguities, and generates a parser that reads sequences of tokens and decides whether the sequence conforms to the syntax specified by the grammar.

Generation of counterexamples for conflicts
Location tracking (e.g., file, line, column)
Rich and internationalizable syntax error messages in the generated parsers
Customizable syntax error generation,
Reentrant parsers
Push parsers, with autocompletion
Support for named references
Several types of reports (graphical, XML) on the generated parser
Support for several programming languages (C, C++, D, or Java)

Flex, an automatic lexical analyser, is often used with Bison, to tokenise input data and provide Bison with tokens.^[5]

Bison was originally written by Robert Corbett in 1985.^[1] Later, in 1989, Robert Corbett released another parser generator named Berkeley Yacc. Bison was made Yacc-compatible by Richard Stallman.^[6]

Bison is free software and is available under the GNU General Public License, with an exception (discussed below) allowing its generated code to be used without triggering the copyleft requirements of the licence.

Features

Counterexample generation

One delicate issue with LR parser generators is the resolution of conflicts (shift/reduce and reduce/reduce conflicts). With many LR parser generators, resolving conflicts requires the analysis of the parser automaton, which demands some expertise from the user.

To aid the user in understanding conflicts more intuitively, Bison can instead automatically generate counterexamples. For ambiguous grammars, Bison often can even produce counterexamples that show the grammar is ambiguous.

For instance, on a grammar suffering from the infamous dangling else problem, Bison reports

doc/if-then-else.y: <span style="color:purple;">warning</span>: shift/reduce conflict on token "else" [-<span style="color:purple;">Wcounterexamples</span>]   Example: <span style="color:orange;">"if" expr "then"</span> <span style="color:blue;">"if" expr "then" stmt</span> <span style="color:red;">•</span> <span style="color:blue;">"else" stmt</span>   Shift derivation     <span style="color:orange;">if_stmt</span>     <span style="color:orange;">↳ "if" expr "then"</span> <span style="color:green;">stmt</span>                         <span style="color:green;">↳</span> <span style="color:blue;">if_stmt</span>                            <span style="color:blue;">↳ "if" expr "then" stmt</span> <span style="color:red;">•</span> <span style="color:blue;">"else" stmt</span>   Example: <span style="color:orange;">"if" expr "then"</span> <span style="color:blue;">"if" expr "then" stmt</span> <span style="color:red;">•</span> <span style="color:orange;">"else" stmt</span>   Reduce derivation     <span style="color:orange;">if_stmt</span>     <span style="color:orange;">↳ "if" expr "then"</span> <span style="color:green;">stmt</span>                        <span style="color:orange;">"else" stmt</span>                         <span style="color:green;">↳</span> <span style="color:blue;">if_stmt</span>                            <span style="color:blue;">↳ "if" expr "then" stmt</span> <span style="color:red;">•</span>

Reentrancy

Reentrancy is a feature which has been added to Bison and does not exist in Yacc.

Normally, Bison generates a parser which is not reentrant. In order to achieve reentrancy the declaration %define api.pure must be used. More details on Bison reentrancy can be found in the Bison manual.^[7]

Output languages

Bison can generate code for C, C++, D and Java.^[8]

For using the Bison-generated parser from other languages a language binding tool such as SWIG can be used.

License and distribution of generated code

Because Bison generates source code that in turn gets added to the source code of other software projects, it raises some simple but interesting copyright questions.

A GPL-compatible license is not required

The code generated by Bison includes significant amounts of code from the Bison project itself. The Bison package is distributed under the terms of the GNU General Public License (GPL) but an exception has been added so that the GPL does not apply to output.^[9]^[10]

Earlier releases of Bison stipulated that parts of its output were also licensed under the GPL, due to the inclusion of the yyparse() function from the original source code in the output.

Distribution of packages using Bison

Free software projects that use Bison may have a choice of whether to distribute the source code which their project feeds into Bison, or the resulting C code made output by Bison. Both are sufficient for a recipient to be able to compile the project source code. However, distributing only the input carries the minor inconvenience that the recipients must have a compatible copy of Bison installed so that they can generate the necessary C code when compiling the project. And distributing only the C code in output, creates the problem of making it very difficult for the recipients to modify the parser since this code was written neither by a human nor for humans - its purpose is to be fed directly into a C compiler.

These problems can be avoided by distributing both the input files and the generated code. Most people will compile using the generated code, no different from any other software package, but anyone who wants to modify the parser component can modify the input files first and re-generate the generated files before compiling. Projects distributing both usually do not have the generated files in their version control systems. The files are only generated when making a release.

Some licenses, such as the GPL, require that the source code be in "the preferred form of the work for making modifications to it". GPL'd projects using Bison must thus distribute the files which are the input for Bison. Of course, they can also include the generated files.

Use

Because Bison was written as a replacement for Yacc, and is largely compatible, the code from a lot of projects using Bison could equally be fed into Yacc. This makes it difficult to determine if a project "uses" Bison-specific source code or not. In many cases, the "use" of Bison could be trivially replaced by the equivalent use of Yacc or one of its other derivatives.

Bison has features not found in Yacc, so some projects can be truly said to "use" Bison, since Yacc would not suffice.

The following list is of projects which are known to "use" Bison in the looser sense, that they use free software development tools and distribute code which is intended to be fed into Bison or a Bison-compatible package.

Bash shell uses a yacc grammar for parsing the command input.
Bison's own grammar parser is generated by Bison.^[11]
CMake uses several Bison grammars.^[12]
GCC started out using Bison, but switched to a hand-written recursive-descent parser for C++ in 2004 (version 3.4),^[13] and for C and Objective-C in 2006 (version 4.1)^[14]
The Go programming language (GC) used Bison, but switched to a hand-written scanner and parser in version 1.5.^[15]
LilyPond requires Bison to generate its parser.^[16]
MySQL ^[17]
GNU Octave uses a Bison-generated parser.^[18]
Perl 5 uses a Bison-generated parser starting in 5.10.^[19]
The PHP programming language (Zend Parser).
PostgreSQL ^[20]
Ruby MRI, the reference implementation of the Ruby programming language, relies on a Bison grammar.^[21]
syslog-ng uses several Bison grammars assembled together.^[22]

A complete reentrant parser example

‹The template Manual is being considered for merging.›

The following example shows how to use Bison and flex to write a simple calculator program (only addition and multiplication) and a program for creating an abstract syntax tree. The next two files provide definition and implementation of the syntax tree functions.

/* * Expression.h * Definition of the structure used to build the syntax tree. */#ifndef __EXPRESSION_H__#define __EXPRESSION_H__/** * @brief The operation type */typedefenumtagEOperationType{eVALUE,eMULTIPLY,eADD}EOperationType;/** * @brief The expression structure */typedefstructtagSExpression{EOperationTypetype;/* /< type of operation */intvalue;/* /< valid only when type is eVALUE */structtagSExpression*left;/* /<  left side of the tree */structtagSExpression*right;/* /< right side of the tree */}SExpression;/** * @brief It creates an identifier * @param value The number value * @return The expression or NULL in case of no memory */SExpression*createNumber(intvalue);/** * @brief It creates an operation * @param type The operation type * @param left The left operand * @param right The right operand * @return The expression or NULL in case of no memory */SExpression*createOperation(EOperationTypetype,SExpression*left,SExpression*right);/** * @brief Deletes a expression * @param b The expression */voiddeleteExpression(SExpression*b);#endif /* __EXPRESSION_H__ */

/* * Expression.c * Implementation of functions used to build the syntax tree. */#include"Expression.h"#include<stdlib.h>/** * @brief Allocates space for expression * @return The expression or NULL if not enough memory */staticSExpression*allocateExpression(){SExpression*b=(SExpression*)malloc(sizeof(SExpression));if(b==NULL)returnNULL;b->type=eVALUE;b->value=0;b->left=NULL;b->right=NULL;returnb;}SExpression*createNumber(intvalue){SExpression*b=allocateExpression();if(b==NULL)returnNULL;b->type=eVALUE;b->value=value;returnb;}SExpression*createOperation(EOperationTypetype,SExpression*left,SExpression*right){SExpression*b=allocateExpression();if(b==NULL)returnNULL;b->type=type;b->left=left;b->right=right;returnb;}voiddeleteExpression(SExpression*b){if(b==NULL)return;deleteExpression(b->left);deleteExpression(b->right);free(b);}

The tokens needed by the Bison parser will be generated using flex.

%{/* * Lexer.l file * To generate the lexical analyzer run: "flex Lexer.l" */#include"Expression.h"#include"Parser.h"#include<stdio.h>%}%optionoutfile="Lexer.c"header-file="Lexer.h"%optionwarnnodefault%optionreentrantnoyywrapnever-interactivenounistd%optionbison-bridge%%[\r\n\t]*{continue;/* Skip blanks. */}[0-9]+{sscanf(yytext,"%d",&yylval->value);returnTOKEN_NUMBER;}"*"{returnTOKEN_STAR;}"+"{returnTOKEN_PLUS;}"("{returnTOKEN_LPAREN;}")"{returnTOKEN_RPAREN;}.{continue;/* Ignore unexpected characters. */}%%intyyerror(SExpression**expression,yyscan_tscanner,constchar*msg){fprintf(stderr,"Error: %s\n",msg);return0;}

The names of the tokens are typically neutral: "TOKEN_PLUS" and "TOKEN_STAR", not "TOKEN_ADD" and "TOKEN_MULTIPLY". For instance if we were to support the unary "+" (as in "+1"), it would be wrong to name this "+" "TOKEN_ADD". In a language such as C, "int *ptr" denotes the definition of a pointer, not a product: it would be wrong to name this "*" "TOKEN_MULTIPLY".

Since the tokens are provided by flex we must provide the means to communicate between the parser and the lexer.^[23] The data type used for communication, YYSTYPE, is set using Bison %union declaration.

Since in this sample we use the reentrant version of both flex and yacc we are forced to provide parameters for the yylex function, when called from yyparse.^[23] This is done through Bison %lex-param and %parse-param declarations.^[24]

%{/* * Parser.y file * To generate the parser run: "bison Parser.y" */#include"Expression.h"#include"Parser.h"#include"Lexer.h"// reference the implementation provided in Lexer.lintyyerror(SExpression**expression,yyscan_tscanner,constchar*msg);%}%coderequires{typedefvoid*yyscan_t;}%output"Parser.c"%defines"Parser.h"%defineapi.pure%lex-param{yyscan_tscanner}%parse-param{SExpression**expression}%parse-param{yyscan_tscanner}%union{intvalue;SExpression*expression;}%tokenTOKEN_LPAREN"("%tokenTOKEN_RPAREN")"%tokenTOKEN_PLUS"+"%tokenTOKEN_STAR"*"%token<value>TOKEN_NUMBER"number"%type<expression>expr/* Precedence (increasing) and associativity:   a+b+c is (a+b)+c: left associativity   a+b*c is a+(b*c): the precedence of "*" is higher than that of "+". */%left"+"%left"*"%%input:expr{*expression=$1;};expr:expr[L]"+"expr[R]{$$=createOperation(eADD,$L,$R);}|expr[L]"*"expr[R]{$$=createOperation(eMULTIPLY,$L,$R);}|"("expr[E]")"{$$=$E;}|"number"{$$=createNumber($1);};%%

The code needed to obtain the syntax tree using the parser generated by Bison and the scanner generated by flex is the following.

/* * main.c file */#include"Expression.h"#include"Parser.h"#include"Lexer.h"#include<stdio.h>intyyparse(SExpression**expression,yyscan_tscanner);SExpression*getAST(constchar*expr){SExpression*expression;yyscan_tscanner;YY_BUFFER_STATEstate;if(yylex_init(&scanner)){/* could not initialize */returnNULL;}state=yy_scan_string(expr,scanner);if(yyparse(&expression,scanner)){/* error parsing */returnNULL;}yy_delete_buffer(state,scanner);yylex_destroy(scanner);returnexpression;}intevaluate(SExpression*e){switch(e->type){caseeVALUE:returne->value;caseeMULTIPLY:returnevaluate(e->left)*evaluate(e->right);caseeADD:returnevaluate(e->left)+evaluate(e->right);default:/* should not be here */return0;}}intmain(void){chartest[]=" 4 + 2*10 + 3*( 5 + 1 )";SExpression*e=getAST(test);intresult=evaluate(e);printf("Result of '%s' is %d\n",test,result);deleteExpression(e);return0;}

A simple makefile to build the project is the following.

# MakefileFILES=Lexer.cParser.cExpression.cmain.c CC=g++ CFLAGS=-g-ansi  test:$(FILES)$(CC)$(CFLAGS)$(FILES)-otestLexer.c:Lexer.lflexLexer.l  Parser.c:Parser.yLexer.cbisonParser.y  clean:rm-f*.o*~Lexer.cLexer.hParser.cParser.htest

Related Research Articles

Yacc is a computer program for the Unix operating system developed by Stephen C. Johnson. It is a lookahead left-to-right rightmost derivation (LALR) parser generator, generating a LALR parser based on a formal grammar, written in a notation similar to Backus–Naur form (BNF). Yacc is supplied as a standard utility on BSD and AT&T Unix. GNU-based Linux distributions include Bison, a forward-compatible Yacc replacement.

In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer science, a compiler-compiler or compiler generator is a programming tool that creates a parser, interpreter, or compiler from some form of formal description of a programming language and machine.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in large language models (LLMs) but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

Lex is a computer program that generates lexical analyzers. It is commonly used with the yacc parser generator and is the standard lexical analyzer generator on many Unix and Unix-like systems. An equivalent tool is specified as part of the POSIX standard.

Flex is a free and open-source software alternative to lex. It is a computer program that generates lexical analyzers . It is frequently used as the lex implementation together with Berkeley Yacc parser generator on BSD-derived operating systems, or together with GNU bison in *BSD ports and in Linux distributions. Unlike Bison, flex is not part of the GNU Project and is not released under the GNU General Public License, although a manual for Flex was produced and published by the Free Software Foundation.

The syntax of the C programming language is the set of rules governing writing of software in C. It is designed to allow for programs that are extremely terse, have a close relationship with the resulting object code, and yet provide relatively high-level data abstraction. C was the first widely successful high-level language for portable operating-system development.

<span class="mw-page-title-main">Doxygen</span> Free software for generating software documentation from source code

Doxygen is a documentation generator and static analysis tool for software source trees. When used as a documentation generator, Doxygen extracts information from specially-formatted comments within the code. When used for analysis, Doxygen uses its parse tree to generate diagrams and charts of the code structure. Doxygen can cross reference documentation and code, so that the reader of a document can easily refer to the actual code.

An attribute grammar is a formal way to supplement a formal grammar with semantic information processing. Semantic information is stored in attributes associated with terminal and nonterminal symbols of the grammar. The values of attributes are the result of attribute evaluation rules associated with productions of the grammar. Attributes allow the transfer of information from anywhere in the abstract syntax tree to anywhere else, in a controlled and formal way.

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In computer science, scannerless parsing performs tokenization and parsing in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical and phrase level structure of the language.

In computer programming, the lexer hack is a solution to parsing context-sensitive grammars such as C, where classifying a sequence of characters as a variable name or a type name requires contextual information, by feeding contextual information backwards from the parser to the lexer.

This is a list of notable lexer generators and parser generators for various language classes.

Lemon is a parser generator, maintained as part of the SQLite project, that generates a look-ahead LR parser in the programming language C from an input context-free grammar. The generator is quite simple, implemented in one C source file with another file used as a template for output. Lexical analysis is performed externally.

This article describes the syntax of the C# programming language. The features described are compatible with .NET Framework and Mono.

In computing, a compiler is a computer program that transforms source code written in a programming language or computer language, into another computer language. The most common reason for transforming source code is to create an executable program.

In database management systems (DBMS), a prepared statement, parameterized statement, or parameterized query is a feature where the database pre-compiles SQL code and stores the results, separating it from data. Benefits of prepared statements are:

OMeta is a specialized object-oriented programming language for pattern matching, developed by Alessandro Warth and Ian Piumarta in 2007 at the Viewpoints Research Institute. The language is based on parsing expression grammars (PEGs), rather than context-free grammars, with the intent to provide "a natural and convenient way for programmers to implement tokenizers, parsers, visitors, and tree-transformers".

re2c is a free and open-source lexer generator for C, C++, D, Go, Haskell, Java, JavaScript, OCaml, Python, Rust, V and Zig. It compiles declarative regular expression specifications to deterministic finite automata. Originally written by Peter Bumbulis and described in his paper, re2c was put in public domain and has been since maintained by volunteers. It is the lexer generator adopted by projects such as PHP, SpamAssassin, Ninja build system and others. Together with the Lemon parser generator, re2c is used in BRL-CAD. This combination is also used with STEPcode, an implementation of ISO 10303 standard.

RE/flex is a free and open source computer program written in C++ that generates fast lexical analyzers in C++. RE/flex offers full Unicode support, indentation anchors, word boundaries, lazy quantifiers, and performance tuning options. RE/flex accepts Flex lexer specifications and offers options to generate scanners for Bison parsers. RE/flex includes a fast C++ regular expression library.

References

1 2 Corbett, Robert Paul (June 1985). Static Semantics and Compiler Error Recovery (Ph.D.). University of California, Berkeley. DTIC ADA611756.
↑ Akim Demaille (25 September 2021). "Bison 3.8.2".
↑ "Language and Grammar (Bison 3.8.1)". www.gnu.org. Retrieved 2021-12-26.
↑ Bison Manual: Introduction.
↑ Levine, John (August 2009). flex & bison. O'Reilly Media. ISBN 978-0-596-15597-1.
↑ "AUTHORS". bison.git. GNU Savannah . Retrieved 2017-08-26.
↑ Bison Manual: A Pure (Reentrant) Parser
↑ Bison Manual: Bison Declaration Summary
↑ Bison Manual: Conditions for Using Bison
↑ A source code file, parse-gram.c, which includes the exception
↑ "parse-gram.y". bison.git. GNU Savannah . Retrieved 2020-07-29.
↑ "LexerParser in CMake". github.com.
↑ GCC 3.4 Release Series Changes, New Features, and Fixes
↑ GCC 4.1 Release Series Changes, New Features, and Fixes
↑ Golang grammar definition
↑ "Parser.yy - GNU LilyPond Git Repository". git.savannah.gnu.org.
↑ "4. Parsing SQL - flex & bison [Book]".
↑ "GNU Octave: Libinterp/Parse-tree/Oct-parse.cc Source File".
↑ "What is new for perl 5.10.0?". perl.org.
↑ "The Parser Stage". postgresql.org. 30 September 2021.
↑ "Ruby MRI Parser". github.com.
↑ "syslog-ng's XML Parser". github.com. 14 October 2021.
1 2 Flex Manual: C Scanners with Bison Parsers Archived 2010-12-17 at the Wayback Machine
↑ Bison Manual: Calling Conventions for Pure Parsers

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[Corbett85-1] 1 2 Corbett, Robert Paul (June 1985). Static Semantics and Compiler Error Recovery (Ph.D.). University of California, Berkeley. DTIC ADA611756.

[wikidata-ac0478adede0c86528d5d650f5f9324f10ddbf6f-v18-2] Akim Demaille (25 September 2021). "Bison 3.8.2".

[3] "Language and Grammar (Bison 3.8.1)". www.gnu.org. Retrieved 2021-12-26.

[4] Bison Manual: Introduction.

[5] Levine, John (August 2009). flex & bison. O'Reilly Media. ISBN 978-0-596-15597-1.

[authors-6] "AUTHORS". bison.git. GNU Savannah . Retrieved 2017-08-26.

[7] Bison Manual: A Pure (Reentrant) Parser

[8] Bison Manual: Bison Declaration Summary

[9] Bison Manual: Conditions for Using Bison

[10] A source code file, parse-gram.c, which includes the exception

[parse-gram.y-11] "parse-gram.y". bison.git. GNU Savannah . Retrieved 2020-07-29.

[12] "LexerParser in CMake". github.com.

[13] GCC 3.4 Release Series Changes, New Features, and Fixes

[14] GCC 4.1 Release Series Changes, New Features, and Fixes

[15] Golang grammar definition

[16] "Parser.yy - GNU LilyPond Git Repository". git.savannah.gnu.org.

[17] "4. Parsing SQL - flex & bison [Book]".

[18] "GNU Octave: Libinterp/Parse-tree/Oct-parse.cc Source File".

[19] "What is new for perl 5.10.0?". perl.org.

[20] "The Parser Stage". postgresql.org. 30 September 2021.

[21] "Ruby MRI Parser". github.com.

[22] "syslog-ng's XML Parser". github.com. 14 October 2021.

[flex-bison-bridge-23] 1 2 Flex Manual: C Scanners with Bison Parsers Archived 2010-12-17 at the Wayback Machine

[pure-calling-conventions-24] Bison Manual: Calling Conventions for Pure Parsers

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

v t e GNU Project
History	GNU Manifesto Free Software Foundation Europe India Latin America History of free software
Licenses	GNU General Public License linking exception font exception GNU Lesser General Public License GNU Affero General Public License GNU Free Documentation License GNAT Modified General Public License
Software	GNU (variants) Hurd Linux-libre glibc Bash coreutils findutils Build system GCC binutils GDB GRUB GNUstep GIMP Jami GNU Emacs GNU TeXmacs GNU Octave GNU Taler GNU R GSL GMP GNU Electric GNU Archimedes GNUnet GNU Privacy Guard Gnuzilla (IceCat) GNU Health GNUmed GNU LilyPond GNU Go GNU Chess Gnash Guix more...
Contributors	Benjamin Mako Hill Bradley M. Kuhn Brian Fox Federico Heinz Georg C. F. Greve John Sullivan Nagarjuna G. Richard M. Stallman
Other topics	GNU/Linux naming controversy Revolution OS Free Software Foundation anti-Windows campaigns Defective by Design