Shunting yard algorithm

Shunting yard algorithm
Class	Parsing
Data structure	Stack
Worst-case performance
Worst-case space complexity

Last updated November 05, 2024

In computer science, the shunting yard algorithm is a method for parsing arithmetical or logical expressions, or a combination of both, specified in infix notation. It can produce either a postfix notation string, also known as reverse Polish notation (RPN), or an abstract syntax tree (AST).^[1] The algorithm was invented by Edsger Dijkstra, first published in November 1961,^[2] and named the "shunting yard" algorithm because its operation resembles that of a railroad shunting yard.

Like the evaluation of RPN, the shunting yard algorithm is stack-based. Infix expressions are the form of mathematical notation most people are used to, for instance "3 + 4" or "3 + 4 × (2 − 1)". For the conversion there are two text variables (strings), the input and the output. There is also a stack that holds operators not yet added to the output queue. To convert, the program reads each symbol in order and does something based on that symbol. The result for the above examples would be (in reverse Polish notation) "3 4 +" and "3 4 2 1 − × +", respectively.

The shunting yard algorithm will correctly parse all valid infix expressions, but does not reject all invalid expressions. For example, "1 2 +" is not a valid infix expression, but would be parsed as "1 + 2". The algorithm can however reject expressions with mismatched parentheses.

The shunting yard algorithm was later generalized into operator-precedence parsing.

A simple conversion

Input: 3 + 4
Push 3 to the output queue (whenever a number is read it is pushed to the output)
Push + (or its ID) onto the operator stack
Push 4 to the output queue
After reading the expression, pop the operators off the stack and add them to the output.
In this case there is only one, "+".
Output: 3 4 +

This already shows a couple of rules:

All numbers are pushed to the output when they are read.
At the end of reading the expression, pop all operators off the stack and onto the output.

Graphical illustration

Graphical illustration of algorithm, using a three-way railroad junction. The input is processed one symbol at a time: if a variable or number is found, it is copied directly to the output a), c), e), h). If the symbol is an operator, it is pushed onto the operator stack b), d), f). If the operator's precedence is lower than that of the operators at the top of the stack or the precedences are equal and the operator is left associative, then that operator is popped off the stack and added to the output g). Finally, any remaining operators are popped off the stack and added to the output i).

The algorithm in detail

/* The functions referred to in this algorithm are simple single argument functions such as sine, inverse or factorial. *//* This implementation does not implement composite functions, functions with a variable number of arguments, or unary operators. */while there are tokens to be read:     read a token     if the token is:     - a number:         put it into the output queue     - a  function :         push it onto the operator stack      - an operatoro₁:         while (             there is an operator o₂ at the top of the operator stack which is not a left parenthesis,              and (o₂ has greater precedence than o₁or (o₁ and o₂ have the same precedence ando₁ is left-associative))         ):             pop o₂ from the operator stack into the output queue         push o₁ onto the operator stack     - a ",":         while the operator at the top of the operator stack is not a left parenthesis:              pop the operator from the operator stack into the output queue     - a left parenthesis (i.e. "("):         push it onto the operator stack     - a right parenthesis (i.e. ")"):         while the operator at the top of the operator stack is not a left parenthesis:             {assert the operator stack is not empty}             /* If the stack runs out without finding a left parenthesis, then there are mismatched parentheses. */             pop the operator from the operator stack into the output queue         {assert there is a left parenthesis at the top of the operator stack}         pop the left parenthesis from the operator stack and discard it         if there is a function token at the top of the operator stack, then:             pop the function from the operator stack into the output queue /* After the while loop, pop the remaining items from the operator stack into the output queue. */while there are tokens on the operator stack:     /* If the operator token on the top of the stack is a parenthesis, then there are mismatched parentheses. */     {assert the operator on top of the stack is not a (left) parenthesis}     pop the operator from the operator stack onto the output queue

To analyze the running time complexity of this algorithm, one has only to note that each token will be read once, each number, function, or operator will be printed once, and each function, operator, or parenthesis will be pushed onto the stack and popped off the stack once—therefore, there are at most a constant number of operations executed per token, and the running time is thus O(n) — linear in the size of the input.

The shunting yard algorithm can also be applied to produce prefix notation (also known as Polish notation). To do this one would simply start from the end of a string of tokens to be parsed and work backwards, reverse the output queue (therefore making the output queue an output stack), and flip the left and right parenthesis behavior (remembering that the now-left parenthesis behavior should pop until it finds a now-right parenthesis). And changing the associativity condition to right.

Detailed examples

Input: 3 + 4 × 2 ÷ ( 1 − 5 ) ^ 2 ^ 3

Operator	Precedence	Associativity
^	4	Right
×	3	Left
÷	3	Left
+	2	Left
−	2	Left

The symbol ^ represents the power operator.

Token	Action	Output (in RPN)	Operator stack	Notes
3	Add token to output	3
+	Push token to stack	3	+
4	Add token to output	3 4	+
×	Push token to stack	3 4	× +	× has higher precedence than +
2	Add token to output	3 4 2	× +
÷	Pop stack to output	3 4 2 ×	+	÷ and × have same precedence
÷	Push token to stack	3 4 2 ×	÷ +	÷ has higher precedence than +
(	Push token to stack	3 4 2 ×	( ÷ +
1	Add token to output	3 4 2 × 1	( ÷ +
−	Push token to stack	3 4 2 × 1	− ( ÷ +
5	Add token to output	3 4 2 × 1 5	− ( ÷ +
)	Pop stack to output	3 4 2 × 1 5 −	( ÷ +	Repeated until "(" found
)	Pop stack	3 4 2 × 1 5 −	÷ +	Discard matching parenthesis
^	Push token to stack	3 4 2 × 1 5 −	^ ÷ +	^ has higher precedence than ÷
2	Add token to output	3 4 2 × 1 5 − 2	^ ÷ +
^	Push token to stack	3 4 2 × 1 5 − 2	^ ^ ÷ +	^ is evaluated right-to-left
3	Add token to output	3 4 2 × 1 5 − 2 3	^ ^ ÷ +
end	Pop entire stack to output	3 4 2 × 1 5 − 2 3 ^ ^ ÷ +

Input: sin ( max ( 2, 3 ) ÷ 3 × $π$ )

Token	Action	Output (in RPN)	Operator stack	Notes
sin	Push token to stack		sin
(	Push token to stack		( sin
max	Push token to stack		max ( sin
(	Push token to stack		( max ( sin
2	Add token to output	2	( max ( sin
,	Ignore	2	( max ( sin	The operator at the top of the stack is a left parenthesis
3	Add token to output	2 3	( max ( sin
)	Pop stack to output	2 3	( max ( sin	Repeated until "(" is at the top of the stack
Pop stack	2 3	max ( sin	Discarding matching parentheses
Pop stack to output	2 3 max	( sin	Function at top of the stack
÷	Push token to stack	2 3 max	÷ ( sin
3	Add token to output	2 3 max 3	÷ ( sin
×	Pop stack to output	2 3 max 3 ÷	( sin
×	Push token to stack	2 3 max 3 ÷	× ( sin
$π$	Add token to output	2 3 max 3 ÷ $π$	× ( sin
)	Pop stack to output	2 3 max 3 ÷ $π$ ×	( sin	Repeated until "(" is at the top of the stack
	Pop stack	2 3 max 3 ÷ $π$ ×	sin	Discarding matching parentheses
	Pop stack to output	2 3 max 3 ÷ $π$ × sin		Function at top of the stack
end	Pop entire stack to output	2 3 max 3 ÷ $π$ × sin

Related Research Articles

Forth is a stack-oriented programming language and interactive integrated development environment designed by Charles H. "Chuck" Moore and first used by other programmers in 1970. Although not an acronym, the language's name in its early years was often spelled in all capital letters as FORTH. The FORTH-79 and FORTH-83 implementations, which were not written by Moore, became de facto standards, and an official technical standard of the language was published in 1994 as ANS Forth. A wide range of Forth derivatives existed before and after ANS Forth. The free and open-source software Gforth implementation is actively maintained, as are several commercially supported systems.

In mathematics, an operand is the object of a mathematical operation, i.e., it is the object or quantity that is operated on.

<span class="mw-page-title-main">Polish notation</span> Mathematics notation with operators preceding operands

Polish notation (PN), also known as normal Polish notation (NPN), Łukasiewicz notation, Warsaw notation, Polish prefix notation or simply prefix notation, is a mathematical notation in which operators precede their operands, in contrast to the more common infix notation, in which operators are placed between operands, as well as reverse Polish notation (RPN), in which operators follow their operands. It does not need any parentheses as long as each operator has a fixed number of operands. The description "Polish" refers to the nationality of logician Jan Łukasiewicz, who invented Polish notation in 1924.

Rebol is a cross-platform data exchange language and a multi-paradigm dynamic programming language designed by Carl Sassenrath for network communications and distributed computing. It introduces the concept of dialecting: small, optimized, domain-specific languages for code and data, which is also the most notable property of the language according to its designer Carl Sassenrath:

Although it can be used for programming, writing functions, and performing processes, its greatest strength is the ability to easily create domain-specific languages or dialects

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

Lexical tokenization is conversion of a text into meaningful lexical tokens belonging to categories defined by a "lexer" program. In case of a natural language, those categories include nouns, verbs, adjectives, punctuations etc. In case of a programming language, the categories include identifiers, operators, grouping symbols and data types. Lexical tokenization is related to the type of tokenization used in large language models (LLMs) but with two differences. First, lexical tokenization is usually based on a lexical grammar, whereas LLM tokenizers are usually probability-based. Second, LLM tokenizers perform a second step that converts the tokens into numerical values.

<span class="mw-page-title-main">Infix notation</span> Mathematics notation with operators between operands

Infix notation is the notation commonly used in arithmetical and logical formulae and statements. It is characterized by the placement of operators between operands—"infixed operators"—such as the plus sign in 2 + 2.

In mathematics and computer programming, the order of operations is a collection of rules that reflect conventions about which operations to perform first in order to evaluate a given mathematical expression.

In computer science, a stack is an abstract data type that serves as a collection of elements with two main operations:

In computer programming, M-expressions were an early proposed syntax for the Lisp programming language, inspired by contemporary languages such as Fortran and ALGOL. The notation was never implemented into the language and, as such, it was never finalized.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

In programming language theory, the associativity of an operator is a property that determines how operators of the same precedence are grouped in the absence of parentheses. If an operand is both preceded and followed by operators, and those operators have equal precedence, then the operand may be used as input to two different operations. The choice of which operations to apply the operand to, is determined by the associativity of the operators. Operators may be associative, left-associative, right-associative or non-associative. The associativity and precedence of an operator is a part of the definition of the programming language; different programming languages may have different associativity and precedence for the same type of operator.

dc is a cross-platform reverse-Polish calculator which supports arbitrary-precision arithmetic. It was written by Lorinda Cherry and Robert Morris at Bell Labs. It is one of the oldest Unix utilities, preceding even the invention of the C programming language. Like other utilities of that vintage, it has a powerful set of features but terse syntax. Traditionally, the bc calculator program was implemented on top of dc.

In computer programming, operators are constructs defined within programming languages which behave generally like functions, but which differ syntactically or semantically.

In computer science, an operator-precedence parser is a bottom-up parser that interprets an operator-precedence grammar. For example, most calculators use operator-precedence parsers to convert from the human-readable infix notation relying on order of operations to a format that is optimized for evaluation such as Reverse Polish notation (RPN).

An operator precedence grammar is a kind of grammar for formal languages.

CGOL is an alternative syntax featuring an extensible algebraic notation for the Lisp programming language. It was designed for MACLISP by Vaughan Pratt and subsequently ported to Common Lisp.

A shift-reduce parser is a class of efficient, table-driven bottom-up parsing methods for computer languages and other notations formally defined by a grammar. The parsing methods most commonly used for parsing programming languages, LR parsing and its variations, are shift-reduce methods. The precedence parsers used before the invention of LR parsing are also shift-reduce methods. All shift-reduce parsers have similar outward effects, in the incremental order in which they build a parse tree or call specific output actions.

exp4j is a small Java library for evaluation of mathematical expressions. It implements Dijkstra's Shunting-yard algorithm to translate expressions from infix notation to Reverse Polish notation and calculates the result using a simple Stack algorithm.

In mathematics and computer science, a stack-sortable permutation is a permutation whose elements may be sorted by an algorithm whose internal storage is limited to a single stack data structure. The stack-sortable permutations are exactly the permutations that do not contain the permutation pattern 231; they are counted by the Catalan numbers, and may be placed in bijection with many other combinatorial objects with the same counting function including Dyck paths and binary trees.

References

↑ Theodore Norvell (1999). "Parsing Expressions by Recursive Descent". www.engr.mun.ca. Retrieved 2020-12-28.
↑ Dijkstra, Edsger (1961-11-01). "Algol 60 translation : An Algol 60 translator for the X1 and making a translator for Algol 60". Stichting Mathematisch Centrum.

External links

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Theodore Norvell (1999). "Parsing Expressions by Recursive Descent". www.engr.mun.ca. Retrieved 2020-12-28.

[2] Dijkstra, Edsger (1961-11-01). "Algol 60 translation : An Algol 60 translator for the X1 and making a translator for Algol 60". Stichting Mathematisch Centrum.

[1]

[2]

v t e Parsing algorithms
Top-down	Earley LL Recursive descent Tail recursive
Bottom-up	Precedence Simple Operator Shunting-yard LR Simple Look-ahead Canonical Generalized CYK Recursive ascent Shift-reduce
Mixed, other	Combinator Chart Left corner Statistical
Related topics	PEG Definite clause grammar Deterministic parsing Dynamic programming Memoization Parser generator LALR Parse tree AST Scannerless parsing History of compiler construction Comparison of parser generators Operator-precedence grammar