Chomsky normal form

Last updated September 13, 2023

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form (first described by Noam Chomsky)^[1] if all of its production rules are of the form:^[2]^[3]

Converting a grammar to Chomsky normal form
START: Eliminate the start symbol from right-hand sides
TERM: Eliminate rules with nonsolitary terminals
BIN: Eliminate right-hand sides with more than 2 nonterminals
DEL: Eliminate ε-rules
UNIT: Eliminate unit rules
Order of transformations
Example
Alternative definition
Chomsky reduced form
Floyd normal form
Application
See also
Notes
References
Further reading

A → BC, or

A → a, or

S → ε,

where A, B, and C are nonterminal symbols, the letter a is a terminal symbol (a symbol that represents a constant value), S is the start symbol, and ε denotes the empty string. Also, neither B nor C may be the start symbol, and the third production rule can only appear if ε is in L(G), the language produced by the context-free grammar G.^[4]^{: 92–93, 106}

Every grammar in Chomsky normal form is context-free, and conversely, every context-free grammar can be transformed into an equivalent one^{[note 1]} which is in Chomsky normal form and has a size no larger than the square of the original grammar's size.

Converting a grammar to Chomsky normal form

To convert a grammar to Chomsky normal form, a sequence of simple transformations is applied in a certain order; this is described in most textbooks on automata theory.^[4]^: 87–94^[5]^[6]^[7] The presentation here follows Hopcroft, Ullman (1979), but is adapted to use the transformation names from Lange, Leiß (2009).^[8]^{[note 2]} Each of the following transformations establishes one of the properties required for Chomsky normal form.

START: Eliminate the start symbol from right-hand sides

Introduce a new start symbol S₀, and a new rule

S₀ → S,

where S is the previous start symbol. This does not change the grammar's produced language, and S₀ will not occur on any rule's right-hand side.

TERM: Eliminate rules with nonsolitary terminals

To eliminate each rule

A → X₁ ... a ... X_n

with a terminal symbol a being not the only symbol on the right-hand side, introduce, for every such terminal, a new nonterminal symbol N_a, and a new rule

N_a → a.

Change every rule

A → X₁ ... a ... X_n

to

A → X₁ ... N_a ... X_n.

If several terminal symbols occur on the right-hand side, simultaneously replace each of them by its associated nonterminal symbol. This does not change the grammar's produced language.^[4]^: 92

BIN: Eliminate right-hand sides with more than 2 nonterminals

Replace each rule

A → X₁X₂ ... X_n

with more than 2 nonterminals X₁,...,X_n by rules

A → X₁A₁,

A₁ → X₂A₂,

... ,

A_n-2 → X_n-1X_n,

where A_i are new nonterminal symbols. Again, this does not change the grammar's produced language.^[4]^: 93

DEL: Eliminate ε-rules

An ε-rule is a rule of the form

A → ε,

where A is not S₀, the grammar's start symbol.

To eliminate all rules of this form, first determine the set of all nonterminals that derive ε. Hopcroft and Ullman (1979) call such nonterminals nullable, and compute them as follows:

If a rule A → ε exists, then A is nullable.
If a rule A → X₁ ... X_n exists, and every single X_i is nullable, then A is nullable, too.

Obtain an intermediate grammar by replacing each rule

A → X₁ ... X_n

by all versions with some nullable X_i omitted. By deleting in this grammar each ε-rule, unless its left-hand side is the start symbol, the transformed grammar is obtained.^[4]^: 90

For example, in the following grammar, with start symbol S₀,

S₀ → AbB | C

B → AA | AC

C → b | c

A → a | ε

the nonterminal A, and hence also B, is nullable, while neither C nor S₀ is. Hence the following intermediate grammar is obtained:^{[note 3]}

S₀ → AbB | AbB | AbB | AbB | C

B → AA | AA | AA | AεA | AC | AC

C → b | c

A → a | ε

In this grammar, all ε-rules have been "inlined at the call site".^{[note 4]} In the next step, they can hence be deleted, yielding the grammar:

S₀ → AbB | Ab | bB | b | C

B → AA | A | AC | C

C → b | c

A → a

This grammar produces the same language as the original example grammar, viz. {ab,aba,abaa,abab,abac,abb,abc,b,bab,bac,bb,bc,c}, but has no ε-rules.

UNIT: Eliminate unit rules

A unit rule is a rule of the form

A → B,

where A, B are nonterminal symbols. To remove it, for each rule

B → X₁ ... X_n,

where X₁ ... X_n is a string of nonterminals and terminals, add rule

A → X₁ ... X_n

unless this is a unit rule which has already been (or is being) removed. The skipping of nonterminal symbol B in the resulting grammar is possible due to B being a member of the unit closure of nonterminal symbol A.^[9]

Order of transformations

Mutual preservation
of transformation results
Y X	START	TERM	BIN	DEL	UNIT
Transformation Xalways preserves (Y) resp. may destroy (N) the result of Y:
START
TERM
BIN
DEL
UNIT				(Y)^*
^*UNIT preserves the result of DEL if START had been called before.

When choosing the order in which the above transformations are to be applied, it has to be considered that some transformations may destroy the result achieved by other ones. For example, START will re-introduce a unit rule if it is applied after UNIT. The table shows which orderings are admitted.

Moreover, the worst-case bloat in grammar size^{[note 5]} depends on the transformation order. Using |G| to denote the size of the original grammar G, the size blow-up in the worst case may range from |G|² to 2^{2 |G|}, depending on the transformation algorithm used.^[8]^: 7 The blow-up in grammar size depends on the order between DEL and BIN. It may be exponential when DEL is done first, but is linear otherwise. UNIT can incur a quadratic blow-up in the size of the grammar.^[8]^: 5 The orderings START,TERM,BIN,DEL,UNIT and START,BIN,DEL,UNIT,TERM lead to the least (i.e. quadratic) blow-up.

Example

The following grammar, with start symbol Expr, describes a simplified version of the set of all syntactical valid arithmetic expressions in programming languages like C or Algol60. Both number and variable are considered terminal symbols here for simplicity, since in a compiler front end their internal structure is usually not considered by the parser. The terminal symbol "^" denoted exponentiation in Algol60.

Expr	→ Term	\| ExprAddOpTerm	\| AddOpTerm
Term	→ Factor	\| TermMulOpFactor
Factor	→ Primary	\| Factor ^ Primary
Primary	→ number	\| variable	\| ( Expr )
AddOp	→ +	\| −
MulOp	→ *	\| /

In step "START" of the above conversion algorithm, just a rule S₀→Expr is added to the grammar. After step "TERM", the grammar looks like this:

S₀	→ Expr
Expr	→ Term	\| ExprAddOpTerm	\| AddOpTerm
Term	→ Factor	\| TermMulOpFactor
Factor	→ Primary	\| FactorPowOpPrimary
Primary	→ number	\| variable	\| OpenExprClose
AddOp	→ +	\| −
MulOp	→ *	\| /
PowOp	→ ^
Open	→ (
Close	→ )

After step "BIN", the following grammar is obtained:

S₀	→ Expr
Expr	→ Term	\| ExprAddOp_Term	\| AddOpTerm
Term	→ Factor	\| TermMulOp_Factor
Factor	→ Primary	\| FactorPowOp_Primary
Primary	→ number	\| variable	\| OpenExpr_Close
AddOp	→ +	\| −
MulOp	→ *	\| /
PowOp	→ ^
Open	→ (
Close	→ )
AddOp_Term	→ AddOpTerm
MulOp_Factor	→ MulOpFactor
PowOp_Primary	→ PowOpPrimary
Expr_Close	→ ExprClose

Since there are no ε-rules, step "DEL" does not change the grammar. After step "UNIT", the following grammar is obtained, which is in Chomsky normal form:

S₀	→ number	\| variable	\| OpenExpr_Close	\| FactorPowOp_Primary	\| TermMulOp_Factor	\| ExprAddOp_Term	\| AddOpTerm
Expr	→ number	\| variable	\| OpenExpr_Close	\| FactorPowOp_Primary	\| TermMulOp_Factor	\| ExprAddOp_Term	\| AddOpTerm
Term	→ number	\| variable	\| OpenExpr_Close	\| FactorPowOp_Primary	\| TermMulOp_Factor
Factor	→ number	\| variable	\| OpenExpr_Close	\| FactorPowOp_Primary
Primary	→ number	\| variable	\| OpenExpr_Close
AddOp	→ +	\| −
MulOp	→ *	\| /
PowOp	→ ^
Open	→ (
Close	→ )
AddOp_Term	→ AddOpTerm
MulOp_Factor	→ MulOpFactor
PowOp_Primary	→ PowOpPrimary
Expr_Close	→ ExprClose

The N_a introduced in step "TERM" are PowOp, Open, and Close. The A_i introduced in step "BIN" are AddOp_Term, MulOp_Factor, PowOp_Primary, and Expr_Close.

Alternative definition

Chomsky reduced form

Another way^[4]^: 92^[10] to define the Chomsky normal form is:

A formal grammar is in Chomsky reduced form if all of its production rules are of the form:

A\rightarrow \,BC

or

A\rightarrow \,a

,

where $A$ , $B$ and $C$ are nonterminal symbols, and $a$ is a terminal symbol. When using this definition, $B$ or $C$ may be the start symbol. Only those context-free grammars which do not generate the empty string can be transformed into Chomsky reduced form.

Floyd normal form

In a letter where he proposed a term Backus–Naur form (BNF), Donald E. Knuth implied a BNF "syntax in which all definitions have such a form may be said to be in 'Floyd Normal Form'",

{\displaystyle \langle A\rangle

or

{\displaystyle \langle A\rangle

or

{\displaystyle \langle A\rangle

,

where $\langle A\rangle$ , $\langle B\rangle$ and $\langle C\rangle$ are nonterminal symbols, and $a$ is a terminal symbol, because Robert W. Floyd found any BNF syntax can be converted to the above one in 1961.^[11] But he withdrew this term, "since doubtless many people have independently used this simple fact in their own work, and the point is only incidental to the main considerations of Floyd's note."^[12] While Floyd's note cites Chomsky's original 1959 article, Knuth's letter does not.

Application

Besides its theoretical significance, CNF conversion is used in some algorithms as a preprocessing step, e.g., the CYK algorithm, a bottom-up parsing for context-free grammars, and its variant probabilistic CKY.^[13]

Notes

↑ that is, one that produces the same language
↑ For example, Hopcroft, Ullman (1979) merged TERM and BIN into a single transformation.
↑ indicating a kept and omitted nonterminal N by N and N, respectively
↑ If the grammar had a rule S₀ → ε, it could not be "inlined", since it had no "call sites". Therefore it could not be deleted in the next step.
↑ i.e. written length, measured in symbols

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

In formal language theory, a context-free grammar is in Greibach normal form (GNF) if the right-hand sides of all production rules start with a terminal symbol, optionally followed by some variables. A non-strict form allows one exception to this format restriction for allowing the empty word to be a member of the described language. The normal form was established by Sheila Greibach and it bears her name.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

In computer science, an ambiguous grammar is a context-free grammar for which there exists a string that can have more than one leftmost derivation or parse tree. Every non-empty context-free language admits an ambiguous grammar by introducing e.g. a duplicate rule. A language that only admits ambiguous grammars is called an inherently ambiguous language. Deterministic context-free grammars are always unambiguous, and are an important subclass of unambiguous grammars; there are non-deterministic unambiguous grammars, however.

In formal language theory, a noncontracting grammar is in Kuroda normal form if all production rules are of the form:

In computer science, a linear grammar is a context-free grammar that has at most one nonterminal in the right-hand side of each of its productions.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In automata theory, the class of unrestricted grammars is the most general class of grammars in the Chomsky hierarchy. No restrictions are made on the productions of an unrestricted grammar, other than each of their left-hand sides being non-empty. This grammar class can generate arbitrary recursively enumerable languages.

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

In formal language theory, a grammar is noncontracting if for all of its production rules, α → β, it holds that |α| ≤ |β|, that is β has at least as many symbols as α. A grammar is essentially noncontracting if there may be one exception, namely, a rule S → ε where S is the start symbol and ε the empty string, and furthermore, S never occurs in the right-hand side of any rule.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

In formal language theory, a grammar describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

Range concatenation grammar (RCG) is a grammar formalism developed by Pierre Boullier in 1998 as an attempt to characterize a number of phenomena of natural language, such as Chinese numbers and German word order scrambling, which are outside the bounds of the mildly context-sensitive languages.

Generalized context-free grammar (GCFG) is a grammar formalism that expands on context-free grammars by adding potentially non-context-free composition functions to rewrite rules. Head grammar is an instance of such a GCFG which is known to be especially adept at handling a wide variety of non-CF properties of natural language.

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

References

↑ Chomsky, Noam (1959). "On Certain Formal Properties of Grammars". Information and Control. 2 (2): 137–167. doi:10.1016/S0019-9958(59)90362-6. Here: Sect.6, p.152ff.
↑ D'Antoni, Loris. "Page 7, Lecture 9: Bottom-up Parsing Algorithms" (PDF). CS536-S21 Intro to Programming Languages and Compilers. University of Wisconsin-Madison. Archived (PDF) from the original on 2021-07-19.
↑ Sipser, Michael (2006). Introduction to the theory of computation (2nd ed.). Boston: Thomson Course Technology. Definition 2.8. ISBN 0-534-95097-3. OCLC 58544333.
1 2 3 4 5 6 Hopcroft, John E.; Ullman, Jeffrey D. (1979). Introduction to Automata Theory, Languages and Computation. Reading, Massachusetts: Addison-Wesley Publishing. ISBN 978-0-201-02988-8.
↑ Hopcroft, John E.; Motwani, Rajeev; Ullman, Jeffrey D. (2006). Introduction to Automata Theory, Languages, and Computation (3rd ed.). Addison-Wesley. ISBN 978-0-321-45536-9. Section 7.1.5, p.272
↑ Rich, Elaine (2007). "11.8 Normal Forms". Automata, Computability, and Complexity: Theory and Applications (PDF) (1st ed.). Prentice-Hall. p. 169. ISBN 978-0132288064. Archived from the original (PDF) on 2023-01-17.
↑ Wegener, Ingo (1993). Theoretische Informatik - Eine algorithmenorientierte Einführung. Leitfäden und Mongraphien der Informatik (in German). Stuttgart: B. G. Teubner. ISBN 978-3-519-02123-0. Section 6.2 "Die Chomsky-Normalform für kontextfreie Grammatiken", p. 149–152
1 2 3 Lange, Martin; Leiß, Hans (2009). "To CNF or not to CNF? An Efficient Yet Presentable Version of the CYK Algorithm" (PDF). Informatica Didactica. 8. Archived (PDF) from the original on 2011-07-19.
↑ Allison, Charles D. (2022). Foundations of Computing: An Accessible Introduction to Automata and Formal Languages. Fresh Sources, Inc. p. 176. ISBN 9780578944173.
↑ Hopcroft et al. (2006)^{[ page needed ]}
↑ Floyd, Robert W. (1961). "Note on mathematical induction in phrase structure grammars" (PDF). Information and Control. 4 (4): 353–358. doi: 10.1016/S0019-9958(61)80052-1 . Archived (PDF) from the original on 2021-03-05. Here: p.354
↑ Knuth, Donald E. (December 1964). "Backus Normal Form vs. Backus Naur Form". Communications of the ACM. 7 (12): 735–736. doi: 10.1145/355588.365140 . S2CID 47537431.
↑ Jurafsky, Daniel; Martin, James H. (2008). Speech and Language Processing (2nd ed.). Pearson Prentice Hall. p. 465. ISBN 978-0-13-187321-6.

Chomsky normal form

Contents

Converting a grammar to Chomsky normal form

START: Eliminate the start symbol from right-hand sides

TERM: Eliminate rules with nonsolitary terminals

BIN: Eliminate right-hand sides with more than 2 nonterminals

DEL: Eliminate ε-rules

UNIT: Eliminate unit rules

Order of transformations

Example

Alternative definition

Chomsky reduced form

Floyd normal form

Application

See also

Notes

Related Research Articles

References

Further reading