Terminal and nonterminal symbols

Last updated July 18, 2024

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols (or syntactic variables) are replaced by groups of terminal symbols according to the production rules.

Terminal symbols

Terminal symbols are symbols that may appear in the outputs of the production rules of a formal grammar and which cannot be changed using the rules of the grammar. Applying the rules recursively to a source string of symbols will usually terminate in a final output string consisting only of terminal symbols.

Consider a grammar defined by two rules. In this grammar, the symbol Б is a terminal symbol and Ψ is both a non-terminal symbol and the start symbol. The production rules for creating strings are as follows:

The symbol Ψ can become БΨ
The symbol Ψ can become Б

Here Б is a terminal symbol because no rule exists which would change it into something else. On the other hand, Ψ has two rules that can change it, thus it is nonterminal. A formal language defined or generated by a particular grammar is the set of strings that can be produced by the grammar and that consist only of terminal symbols. Diagram 1 illustrates a string that can be produced with this grammar.

Nonterminal symbols

Nonterminal symbols are those symbols that can be replaced. They may also be called simply syntactic variables. A formal grammar includes a start symbol, a designated member of the set of nonterminals from which all the strings in the language may be derived by successive applications of the production rules. In fact, the language defined by a grammar is precisely the set of terminal strings that can be so derived.

Context-free grammars are those grammars in which the left-hand side of each production rule consists of only a single nonterminal symbol. This restriction is non-trivial; not all languages can be generated by context-free grammars. Those that can are called context-free languages. These are exactly the languages that can be recognized by a non-deterministic push down automaton. Context-free languages are the theoretical basis for the syntax of most programming languages.

Production rules

A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form head → body; e.g., the rule a → b specifies that a can be replaced by b.

In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s,^[2]^[3] a grammar G consists of the following components:

A finite set $N$ of nonterminal symbols.
A finite set $Σ$ of terminal symbols that is disjoint from $N$ .
A finite set $P$ of production rules, each rule of the form

(\Sigma \cup N)^{*}N(\Sigma \cup N)^{*}\rightarrow (\Sigma \cup N)^{*}

where

{}^{*}

is the Kleene star operator and

\cup

denotes set union, so

(\Sigma \cup N)^{*}

represents zero or more symbols, and

N

means one nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the empty string ^{[note 1]}, it may be denoted with a special notation (often

Λ

,

e

or

ε

) in order to avoid confusion.

A distinguished symbol $S\in N$ that is the start symbol.

A grammar is formally defined as the ordered quadruple $\langle N,\Sigma ,P,S\rangle$ . Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature.^[4]^[5]

Example

Backus–Naur form is a notation for expressing certain grammars. For instance, the following production rules in Backus-Naur form are used to represent an integer (which may be signed):

<digit>::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' <integer>::= ['-'] <digit> {<digit>}

In this example, the symbols (-,0,1,2,3,4,5,6,7,8,9) are terminal symbols and <digit> and <integer> are nonterminal symbols. ^{[note 2]}

Another example is:

{\ce {S -> cAd}}

{\ce {A -> a | ab}}

In this example, the symbols $a,b,c,d$ are terminal symbols and $S,A$ are nonterminal symbols.

Notes

↑ It contains no symbols at all.
↑ This example supports strings with leading zeroes like "0056" or "0000", as well as negative zero strings like "-0" and "-00000".

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. The linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, Backus–Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus–Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programming language. They are extensions of the basic Backus–Naur form (BNF) metasyntax notation.

Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a alphabet. Given a binary relation $between fixed strings over the alphabet, called rewrite rules, denoted by, an SRS extends the rewriting relation to all strings in which the left- and right-hand side of the rules appear as substrings, that is, where,,, and are strings.$

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

Boolean grammars, introduced by Okhotin, are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with conjunction and negation operations. Besides these explicit operations, Boolean grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction and negation can be used, in particular, to specify intersection and complement of languages. An intermediate class of grammars known as conjunctive grammars allows conjunction and disjunction, but not negation.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/characters/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite, countable, or even uncountable.

In formal language theory, a grammar is noncontracting if for all of its production rules, α → β, it holds that |α| ≤ |β|, that is β has at least as many symbols as α. A grammar is essentially noncontracting if there may be one exception, namely, a rule S → ε where S is the start symbol and ε the empty string, and furthermore, S never occurs in the right-hand side of any rule.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions $is the main component in the specification of a formal grammar. The other components are a finite set of nonterminal symbols, a finite set of terminal symbols that is disjoint from and a distinguished symbol that is the start symbol .$

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

References

↑ Rosen, K. H. (2012). Discrete mathematics and its applications. McGraw-Hill. pages 847-851
↑ Chomsky, Noam (1956). "Three Models for the Description of Language". IRE Transactions on Information Theory . 2 (3): 113–123. doi:10.1109/TIT.1956.1056813. S2CID 19519474.
↑ Chomsky, Noam (1957). Syntactic Structures . The Hague: Mouton.
↑ Ginsburg, Seymour (1975). Algebraic and automata theoretic properties of formal languages. North-Holland. pp. 8–9. ISBN 0-7204-2506-9.
↑ Harrison, Michael A. (1978). Introduction to Formal Language Theory . Reading, Mass.: Addison-Wesley Publishing Company. pp. 13. ISBN 0-201-02955-3.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[4] It contains no symbols at all.

[7] This example supports strings with leading zeroes like "0056" or "0000", as well as negative zero strings like "-0" and "-00000".

[1] Rosen, K. H. (2012). Discrete mathematics and its applications. McGraw-Hill. pages 847-851

[Chomsky1956-2] Chomsky, Noam (1956). "Three Models for the Description of Language". IRE Transactions on Information Theory . 2 (3): 113–123. doi:10.1109/TIT.1956.1056813. S2CID 19519474.

[Chomsky1957-3] Chomsky, Noam (1957). Syntactic Structures . The Hague: Mouton.

[5] Ginsburg, Seymour (1975). Algebraic and automata theoretic properties of formal languages. North-Holland. pp. 8–9. ISBN 0-7204-2506-9.

[6] Harrison, Michael A. (1978). Introduction to Formal Language Theory . Reading, Mass.: Addison-Wesley Publishing Company. pp. 13. ISBN 0-201-02955-3.

[2]

[3]

[note 1]

[4]

[5]

[note 2]