Terminal and nonterminal symbols

Last updated
The string "the dog ate the bone" was created using production rules that replaced non-terminal with terminal symbols. Terminal and non-terminal symbols example.png
The string "the dog ate the bone" was created using production rules that replaced non-terminal with terminal symbols.

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols (or syntactic variables) are replaced by groups of terminal symbols according to the production rules.

Contents

The terminals and nonterminals of a particular grammar are in two completely separate sets.

Terminal symbols

Terminal symbols are symbols that may appear in the outputs of the production rules of a formal grammar and which cannot be changed using the rules of the grammar. Applying the rules recursively to a source string of symbols will usually terminate in a final output string consisting only of terminal symbols.

Consider a grammar defined by two rules. In this grammar, the symbol Б is a terminal symbol and Ψ is both a non-terminal symbol and the start symbol. The production rules for creating strings are as follows:

  1. The symbol Ψ can become БΨ
  2. The symbol Ψ can become Б

Here Б is a terminal symbol because no rule exists which would change it into something else. On the other hand, Ψ has two rules that can change it, thus it is nonterminal. A formal language defined or generated by a particular grammar is the set of strings that can be produced by the grammar and that consist only of terminal symbols. Diagram 1 illustrates a string that can be produced with this grammar.

Diagram 1. The string B B B B was formed by the grammar defined by the given production rules. This grammar can create strings with any number of the symbol B Terminal and non-terminal symbols example number two.png
Diagram 1. The string Б Б Б Б was formed by the grammar defined by the given production rules. This grammar can create strings with any number of the symbol Б

Nonterminal symbols

Nonterminal symbols are those symbols that can be replaced. They may also be called simply syntactic variables. A formal grammar includes a start symbol, a designated member of the set of nonterminals from which all the strings in the language may be derived by successive applications of the production rules. In fact, the language defined by a grammar is precisely the set of terminal strings that can be so derived.

Context-free grammars are those grammars in which the left-hand side of each production rule consists of only a single nonterminal symbol. This restriction is non-trivial; not all languages can be generated by context-free grammars. Those that can are called context-free languages. These are exactly the languages that can be recognized by a non-deterministic push down automaton. Context-free languages are the theoretical basis for the syntax of most programming languages.

Production rules

A grammar is defined by production rules (or just 'productions') that specify which symbols may replace which other symbols; these rules may be used to generate strings, or to parse them. Each such rule has a head, or left-hand side, which consists of the string that may be replaced, and a body, or right-hand side, which consists of a string that may replace it. Rules are often written in the form headbody; e.g., the rule ab specifies that a can be replaced by b.

In the classic formalization of generative grammars first proposed by Noam Chomsky in the 1950s, [2] [3] a grammar G consists of the following components:

where is the Kleene star operator and denotes set union, so represents zero or more symbols, and N means one nonterminal symbol. That is, each production rule maps from one string of symbols to another, where the first string contains at least one nonterminal symbol. In the case that the body consists solely of the empty string [note 1] , it may be denoted with a special notation (often Λ, e or ε) in order to avoid confusion.

A grammar is formally defined as the ordered quadruple . Such a formal grammar is often called a rewriting system or a phrase structure grammar in the literature. [4] [5]

Example

Backus–Naur form is a notation for expressing certain grammars. For instance, the following production rules in Backus-Naur form are used to represent an integer (which may be signed):

<digit>::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9' <integer>::= ['-'] <digit> {<digit>} 

In this example, the symbols (-,0,1,2,3,4,5,6,7,8,9) are terminal symbols and <digit> and <integer> are nonterminal symbols. [note 2]

Another example is:

In this example, the symbols a,b,c,d are terminal symbols and S,A are nonterminal symbols.

See also

Notes

  1. It contains no symbols at all.
  2. This example supports strings with leading zeroes like "0056" or "0000", as well as negative zero strings like "-0" and "-00000".


Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

A context-sensitive grammar (CSG) is a formal grammar in which the left-hand sides and right-hand sides of any production rules may be surrounded by a context of terminal and nonterminal symbols. Context-sensitive grammars are more general than context-free grammars, in the sense that there are languages that can be described by a CSG but not by a context-free grammar. Context-sensitive grammars are less general than unrestricted grammars. Thus, CSGs are positioned between context-free and unrestricted grammars in the Chomsky hierarchy.

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free grammar, G, is said to be in Chomsky normal form if all of its production rules are of the form:

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

In computer science, an LL parser is a top-down parser for a restricted context-free language. It parses the input from Left to right, performing Leftmost derivation of the sentence.

In computer science, Backus-Naur form is a notation used to describe the syntax of programming languages or other formal languages. It was developed by John Backus and Peter Naur. BNF can be described as a metasyntax notation for context-free grammars. Backus-Naur form is applied wherever exact descriptions of languages are needed, such as in official language specifications, in manuals, and in textbooks on programming language theory. BNF can be used to describe document formats, instruction sets, and communication protocols.

In computer science, extended Backus–Naur form (EBNF) is a family of metasyntax notations, any of which can be used to express a context-free grammar. EBNF is used to make a formal description of a formal language such as a computer programming language. They are extensions of the basic Backus–Naur form (BNF) metasyntax notation.

Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.

Top-Down Parsing Language (TDPL) is a type of analytic formal grammar developed by Alexander Birman in the early 1970s in order to study formally the behavior of a common class of practical top-down parsers that support a limited form of backtracking. Birman originally named his formalism the TMG Schema (TS), after TMG, an early parser generator, but it was later given the name TDPL by Aho and Ullman in their classic anthology The Theory of Parsing, Translation and Compiling.

In theoretical computer science and mathematical logic a string rewriting system (SRS), historically called a semi-Thue system, is a rewriting system over strings from a alphabet. Given a binary relation between fixed strings over the alphabet, called rewrite rules, denoted by , an SRS extends the rewriting relation to all strings in which the left- and right-hand side of the rules appear as substrings, that is , where , , , and are strings.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

Boolean grammars, introduced by Okhotin, are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with conjunction and negation operations. Besides these explicit operations, Boolean grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction and negation can be used, in particular, to specify intersection and complement of languages. An intermediate class of grammars known as conjunctive grammars allows conjunction and disjunction, but not negation.

In formal language theory, an alphabet, sometimes called a vocabulary, is a non-empty set of indivisible symbols/glyphs, typically thought of as representing letters, characters, digits, phonemes, or even words. Alphabets in this technical sense of a set are used in a diverse range of fields including logic, mathematics, computer science, and linguistics. An alphabet may have any cardinality ("size") and, depending on its purpose, may be finite, countable, or even uncountable.

In formal language theory, a grammar is noncontracting if for all of its production rules, α → β, it holds that |α| ≤ |β|, that is β has at least as many symbols as α. A grammar is essentially noncontracting if there may be one exception, namely, a rule S → ε where S is the start symbol and ε the empty string, and furthermore, S never occurs in the right-hand side of any rule.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A production or production rule in computer science is a rewrite rule specifying a symbol substitution that can be recursively performed to generate new symbol sequences. A finite set of productions is the main component in the specification of a formal grammar. The other components are a finite set of nonterminal symbols, a finite set of terminal symbols that is disjoint from and a distinguished symbol that is the start symbol.

<span class="mw-page-title-main">Formal grammar</span> Structure of a formal language

A formal grammar describes how to form strings from an alphabet of a formal language that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

<span class="mw-page-title-main">LL grammar</span> Type of a context-free grammar

In formal language theory, an LL grammar is a context-free grammar that can be parsed by an LL parser, which parses the input from Left to right, and constructs a Leftmost derivation of the sentence. A language that has an LL grammar is known as an LL language. These form subsets of deterministic context-free grammars (DCFGs) and deterministic context-free languages (DCFLs), respectively. One says that a given grammar or language "is an LL grammar/language" or simply "is LL" to indicate that it is in this class.

References

  1. Rosen, K. H. (2012). Discrete mathematics and its applications. McGraw-Hill. pages 847-851
  2. Chomsky, Noam (1956). "Three Models for the Description of Language". IRE Transactions on Information Theory . 2 (3): 113–123. doi:10.1109/TIT.1956.1056813. S2CID   19519474.
  3. Chomsky, Noam (1957). Syntactic Structures . The Hague: Mouton.
  4. Ginsburg, Seymour (1975). Algebraic and automata theoretic properties of formal languages. North-Holland. pp. 8–9. ISBN   0-7204-2506-9.
  5. Harrison, Michael A. (1978). Introduction to Formal Language Theory . Reading, Mass.: Addison-Wesley Publishing Company. pp.  13. ISBN   0-201-02955-3.