Regular tree grammar

Last updated February 19, 2024

In theoretical computer science and formal language theory, a regular tree grammar is a formal grammar that describes a set of directed trees, or terms.^[1] A regular word grammar can be seen as a special kind of regular tree grammar, describing a set of single-path trees.

Definition

A regular tree grammar G is defined by the tuple

G = (N, Σ, Z, P),

where

N is a finite set of nonterminals,
Σ is a ranked alphabet (i.e., an alphabet whose symbols have an associated arity) disjoint from N,
Z is the starting nonterminal, with $Z \in N$ , and
P is a finite set of productions of the form A → t, with $A \in N$ , and $t \in T Σ (N)$ , where T_Σ(N) is the associated term algebra, i.e. the set of all trees composed from symbols in $Σ \cup N$ according to their arities, where nonterminals are considered nullary.

Derivation of trees

The grammar G implicitly defines a set of trees: any tree that can be derived from Z using the rule set P is said to be described by G. This set of trees is known as the language of G. More formally, the relation ⇒_G on the set T_Σ(N) is defined as follows:

A tree $t 1 \in T Σ (N)$ can be derived in a single step into a tree $t 2 \in T Σ (N)$ (in short: t₁ ⇒_Gt₂), if there is a context S and a production $(A \to t) \in P$ such that:

t₁ = S[A], and
t₂ = S[t].

Here, a context means a tree with exactly one hole in it; if S is such a context, S[t] denotes the result of filling the tree t into the hole of S.

The tree language generated by G is the language $L (G) = {t \in T Σ | Z \Rightarrow G * t}$ .

Here, T_Σ denotes the set of all trees composed from symbols of Σ, while ⇒_G* denotes successive applications of ⇒_G.

A language generated by some regular tree grammar is called a regular tree language .

Examples

Let G₁ = (N₁,Σ₁,Z₁,P₁), where

N₁ = {Bool, BList } is our set of nonterminals,
Σ₁ = { true, false, nil, cons(.,.) } is our ranked alphabet, arities indicated by dummy arguments (i.e. the symbol cons has arity 2),
Z₁ = BList is our starting nonterminal, and
the set P₁ consists of the following productions:
- Bool → false
- Bool → true
- BList → nil
- BList → cons(Bool,BList)

An example derivation from the grammar G₁ is

BList ⇒ cons(Bool,BList) ⇒ cons(false,cons(Bool,BList)) ⇒ cons(false,cons(true,nil)).

The image shows the corresponding derivation tree; it is a tree of trees (main picture), whereas a derivation tree in word grammars is a tree of strings (upper left table).

The tree language generated by G₁ is the set of all finite lists of boolean values, that is, L(G₁) happens to equal T_Σ1. The grammar G₁ corresponds to the algebraic data type declarations (in the Standard ML programming language):

datatypeBool=false|truedatatypeBList=nil|consofBool*BList

Every member of L(G₁) corresponds to a Standard-ML value of type BList.

For another example, let $G 2 = (N 1, Σ 1, BList 1, P 1 \cup P 2)$ , using the nonterminal set and the alphabet from above, but extending the production set by P₂, consisting of the following productions:

BList₁ → cons(true,BList)
BList₁ → cons(false,BList₁)

The language L(G₂) is the set of all finite lists of boolean values that contain true at least once. The set L(G₂) has no datatype counterpart in Standard ML, nor in any other functional language. It is a proper subset of L(G₁). The above example term happens to be in L(G₂), too, as the following derivation shows:

BList₁ ⇒ cons(false,BList₁) ⇒ cons(false,cons(true,BList)) ⇒ cons(false,cons(true,nil)).

Language properties

If L₁, L₂ both are regular tree languages, then the tree sets $L 1 \cap L 2, L 1 \cup L 2$ , and L₁ \ L₂ are also regular tree languages, and it is decidable whether $L 1 \subseteq L 2$ , and whether L₁ = L₂.

Alternative characterizations and relation to other formal languages

Regular tree grammars are a generalization of regular word grammars.
The regular tree languages are also the languages recognized by bottom-up tree automata and nondeterministic top-down tree automata.^[2]
Rajeev Alur and Parthasarathy Madhusudan related a subclass of regular binary tree languages to nested words and visibly pushdown languages.^[3]^[4]

Applications

Applications of regular tree grammars include:

Instruction selection in compiler code generation^[5]
A decision procedure for the first-order logic theory of formulas over equality (=) and set membership (∈) as the only predicates ^[6]
Solving constraints about mathematical sets^[7]
The set of all truths expressible in first-order logic about a finite algebra (which is always a regular tree language)^[8]
Graph-search ^[9]

Related Research Articles

<span class="mw-page-title-main">Chomsky hierarchy</span> Hierarchy of classes of formal grammars

The Chomsky hierarchy in the fields of formal language theory, computer science, and linguistics, is a containment hierarchy of classes of formal grammars. A formal grammar describes how to form strings from a language's vocabulary that are valid according to the language's syntax. Linguist Noam Chomsky theorized that four different classes of formal grammars existed that could generate increasingly complex languages. Each class can also completely generate the language of all inferior classes.

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

In formal language theory, a context-free language (CFL) is a language generated by a context-free grammar (CFG).

A finite-state machine (FSM) or finite-state automaton, finite automaton, or simply a state machine, is a mathematical model of computation. It is an abstract machine that can be in exactly one of a finite number of states at any given time. The FSM can change from one state to another in response to some inputs; the change from one state to another is called a transition. An FSM is defined by a list of its states, its initial state, and the inputs that trigger each transition. Finite-state machines are of two types—deterministic finite-state machines and non-deterministic finite-state machines. For any non-deterministic finite-state machine, an equivalent deterministic one can be constructed.

In the theory of computation, a branch of theoretical computer science, a pushdown automaton (PDA) is a type of automaton that employs a stack.

In theoretical computer science and formal language theory, a regular language is a formal language that can be defined by a regular expression, in the strict sense in theoretical computer science.

In theoretical computer science and formal language theory, a regular grammar is a grammar that is right-regular or left-regular. While their exact definition varies from textbook to textbook, they all require that

In logic and computer science, unification is an algorithmic process of solving equations between symbolic expressions, each of the form Left-hand side = Right-hand side. For example, using x,y,z as variables, the singleton equation set { cons(x,cons(x,nil)) = cons(2,y) } is a syntactic first-order unification problem that has the substitution { x ↦ 2, y ↦ cons(2,nil) } as its only solution.

A tree automaton is a type of state machine. Tree automata deal with tree structures, rather than the strings of more conventional state machines.

Automata theory is the study of abstract machines and automata, as well as the computational problems that can be solved using them. It is a theory in theoretical computer science with close connections to mathematical logic. The word automata comes from the Greek word αὐτόματος, which means "self-acting, self-willed, self-moving". An automaton is an abstract self-propelled computing device which follows a predetermined sequence of operations automatically. An automaton with a finite number of states is called a Finite Automaton (FA) or Finite-State Machine (FSM). The figure on the right illustrates a finite-state machine, which is a well-known type of automaton. This automaton consists of states and transitions. As the automaton sees a symbol of input, it makes a transition to another state, according to its transition function, which takes the previous state and current input symbol as its arguments.

In computer science, a list or sequence is an abstract data type that represents a finite number of ordered values, where the same value may occur more than once. An instance of a list is a computer representation of the mathematical concept of a tuple or finite sequence; the (potentially) infinite analog of a list is a stream. Lists are a basic example of containers, as they contain other values. If the same value occurs multiple times, each occurrence is considered a distinct item.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

In computer programming languages, a recursive data type is a data type for values that may contain other values of the same type. Data of recursive types are usually viewed as directed graphs.

Conjunctive grammars are a class of formal grammars studied in formal language theory. They extend the basic type of grammars, the context-free grammars, with a conjunction operation. Besides explicit conjunction, conjunctive grammars allow implicit disjunction represented by multiple rules for a single nonterminal symbol, which is the only logical connective expressible in context-free grammars. Conjunction can be used, in particular, to specify intersection of languages. A further extension of conjunctive grammars known as Boolean grammars additionally allows explicit negation.

In formal languages, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined as part of a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

In formal language theory, a grammar is noncontracting if for all of its production rules, α → β, it holds that |α| ≤ |β|, that is β has at least as many symbols as α. A grammar is essentially noncontracting if there may be one exception, namely, a rule S → ε where S is the start symbol and ε the empty string, and furthermore, S never occurs in the right-hand side of any rule.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.

In theoretical computer science, in particular in formal language theory, Greibach's theorem states that certain properties of formal language classes are undecidable. It is named after the computer scientist Sheila Greibach, who first proved it in 1963.

References

↑ "Regular tree grammars as a formalism for scope underspecification". CiteSeerX 10.1.1.164.5484 .
↑ Comon, Hubert; Dauchet, Max; Gilleron, Remi; Löding, Christof; Jacquemard, Florent; Lugiez, Denis; Tison, Sophie; Tommasi, Marc (12 October 2007). "Tree Automata Techniques and Applications" . Retrieved 25 January 2016.
↑ Alur, R.; Madhusudan, P. (2004). "Visibly pushdown languages" (PDF). Proceedings of the thirty-sixth annual ACM symposium on Theory of computing - STOC '04. pp. 202–211. doi:10.1145/1007352.1007390. ISBN 978-1581138528. S2CID 7473479. Sect.4, Theorem 5,
↑ Alur, R.; Madhusudan, P. (2009). "Adding nesting structure to words" (PDF). Journal of the ACM. 56 (3): 1–43. CiteSeerX 10.1.1.145.9971 . doi:10.1145/1516512.1516518. S2CID 768006. Sect.7
↑ Emmelmann, Helmut (1991). "Code Selection by Regularly Controlled Term Rewriting". Code Generation - Concepts, Tools, Techniques. Workshops in Computing. Springer. pp. 3–29.
↑ Comon, Hubert (1990). "Equational Formulas in Order-Sorted Algebras". Proc. ICALP.
↑ Gilleron, R.; Tison, S.; Tommasi, M. (1993). "Solving Systems of Set Constraints using Tree Automata". 10th Annual Symposium on Theoretical Aspects of Computer Science. LNCS. Vol. 665. Springer. pp. 505–514.
↑ Burghardt, Jochen (2002). "Axiomatization of Finite Algebras". Advances in Artificial Intelligence. LNAI. Vol. 2479. Springer. pp. 222–234. arXiv: 1403.7347 . Bibcode:2014arXiv1403.7347B. ISBN 3-540-44185-9.
↑ Ziv-Ukelson, Smoly (2016). Algorithms for Regular Tree Grammar Network Search and Their Application to Mining Human–viral Infection Patterns. J. of Comp. Bio.