Tree-adjoining grammar

Last updated

Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free grammars have rules for rewriting symbols as strings of other symbols, tree-adjoining grammars have rules for rewriting the nodes of trees as other trees (see tree (graph theory) and tree (data structure)).

Contents

History

TAG originated in investigations by Joshi and his students into the family of adjunction grammars (AG), [1] the "string grammar" of Zellig Harris. [2] AGs handle exocentric properties of language in a natural and effective way, but do not have a good characterization of endocentric constructions; the converse is true of rewrite grammars, or phrase-structure grammar (PSG). In 1969, Joshi introduced a family of grammars that exploits this complementarity by mixing the two types of rules. A few very simple rewrite rules suffice to generate the vocabulary of strings for adjunction rules. This family is distinct from the Chomsky-Schützenberger hierarchy but intersects it in interesting and linguistically relevant ways. [3] The center strings and adjunct strings can also be generated by a dependency grammar, avoiding the limitations of rewrite systems entirely. [4] [5]

Description

The rules in a TAG are trees with a special leaf node known as the foot node, which is anchored to a word. There are two types of basic trees in TAG: initial trees (often represented as '') and auxiliary trees (''). Initial trees represent basic valency relations, while auxiliary trees allow for recursion. [6] Auxiliary trees have the root (top) node and foot node labeled with the same symbol. A derivation starts with an initial tree, combining via either substitution or adjunction. Substitution replaces a frontier node with another tree whose top node has the same label. The root/foot label of the auxiliary tree must match the label of the node at which it adjoins. Adjunction can thus have the effect of inserting an auxiliary tree into the center of another tree. [4]

Other variants of TAG allow multi-component trees, trees with multiple foot nodes, and other extensions.

Complexity and application

Tree-adjoining grammars are more powerful (in terms of weak generative capacity) than context-free grammars, but less powerful than linear context-free rewriting systems, [7] indexed [note 1] or context-sensitive grammars.

A TAG can describe the language of squares (in which some arbitrary string is repeated), and the language . This type of processing can be represented by an embedded pushdown automaton. Languages with cubes (i.e. triplicated strings) or with more than four distinct character strings of equal length cannot be generated by tree-adjoining grammars.

For these reasons, tree-adjoining grammars are often described as mildly context-sensitive. These grammar classes are conjectured to be powerful enough to model natural languages while remaining efficiently parsable in the general case. [8]

Equivalences

Vijay-Shanker and Weir (1994) [9] demonstrate that linear indexed grammars, combinatory categorial grammar, tree-adjoining grammars, and head grammars are weakly equivalent formalisms, in that they all define the same string languages.

Lexicalized

Lexicalized tree-adjoining grammars (LTAG) are a variant of TAG in which each elementary tree (initial or auxiliary) is associated with a lexical item. A lexicalized grammar for English has been developed by the XTAG Research Group of the Institute for Research in Cognitive Science at the University of Pennsylvania. [5]

See also

Notes

  1. since for each tree-adjoining grammar, a linear indexed grammar can be found producing the same language, see below, and for the latter, a weakly equivalent (proper) indexed grammar can be found, in turn, see Indexed grammar#Computational Power

Related Research Articles

<span class="mw-page-title-main">Context-free grammar</span> Type of formal grammar

In formal language theory, a context-free grammar (CFG) is a formal grammar whose production rules can be applied to a nonterminal symbol regardless of its context. In particular, in a context-free grammar, each production rule is of the form

<span class="mw-page-title-main">Formal language</span> Sequence of words formed by specific rules

In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules.

<span class="mw-page-title-main">Parse tree</span> Tree in formal language theory

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

In linguistics, X-bar theory is a model of phrase-structure grammar and a theory of syntactic category formation that was first proposed by Noam Chomsky in 1970 reformulating the ideas of Zellig Harris (1951,) and further developed by Ray Jackendoff, along the lines of the theory of generative grammar put forth in the 1950s by Chomsky. It attempts to capture the structure of phrasal categories with a single uniform structure called the X-bar schema, basing itself on the assumption that any phrase in natural language is an XP that is headed by a given syntactic category X. It played a significant role in resolving issues that phrase structure rules had, representative of which is the proliferation of grammatical rules, which is against the thesis of generative grammar.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.

In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.

In computer science, Van Wijngaarden grammars are a formalism for defining formal languages invented by Adriaan van Wijngaarden for the purpose of defining the ALGOL 68 programming language. The resulting specification remains its most notable application.

ID/LP Grammars are a subset of Phrase Structure Grammars, differentiated from other formal grammars by distinguishing between immediate dominance (ID) and linear precedence (LP) constraints. Whereas traditional phrase structure rules incorporate dominance and precedence into a single rule, ID/LP Grammars maintains separate rule sets which need not be processed simultaneously. ID/LP Grammars are used in Computational Linguistics.

<span class="mw-page-title-main">Grammar induction</span>

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

In computer science, terminal and nonterminal symbols are the lexical elements used in specifying the production rules constituting a formal grammar. Terminal symbols are the elementary symbols of the language defined by a formal grammar. Nonterminal symbols are replaced by groups of terminal symbols according to the production rules.

Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.

An embedded pushdown automaton or EPDA is a computational model for parsing languages generated by tree-adjoining grammars (TAGs). It is similar to the context-free grammar-parsing pushdown automaton, but instead of using a plain stack to store symbols, it has a stack of iterated stacks that store symbols, giving TAGs a generative capacity between context-free and context-sensitive grammars, or a subset of mildly context-sensitive grammars. Embedded pushdown automata should not be confused with nested stack automata which have more computational power.

Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structure, quantification and information structure. The formalism generates constituency-based structures and is therefore a type of phrase structure grammar.

In formal language theory, a grammar describes how to form strings from a language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.

Generalized context-free grammar (GCFG) is a grammar formalism that expands on context-free grammars by adding potentially non-context-free composition functions to rewrite rules. Head grammar is an instance of such a GCFG which is known to be especially adept at handling a wide variety of non-CF properties of natural language.

In formal language theory, weak equivalence of two grammars means they generate the same set of strings, i.e. that the formal language they generate is the same. In compiler theory the notion is distinguished from strongequivalence, which additionally means that the two parse trees are reasonably similar in that the same semantic interpretation can be assigned to both.

In computational linguistics, the term mildly context-sensitive grammar formalisms refers to several grammar formalisms that have been developed in an effort to provide adequate descriptions of the syntactic structure of natural language.

Dynamic Syntax (DS) is a grammar formalism and linguistic theory whose overall aim is to explain the real-time processes of language understanding and production, and describe linguistic structures as happening step-by-step over time. Under the DS approach, syntactic knowledge is understood as the ability to incrementally analyse the structure and content of spoken and written language in context and in real-time. While it posits representations similar to those used in Combinatory categorial grammars (CCG), it builds those representations left-to-right going word-by-word. Thus it differs from other syntactic models which generally abstract way from features of everyday conversation such as interruption, backtracking, and self-correction. Moreover, it differs from other approaches in that it does not postulate an independent level of syntactic structure over words.

Anne Abeillé is a French linguist specialising in French grammar and syntactic theory, in particular constraint-based grammar, as well as natural language processing. She led the creation of the French Treebank, the first syntactically-annotated corpus of French.

References

  1. Joshi, Aravind; S. R. Kosaraju; H. Yamada (1969). "String Adjunct Grammars". Proceedings Tenth Annual Symposium on Automata Theory, Waterloo, Canada.{{cite journal}}: Cite journal requires |journal= (help)Joshi, Aravind K.; Kosaraju, S. Rao; Yamada, H. M. (1972), "String Adjunct Grammars: I. Local and Distributed Adjunction", Information and Control, 21 (2): 93–116, doi: 10.1016/S0019-9958(72)90051-4 Joshi, Aravind K.; Kosaraju, S. Rao; Yamada, H. M. (1972), "String Adjunct Grammars: II. Equational Representation, Null Symbols, and Linguistic Relevance", Information and Control, 21 (3): 235–260, doi: 10.1016/S0019-9958(72)80005-6
  2. Harris, Zellig S. (1962). String analysis of sentence structure. Papers on Formal Linguistics. Vol. 1. The Hague: Mouton & Co.
  3. Joshi, Aravind (1969). "Properties of Formal Grammars with Mixed Types of Rules and Their Linguistic Relevance". Proceedings Third International Symposium on Computational Linguistics, Stockholm, Sweden.{{cite journal}}: Cite journal requires |journal= (help)
  4. 1 2 Joshi, Aravind; Owen Rambow (2003). "A Formalism for Dependency Grammar Based on Tree Adjoining Grammar" (PDF). Proceedings of the Conference on Meaning-Text Theory.
  5. 1 2 "A Lexicalized Tree Adjoining Grammar for English".
  6. Jurafsky, Daniel; James H. Martin (2000). Speech and Language Processing. Upper Saddle River, NJ: Prentice Hall. p. 354.
  7. Kallmeyer, Laura (2010). Parsing Beyond Context-Free Grammars. Springer. Here: p.215-216
  8. Joshi, Aravind (1985). "How much context-sensitivity is necessary for characterizing structural descriptions". In D. Dowty; L. Karttunen; A. Zwicky (eds.). Natural Language Processing: Theoretical, Computational, and Psychological Perspectives . New York, NY: Cambridge University Press. pp.  206–250. ISBN   9780521262033.
  9. Vijay-Shanker, K. and Weir, David J. 1994. The Equivalence of Four Extensions of Context-Free Grammars. Mathematical Systems Theory 27(6): 511–546.