Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.
A categorial grammar consists of two parts: a lexicon, which assigns a set of types (also called categories) to each basic symbol, and some type inference rules, which determine how the type of a string of symbols follows from the types of the constituent symbols. It has the advantage that the type inference rules can be fixed once and for all, so that the specification of a particular language grammar is entirely determined by the lexicon.
A categorial grammar shares some features with the simply typed lambda calculus. Whereas the lambda calculus has only one function type , a categorial grammar typically has two function types, one type that is applied on the left, and one on the right. For example, a simple categorial grammar might have two function types and . The first, , is the type of a phrase that results in a phrase of type when followed (on the right) by a phrase of type . The second, , is the type of a phrase that results in a phrase of type when preceded (on the left) by a phrase of type .
The notation is based upon algebra. A fraction when multiplied by (i.e. concatenated with) its denominator yields its numerator. As concatenation is not commutative, it makes a difference whether the denominator occurs to the left or right. The concatenation must be on the same side as the denominator for it to cancel out.
The first and simplest kind of categorial grammar is called a basic categorial grammar, or sometimes an AB-grammar (after Ajdukiewicz and Bar-Hillel). Given a set of primitive types , let be the set of types constructed from primitive types. In the basic case, this is the least set such that and if then . Think of these as purely formal expressions freely generated from the primitive types; any semantics will be added later. Some authors assume a fixed infinite set of primitive types used by all grammars, but by making the primitive types part of the grammar, the whole construction is kept finite.
A basic categorial grammar is a tuple where is a finite set of symbols, is a finite set of primitive types, and .
The relation is the lexicon, which relates types to symbols . Since the lexicon is finite, it can be specified by listing a set of pairs like .
Such a grammar for English might have three basic types , assigning count nouns the type , complete noun phrases the type , and sentences the type . Then an adjective could have the type , because if it is followed by a noun then the whole phrase is a noun. Similarly, a determiner has the type , because it forms a complete noun phrase when followed by a noun. Intransitive verbs have the type , and transitive verbs the type . Then a string of words is a sentence if it has overall type .
For example, take the string "the bad boy made that mess". Now "the" and "that" are determiners, "boy" and "mess" are nouns, "bad" is an adjective, and "made" is a transitive verb, so the lexicon is {, , , , , }.
and the sequence of types in the string is
now find functions and appropriate arguments and reduce them according to the two inference rules and :
The fact that the result is means that the string is a sentence, while the sequence of reductions shows that it can be parsed as ((the (bad boy)) (made (that mess))).
Categorial grammars of this form (having only function application rules) are equivalent in generative capacity to context-free grammars and are thus often considered inadequate for theories of natural language syntax. Unlike CFGs, categorial grammars are lexicalized, meaning that only a small number of (mostly language-independent) rules are employed, and all other syntactic phenomena derive from the lexical entries of specific words.
Another appealing aspect of categorial grammars is that it is often easy to assign them a compositional semantics, by first assigning interpretation types to all the basic categories, and then associating all the derived categories with appropriate function types. The interpretation of any constituent is then simply the value of a function at an argument. With some modifications to handle intensionality and quantification, this approach can be used to cover a wide variety of semantic phenomena.
A Lambek grammar is an elaboration of this idea that has a concatenation operator for types, and several other inference rules. Mati Pentus has shown that these still have the generative capacity of context-free grammars.
For the Lambek calculus, there is a type concatenation operator , so that and if then .
The Lambek calculus consists of several deduction rules, which specify how type inclusion assertions can be derived. In the following rules, upper case roman letters stand for types, upper case Greek letters stand for sequences of types. A sequent of the form can be read: a string is of type X if it consists of the concatenation of strings of each of the types in Γ. If a type is interpreted as a set of strings, then the ← may be interpreted as ⊇, that is, "includes as a subset". A horizontal line means that the inclusion above the line implies the one below the line.
The process is begun by the Axiom rule, which has no antecedents and just says that any type includes itself.
The Cut rule says that inclusions can be composed.
The other rules come in pairs, one pair for each type construction operator, each pair consisting of one rule for the operator in the target, one in the source, of the arrow. The name of a rule consists of the operator and an arrow, with the operator on the side of the arrow on which it occurs in the conclusion.
Target | Source |
---|---|
For an example, here is a derivation of "type raising", which says that . The names of rules and the substitutions used are to the right.
Recall that a context-free grammar is a 4-tuple where
From the point of view of categorial grammars, a context-free grammar can be seen as a calculus with a set of special purpose axioms for each language, but with no type construction operators and no inference rules except Cut.
Specifically, given a context-free grammar as above, define a categorial grammar where , and . Let there be an axiom for every symbol , an axiom for every production rule , a lexicon entry for every terminal symbol , and Cut for the only rule. This categorial grammar generates the same language as the given CFG.
Of course, this is not a basic categorial grammar, since it has special axioms that depend upon the language; i.e. it is not lexicalized. Also, it makes no use at all of non-primitive types.
To show that any context-free language can be generated by a basic categorial grammar, recall that any context-free language can be generated by a context-free grammar in Greibach normal form.
The grammar is in Greibach normal form if every production rule is of the form , where capital letters are variables, , and , that is, the right side of the production is a single terminal symbol followed by zero or more (non-terminal) variables.
Now given a CFG in Greibach normal form, define a basic categorial grammar with a primitive type for each non-terminal variable , and with an entry in the lexicon , for each production rule . It is fairly easy to see that this basic categorial grammar generates the same language as the original CFG. Note that the lexicon of this grammar will generally assign multiple types to each symbol.
The same construction works for Lambek grammars, since they are an extension of basic categorial grammars. It is necessary to verify that the extra inference rules do not change the generated language. This can be done and shows that every context-free language is generated by some Lambek grammar.
To show the converse, that every language generated by a Lambek grammar is context-free, is much more difficult. It was an open problem for nearly thirty years, from the early 1960s until about 1991 when it was proven by Pentus.
The basic idea is, given a Lambek grammar, construct a context-free grammar with the same set of terminal symbols, the same start symbol, with variables some (not all) types , and with a production rule for each entry in the lexicon, and production rules for certain sequents that are derivable in the Lambek calculus.
Of course, there are infinitely many types and infinitely many derivable sequents, so in order to make a finite grammar it is necessary put a bound on the size of the types and sequents that are needed. The heart of Pentus's proof is to show that there is such a finite bound.
The notation in this field is not standardized. The notations used in formal language theory, logic, category theory, and linguistics, conflict with each other. In logic, arrows point to the more general from the more particular, that is, to the conclusion from the hypotheses. In this article, this convention is followed, i.e. the target of the arrow is the more general (inclusive) type.
In logic, arrows usually point left to right. In this article this convention is reversed for consistency with the notation of context-free grammars, where the single non-terminal symbol is always on the left. We use the symbol in a production rule as in Backus–Naur form. Some authors use an arrow, which unfortunately may point in either direction, depending on whether the grammar is thought of as generating or recognizing the language.
Some authors on categorial grammars write instead of . The convention used here follows Lambek and algebra.
The basic ideas of categorial grammar date from work by Kazimierz Ajdukiewicz (in 1935) and other scholars from the Polish tradition of mathematical logic including Stanisław Leśniewski, Emil Post and Alfred Tarski. Ajdukiewicz's formal approach to syntax was influenced by Edmund Husserl's pure logical grammar, which was formalized by Rudolph Carnap. It represents a development in the historical idea of universal logical grammar as an underlying structure of all languages. A core concept of the approach is the substitutability of syntactic categories—hence the name categorial grammar. The membership of an element (e.g., word or phrase) in a syntactic category (word class, phrase type) is established by the commutation test, and the formal grammar is constructed through series of such tests. [1]
The term categorial grammar was coined by Yehoshua Bar-Hillel (in 1953). In 1958, Joachim Lambek introduced a syntactic calculus that formalized the function type constructors along with various rules for the combination of functions. This calculus is a forerunner of linear logic in that it is a substructural logic.
Montague grammar is based on the same principles as categorial grammar. [2] Montague's work helped to bolster interest in categorial grammar by associating it with his highly successful formal treatment of natural language semantics. Later work in categorial grammar has focused on the improvement of syntactic coverage. One formalism that has received considerable attention in recent years is Steedman and Szabolcsi's combinatory categorial grammar, which builds on combinatory logic invented by Moses Schönfinkel and Haskell Curry.
There are a number of related formalisms of this kind in linguistics, such as type logical grammar and abstract categorial grammar. [3]
A variety of changes to categorial grammar have been proposed to improve syntactic coverage. Some of the most common are listed below.
Most systems of categorial grammar subdivide categories. The most common way to do this is by tagging them with features, such as person, gender, number, and tense. Sometimes only atomic categories are tagged in this way. In Montague grammar, it is traditional to subdivide function categories using a multiple slash convention, so A/B and A//B would be two distinct categories of left-applying functions, that took the same arguments but could be distinguished between by other functions taking them as arguments.
Rules of function composition are included in many categorial grammars. An example of such a rule would be one that allowed the concatenation of a constituent of type A/B with one of type B/C to produce a new constituent of type A/C. The semantics of such a rule would simply involve the composition of the functions involved. Function composition is important in categorial accounts of conjunction and extraction, especially as they relate to phenomena like right node raising. The introduction of function composition into a categorial grammar leads to many kinds of derivational ambiguity that are vacuous in the sense that they do not correspond to semantic ambiguities.
Many categorial grammars include a typical conjunction rule, of the general form X CONJ X → X, where X is a category. Conjunction can generally be applied to nonstandard constituents resulting from type raising or function composition..
The grammar is extended to handle linguistic phenomena such as discontinuous idioms, gapping and extraction. [4]
In logic, mathematics, computer science, and linguistics, a formal language consists of words whose letters are taken from an alphabet and are well-formed according to a specific set of rules called a formal grammar.
In mathematics, especially in order theory, a preorder or quasiorder is a binary relation that is reflexive and transitive. The name preorder is meant to suggest that preorders are almost partial orders, but not quite, as they are not necessarily antisymmetric.
In mathematics and theoretical computer science, a type theory is the formal presentation of a specific type system. Type theory is the academic study of type systems.
In logic and computer science, specifically automated reasoning, unification is an algorithmic process of solving equations between symbolic expressions, each of the form Left-hand side = Right-hand side. For example, using x,y,z as variables, and taking f to be an uninterpreted function, the singleton equation set { f(1,y) = f(x,2) } is a syntactic first-order unification problem that has the substitution { x ↦ 1, y ↦ 2 } as its only solution.
In mathematics, computer science, and logic, rewriting covers a wide range of methods of replacing subterms of a formula with other terms. Such methods may be achieved by rewriting systems. In their most basic form, they consist of a set of objects, plus relations on how to transform those objects.
Montague grammar is an approach to natural language semantics, named after American logician Richard Montague. The Montague grammar is based on mathematical logic, especially higher-order predicate logic and lambda calculus, and makes use of the notions of intensional logic, via Kripke models. Montague pioneered this approach in the 1960s and early 1970s.
In computer science, a parsing expression grammar (PEG) is a type of analytic formal grammar, i.e. it describes a formal language in terms of a set of rules for recognizing strings in the language. The formalism was introduced by Bryan Ford in 2004 and is closely related to the family of top-down parsing languages introduced in the early 1970s. Syntactically, PEGs also look similar to context-free grammars (CFGs), but they have a different interpretation: the choice operator selects the first match in PEG, while it is ambiguous in CFG. This is closer to how string recognition tends to be done in practice, e.g. by a recursive descent parser.
The simply typed lambda calculus, a form of type theory, is a typed interpretation of the lambda calculus with only one type constructor that builds function types. It is the canonical and simplest example of a typed lambda calculus. The simply typed lambda calculus was originally introduced by Alonzo Church in 1940 as an attempt to avoid paradoxical use of the untyped lambda calculus.
String diagrams are a formal graphical language for representing morphisms in monoidal categories, or more generally 2-cells in 2-categories. They are a prominent tool in applied category theory. When interpreted in the monoidal category of vector spaces and linear maps with the tensor product, string diagrams are called tensor networks or Penrose graphical notation. This has led to the development of categorical quantum mechanics where the axioms of quantum theory are expressed in the language of monoidal categories.
In universal algebra and in model theory, a structure consists of a set along with a collection of finitary operations and relations that are defined on it.
The concept of a stable model, or answer set, is used to define a declarative semantics for logic programs with negation as failure. This is one of several standard approaches to the meaning of negation in logic programming, along with program completion and the well-founded semantics. The stable model semantics is the basis of answer set programming.
Indexed grammars are a generalization of context-free grammars in that nonterminals are equipped with lists of flags, or index symbols. The language produced by an indexed grammar is called an indexed language.
Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structure, quantification and information structure. The formalism generates constituency-based structures and is therefore a type of phrase structure grammar.
A formal grammar describes which strings from an alphabet of a formal language are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form. A formal grammar is defined as a set of production rules for such strings in a formal language.
In computer science, more specifically in automata and formal language theory, nested words are a concept proposed by Alur and Madhusudan as a joint generalization of words, as traditionally used for modelling linearly ordered structures, and of ordered unranked trees, as traditionally used for modelling hierarchical structures. Finite-state acceptors for nested words, so-called nested word automata, then give a more expressive generalization of finite automata on words. The linear encodings of languages accepted by finite nested word automata gives the class of visibly pushdown languages. The latter language class lies properly between the regular languages and the deterministic context-free languages. Since their introduction in 2004, these concepts have triggered much research in that area.
Minimalist grammars are a class of formal grammars that aim to provide a more rigorous, usually proof-theoretic, formalization of Chomskyan Minimalist program than is normally provided in the mainstream Minimalist literature. A variety of particular formalizations exist, most of them developed by Edward Stabler, Alain Lecomte, Christian Retoré, or combinations thereof.
A Hindley–Milner (HM) type system is a classical type system for the lambda calculus with parametric polymorphism. It is also known as Damas–Milner or Damas–Hindley–Milner. It was first described by J. Roger Hindley and later rediscovered by Robin Milner. Luis Damas contributed a close formal analysis and proof of the method in his PhD thesis.
Pregroup grammar (PG) is a grammar formalism intimately related to categorial grammars. Much like categorial grammar (CG), PG is a kind of type logical grammar. Unlike CG, however, PG does not have a distinguished function type. Rather, PG uses inverse types combined with its monoidal operation.
Anti-unification is the process of constructing a generalization common to two given symbolic expressions. As in unification, several frameworks are distinguished depending on which expressions are allowed, and which expressions are considered equal. If variables representing functions are allowed in an expression, the process is called "higher-order anti-unification", otherwise "first-order anti-unification". If the generalization is required to have an instance literally equal to each input expression, the process is called "syntactical anti-unification", otherwise "E-anti-unification", or "anti-unification modulo theory".
Bounded arithmetic is a collective name for a family of weak subtheories of Peano arithmetic. Such theories are typically obtained by requiring that quantifiers be bounded in the induction axiom or equivalent postulates. The main purpose is to characterize one or another class of computational complexity in the sense that a function is provably total if and only if it belongs to a given complexity class. Further, theories of bounded arithmetic present uniform counterparts to standard propositional proof systems such as Frege system and are, in particular, useful for constructing polynomial-size proofs in these systems. The characterization of standard complexity classes and correspondence to propositional proof systems allows to interpret theories of bounded arithmetic as formal systems capturing various levels of feasible reasoning.
{{cite book}}
: |journal=
ignored (help)