Constraint grammar

Last updated

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation (lexeme or base form), inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally (defined distances) or globally (undefined distances). Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

Contents

The CG concept was launched by Fred Karlsson in 1990 (Karlsson 1990; Karlsson et al., eds, 1995), and CG taggers and parsers have since been written for a large variety of languages, routinely achieving accuracy F-scores for part of speech (word class) of over 99%. [1] A number of syntactic CG systems have reported F-scores of around 95% for syntactic function labels. CG systems can be used to create full syntactic trees in other formalisms by adding small, non-terminal based phrase structure grammars or dependency grammars, and a number of Treebank projects have used CG for automatic annotation. CG methodology has also been used in a number of language technology applications, such as spell checkers and machine translation systems.

Rule syntax and format

A Constraint Grammar parser expects as its input a stream of morphologically analysed tokens, typically produced by a Finite-state transducer-based analyser (common ones are the Xerox tools twolc/lexc/xfst, HFST or Apertium's lttoolbox). Each token may be ambiguous and have many readings, the surface form with all its readings is called a cohort. Below is a possible example analysis of ", and X was like “" in the input format expected by VISL CG-3:

"<,>"  "," cm "<and>"  "and" conj "<X>"  "X" num pl  "X" noun prop "<was>"  "be" verb past p1 sg  "be" verb past p3 sg "<like>"  "like" adj  "like" subj  "like" pr  "like" verb inf  "like" verb pres  "like" verb imp "<“>"  "“" lquot 

This snippet shows 5 cohorts, each with one or more readings. The surface wordforms are in "<anglequotes>" while the lemmas/baseforms are in regular "quotes" followed by an unquoted set of tags, and we see some cohorts have several readings, ie. are ambiguous ("<like>" being ambiguous between 6 readings). The job of the CG parser is now to 1) remove as many wrong readings as it is safe to do given the context, 2) optionally apply one or more syntactic function labels to each cohort (or even dependency relations) and 3) disambiguate the applied labels/relations.

Below is an example rule (again in VISL CG-3 format) for picking the third person reading of "was" (by removing the first person reading), given that there is no first person pronoun to the left:

REMOVE (verb p1) IF    (0C (verb))   (NEGATE *-1 (prn p1)) ; 

Here (verb p1) is a set of tags (order doesn't matter) that must match the reading we're removing. After IF follows a list of zero or more constraints, the first one says that in this cohort (position 0), all readings (the qualifier C, for Careful) have the tag verb. The second constraint says that if there is a cohort that is at least one word to the left (position *-1, the * meaning we may go further than one word and - meaning left) and that cohort is a first person pronoun, then the constraint does *not* match (NEGATE).

In CG-3, rules may also be given names, e.g. SELECT:somename (…) IF, which show up in the trace output.

A rule can also pick a single reading if we're sure that all the other readings must be wrong given the constraints:

SELECT:quoting ("like" subj) IF    (-1 ("<was>"))   (1 (lquot) OR (":"))  ; 

In this rule, we see that we can refer to both wordforms and baseforms in tagsets (they are treated just like any other tag, and a reading will always match its wordform). Here the second constraint uses OR to combine two tagsets. If this set is commonly used, we can give it a name and use the name – without the parentheses – like this:

LIST prequote = lquot ":" ;  SELECT:quoting ("like" subj) IF    (-1 ("<was>"))   (1 prequote) ; 

An equivalent definition would be SET prequote = (lquot) OR (":") ; .

After running the above rules, we should end up with this:

"<,>"  "," cm "<and>"  "and" conj "<X>"  "X" num pl  "X" noun prop "<was>"  "be" verb past p3 sg "<like>"  "like" subj "<“>"  "“" lquot 

If we used --trace, we would see the removed readings with an initial ;, and the name and line number of the rule wherever it applied to a reading.


The rule syntax for adding syntactic function labels follows a similar scheme of "do this if x, y and z":

LIST nominal = noun prn ; ADD (@SUBJ) IF    (NEGATE *-1 nominal)   (0C (prop))   (1C finiteverb) ; 

This is called a "mapping rule", and we may end up with multiple such mapping tags per cohort, in which case we can disambiguate using the same SELECT/REMOVE rules.

Implementations

CG-1

The first CG implementation was CGP by Fred Karlsson in the early 1990s. It was purely LISP-based, and the syntax was based on LISP s-expressions (Karlsson 1990).

CG-2

Pasi Tapanainen's CG-2 implementation mdis [2] removed some of the parentheses in the grammar format and was implemented in C++, interpreting the grammar as a Finite State Transducer for speed.

CG-2 was later reimplemented (with a non-FST method) by the VISL group at Syddansk Universitet as the open source VISL CG , keeping the same format as Tapanainen's closed-source mdis.


CG-3

A screenshot of VISL's cg3ide Cg3ide screenshot.png
A screenshot of VISL's cg3ide
Editing and running a CG-3 file in Emacs cg.el Editing Constraint Grammar in Emacs cg.el screenshot.png
Editing and running a CG-3 file in Emacs cg.el

The VISL project later turned into VISL CG-3, which brought further changes and additions to the grammar format, e.g.:

There is also simple IDE for CG-3 developed by VISL, which provides syntax highlighting and lets you see input and output and possible errors as you write your grammar. There is also an Emacs mode cg.el with similar features and simple code-navigation.

Unlike the Tapanainen implementation, the VISL implementations do not use finite state transducers. Rules are ordered within sections, which gives more predictability when writing grammars, but at the cost of slower parsing and the possibility of endless loops.

There have been experimental open-source FST-based reimplementations of CG-2 that for small grammars reach the speed of VISL CG-3, if not mdis. [3]

List of systems

Free software
Non-free software

Related Research Articles

A syntactic category is a syntactic unit that theories of syntax assume. Word classes, largely corresponding to traditional parts of speech, are syntactic categories. In phrase structure grammars, the phrasal categories are also syntactic categories. Dependency grammars, however, do not acknowledge phrasal categories.

English grammar is the set of structural rules of the English language. This includes the structure of words, phrases, clauses, sentences, and whole texts.

A noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently occurring phrase type.

<span class="mw-page-title-main">Parse tree</span> Tree in formal language theory

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits two separate levels of syntactic structure, a phrase structure grammar representation of word order and constituency, and a representation of grammatical functions such as subject and object, similar to dependency grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the theory of transformational grammar which was current in the late 1970s. It mainly focuses on syntax, including its relation with morphology and semantics. There has been little LFG work on phonology.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Link grammar (LG) is a theory of syntax by Davy Temperley and Daniel Sleator which builds relations between pairs of words, rather than constructing constituents in a phrase structure hierarchy. Link grammar is similar to dependency grammar, but dependency grammar includes a head-dependent relationship, whereas link grammar makes the head-dependent relationship optional. Colored Multiplanar Link Grammar (CMLG) is an extension of LG allowing crossing relations between pairs of words. The relationship between words is indicated with link types, thus making the Link grammar closely related to certain categorial grammars.

Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.

Theta roles are the names of the participant roles associated with a predicate: the predicate may be a verb, an adjective, a preposition, or a noun. If an object is in motion or in a steady state as the speakers perceives the state, or it is the topic of discussion, it is called a theme. The participant is usually said to be an argument of the predicate. In generative grammar, a theta role or θ-role is the formal device for representing syntactic argument structure—the number and type of noun phrases—required syntactically by a particular verb. For example, the verb put requires three arguments.

In linguistics, especially within generative grammar, phi features are the morphological expression of a semantic process in which a word or morpheme varies with the form of another word or phrase in the same sentence. This variation can include person, number, gender, and case, as encoded in pronominal agreement with nouns and pronouns. Several other features are included in the set of phi-features, such as the categorical features ±N (nominal) and ±V (verbal), which can be used to describe lexical categories and case features.

The Kamayurá language belongs to the Tupi–Guarani family, and is spoken by the Kamayurá people of Brazil – who numbered about 600 individuals in 2014. There is speculation that as the indigenous peoples who spoke the Tupi languages mingled with other indigenous peoples, their languages gradually changed accordingly. This speculation is consistent with research done by linguists who study languages in different regions in order to find similarities and differences between languages. The Kamayurá people live in the Mato Grosso region of Brazil, specifically in the Upper Xingu area.

A definite clause grammar (DCG) is a way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog. It is closely related to the concept of attribute grammars / affix grammars. DCGs are usually associated with Prolog, but similar languages such as Mercury also include DCGs. They are called definite clause grammars because they represent a grammar as a set of definite clauses in first-order logic.

<span class="mw-page-title-main">Apertium</span> Open-source rule-based machine translation platform

Apertium is a free/open-source rule-based machine translation platform. It is free software and released under the terms of the GNU General Public License.

In linguistics, an argument is an expression that helps complete the meaning of a predicate, the latter referring in this context to a main verb and its auxiliaries. In this regard, the complement is a closely related concept. Most predicates take one, two, or three arguments. A predicate and its arguments form a predicate-argument structure. The discussion of predicates and arguments is associated most with (content) verbs and noun phrases (NPs), although other syntactic categories can also be construed as predicates and as arguments. Arguments must be distinguished from adjuncts. While a predicate needs its arguments to complete its meaning, the adjuncts that appear with a predicate are optional; they are not necessary to complete the meaning of the predicate. Most theories of syntax and semantics acknowledge arguments and adjuncts, although the terminology varies, and the distinction is generally believed to exist in all languages. Dependency grammars sometimes call arguments actants, following Lucien Tesnière (1959).

Attempto Controlled English (ACE) is a controlled natural language, i.e. a subset of standard English with a restricted syntax and restricted semantics described by a small set of construction and interpretation rules. It has been under development at the University of Zurich since 1995. In 2013, ACE version 6.7 was announced.

A syntactic predicate specifies the syntactic validity of applying a production in a formal grammar and is analogous to a semantic predicate that specifies the semantic validity of applying a production. It is a simple and effective means of dramatically improving the recognition strength of an LL parser by providing arbitrary lookahead. In their original implementation, syntactic predicates had the form “( α )?” and could only appear on the left edge of a production. The required syntactic condition α could be any valid context-free grammar fragment.

The term linguistic performance was used by Noam Chomsky in 1960 to describe "the actual use of language in concrete situations". It is used to describe both the production, sometimes called parole, as well as the comprehension of language. Performance is defined in opposition to "competence"; the latter describes the mental knowledge that a speaker or listener has of language.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

<span class="mw-page-title-main">English clause syntax</span> Clauses in English grammar

This article describes the syntax of clauses in the English language, chiefly in Modern English. A clause is often said to be the smallest grammatical unit that can express a complete proposition. But this semantic idea of a clause leaves out much of English clause syntax. For example, clauses can be questions, but questions are not propositions. A syntactic description of an English clause is that it is a subject and a verb. But this too fails, as a clause need not have a subject, as with the imperative, and, in many theories, an English clause may be verbless. The idea of what qualifies varies between theories and has changed over time.

References

  1. For English, see for example Tapanainen and Voutilainen 1994.
  2. Tapanainen, Pasi 1996: The Constraint Grammar Parser CG-2. University of Helsinki Publications No. 27.
  3. Nemeskey, D. M., Tyers, F. M. and Hulden, M. (2014) "Why Implementation Matters: Evaluation of an Open-source Constraint Grammar Parser". Proceedings of the 25th International Conference on Computational Linguistics (COLING 2014) (to appear)