Fluid construction grammar

Last updated

Fluid construction grammar (FCG) is an open-source computational construction grammar formalism that allows computational linguists to formally write down the inventory of lexical and grammatical constructions as well as to do experiments in language learning and language evolution. [1] FCG is an open instrument that can be used by construction grammarians who want to formulate their intuitions and data in a precise way and who want to test the implications of their grammar designs for language parsing, production and learning. The formalism can be tested through an interactive web interface at the FCG website.

Contents

FCG integrates many notions from contemporary computational linguistics such as feature structure and unification-based language processing, but uses them in a novel way to operationalize insights from construction grammar theory. Constructions are considered bi-directional and hence usable both for parsing and production. Processing is flexible in the sense that FCG provides meta-layer processing for coping with novelty, partially ungrammatical or incomplete sentences. FCG is called 'fluid' because it acknowledges the premise that language users constantly change and update their grammars. The research on FCG is primarily carried out by Luc Steels and his teams at the VUB AI Lab in Brussels and the Language Evolution Lab in Barcelona, and the Sony Computer Science Laboratories in Paris. Besides Steels, current and former contributors to the FCG formalism include Katrien Beuls, Paul Van Eecke, Remi van Trijp, Joris Bleys, Joachim De Beule, Martin Loetzsch, Nicolas Neubauer, Michael Spranger, Wouter Van den Broeck, Pieter Wellens, and others.

Transient structure

FCG treats parsing and production as a search problem, in which the FCG engine searches for the best utterance to verbalize a meaning (language production) or the best semantic network (or meaning representation) to analyze an utterance (parsing). Each state representation in the search process is called a Transient Structure. A Transient Structure can be considered as an extended feature structure, as it consists of a (flat) list of "units" that consist of a unit-name (a unique constant symbol) and a unit-body (a set of feature-value pairs). Older versions of FCG (before 2011) used to split the transient structure into two separate poles for semantics and syntax, but the current version implements a single representation for all linguistic information.

Constructions

FCG constructions (or technically speaking: construction schemas) are treated as the operators of the search process. That is, by applying a construction to a transient structure, a new transient structure (or state representation) in the search space may be created. Just like transient structures, constructions mostly consists of units of feature-value pairs. Constructions are however more structured because they contain two distinct parts:

Linguistic processing

To decide whether a construction can apply, the conditional part is "matched" against the current transient structure using a unification-based algorithm. In production, only features that are part of the formulation locks of the construction must be matched against the transient structure; whereas in parsing, only features that are part of the comprehension locks will be considered. If a match is successful, the FCG engine will "merge" all of the units of feature-value pairs with the transient structure in a similar unification-based process.

Flexibility

FCG features a meta-layers of diagnostics, repairs and consolidation strategies that allow the grammar designer to implement ways to handle novelty, errors and unexpected input during processing. These diagnostics and repairs can also be used for exploring the (automated) acquisition of new constructions.

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions.

In formal language theory, a context-free grammar (CFG) is a certain type of formal grammar: a set of production rules that describe all possible strings in a given formal language. Production rules are simple replacements. For example, the rule

Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits two separate levels of syntactic structure, a phrase structure grammar representation of word order and constituency, and a representation of grammatical functions such as subject and object, similar to dependency grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the theory of transformational grammar which was current in the late 1970s. It mainly focuses on syntax, including its relation with morphology and semantics. There has been little LFG work on phonology.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Grammar theory to model symbol strings originated from work in computational linguistics aiming to understand the structure of natural languages. Probabilistic context free grammars (PCFGs) have been applied in probabilistic modeling of RNA structures almost 40 years after they were introduced in computational linguistics.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Top-down parsing in computer science is a parsing strategy where one first looks at the highest level of the parse tree and works down the parse tree by using the rewriting rules of a formal grammar. LL parsers are a type of parser that uses a top-down parsing strategy.

Tree-adjoining grammar (TAG) is a grammar formalism defined by Aravind Joshi. Tree-adjoining grammars are somewhat similar to context-free grammars, but the elementary unit of rewriting is the tree rather than the symbol. Whereas context-free grammars have rules for rewriting symbols as strings of other symbols, tree-adjoining grammars have rules for rewriting the nodes of trees as other trees.

In linguistics, construction grammar is a family of theories which posit that human language consists of constructions, or learned pairings of linguistic forms with functions or meanings. Constructions can be individual words, morphemes, fixed expressions and idioms, and abstract grammatical rules such as the passive voice or ditransitive. Any linguistic pattern is considered to be a construction as long as some aspect of its form or its meaning cannot be predicted from its component parts, or from other constructions that are recognized to exist. In construction grammar, every utterance is understood to be a combination of multiple different constructions, which together specify its precise meaning and form.

In computer science, scannerless parsing performs tokenization and parsing in a single step, rather than breaking it up into a pipeline of a lexer followed by a parser, executing concurrently. A language grammar is scannerless if it uses a single formalism to express both the lexical and phrase level structure of the language.

Treebank

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. The exploitation of treebank data has been important ever since the first large-scale treebank, The Penn Treebank, was published. However, although originating in computational linguistics, the value of treebanks is becoming more widely appreciated in linguistics research as a whole. For example, annotated treebank data has been crucial in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.

Grammar induction

Grammar induction is the process in machine learning of learning a formal grammar from a set of observations, thus constructing a model which accounts for the characteristics of the observed objects. More generally, grammatical inference is that branch of machine learning where the instance space consists of discrete combinatorial objects such as strings, trees and graphs.

Martin Kay is a computer scientist, born in 1935, known especially for his work in computational linguistics.

Statistical parsing is a group of parsing methods within natural language processing. The methods have in common that they associate grammar rules with a probability. Grammar rules are traditionally viewed in computational linguistics as defining the valid sentences in a language. Within this mindset, the idea of associating each rule with a probability then provides the relative frequency of any given grammar rule and, by deduction, the probability of a complete parse for a sentence. Using this concept, statistical parsers make use of a procedure to search over a space of all candidate parses, and the computation of each candidate's probability, to derive the most probable parse of a sentence. The Viterbi algorithm is one popular method of searching for the most probable parse.

In formal language theory, a grammar is a set of production rules for strings in a formal language. The rules describe how to form strings from the language's alphabet that are valid according to the language's syntax. A grammar does not describe the meaning of the strings or what can be done with them in whatever context—only their form.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

Deep linguistic processing is a natural language processing framework which draws on theoretical and descriptive linguistics. It models language predominantly by way of theoretical syntactic/semantic theory. Deep linguistic processing approaches differ from "shallower" methods in that they yield more expressive and structural representations which directly capture long-distance dependencies and underlying predicate-argument structures.
The knowledge-intensive approach of deep linguistic processing requires considerable computational power, and has in the past sometimes been judged as being intractable. However, research in the early 2000s had made considerable advancement in efficiency of deep processing. Today, efficiency is no longer a major problem for applications using deep linguistic processing.

Minimal recursion semantics (MRS) is a framework for computational semantics. It can be implemented in typed feature structure formalisms such as head-driven phrase structure grammar and lexical functional grammar. It is suitable for computational language parsing and natural language generation. MRS enables a simple formulation of the grammatical constraints on lexical and phrasal semantics, including the principles of semantic composition. This technique is used in machine translation.

In computational linguistics, the term mildly context-sensitive grammar formalisms refers to several grammar formalisms that have been developed with the ambition to provide adequate descriptions of the syntactic structure of natural language.

In everyday conversation, we speak & understand language incrementally, word by word, in a time-linear fashion; we interrupt each other mid-sentence; we complete each other's sentences; we pause, stall, backtrack, self-correct, hesitate. Dynamic Syntax (DS) is a grammar formalism and linguistic theory whose overall aim is to characterise & capture the real-time twin processes of language understanding and production. Under the DS approach, linguistic knowledge is considered to be the ability to analyse the structure & content of spoken & written language in context, and in real-time. NL syntax, on this view, is the constraint-based way in which representations of meaning can be built up from words encountered in a string.

References

  1. Steels, Luc (ed.) (2011). Design Patterns in Fluid Construction Grammar. Amsterdam: John Benjamins.CS1 maint: extra text: authors list (link)