Grammatical Framework (programming language)

Last updated

Grammatical Framework (GF) is a programming language for writing grammars of natural languages. GF is capable of parsing and generating texts in several languages simultaneously while working from a language-independent representation of meaning. Grammars written in GF can be compiled into a platform independent format and then used from different programming languages including C and Java, C#, Python and Haskell. A companion to GF is the GF Resource Grammar Library, a reusable library for dealing with the morphology and syntax of a growing number of natural languages.

Contents

Both GF itself and the GF Resource Grammar Library are open-source. Typologically, GF is a functional programming language. Mathematically, it is a type-theoretic formal system (a logical framework to be precise) based on Martin-Löf's intuitionistic type theory, with additional judgments tailored specifically to the domain of linguistics.

Language features

Tutorial

Goal: write a multilingual grammar for expressing statements about John and Mary loving each other. [2]

Abstract and concrete modules

In GF, grammars are divided to two module types:

Consider the following:

Abstract syntax

abstractZero={catS;NP;VP;V2;funPred:NP->VP->S;Compl:V2->NP->VP;John,Mary:NP;Love:V2;}

Concrete syntax: English

concreteZeroEngofZero={lincatS,NP,VP,V2=Str;linPrednpvp=np++vp;Complv2np=v2++np;John="John";Mary="Mary";Love="loves";}

Notice: Str (token list or "string") as the only linearization type.

Making a grammar multilingual

A single abstract syntax may be applied to many concrete syntaxes, in our case one for each new natural language we wish to add. The same system of trees can be given:

Concrete syntax: French

concreteZeroFreofZero={lincatS,NP,VP,V2=Str;linPrednpvp=np++vp;Complv2np=v2++np;John="Jean";Mary="Marie";Love="aime";}

Translation and multilingual generation

We can now use our grammar to translate phrases between French and English. The following commands can be executed in the GF interactive shell.

Import many grammars with the same abstract syntax

> import ZeroEng.gf ZeroFre.gf Languages: ZeroEng ZeroFre

Translation: pipe linearization to parsing

> parse -lang=Eng "John loves Mary"| linearize -lang=Fre Jean aime Marie

Multilingual generation: linearize into all languages

> generate_random | linearize -treebank Zero: Pred Mary (Compl Love Mary)ZeroEng: Mary loves MaryZeroFre: Marie aime Marie

Parameters, tables

Latin has cases: nominative for subject, accusative for object.

We use a parameter type for case (just 2 of Latin's 6 cases). The linearization type of NP is a table type: from Case to Str. The linearization of John is an inflection table. When using an NP, we select (!) the appropriate case from the table.

Concrete syntax: Latin

concreteZeroLatofZero={lincatS,VP,V2=Str;NP=Case=>Str;linPrednpvp=np!Nom++vp;Complv2np=np!Acc++v2;John=table{Nom=>"Ioannes";Acc=>"Ioannem"};Mary=table{Nom=>"Maria";Acc=>"Mariam"};Love="amat";paramCase=Nom|Acc;}

Discontinuous constituents, records

In Dutch, the verb heeft lief is a discontinuous constituent. The linearization type of V2 is a record type with two fields. The linearization of Love is a record. The values of fields are picked by projection (.)

Concrete syntax: Dutch

concreteZeroDutofZero={lincatS,NP,VP=Str;V2={v:Str;p:Str};linPrednpvp=np++vp;Complv2np=v2.v++np++v2.p;John="Jan";Mary="Marie";Love={v="heeft";p="lief"};}

Variable and inherent features, agreement, Unicode support

For Hebrew, NP has gender as its inherent feature  a field in the record. VP has gender as its variable feature  an argument of a table. In predication, the VP receives the gender of the NP.

Concrete syntax: Hebrew

concreteZeroHebofZero={flagscoding=utf8;lincatS=Str;NP={s:Str;g:Gender};VP,V2=Gender=>Str;linPrednpvp=np.s++vp!np.g;Complv2np=table{g=>v2!g++"את"++np.s};John={s="ג׳ון";g=Masc};Mary={s="מרי";g=Fem};Love=table{Masc=>"אוהב";Fem=>"אוהבת"};paramGender=Masc|Fem;}

Visualizing parse trees

GF has inbuilt functions which can be used for visualizing parse trees and word alignments.

The following commands will generate parse trees for the given phrases and open the produced PNG image using the system's eog command.

> parse -lang=Eng "John loves Mary"| visualize_parse -view="eog"> parse -lang=Dut "Jan heeft Marie lief"| visualize_parse -view="eog"
Parse tree - English.png Parse tree - Dutch.png

Generating word alignment

  1. In languages L1 and L2: link every word with its smallest spanning subtree.
  2. Delete the intervening tree, combining links directly from L1 to L2.

In general, this gives phrase alignment. Links can be crossing, phrases can be discontinuous. The align_words command follows a similar syntax:

> parse -lang=Fre "Marie aime Jean"| align_words -lang=Fre,Dut,Lat -view="eog"
Word alignment for "Marie aime Jean" in French, Dutch and Latin Word alignment - French, Dutch, Latin.png
Word alignment for "Marie aime Jean" in French, Dutch and Latin

Resource Grammar Library

In natural language applications, libraries are a way to cope with thousands of details involved in syntax, lexicon, and inflection. The GF Resource Grammar Library is the standard library for Grammatical Framework. It covers the morphology and basic syntax for an increasing number of languages, currently including Afrikaans, Amharic (partial), Arabic (partial), Basque (partial), Bulgarian, Catalan, Chinese, Czech (partial), Danish, Dutch, English, Estonian, Finnish, French, German, Greek ancient (partial), Greek modern, Hebrew (fragments), Hindi, Hungarian (partial), Interlingua, Italian, Japanese, Korean (partial), Latin (partial), Latvian, Maltese, Mongolian, Nepali, Norwegian bokmål, Norwegian nynorsk, Persian, Polish, Punjabi, Romanian, Russian, Sindhi, Slovak (partial), Slovene (partial), Somali (partial), Spanish, Swahili (fragments), Swedish, Thai, Turkish (fragments), and Urdu. In addition, 14 languages have WordNet lexicon and large-scale parsing extensions. [3]

A full API documentation of the library can be found at the RGL Synopsis page. The RGL status document gives the languages currently available in the GF Resource Grammar Library, including their maturity.

Uses of GF

GF was first created in 1998 at Xerox Research Centre Europe, Grenoble, in the project Multilingual Document Authoring. At Xerox, it was used for prototypes including a restaurant phrase book, a database query system, a formalization of an alarm system instructions with translations to 5 languages, and an authoring system for medical drug descriptions.

Later projects using GF and involving third parties include:

Academically, GF has been used in many PhD theses and resulted in a lot of scientific publications (see the GF publication list for some of them).

Commercially, GF has been used by a number of companies, in domains such as e-commerce, health care and translating formal specifications to natural language. [4]

Community

Developer mailing list

There is an active group for developers and users of GF alike, located at https://groups.google.com/group/gf-dev

Summer schools

2020 – GF as a resource for Computational Law (Singapore)

The seventh GF summer school, postponed due to COVID-19, is to be held in Singapore. Co-organised with the Singapore Management University's Centre for Computational Law, the summer school will have a special focus on computational law.

2018 – Sixth GF Summer School (Stellenbosch, South Africa)

The sixth GF summer school was the first one held outside Europe. The major themes of the summer school were African language resources, and the growing usage of GF in commercial applications.

2017 – GF in a Full Stack of Language Technology (Riga, Latvia)

The fifth GF summer school was held in Riga, Latvia. This summer school had a number of participant from startups, presenting industrial use cases of GF.

2016 – Summer School in Rule-Based Machine Translation (Alicante, Spain)

GF was one of the four platforms featured at the Summer School in Rule-Based Machine Translation, along with Apertium, Matxin and TectoMT.

2015 – Fourth GF Summer School (Gozo, Malta)

The fourth GF summer school was held on Gozo island in Malta. Like the previous edition in 2013, this summer school featured collaborations with other resources, such as Apertium and FrameNet.

2013 – Scaling Up Grammatical Resources (Lake Chiemsee, Germany)

The third GF Summer school, was held on Frauenchiemsee island in Bavaria, Germany with the special theme "Scaling up Grammar Resources". This summer school focused on extending the existing resource grammars with the ultimate goal of dealing with any text in the supported languages. Lexicon extension is an obvious part of this work, but also new grammatical constructions were also of interest. There was a special interest in porting resources from other open-source approaches, such as WordNets and Apertium, and reciprocally making GF resources easily reusable in other approaches.

2011 – Frontiers of Multilingual Technologies (Barcelona, Spain)

The second GF Summer school, subtitled Frontiers of Multilingual Technologies was held in 2011 in Barcelona, Spain. It was sponsored by CLT, the Centre for Language Technology of the University of Gothenburg, and by UPC, Universitat Politècnica de Catalunya. The School addressed new languages and also promoted ongoing work in those languages which are already under construction. Missing EU languages were especially encouraged.

The school began with a 2-day GF tutorial, serving those interested in getting an introduction to GF or an overview of on-going work.

All results of the summer school are available as open-source software released under the LGPL license.

2009 – GF Summer School (Gothenburg, Sweden)

Group photo from the 2009 GF Summer School in Gothenburg, Sweden GF Summer School 2009 Group Photo.jpg
Group photo from the 2009 GF Summer School in Gothenburg, Sweden

The first GF summer school was held in 2009 in Gothenburg, Sweden. It was a collaborative effort to create grammars of new languages in Grammatical Framework, GF. These grammars were added to the Resource Grammar Library, which previously had 12 languages. Around 10 new languages are already under construction, and the School aimed to address 23 new languages. All results of the Summer School were made available as open-source software released under the LGPL license.

The summer school was organized by the Language Technology Group at the Department of Computer Science and Engineering. The group is a part of the Centre of Language Technology, a focus research area of the University of Gothenburg.

The code created by the school participants is made accessible in the GF darcs repository, subdirectory contrib/summerschool.

Related Research Articles

<span class="mw-page-title-main">Syntax</span> System responsible for combining morphemes into complex structures

In linguistics, syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), agreement, the nature of crosslinguistic variation, and the relationship between form and meaning (semantics). There are numerous approaches to syntax that differ in their central assumptions and goals.

In grammar, a phrase—called expression in some contexts—is a group of words or singular word acting as a grammatical unit. For instance, the English expression "the very happy squirrel" is a noun phrase which contains the adjective phrase "very happy". Phrases can consist of a single word or a complete sentence. In theoretical linguistics, phrases are often analyzed as units of syntactic structure such as a constituent.

In computer science, the Cocke–Younger–Kasami algorithm is a parsing algorithm for context-free grammars published by Itiroo Sakai in 1961. The algorithm is named after some of its rediscoverers: John Cocke, Daniel Younger, Tadao Kasami, and Jacob T. Schwartz. It employs bottom-up parsing and dynamic programming.

<span class="mw-page-title-main">Parse tree</span> Tree in formal language theory

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

In linguistics, X-bar theory is a model of phrase-structure grammar and a theory of syntactic category formation that was first proposed by Noam Chomsky in 1970 reformulating the ideas of Zellig Harris (1951,) and further developed by Ray Jackendoff, along the lines of the theory of generative grammar put forth in the 1950s by Chomsky. It attempts to capture the structure of phrasal categories with a single uniform structure called the X-bar schema, basing itself on the assumption that any phrase in natural language is an XP that is headed by a given syntactic category X. It played a significant role in resolving issues that phrase structure rules had, representative of which is the proliferation of grammatical rules, which is against the thesis of generative grammar.

Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits two separate levels of syntactic structure, a phrase structure grammar representation of word order and constituency, and a representation of grammatical functions such as subject and object, similar to dependency grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the theory of transformational grammar which was current in the late 1970s. It mainly focuses on syntax, including its relation with morphology and semantics. There has been little LFG work on phonology.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Dependency grammar (DG) is a class of modern grammatical theories that are all based on the dependency relation and that can be traced back primarily to the work of Lucien Tesnière. Dependency is the notion that linguistic units, e.g. words, are connected to each other by directed links. The (finite) verb is taken to be the structural center of clause structure. All other syntactic units (words) are either directly or indirectly connected to the verb in terms of the directed links, which are called dependencies. Dependency grammar differs from phrase structure grammar in that while it can identify phrases it tends to overlook phrasal nodes. A dependency structure is determined by the relation between a word and its dependents. Dependency structures are flatter than phrase structures in part because they lack a finite verb phrase constituent, and they are thus well suited for the analysis of languages with free word order, such as Czech or Warlpiri.

The term phrase structure grammar was originally introduced by Noam Chomsky as the term for grammar studied previously by Emil Post and Axel Thue. Some authors, however, reserve the term for more restricted grammars in the Chomsky hierarchy: context-sensitive grammars or context-free grammars. In a broader sense, phrase structure grammars are also known as constituency grammars. The defining trait of phrase structure grammars is thus their adherence to the constituency relation, as opposed to the dependency relation of dependency grammars.

In linguistics, nominalization or nominalisation is the use of a word that is not a noun as a noun, or as the head of a noun phrase. This change in functional category can occur through morphological transformation, but it does not always. Nominalization can refer, for instance, to the process of producing a noun from another part of speech by adding a derivational affix, but it can also refer to the complex noun that is formed as a result.

In linguistics, antisymmetry is a syntactic theory presented in Richard S. Kayne's 1994 monograph The Antisymmetry of Syntax. It asserts that grammatical hierarchies in natural language follow a universal order, namely specifier-head-complement branching order. The theory is built on the foundation of X-bar theory. Kayne hypothesizes that all phrases whose surface order is not specifier-head-complement have undergone syntactic movements that disrupt this underlying order. Others have posited specifier-complement-head as the basic word order.

<span class="mw-page-title-main">Syntax (programming languages)</span> Set of rules defining correctly structured programs

In computer science, the syntax of a computer language is the rules that define the combinations of symbols that are considered to be correctly structured statements or expressions in that language. This applies both to programming languages, where the document represents source code, and to markup languages, where the document represents data.

In linguistics, head directionality is a proposed parameter that classifies languages according to whether they are head-initial or head-final. The head is the element that determines the category of a phrase: for example, in a verb phrase, the head is a verb. Therefore, head initial would be "VO" languages and head final would be "OV" languages.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

In linguistic typology, a verb–object–subject or verb–object–agent language, which is commonly abbreviated VOS or VOA, is one in which most sentences arrange their elements in that order. That would be the equivalent in English to "Drank cocktail Sam." The relatively rare default word order accounts for only 3% of the world's languages. It is the fourth-most common default word order among the world's languages out of the six. It is a more common default permutation than OVS and OSV but is significantly rarer than SOV, SVO, and VSO. Families in which all or many of their languages are VOS include the following:

The term linguistic performance was used by Noam Chomsky in 1960 to describe "the actual use of language in concrete situations". It is used to describe both the production, sometimes called parole, as well as the comprehension of language. Performance is defined in opposition to "competence"; the latter describes the mental knowledge that a speaker or listener has of language.

In linguistics, relational grammar (RG) is a syntactic theory which argues that primitive grammatical relations provide the ideal means to state syntactic rules in universal terms. Relational grammar began as an alternative to transformational grammar.

Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structure, quantification and information structure. The formalism generates constituency-based structures and is therefore a type of phrase structure grammar.

Rule-based machine translation is machine translation systems based on linguistic information about source and target languages basically retrieved from dictionaries and grammars covering the main semantic, morphological, and syntactic regularities of each language respectively. Having input sentences, an RBMT system generates them to output sentences on the basis of morphological, syntactic, and semantic analysis of both the source and the target languages involved in a concrete translation task.

<span class="mw-page-title-main">English clause syntax</span> Clauses in English grammar

This article describes the syntax of clauses in the English language, chiefly in Modern English. A clause is often said to be the smallest grammatical unit that can express a complete proposition. But this semantic idea of a clause leaves out much of English clause syntax. For example, clauses can be questions, but questions are not propositions. A syntactic description of an English clause is that it is a subject and a verb. But this too fails, as a clause need not have a subject, as with the imperative, and, in many theories, an English clause may be verbless. The idea of what qualifies varies between theories and has changed over time.

References

  1. Ranta, Aarne (2011). Grammatical Framework: Programming with Multilingual Grammars . CSLI Publications, Center for the Study of Language and Information. pp.  8–9. ISBN   978-1-57586-627-7.
  2. LREC 2010 tutorial
  3. "A WordNet in GF". GitHub . 16 October 2021.
  4. "Customers".