Definite clause grammar

Last updated

A definite clause grammar (DCG) is a way of expressing grammar, either for natural or formal languages, in a logic programming language such as Prolog. It is closely related to the concept of attribute grammars / affix grammars . DCGs are usually associated with Prolog, but similar languages such as Mercury also include DCGs. They are called definite clause grammars because they represent a grammar as a set of definite clauses in first-order logic.

Contents

The term DCG refers to the specific type of expression in Prolog and other similar languages; not all ways of expressing grammars using definite clauses are considered DCGs. However, all of the capabilities or properties of DCGs will be the same for any grammar that is represented with definite clauses in essentially the same way as in Prolog.

The definite clauses of a DCG can be considered a set of axioms where the validity of a sentence, and the fact that it has a certain parse tree can be considered theorems that follow from these axioms. [1] This has the advantage of making it so that recognition and parsing of expressions in a language becomes a general matter of proving statements, such as statements in a logic programming language.

History

The history of DCGs is closely tied to the history of Prolog, and the history of Prolog revolves around several researchers in both Marseille, France, and Edinburgh, Scotland. According to Robert Kowalski, an early developer of Prolog, the first Prolog system was developed in 1972 by Alain Colmerauer and Phillipe Roussel. [2] The first program written in the language was a large natural-language processing system. Fernando Pereira and David Warren at the University of Edinburgh were also involved in the early development of Prolog.

Colmerauer had previously worked on a language processing system called Q-systems that was used to translate between English and French. [3] In 1978, Colmerauer wrote a paper about a way of representing grammars called metamorphosis grammars which were part of the early version of Prolog called Marseille Prolog. In this paper, he gave a formal description of metamorphosis grammars and some examples of programs that use them.

Fernando Pereira and David Warren, two other early architects of Prolog, coined the term "definite clause grammar" and created the notation for DCGs that is used in Prolog today. They gave credit for the idea to Colmerauer and Kowalski, and they note that DCGs are a special case of Colmerauer's metamorphosis grammars. They introduced the idea in an article called "Definite Clause Grammars for Language Analysis", where they describe DCGs as a "formalism ... in which grammars are expressed clauses of first-order predicate logic" that "constitute effective programs of the programming language Prolog". [4]

Pereira, Warren, and other pioneers of Prolog later wrote about several other aspects of DCGs. Pereira and Warren wrote an article called "Parsing as Deduction", describing things such as how the Earley Deduction proof procedure is used for parsing. [5] Pereira also collaborated with Stuart M. Shieber on a book called "Prolog and Natural Language Analysis", that was intended as a general introduction to computational linguistics using logic programming. [6]

Example

A basic example of DCGs helps to illustrate what they are and what they look like.

sentence-->noun_phrase,verb_phrase.noun_phrase-->det,noun.verb_phrase-->verb,noun_phrase.det-->[the].det-->[a].noun-->[cat].noun-->[bat].verb-->[eats].

This generates sentences such as "the cat eats the bat", "a bat eats the cat". One can generate all of the valid expressions in the language generated by this grammar at a Prolog interpreter by typing sentence(X,[]). Similarly, one can test whether a sentence is valid in the language by typing something like sentence([the,bat,eats,the,bat],[]).

Translation into definite clauses

DCG notation is just syntactic sugar for normal definite clauses in Prolog. For example, the previous example could be translated into the following:

sentence(A,Z):-noun_phrase(A,B),verb_phrase(B,Z).noun_phrase(A,Z):-det(A,B),noun(B,Z).verb_phrase(A,Z):-verb(A,B),noun_phrase(B,Z).det([the|X],X).det([a|X],X).noun([cat|X],X).noun([bat|X],X).verb([eats|X],X).

Difference lists

The arguments to each functor, such as (A,B) and (B,Z) are difference lists; difference lists are a way of representing a prefix of a list as the difference between its two suffixes (the bigger including the smaller one). Using Prolog's notation for lists, a singleton list prefix P = [H] can be seen as the difference between [H|X] and X, and thus represented with the pair ([H|X],X), for instance.

Saying that P is the difference between A and B is the same as saying that append(P,B,A) holds. Or in the case of the previous example, append([H],X,[H|X]).

Difference lists are used to represent lists with DCGs for reasons of efficiency. It is much more efficient to concatenate list differences (prefixes), in the circumstances that they can be used, because the concatenation of (A,B) and (B,Z) is just (A,Z). [7]

Indeed, append(P,B,A), append(Q,Z,B) entails append(P,Q,S), append(S,Z,A). This is the same as saying that list concatenation is associative:

A = P + B = P + (Q + Z) = (P + Q) + Z = S + Z = A

Non-context-free grammars

In pure Prolog, normal DCG rules with no extra arguments on the functors, such as the previous example, can only express context-free grammars; there is only one argument on the left side of the production. However, context-sensitive grammars can also be expressed with DCGs, by providing extra arguments, such as in the following example:

s-->a(N),b(N),c(N).a(0)-->[].a(M)-->[a],a(N),{MisN+1}.b(0)-->[].b(M)-->[b],b(N),{MisN+1}.c(0)-->[].c(M)-->[c],c(N),{MisN+1}.

This set of DCG rules describes the grammar which generates the language that consists of strings of the form . [8]

s-->symbols(Sem,a),symbols(Sem,b),symbols(Sem,c).symbols(end,_)-->[].symbols(s(Sem),S)-->[S],symbols(Sem,S).

This set of DCG rules describes the grammar which generates the language that consists of strings of the form , by structurally representing n[ citation needed ]

Representing features

Various linguistic features can also be represented fairly concisely with DCGs by providing extra arguments to the functors. [9] For example, consider the following set of DCG rules:

sentence-->pronoun(subject),verb_phrase.verb_phrase-->verb,pronoun(object).pronoun(subject)-->[he].pronoun(subject)-->[she].pronoun(object)-->[him].pronoun(object)-->[her].verb-->[likes].

This grammar allows sentences like "he likes her" and "he likes him", but not "her likes he" and "him likes him".

Parsing with DCGs

An example parse tree for this grammar. The bat eats a cat tree.png
An example parse tree for this grammar.

The main practical use of a DCG is to parse sentences of the given grammar, i.e. to construct a parse tree. This can be done by providing "extra arguments" to the functors in the DCG, like in the following rules:

sentence(s(NP,VP))-->noun_phrase(NP),verb_phrase(VP).noun_phrase(np(D,N))-->det(D),noun(N).verb_phrase(vp(V,NP))-->verb(V),noun_phrase(NP).det(d(the))-->[the].det(d(a))-->[a].noun(n(bat))-->[bat].noun(n(cat))-->[cat].verb(v(eats))-->[eats].

One can now query the interpreter to yield a parse tree of any given sentence:

|?-sentence(Parse_tree,[the,bat,eats,a,cat],[]).Parse_tree=s(np(d(the),n(bat)),vp(v(eats),np(d(a),n(cat))))?;

Other uses

DCGs can serve as a convenient syntactic sugar to hide certain parameters in code in other places besides parsing applications. In the declaratively pure programming language Mercury I/O must be represented by a pair of io.state arguments. DCG notation can be used to make using I/O more convenient, [10] although state variable notation is usually preferred. [11] DCG notation is also used for parsing and similar things in Mercury as it is in Prolog.

Extensions

Since DCGs were introduced by Pereira and Warren, several extensions have been proposed. Pereira himself proposed an extension called extraposition grammars (XGs). [12] This formalism was intended in part to make it easier to express certain grammatical phenomena, such as left-extraposition. Pereira states, "The difference between XG rules and DCG rules is then that the left-hand side of an XG rule may contain several symbols." This makes it easier to express rules for context-sensitive grammars.

Peter Van Roy extended DCGs to allow multiple accumulators. [13] [14]

Another, more recent, extension was made by researchers at NEC Corporation called Multi-Modal Definite Clause Grammars (MM-DCGs) in 1995. Their extensions were intended to allow the recognizing and parsing expressions that include non-textual parts such as pictures. [15]

Another extension, called definite clause translation grammars (DCTGs) was described by Harvey Abramson in 1984. [16] DCTG notation looks very similar to DCG notation; the major difference is that one uses ::= instead of --> in the rules. It was devised to handle grammatical attributes conveniently. [17] The translation of DCTGs into normal Prolog clauses is like that of DCGs, but 3 arguments are added instead of 2.

See also

Notes

  1. Johnson, M. (1994). "Two ways of formalizing grammars". Linguistics and Philosophy. 17 (3): 221–240. doi:10.1007/BF00985036. S2CID   62165766.
  2. Kowalski, Robert A. (January 1988). "The Early Years of Logic Programming" (PDF). Communications of the ACM. 31 (1): 38–43. doi:10.1145/35043.35046. S2CID   12259230.
  3. Colmerauer, A. (1978). "Metamorphosis grammars". Natural Language Communication with Computers. Lecture Notes in Computer Science. Vol. 63. pp. 133–189. doi:10.1007/BFb0031371. ISBN   3-540-08911-X.
  4. Pereira, Fernando C. N.; Warren, David H. D. (1980). "Definite Clause Grammars for Language Analysis—A Survey of the Formalism and a Comparison with Augmented Transition Networks" (PDF). Artificial Intelligence. 13 (3): 231–278. doi:10.1016/0004-3702(80)90003-X.
  5. Pereira, F. C. N.; D. H. D. Warren (1983). "Parsing as deduction" (PDF). Proceedings of the 21st annual meeting on Association for Computational Linguistics. Association for Computational Linguistics Morristown, NJ, USA. pp. 137–144.
  6. Pereira, F. C. N.; S. M. Shieber (2002). Prolog and natural-language analysis. Microtome Publishing. ISBN   9780971977709.
  7. Fleck, Arthur. "Definite Clause Grammar Translation" . Retrieved 2009-04-16.
  8. Fisher, J. R. "Prolog Tutorial -- 7.1" . Retrieved 2016-05-17.
  9. "DCGs give us a Natural Notation for Features" . Retrieved 2009-04-21.
  10. "Prolog to Mercury Transition Guide: Input/Output" . Retrieved 2015-03-26.
  11. "The Mercury Language Reference Manual: DCG-rules". www.mercurylang.org. Retrieved 2023-04-07.
  12. Pereira, F. (1981). "Extraposition grammars" (PDF). Computational Linguistics. 7 (4): 243–256.
  13. Van Roy, Peter (1990). "Extended DCG Notation: A Tool for Applicative Programming in Prolog". UCB Technical Report. 90 (583).
  14. Source code is available at .
  15. Shimazu, H.; Y. Takashima (1995). "Multimodal definite clause grammar" (PDF). Systems and Computers in Japan. 26 (3): 93–102. doi:10.1002/scj.4690260309. S2CID   8199913.
  16. Abramson, H. (April 1984). Definite clause translation grammars (PDF) (Technical report). 84-3.
  17. Sperberg-McQueen, C. M. "A brief introduction to definite clause grammars and definite clause translation grammars" . Retrieved 2009-04-21.

Related Research Articles

Prolog is a logic programming language associated with artificial intelligence and computational linguistics.

A noun phrase, or nominal (phrase), is a phrase that has a noun or pronoun as its head or performs the same grammatical function as a noun. Noun phrases are very common cross-linguistically, and they may be the most frequently occurring phrase type.

In language, a clause is a constituent that comprises a semantic predicand and a semantic predicate. A typical clause consists of a subject and a syntactic predicate, the latter typically a verb phrase composed of a verb with any objects and other modifiers. However, the subject is sometimes unvoiced if it is retrievable from context, especially in null-subject language but also in other languages, including English instances of the imperative mood.

<span class="mw-page-title-main">Parse tree</span> Tree in formal language theory

A parse tree or parsing tree or derivation tree or concrete syntax tree is an ordered, rooted tree that represents the syntactic structure of a string according to some context-free grammar. The term parse tree itself is used primarily in computational linguistics; in theoretical syntax, the term syntax tree is more common.

In linguistics, a determiner phrase (DP) is a type of phrase headed by a determiner such as many. Controversially, many approaches, take a phrase like not very many apples to be a DP, headed, in this case, by the determiner many. This is called the DP analysis or the DP hypothesis. Others reject this analysis in favor of the more traditional NP analysis where apples would be the head of the phrase in which the DP not very many is merely a dependent. Thus, there are competing analyses concerning heads and dependents in nominal groups. The DP analysis developed in the late 1970s and early 1980s, and it is the majority view in generative grammar today.

<span class="mw-page-title-main">Crow language</span> Missouri Valley Siouan language of Montana, US

Crow is a Missouri Valley Siouan language spoken primarily by the Crow Nation in present-day southeastern Montana. The word, Apsáalooke, translates to "children of the raven." It is one of the larger populations of American Indian languages with 2,480 speakers according to the 1990 US Census.

Categorial grammar is a family of formalisms in natural language syntax that share the central assumption that syntactic constituents combine as functions and arguments. Categorial grammar posits a close relationship between the syntax and semantic composition, since it typically treats syntactic categories as corresponding to semantic types. Categorial grammars were developed in the 1930s by Kazimierz Ajdukiewicz and in the 1950s by Yehoshua Bar-Hillel and Joachim Lambek. It saw a surge of interest in the 1970s following the work of Richard Montague, whose Montague grammar assumed a similar view of syntax. It continues to be a major paradigm, particularly within formal semantics.

In linguistics, branching refers to the shape of the parse trees that represent the structure of sentences. Assuming that the language is being written or transcribed from left to right, parse trees that grow down and to the right are right-branching, and parse trees that grow down and to the left are left-branching. The direction of branching reflects the position of heads in phrases, and in this regard, right-branching structures are head-initial, whereas left-branching structures are head-final. English has both right-branching (head-initial) and left-branching (head-final) structures, although it is more right-branching than left-branching. Some languages such as Japanese and Turkish are almost fully left-branching (head-final). Some languages are mostly right-branching (head-initial).

In computer science, a Van Wijngaarden grammar is a formalism for defining formal languages. The name derives from the formalism invented by Adriaan van Wijngaarden for the purpose of defining the ALGOL 68 programming language. The resulting specification remains its most notable application.

An adpositional phrase is a syntactic category that includes prepositional phrases, postpositional phrases, and circumpositional phrases. Adpositional phrases contain an adposition as head and usually a complement such as a noun phrase. Language syntax treats adpositional phrases as units that act as arguments or adjuncts. Prepositional and postpositional phrases differ by the order of the words used. Languages that are primarily head-initial such as English predominantly use prepositional phrases whereas head-final languages predominantly employ postpositional phrases. Many languages have both types, as well as circumpositional phrases.

In linguistics, head directionality is a proposed parameter that classifies languages according to whether they are head-initial or head-final. The head is the element that determines the category of a phrase: for example, in a verb phrase, the head is a verb. Therefore, head initial would be "VO" languages and head final would be "OV" languages.

Tsez, also known as Dido, is a Northeast Caucasian language with about 15,000 speakers spoken by the Tsez, a Muslim people in the mountainous Tsunta District of southwestern Dagestan in Russia. The name is said to derive from the Tsez word for "eagle", but this is most likely a folk etymology. The name Dido is derived from the Georgian word დიდი, meaning "big".

In linguistics, a small clause consists of a subject and its predicate, but lacks an overt expression of tense. Small clauses have the semantic subject-predicate characteristics of a clause, and have some, but not all, properties of a constituent. Structural analyses of small clauses vary according to whether a flat or layered analysis is pursued. The small clause is related to the phenomena of raising-to-object, exceptional case-marking, accusativus cum infinitivo, and object control.

In linguistics, inverse copular constructions, named after Moro (1997), are a type of inversion in English where canonical SCP word order is reversed in a sense, so that one appears to have the order PCS instead. The verb in these constructions is always the copula be. Inverse copular constructions are intriguing because they render the distinction between subject and predicative expression difficult to maintain. The confusion has led to focused study of these constructions, and their impact on the theory of grammar may be great since they appear to challenge the initial binary division of the sentence (S) into a subject noun phrase (NP) and a predicate verb phrase (VP), this division being at the core of all phrase structure grammars.

Araki is a nearly extinct language spoken in the small island of Araki, south of Espiritu Santo Island in Vanuatu. Araki is gradually being replaced by Tangoa, a language from a neighbouring island.

A reciprocal pronoun is a pronoun that indicates a reciprocal relationship. A reciprocal pronoun can be used for one of the participants of a reciprocal construction, i.e. a clause in which two participants are in a mutual relationship. The reciprocal pronouns of English are one another and each other, and they form the category of anaphors along with reflexive pronouns.

Combinatory categorial grammar (CCG) is an efficiently parsable, yet linguistically expressive grammar formalism. It has a transparent interface between surface syntax and underlying semantic representation, including predicate–argument structure, quantification and information structure. The formalism generates constituency-based structures and is therefore a type of phrase structure grammar.

The syntax and semantics of Prolog, a programming language, are the sets of rules that define how a Prolog program is written and how it is interpreted, respectively. The rules are laid out in ISO standard ISO/IEC 13211 although there are differences in the Prolog implementations.

Subject–verb inversion in English is a type of inversion marked by a predicate verb that precedes a corresponding subject, e.g., "Beside the bed stood a lamp". Subject–verb inversion is distinct from subject–auxiliary inversion because the verb involved is not an auxiliary verb.

<span class="mw-page-title-main">Daakaka language</span> Austronesian language spoken in Vanuatu

Daakaka is a native language of Ambrym, Vanuatu. It is spoken by about one thousand speakers in the south-western corner of the island.