Universal Dependencies

Last updated

Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. [1] These treebanks are openly accessible and available. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The project's primary aim is to achieve cross-linguistic consistency of annotation, while still permitting language-specific extensions when necessary. The annotation scheme has it roots in three related projects: Stanford Dependencies, [2] Google universal part-of-speech tags, [3] and the Interset interlingua [4] for morphosyntactic tagsets. The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At the present time (January 2022), there are just over 200 treebanks of more than 100 languages available in the UD inventory.

Contents

Dependency structures

The UD annotation scheme produces syntactic analyses of sentences in terms of the dependencies of dependency grammar. Each dependency is characterized in terms of a syntactic function, which is shown using a label on the dependency edge. For example: [5]

Example from UD of English double object construction qualifying as an extended transitive clause.png

This analysis shows that she, him, and a note are dependents of the left. The pronoun she is identified as a nominal subject (nsubj), the pronoun him as an indirect object (iobj) and the noun phrase a note as a direct object (obj) -- there is a further dependency that connects a to note, although it is not shown. A second example:

Analysis of the sentence "it is for her".png

This analysis identifies it as the subject (nsubj), is as the copula (cop), and for as a case marker (case), all of which are shown as dependents of the root word her, which is a pronoun. The next example includes an expletive and an oblique object:

Analysis of the sentence "there is food in the kitchen".png

This analysis identifies there as an expletive (expl), food as a nominal subject (nsubj), kitchen as an oblique object (obl), and in as a case marker (case) -- there is also a dependency connecting the to kitchen, but it is not shown. The copula is in this case is positioned as the root of the sentence, a fact that is contrary to how the copula is analyzed in the second example just above, where it is positioned as a dependent of the root.

The examples of UD annotation just provided can of course give only an impression of the nature of the UD project and its annotation scheme. The emphasis for UD is on producing cross-linguistically consistent dependency analyses in order to facilitate structural parallelism across diverse languages. To this end, UD uses a universal POS tagset for all languages—although a given language does not have to make use of each tag. More specific information can be added to each word by means of a free morpho-syntactic feature set. The universal labels of dependency links can be specified with secondary relations, which are indicated as a secondary label behind a colon, e.g. nsubj:pass, following the "universal:extension" format.

Function words

Within the dependency grammar community, the UD annotation scheme is controversial. The main bone of contention concerns the analysis of function words. UD chooses to subordinate function words to content words, [6] a practice that is contrary to most works in the tradition of dependency grammar. [7] To briefly illustrate this controversy, UD would produce the following structural analysis of the sentence given:

Hierarchical analysis of function words that UD advocates by Osborne et al..png

This example is taken from the article here. [8] An alternative convention for showing dependencies is now used, different from the convention above. Since the syntactic functions are not important for the point at hand, they are excluded from this structural analysis. What is important is the manner in which this UD analysis subordinates the auxiliary verb will to the content verb say, the preposition to to the pronoun you, the subordinator that to the content verb likes, and the particle to to the content verb swim.

A more traditional dependency grammar analysis of this sentence, one that is motivated more by syntactic considerations than by semantic ones, looks like this: [9]

DG analysis of an English sentence.png

This traditional analysis subordinates the content verb say to the auxiliary verb will, the pronoun you to the preposition to, the content verb likes to the subordinator that, and the content verb swim to the participle to.

Notes

  1. de Marneffe, Marie-Catherine; Manning, Christopher D.; Nivre, Joakim; Zeman, Daniel (13 July 2021). "Universal Dependencies". Computational Linguistics. 47 (2): 255–308. doi: 10.1162/coli_a_00402 . S2CID   219304854.
  2. "Stanford Dependencies". nlp.stanford.edu. The Stanford Natural Language Processing Group. Retrieved 8 May 2020.
  3. Petrov, Slav (11 Apr 2011). "A Universal Part-of-Speech Tagset". arXiv: 1104.2086 [cs.CL].
  4. "Interset". cuni.cz. Institute of Formal and Applied Linguistics (Czech Republic). Retrieved 8 May 2020.
  5. The three example analyses that appear in this section have been taken from the UD webpage here, examples 3, 21, and 23.
  6. The choice was led by Nivre (2015).
  7. The controversy surrounding UD and the status of function words in dependency grammar in general are discussed at length in Osborne & Gerdes (2019).
  8. The structure is (1b) in Osborne & Gerdes (2019) article.
  9. This structure is (1c) in Osborne & Gerdes (2019) article.

Related Research Articles

<span class="mw-page-title-main">Syntax</span> System responsible for combining morphemes into complex structures

In linguistics, syntax is the study of how words and morphemes combine to form larger units such as phrases and sentences. Central concerns of syntax include word order, grammatical relations, hierarchical sentence structure (constituency), agreement, the nature of crosslinguistic variation, and the relationship between form and meaning (semantics). There are numerous approaches to syntax that differ in their central assumptions and goals.

In grammar, a phrase—called expression in some contexts—is a group of words or singular word acting as a grammatical unit. For instance, the English expression "the very happy squirrel" is a noun phrase which contains the adjective phrase "very happy". Phrases can consist of a single word or a complete sentence. In theoretical linguistics, phrases are often analyzed as units of syntactic structure such as a constituent.

In language, a clause is a constituent that comprises a semantic predicand and a semantic predicate. A typical clause consists of a subject and a syntactic predicate, the latter typically a verb phrase composed of a verb with any objects and other modifiers. However, the subject is sometimes unvoiced if it is retrievable from context, especially in null-subject language but also in other languages, including English instances of the imperative mood.

Dependency grammar (DG) is a class of modern grammatical theories that are all based on the dependency relation and that can be traced back primarily to the work of Lucien Tesnière. Dependency is the notion that linguistic units, e.g. words, are connected to each other by directed links. The (finite) verb is taken to be the structural center of clause structure. All other syntactic units (words) are either directly or indirectly connected to the verb in terms of the directed links, which are called dependencies. Dependency grammar differs from phrase structure grammar in that while it can identify phrases it tends to overlook phrasal nodes. A dependency structure is determined by the relation between a word and its dependents. Dependency structures are flatter than phrase structures in part because they lack a finite verb phrase constituent, and they are thus well suited for the analysis of languages with free word order, such as Czech or Warlpiri.

In grammar and theoretical linguistics, government or rection refers to the relationship between a word and its dependents. One can discern between at least three concepts of government: the traditional notion of case government, the highly specialized definition of government in some generative models of syntax, and a much broader notion in dependency grammars.

In linguistics, a zero or null is a segment which is not pronounced or written. It is a useful concept in analysis, indicating lack of an element where one might be expected. It is usually written with the symbol "∅", in Unicode U+2205EMPTY SET. A common ad hoc solution is to use the Scandinavian capital letter Ø instead.

The term predicate is used in one of two ways in linguistics and its subfields. The first defines a predicate as everything in a standard declarative sentence except the subject, and the other views it as just the main content verb or associated predicative expression of a clause. Thus, by the first definition the predicate of the sentence Frank likes cake is likes cake. By the second definition, the predicate of the same sentence is just the content verb likes, whereby Frank and cake are the arguments of this predicate. Differences between these two definitions can lead to confusion.

<span class="mw-page-title-main">Lucien Tesnière</span> French linguist

Lucien Tesnière was a prominent and influential French linguist. He was born in Mont-Saint-Aignan on May 13, 1893. As a senior lecturer at the University of Strasbourg (1924) and later professor at the University of Montpellier (1937), he published many papers and books on Slavic languages. However, his importance in the history of linguistics is based mainly on his development of an approach to the syntax of natural languages that would become known as dependency grammar. He presented his theory in his book Éléments de syntaxe structurale, published posthumously in 1959. In the book he proposes a sophisticated formalization of syntactic structures, supported by many examples from a diversity of languages. Tesnière died in Montpellier on December 6, 1954.

In linguistics, control is a construction in which the understood subject of a given predicate is determined by some expression in context. Stereotypical instances of control involve verbs. A superordinate verb "controls" the arguments of a subordinate, nonfinite verb. Control was intensively studied in the government and binding framework in the 1980s, and much of the terminology from that era is still used today. In the days of Transformational Grammar, control phenomena were discussed in terms of Equi-NP deletion. Control is often analyzed in terms of a null pronoun called PRO. Control is also related to raising, although there are important differences between control and raising. Most if not all languages have control constructions and these constructions tend to occur frequently.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Constraint grammar (CG) is a methodological paradigm for natural language processing (NLP). Linguist-written, context-dependent rules are compiled into a grammar that assigns grammatical tags ("readings") to words or other tokens in running text. Typical tags address lemmatisation, inflexion, derivation, syntactic function, dependency, valency, case roles, semantic type etc. Each rule either adds, removes, selects or replaces a tag or a set of grammatical tags in a given sentence context. Context conditions can be linked to any tag or tag set of any word anywhere in the sentence, either locally or globally. Context conditions in the same rule may be linked, i.e. conditioned upon each other, negated, or blocked by interfering words or tags. Typical CGs consist of thousands of rules, that are applied set-wise in progressive steps, covering ever more advanced levels of analysis. Within each level, safe rules are used before heuristic rules, and no rule is allowed to remove the last reading of a given kind, thus providing a high degree of robustness.

Exceptional case-marking (ECM), in linguistics, is a phenomenon in which the subject of an embedded infinitival verb seems to appear in a superordinate clause and, if it is a pronoun, is unexpectedly marked with object case morphology. The unexpected object case morphology is deemed "exceptional". The term ECM itself was coined in the Government and Binding grammar framework although the phenomenon is closely related to the accusativus cum infinitivo constructions of Latin. ECM-constructions are also studied within the context of raising. The verbs that license ECM are known as raising-to-object verbs. Many languages lack ECM-predicates, and even in English, the number of ECM-verbs is small. The structural analysis of ECM-constructions varies in part according to whether one pursues a relatively flat structure or a more layered one.

Linguistic categories include

In linguistics, inversion is any of several grammatical constructions where two expressions switch their canonical order of appearance, that is, they invert. There are several types of subject-verb inversion in English: locative inversion, directive inversion, copular inversion, and quotative inversion. The most frequent type of inversion in English is subject–auxiliary inversion in which an auxiliary verb changes places with its subject; it often occurs in questions, such as Are you coming?, with the subject you is switched with the auxiliary are. In many other languages, especially those with a freer word order than English, inversion can take place with a variety of verbs and with other syntactic categories as well.

<span class="mw-page-title-main">Quranic Arabic Corpus</span>

The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.

<span class="mw-page-title-main">English clause syntax</span> Clauses in English grammar

This article describes the syntax of clauses in the English language, chiefly in Modern English. A clause is often said to be the smallest grammatical unit that can express a complete proposition. But this semantic idea of a clause leaves out much of English clause syntax. For example, clauses can be questions, but questions are not propositions. A syntactic description of an English clause is that it is a subject and a verb. But this too fails, as a clause need not have a subject, as with the imperative, and, in many theories, an English clause may be verbless. The idea of what qualifies varies between theories and has changed over time.

Subject–verb inversion in English is a type of inversion marked by a predicate verb that precedes a corresponding subject, e.g., "Beside the bed stood a lamp". Subject–verb inversion is distinct from subject–auxiliary inversion because the verb involved is not an auxiliary verb.

Manning's Law describes the combination of principles that need to be balanced in the design and growth of universal linguistic dependencies. These dependencies are used to describe and model syntactic relations, for all languages. This supports natural language processing, and is a major topic, with its own event, thousands of linguistics and AI researchers working with and on it, and widely-adopted. The law was put forward by Christopher D. Manning.

Syntactic parsing is the automatic analysis of syntactic structure of natural language, especially syntactic relations and labelling spans of constituents. It is motivated by the problem of structural ambiguity in natural language: a sentence can be assigned multiple grammatical parses, so some kind of knowledge beyond computational grammar rules are need to tell which parse is intended. Syntactic parsing is one of the important tasks in computational linguistics and natural language processing, and has been a subject of research since the mid-20th century with the advent of computers.

Christopher David Manning is an Australian computer scientist, best known for co-developing GloVe word vectors and the bilinear or multiplicative form of attention in artificial neural networks and for his books Foundations of Statistical Natural Language Processing (1999) and Introduction to Information Retrieval (2008). He is the Thomas M. Siebel Professor in Machine Learning and a professor of Linguistics and Computer Science at Stanford University. He was previously President of the Association for Computational Linguistics (2015) and he has received an honorary doctorate from the University of Amsterdam (2023).

References