Treebank

Last updated
Most syntactic treebanks annotate variants of either phrase structure (left) or dependency structure (right). The house at the end of the street.jpg
Most syntactic treebanks annotate variants of either phrase structure (left) or dependency structure (right).

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data. [1]

Contents

Etymology

The term treebank was coined by linguist Geoffrey Leech in the 1980s, by analogy to other repositories such as a seedbank or bloodbank. [2] This is because both syntactic and semantic structure are commonly represented compositionally as a tree structure. The term parsed corpus is often used interchangeably with the term treebank, with the emphasis on the primacy of sentences rather than trees.

Construction

Treebanks are often created on top of a corpus that has already been annotated with part-of-speech tags. In turn, treebanks are sometimes enhanced with semantic or other linguistic information. Treebanks can be created completely manually, where linguists annotate each sentence with syntactic structure, or semi-automatically, where a parser assigns some syntactic structure which linguists then check and, if necessary, correct. In practice, fully checking and completing the parsing of natural language corpora is a labour-intensive project that can take teams of graduate linguists several years. The level of annotation detail and the breadth of the linguistic sample determine the difficulty of the task and the length of time required to build a treebank.

Example phrase structure tree for John loves Mary Example-tree.png
Example phrase structure tree for John loves Mary
Hybrid constituency/dependency tree from the Quranic Arabic Corpus Quranic-arabic-corpus.png
Hybrid constituency/dependency tree from the Quranic Arabic Corpus

Some treebanks follow a specific linguistic theory in their syntactic annotation (e.g. the BulTreeBank follows HPSG) but most try to be less theory-specific. However, two main groups can be distinguished: treebanks that annotate phrase structure (for example the Penn Treebank or ICE-GB) and those that annotate dependency structure (for example the Prague Dependency Treebank or the Quranic Arabic Dependency Treebank).

It is important to clarify the distinction between the formal representation and the file format used to store the annotated data. Treebanks are necessarily constructed according to a particular grammar. The same grammar may be implemented by different file formats. For example, the syntactic analysis for John loves Mary, shown in the figure on the right, may be represented by simple labelled brackets in a text file, like this (following the Penn Treebank notation):

(S (NP (NNP John))    (VP (VPZ loves)        (NP (NNP Mary)))    (. .))

This type of representation is popular because it is light on resources, and the tree structure is relatively easy to read without software tools. However, as corpora become increasingly complex, other file formats may be preferred. Alternatives include treebank-specific XML schemes, numbered indentation and various types of standoff notation.

Applications

From a computational linguistics [3] perspective, treebanks have been used to engineer state-of-the-art natural language processing systems such as part-of-speech taggers, parsers, semantic analyzers and machine translation systems. [4] Most computational systems utilize gold-standard treebank data. However, an automatically parsed corpus that is not corrected by human linguists can still be useful. It can provide evidence of rule frequency for a parser. A parser may be improved by applying it to large amounts of text and gathering rule frequencies. However, it should be obvious that only by a process of correcting and completing a corpus by hand is it possible then to identify rules absent from the parser knowledge base. In addition, frequencies are likely to be more accurate.

In corpus linguistics, treebanks are used to study syntactic phenomena (for example, diachronic corpora can be used to study the time course of syntactic change). Once parsed, a corpus will contain frequency evidence showing how common different grammatical structures are in use. Treebanks also provide evidence of coverage and support the discovery of new, unanticipated, grammatical phenomena.

Another use of treebanks in theoretical linguistics and psycholinguistics is interaction evidence. A completed treebank can help linguists carry out experiments as to how the decision to use one grammatical construction tends to influence the decision to form others, and to try to understand how speakers and writers make decisions as they form sentences. Interaction research is particularly fruitful as further layers of annotation, e.g. semantic, pragmatic, are added to a corpus. It is then possible to evaluate the impact of non-syntactic phenomena on grammatical choices.

In linguistics research, annotated treebank data has been used in syntactic research to test linguistic theories of sentence structure against large quantities of naturally occurring examples.[ citation needed ]

Semantic treebanks

A semantic treebank is a collection of natural language sentences annotated with a meaning representation. These resources use a formal representation of each sentence's semantic structure. Semantic treebanks vary in the depth of their semantic representation. A notable example of deep semantic annotation is the Groningen Meaning Bank, developed at the University of Groningen and annotated using Discourse Representation Theory. An example of a shallow semantic treebank is PropBank, which provides annotation of verbal propositions and their arguments, without attempting to represent every word in the corpus in logical form.

LanguageTreebankSemantic FormalismDistribution / License
Chinese Chinese Universal Propositions PropBank semanticsCC BY-NC-SA 3.0 US
English Abstract Meaning Representation (AMR) BankDeep semantics?
English FrameNet Shallow semantics?
English Universal Conceptual Cognitive Annotation (UCCA)Deep semantics?
English Robot Commands Treebank [5] Deep semantics?
English Groningen Meaning Bank Deep semanticsdifferent licenses
English Parallel Meaning Bank Deep semanticsdifferent licenses
Dutch Parallel Meaning Bank Deep semanticsdifferent licenses
German Parallel Meaning Bank Deep semanticsdifferent licenses
Italian Parallel Meaning Bank Deep semanticsdifferent licenses
English DeepBank project Deep semantics?
English Treebank Semantics Parsed Corpus Deep semantics?
English RoboCup Corpus Deep semantics?
English Geoquery Deep semantics?
English PropBank PropBank semanticsdifferent licenses
Finnish Finnish Universal Propositions PropBank semanticsCC BY-NC-SA 3.0 US
Finnish Finnish PropBank PropBank semanticsCC BY-SA 4.0
French French Universal Propositions PropBank semanticsCC BY-NC-SA 3.0 US
German German Universal Propositions PropBank semanticsCC BY-NC-SA 3.0 US
Italian Italian Universal Propositions PropBank semanticsCC BY-NC-SA 3.0 US
Portuguese Portuguese PortLex PropBank semantics?
Portuguese Portuguese Universal Propositions PropBank semanticsCC BY-NC-SA 3.0 US
Spanish Spanish Universal Propositions PropBank semanticsCC BY-NC-SA 3.0 US
Turkish Turkish PropBank PropBank semanticsCC BY-NC-SA 4.0

Syntactic treebanks

Many syntactic treebanks have been developed for a wide variety of languages:

LanguageTreebankSyntactic FormalismDistribution / License
Abaza Universal Dependencies, ATB Dependency CC BY-SA
Afrikaans Universal Dependencies, AfriBooms Dependency CC BY-SA
Akkadian Universal Dependencies, PISANDUB Dependency CC BY-SA
Albanian Universal Dependencies, TSA Dependency CC BY-SA
Amharic Universal Dependencies, ATT Dependency CC BY-SA
Ancient Greek Universal Dependencies, Perseus Dependency CC BY-NC-SA
Ancient Greek Universal Dependencies, PROIEL Dependency CC BY-NC-SA
Greek (ancient) Ancient Greek Dependency Treebank [6] [7] Dependency Open source (Creative Commons license)
Greek (ancient) PROIEL Treebank [8] Dependency Open source (Creative Commons license)
Arabic Columbia Arabic Treebank (CATiB) Dependency Linguistic Data Consortium
Arabic Prague Arabic Dependency Treebank (PADT) Dependency Linguistic Data Consortium
Arabic Universal Dependencies, NYUAD Dependency CC BY-SA
Arabic Universal Dependencies, PADT Dependency CC BY-NC-SA
Arabic Universal Dependencies, PUD Dependency CC BY-SA
Arabic Penn Arabic Treebank Phrase structure Linguistic Data Consortium
Armenian Universal Dependencies, ArmTDP Dependency CC BY-SA
Assyrian (Neo-Aramaic) Universal Dependencies, AS Dependency CC BY-SA
Bambara Universal Dependencies, CRB Dependency CC BY-SA
Basque Universal Dependencies, BDT Dependency CC BY-NC-SA
Belarusian Universal Dependencies, HSE Dependency CC BY-SA
Bhojpuri Universal Dependencies, BhEn Dependency CC BY-SA
Bhojpuri Universal Dependencies, BHTB Dependency CC BY-SA
Breton Universal Dependencies, KEB Dependency CC BY-SA
Bulgarian Universal Dependencies, BTB Dependency CC BY-NC-SA
Bulgarian BulTreeBank HPSG Freely available for research
Buryat Universal Dependencies, BDT Dependency CC BY-SA
Cantonese Universal Dependencies, HK Dependency CC BY-SA
Catalan Cat3LB Phrase structure Freely available for research
Catalan Universal Dependencies, AnCora Dependency GPL
Chinese Sinica Treebank Case grammar Not freely available
Chinese Universal Dependencies, CFL Dependency CC BY-SA
Chinese Universal Dependencies, GSD Dependency CC BY-SA
Chinese Universal Dependencies, GSDSimp Dependency CC BY-SA
Chinese Universal Dependencies, HK Dependency CC BY-SA
Chinese Universal Dependencies, PUD Dependency CC BY-SA
Chinese Penn Chinese Treebank Phrase structure Linguistic Data Consortium
Chinese Chinese Dependency Treebank Dependency Linguistic Data Consortium
Arabic (classical) Quranic Arabic Dependency Treebank (QADT) (Quranic Arabic Corpus) Dependency Open source (GNU general public license)
Classical Armenian PROIEL Treebank [8] Dependency Open source (Creative Commons license)
Coptic Universal Dependencies, Coptic Scriptorium Dependency CC BY
Croatian Croatian Dependency Treebank Dependency Open source (Creative Commons license)
Croatian Universal Dependencies, SET Dependency CC BY-SA
Czech Prague Dependency Treebank Dependency Open source (Creative Commons license)
Czech Universal Dependencies, CAC Dependency CC BY-SA
Czech Universal Dependencies, CLTT Dependency CC BY-SA
Czech Universal Dependencies, FicTree Dependency CC BY-NC-SA
Czech Universal Dependencies, PDT Dependency CC BY-NC-SA
Czech Universal Dependencies, PUD Dependency CC BY-SA
Danish Danish Dependency Treebank Dependency Open source (GNU general public license)
Danish Arboretum: A syntactic tree corpus of Danish Phrase structure License fee
Danish Universal Dependencies, DDT Dependency CC BY-SA
Danish Universal Dependencies, DTB Dependency CC BY-SA
Dutch Spoken Dutch Corpus (CGN) Phrase structure License fee
Dutch Universal Dependencies, Alpino Dependency CC BY-SA
Dutch Universal Dependencies, LassySmall Dependency CC BY-SA
Dutch LASSY Small and Large Dependency License fee
Dutch Alpino Treebank Dependency Open source (GNU general public license)
English CCGbank Combinatory categorial grammar Linguistic Data Consortium
English LinGO Redwoods HPSG ?
English Lancaster Parsed Corpus Phrase structure ?
English Prague English Dependency Treebank Dependency Linguistic Data Consortium
English Universal Dependencies, BhEn Dependency CC BY-SA
English Universal Dependencies, ESL Dependency CC BY-SA
English Universal Dependencies, EWT Dependency CC BY-SA
English Universal Dependencies, GUM Dependency CC BY-NC-SA
English Universal Dependencies, GUMReddit Dependency CC BY
English Universal Dependencies, LinES Dependency CC BY-NC-SA
English Universal Dependencies, ParTUT Dependency CC BY-NC-SA
English Universal Dependencies, Pronouns Dependency CC BY-SA
English Universal Dependencies, PUD Dependency CC BY-SA
English Treebank Semantics Parsed Corpus Phrase structure Open source (Creative Commons license)
English Christine Corpus Phrase structure Freely available for research
English Lucy Corpus Phrase structure Freely available for research
English Susanne Corpus Phrase structure Freely available for research
English BLLIP WSJ corpus Phrase structure Linguistic Data Consortium
English Tübingen Treebank of English / Spontaneous Speech (TüBa-E/S) HPSG Freely available for research
English Diachronic Corpus of Present-Day Spoken English (DCPSE) Phrase structure License fee
English British Component of the International Corpus of English (ICE-GB) Phrase structure License fee
English The PARC 700 Dependency Bank Dependency ?
English Yahoo Query Treebank Dependency Freely available for research
English Penn Treebank Phrase structure Linguistic Data Consortium
English Multi-Treebank Phrase structure Available online for comparison purposes
English CHILDES Brown Eve corpus with dependency annotation Dependency Open source (Creative Commons license)
English SMULTRON - Parallel Treebank EN-DE-SV Phrase structure Freely available for research
Erzya Universal Dependencies, JR Dependency CC BY-SA
Estonian Arborest Phrase structure ?
Estonian Syntactically analyzed and disambiguated text corpus Dependency Freely available for research
Estonian Universal Dependencies, EDT Dependency CC BY-NC-SA
Estonian Universal Dependencies, EWT Dependency CC BY-NC-SA
Faroese Universal Dependencies, FarPaHC Dependency CC BY-SA
Faroese Universal Dependencies, OFT Dependency CC BY-SA
Finnish Turku Dependency Treebank (TDT) Dependency Open source (Creative Commons license)
Finnish Universal Dependencies, FTB Dependency CC BY
Finnish Universal Dependencies, PUD Dependency CC BY-SA
Finnish Universal Dependencies, TDT Dependency CC BY-SA
French (spoken) Rhapsodie Dependency and macrosyntactic annotationOpen source (Creative Commons license)
French L'Arboratoire Phrase structure ?
French Universal Dependencies, CrapBank Dependency CC BY-SA
French Universal Dependencies, FQB Dependency GPL
French Universal Dependencies, FTB Dependency GPL
French Universal Dependencies, GSD Dependency CC BY-SA
French Universal Dependencies, ParTUT Dependency CC BY-NC-SA
French Universal Dependencies, PUD Dependency CC BY-SA
French Universal Dependencies, Sequoia Dependency GPL
French Universal Dependencies, Spoken Dependency CC BY-SA
French French Treebank Phrase structure Freely available for research
French Free French Treebank Phrase structure Open Source license LGPL-LR
French Sequoia Treebank Phrase structure & Dependency Open Source license LGPL-LR
Galician Universal Dependencies, CTG Dependency CC BY-NC-SA
Galician Universal Dependencies, TreeGal Dependency GPL
German Hamburg Dependency Treebank (HDT) Dependency Freely available for research
German Universal Dependencies, GSD Dependency CC BY-SA
German Universal Dependencies, LIT Dependency CC BY-NC-SA
German Universal Dependencies, PUD Dependency CC BY-SA
German SMULTRON - Parallel Treebank EN-DE-SV Phrase structure Freely available for research
German NEGRA Phrase structure Freely available for research
German TIGER Phrase structure Freely available for research
German Tübingen Treebank of German / Spontaneous Speech (TüBa-D/S) Phrase structure Freely available for research
German Tübingen Treebank of Written German (TüBa-D/Z) Phrase structure Freely available for research
German Tübingen Partially Parsed Corpus of Written German (TüPP-D/Z) Phrase structure License fee
Gothic PROIEL Treebank [8] Dependency Open source (Creative Commons license)
Gothic Universal Dependencies, PROIEL Dependency CC BY-NC-SA
Greek Greek Dependency Treebank Dependency Not freely available
Greek Universal Dependencies, GDT Dependency CC BY-NC-SA
Hebrew Universal Dependencies, HTB Dependency CC BY-NC-SA
Hebrew Hebrew Dependency Treebank Dependency Open source (GNU general public license)
Hindi English Universal Dependencies, HIENCS Dependency CC BY-SA
Hindi Universal Dependencies, HDTB Dependency CC BY-NC-SA
Hindi Universal Dependencies, PUD Dependency CC BY-SA
Hindi AnnCorra Dependency ?
English (historical) Penn Parsed Corpora of Historical English; Phrase structure Linguistic Data Consortium (as of April 2020)
English (historical) York-Toronto-Helsinki Parsed Corpus of Old English Prose (YCOE) Phrase structure Freely available for research
French (historical) Corpus MCVF Phrase structure Freely available for research
Portuguese (historical) Tycho Brahe corpus Phrase structure ?
Hungarian Universal Dependencies, Szeged Dependency CC BY-NC-SA
Hungarian Hungarian Treebank Phrase structure ?
Icelandic IcePaHC - Icelandic Parsed Historical Corpus Phrase structure Open source (GNU Lesser General Public License)
Icelandic Universal Dependencies, IcePaHC Dependency CC BY-SA
Icelandic Universal Dependencies, PUD Dependency CC BY-SA
Indonesian Universal Dependencies, GSD Dependency CC BY-SA
Indonesian Universal Dependencies, PUD Dependency CC BY-SA
Indonesian ICON Phrase structure ?
Irish Universal Dependencies, IDT Dependency CC BY-SA
Italian ISST - Italian Syntactic-Semantic Treebank Phrase structure and dependency License fee
Italian MIDT (Merged Italian Dependency Treebank) resulting from the merging and harmonization of the TUT and ISST-CoNLL/TANL treebanks dependency Freely available for research
Italian VIT - Venice Italian Treebank Phrase structure and dependency License fee
Italian Universal Dependencies, ISDT Dependency CC BY-NC-SA
Italian Universal Dependencies, ParTUT Dependency CC BY-NC-SA
Italian Universal Dependencies, PoSTWITA Dependency CC BY-NC-SA
Italian Universal Dependencies, PUD Dependency CC BY-SA
Italian Universal Dependencies, TWITTIRO Dependency CC BY-SA
Italian Universal Dependencies, VIT Dependency CC BY-NC-SA
Italian Italian Syntactic-Semantic Treebank for the CoNLL-2007 Shared Task (ISST-CoNLL) dependency Freely available for research
Italian SUT - Siena University Treebank ??
Italian TUT - Turin University Treebank Dependency Open source (Creative Commons license)
Italian ISDT (Italian Stanford Dependency Treebank) dependency Freely available for research
Japanese Kyoto Text Corpus ??
Japanese Universal Dependencies, BCCWJ Dependency CC BY-NC-SA
Japanese Universal Dependencies, GSD Dependency CC BY-SA
Japanese Universal Dependencies, KTC Dependency CC BY-SA
Japanese Universal Dependencies, Modern Dependency CC BY-NC-ND
Japanese Universal Dependencies, PUD Dependency CC BY-SA
Japanese Keyaki Treebank Phrase structure Open source (Creative Commons license)
Japanese Tübingen Treebank of Japanese / Spontaneous Speech (TüBa-J/S) Phrase structure Freely available for research
Japanese ATR Dependency corpus Dependency ?
Karelian Universal Dependencies, KKPP Dependency CC BY-SA
Kazakh Universal Dependencies, KTB Dependency CC BY-SA
Komi Permyak Universal Dependencies, UH Dependency CC BY-SA
Komi Zyrian Universal Dependencies, IKDP Dependency CC BY-SA
Komi Zyrian Universal Dependencies, Lattice Dependency CC BY-SA
Korean Universal Dependencies, GSD Dependency CC BY-SA
Korean Universal Dependencies, Kaist Dependency CC BY-SA
Korean Universal Dependencies, Penn Dependency CC BY-SA
Korean Universal Dependencies, PUD Dependency CC BY-SA
Korean Universal Dependencies, Sejong Dependency CC BY-SA
Korean Korean Treebank Phrase structure Linguistic Data Consortium
Kurmanji Universal Dependencies, MG Dependency CC BY-SA
Latin Universal Dependencies, ITTB Dependency CC BY-NC-SA
Latin Universal Dependencies, LLCT Dependency CC BY-SA
Latin Universal Dependencies, Perseus Dependency CC BY-NC-SA
Latin Universal Dependencies, PROIEL Dependency CC BY-NC-SA
Latin Index Thomisticus Treebank Dependency Open source (Creative Commons license)
Latin PROIEL Treebank [8] Dependency Open source (Creative Commons license)
Latin Latin Dependency Treebank [9] Dependency Open source (Creative Commons license)
Latvian Universal Dependencies, LVTB Dependency CC BY-SA
Lithuanian Universal Dependencies, ALKSNIS Dependency CC BY-SA
Lithuanian Universal Dependencies, HSE Dependency CC BY-SA
Livvi Universal Dependencies, KKPP Dependency CC BY-SA
Magahi Universal Dependencies, MGTB Dependency CC BY-SA
Maltese Universal Dependencies, MUDT Dependency CC BY-SA
Marathi Universal Dependencies, UFAL Dependency CC BY-SA
Mbya Guarani Universal Dependencies, Dooley Dependency CC BY-NC-SA
Mbya Guarani Universal Dependencies, Thomas Dependency CC BY-NC-SA
Middle Irish Universal Dependencies, CritMITB Dependency CC BY-SA
Middle Irish Universal Dependencies, DipMITB Dependency CC BY-SA
Moksha Universal Dependencies, JR Dependency CC BY-SA
Naija Universal Dependencies, NSC Dependency CC BY-SA
North Sami Universal Dependencies, Giella Dependency CC BY-SA
Norwegian INESS treebanking infrastructure LFG ?
Norwegian Universal Dependencies, Bokmaal Dependency CC BY-SA
Norwegian Universal Dependencies, Nynorsk Dependency CC BY-SA
Norwegian Universal Dependencies, NynorskLIA Dependency CC BY-SA
Old Church Slavonic Universal Dependencies, PROIEL Dependency CC BY-NC-SA
Old Church Slavonic TOROT Treebank [8] Dependency Open source (Creative Commons license)
Old French Universal Dependencies, SRCMF Dependency CC BY-NC-SA
Old Russian Universal Dependencies, RNC Dependency CC BY-SA
Old Russian Universal Dependencies, TOROT Dependency CC BY-NC-SA
Old Russian TOROT Treebank [8] Dependency Open source (Creative Commons license)
Persian Persian Dependency Treebank (PerDT) Dependency Freely available for research
Persian PerTreeBank HPSG Freely available for research
Persian Universal Dependencies, Seraji Dependency CC BY-SA
Polish A Treebank / Test Suite for Polish HPSG ?
Polish Universal Dependencies, LFG Dependency GPL
Polish Universal Dependencies, PDB Dependency CC BY-NC-SA
Polish Universal Dependencies, PUD Dependency CC BY-SA
Polish Składnica Phrase structure and Dependency Open source (GNU general public license)
Portuguese Universal Dependencies, Bosque Dependency CC BY-SA
Portuguese Universal Dependencies, GSD Dependency CC BY-SA
Portuguese Universal Dependencies, PUD Dependency CC BY-SA
Portuguese Projecto Floresta Sintá(c)tica Dependency, Phrase structure Open source (GNU general public license)
Romanian Romanian Dependency Treebank Dependency ?
Romanian Universal Dependencies, Nonstandard Dependency CC BY-SA
Romanian Universal Dependencies, RRT Dependency CC BY-SA
Romanian Universal Dependencies, SiMoNERo Dependency CC BY-SA
Russian Universal Dependencies, GSD Dependency CC BY-SA
Russian Universal Dependencies, PUD Dependency CC BY-SA
Russian Universal Dependencies, SynTagRus Dependency CC BY-NC-SA
Russian Universal Dependencies, Taiga Dependency CC BY-SA
Russian SynTagRus Dependency Treebank (Russian National Corpus) Dependency Freely available for research
Sanskrit Universal Dependencies, UFAL Dependency CC BY-SA
Sanskrit Universal Dependencies, Vedic Dependency CC BY-SA
Scottish Gaelic Universal Dependencies, ARCOSG Dependency CC BY-SA
Serbian Universal Dependencies, SET Dependency CC BY-SA
Sindhi Universal Dependencies, MazharDootio Dependency CC BY-SA
Skolt Sami Universal Dependencies, Giellagas Dependency CC BY-SA
Slovak Universal Dependencies, SNK Dependency CC BY-SA
Slovene Slovene Dependency Treebank Dependency Freely available for research
Slovenian Universal Dependencies, SSJ Dependency CC BY-NC-SA
Slovenian Universal Dependencies, SST Dependency CC BY-NC-SA
Spanish Cast3LB Phrase structure and dependency Freely available for research
Spanish Universal Dependencies, AnCora Dependency GPL
Spanish Universal Dependencies, GSD Dependency CC BY-SA
Spanish Universal Dependencies, PUD Dependency CC BY-SA
Spanish UAM Treebank of Spanish Phrase structure Freely available for research
Swedish Talbanken05 Phrase structure and dependency Freely available for research
Swedish Swedish Treebank Phrase structure Freely available for research
Swedish Universal Dependencies, LinES Dependency CC BY-NC-SA
Swedish Universal Dependencies, PUD Dependency CC BY-SA
Swedish Universal Dependencies, Talbanken Dependency CC BY-SA
Swedish SMULTRON - Parallel Treebank EN-DE-SV Phrase structure Freely available for research
Swedish Sign Language Universal Dependencies, SSLC Dependency CC BY-SA
Swiss German Universal Dependencies, UZH Dependency CC BY-SA
Tagalog Universal Dependencies, TRG Dependency CC BY-SA
Tagalog Universal Dependencies, Ugnayan Dependency CC BY-NC-SA
Tamil Universal Dependencies, TTB Dependency CC BY-NC-SA
Telugu Universal Dependencies, MTG Dependency CC BY-SA
Thai NAiST Thai Treebank Dependency Open source (GNU general public license)
Thai Universal Dependencies, PUD Dependency CC BY-SA
Thai THTB Phrase structure CC BY 4.0
Turkish METU-Sabanci Turkish Treebank Dependency Freely available for research
Turkish Universal Dependencies, BOUN Dependency CC BY-SA
Turkish Universal Dependencies, GB Dependency CC BY-SA
Turkish Universal Dependencies, IMST Dependency CC BY-NC-SA
Turkish Universal Dependencies, PUD Dependency CC BY-SA
Ukrainian Institute for Ukrainian, NGO Gold Standard Dependency Open source (Creative Commons license)
Ukrainian Universal Dependencies, IU Dependency CC BY-NC-SA
Upper Sorbian Universal Dependencies, UFAL Dependency CC BY-SA
Urdu NU-FAST Treebank Phrase structure Contact at Computational Learning Strategies & Practices
Urdu The URDU.KON-TB Treebank Phrase and Hyper Dependency Structure Contact at Computational Learning Strategies & Practices
Urdu Universal Dependencies, UDTB Dependency CC BY-NC-SA
Uyghur Universal Dependencies, UDT Dependency CC BY-SA
Vietnamese Universal Dependencies, VTB Dependency CC BY-SA
Vietnamese Vietnamese Treebank Phrase structure Freely available for research
Vietnamese Vietnamese Dependency Treebank Dependency Freely available for research
Warlpiri Universal Dependencies, UFAL Dependency CC BY-SA
Welsh Universal Dependencies, CCG Dependency CC BY-SA
Wolof Universal Dependencies, WTB Dependency CC BY-SA
Yoruba Universal Dependencies, YTB Dependency CC BY-SA

To facilitate the further researches between multilingual tasks, some researchers discussed the universal annotation scheme for cross-languages. In this way, people try to utilize or merge the advantages of different treebanks corpora. For instance, The universal annotation approach for dependency treebanks; [10] and the universal annotation approach for phrase structure treebanks. [11]

Search tools

One of the key ways to extract evidence from a treebank is through search tools. Search tools for parsed corpora typically depend on the annotation scheme that was applied to the corpus. User interfaces range in sophistication from expression-based query systems aimed at computer programmers to full exploration environments aimed at general linguists. Wallis (2008) discusses the principles of searching treebanks in detail and reviews the state of the art around that time. [12]

See also

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Lexical functional grammar (LFG) is a constraint-based grammar framework in theoretical linguistics. It posits two separate levels of syntactic structure, a phrase structure grammar representation of word order and constituency, and a representation of grammatical functions such as subject and object, similar to dependency grammar. The development of the theory was initiated by Joan Bresnan and Ronald Kaplan in the 1970s, in reaction to the theory of transformational grammar which was current in the late 1970s. It mainly focuses on syntax, including its relation with morphology and semantics. There has been little LFG work on phonology.

Head-driven phrase structure grammar (HPSG) is a highly lexicalized, constraint-based grammar developed by Carl Pollard and Ivan Sag. It is a type of phrase structure grammar, as opposed to a dependency grammar, and it is the immediate successor to generalized phrase structure grammar. HPSG draws from other fields such as computer science and uses Ferdinand de Saussure's notion of the sign. It uses a uniform formalism and is organized in a modular way which makes it attractive for natural language processing.

Parsing, syntax analysis, or syntactic analysis is the process of analyzing a string of symbols, either in natural language, computer languages or data structures, conforming to the rules of a formal grammar. The term parsing comes from Latin pars (orationis), meaning part.

Generalized phrase structure grammar (GPSG) is a framework for describing the syntax and semantics of natural languages. It is a type of constraint-based phrase structure grammar. Constraint based grammars are based around defining certain syntactic processes as ungrammatical for a given language and assuming everything not thus dismissed is grammatical within that language. Phrase structure grammars base their framework on constituency relationships, seeing the words in a sentence as ranked, with some words dominating the others. For example, in the sentence "The dog runs", "runs" is seen as dominating "dog" since it is the main focus of the sentence. This view stands in contrast to dependency grammars, which base their assumed structure on the relationship between a single word in a sentence and its dependents.

<span class="mw-page-title-main">Charles J. Fillmore</span> American linguist

Charles J. Fillmore was an American linguist and Professor of Linguistics at the University of California, Berkeley. He received his Ph.D. in Linguistics from the University of Michigan in 1961. Fillmore spent ten years at Ohio State University and a year as a Fellow at the Center for Advanced Study in the Behavioral Sciences at Stanford University before joining Berkeley's Department of Linguistics in 1971. Fillmore was extremely influential in the areas of syntax and lexical semantics.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Linguistic categories include

The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.

<span class="mw-page-title-main">Eckhard Bick</span> German Esperantist

Eckhard Bick is a German-born Esperantist who studied medicine in Bonn but now works as a researcher in computational linguistics. He was active in an Esperanto youth group in Bonn and in the Germana Esperanto-Junularo, a nationwide Esperanto youth federation. Since his marriage to a Danish woman he and his family live in Denmark.

<span class="mw-page-title-main">Quranic Arabic Corpus</span>

The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

Deep linguistic processing is a natural language processing framework which draws on theoretical and descriptive linguistics. It models language predominantly by way of theoretical syntactic/semantic theory. Deep linguistic processing approaches differ from "shallower" methods in that they yield more expressive and structural representations which directly capture long-distance dependencies and underlying predicate-argument structures.
The knowledge-intensive approach of deep linguistic processing requires considerable computational power, and has in the past sometimes been judged as being intractable. However, research in the early 2000s had made considerable advancement in efficiency of deep processing. Today, efficiency is no longer a major problem for applications using deep linguistic processing.

The following outline is provided as an overview of and topical guide to natural-language processing:

Manning's Law describes the combination of principles that need to be balanced in the design and growth of universal linguistic dependencies. These dependencies are used to describe and model syntactic relations, for all languages. This supports natural language processing, and is a major topic, with its own event, thousands of linguistics and AI researchers working with and on it, and widely-adopted. The law was put forward by Christopher D. Manning.

Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. These treebanks are openly accessible and available. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The project's primary aim is to achieve cross-linguistic consistency of annotation, while still permitting language-specific extensions when necessary. The annotation scheme has it roots in three related projects: Stanford Dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets. The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At the present time, there are just over 200 treebanks of more than 100 languages available in the UD inventory.

Syntactic parsing is the automatic analysis of syntactic structure of natural language, especially syntactic relations and labelling spans of constituents. It is motivated by the problem of structural ambiguity in natural language: a sentence can be assigned multiple grammatical parses, so some kind of knowledge beyond computational grammar rules is needed to tell which parse is intended. Syntactic parsing is one of the important tasks in computational linguistics and natural language processing, and has been a subject of research since the mid-20th century with the advent of computers.

References

  1. Alexander Clark, Chris Fox and Shalom Lappin (2010). The handbook of computational linguistics and natural language processing. Wiley.
  2. Sampson, G. (2003) ‘Reflections of a dendrographer.’ In A. Wilson, P. Rayson and T. McEnery (eds.) Corpus Linguistics by the Lune: A Festschrift for Geoffrey Leech, Frankfurt am Main: Peter Lang, pp. 157-184
  3. Haitao Liu, Wei Huang — A Chinese Dependency Syntax for Treebanking, published by Communication University of China, published (online) by the Association for Computational Linguistics - accessed 2020-2-4
  4. Kübler, Sandra; McDonald, Ryan; Nivre, Joakim (2008-12-18). "Dependency Parsing". Synthesis Lectures on Human Language Technologies. 2 (1): 1–127. doi:10.2200/s00169ed1v01y200901hlt002.
  5. Kais Dukes (2013) Semantic Annotation of Robotic Spatial Commands. Language and Technology Conference (LTC). Poznan, Poland.
  6. Celano, Giuseppe G. A. 2014. Guidelines for the annotation of the Ancient Greek Dependency Treebank 2.0. https://github.com/PerseusDL/treebank_data/edit/master/AGDT2/guidelines
  7. Mambrini, F. 2016. The Ancient Greek Dependency Treebank: Linguistic Annotation in a Teaching Environment. In: Bodard, G & Romanello, M (eds.) Digital Classics Outside the Echo-Chamber: Teaching, Knowledge Exchange & Public Engagement, Pp. 83–99. London: Ubiquity Press. doi : 10.5334/bat.f
  8. 1 2 3 4 5 6 Dag Haug. 2015. Treebanks in historical linguistic research. In Carlotta Viti (ed.), Perspectives on Historical Syntax, Benjamins, 188-202. A preprint is available at http://folk.uio.no/daghaug/historical-treebanks.pdf.
  9. Bamman David & al. 2008. Guidelines for the Syntactic Annotation of Latin Treebanks (v. 1.3). http://nlp.perseus.tufts.edu/syntax/treebank/1.3/docs/guidelines.pdf
  10. McDonald, R.; Nivre, J., Quirmbach-Brundage, Y.; et al. "Universal Dependency Annotation for Multilingual Parsing.". Proceedings of the ACL 2013.{{cite conference}}: CS1 maint: multiple names: authors list (link)
  11. Han, A.L.-F; Wong, D.F.; Chao, L.S.; Lu, Y.; He, L. & Tian, L. (2014). "A Universal Phrase Tagset for Multilingual Treebanks" (PDF). Proceedings of the CCL and NLP-NABD 2014, LNAI 8801, pp. 247– 258. © Springer International Publishing Switzerland. doi:10.1007/978-3-319-12277-9_22.
  12. Wallis, Sean (2008). Searching treebanks and other structured corpora. Chapter 34 in Lüdeling, A. & Kytö, M. (ed.) Corpus Linguistics: An International Handbook. Handbücher zur Sprache und Kommunikationswissenschaft series. Berlin: Mouton de Gruyter.