Quranic Arabic Corpus | |
---|---|
Research center: | University of Leeds |
Initial release: | November 2009 |
Language: | Quranic Arabic, English |
Annotation: | Syntax, morphology |
Framework: | Dependency grammar |
License: | GNU General Public License |
Website: | http://corpus.quran.com/ |
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran. [1] [2] [3] [4] [5]
The grammatical analysis helps readers further in uncovering the detailed intended meanings of each verse and sentence. Each word of the Quran is tagged with its part-of-speech as well as multiple morphological features. Unlike other annotated Arabic corpora, the grammar framework adopted by the Quranic Corpus is the traditional Arabic grammar of i'rab (إﻋﺮﺍﺏ). The research project is led by Kais Dukes at the University of Leeds, [4] and is part of the Arabic language computing research group within the School of Computing, supervised by Eric Atwell. [6]
The annotated corpus includes: [1] [7]
Corpus annotation assigns a part-of-speech tag and morphological features to each word. For example, annotation involves deciding whether a word is a noun or a verb, and if it is inflected for masculine or feminine. The first stage of the project involved automatic part-of-speech tagging by applying Arabic language computing technology to the text. The annotation for each of the 77,430 words in the Quran was then reviewed in stages by two annotators, and improvements are still ongoing to further improve accuracy.
Linguistic research for the Quran that uses the annotated corpus includes training Hidden Markov model part-of-speech taggers for Arabic, [8] automatic categorization of Quranic chapters, [9] and prosodic analysis of the text. [10]
In addition, the project provides a word-by-word Quranic translation based on accepted English sources, instead of producing a new translation of the Qur'an. [4]
Natural language processing (NLP) is a subfield of linguistics, computer science, and artificial intelligence concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.
Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.
In linguistics, a corpus or text corpus is a language resource consisting of a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.
Lemmatisation in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.
Link grammar (LG) is a theory of syntax by Davy Temperley and Daniel Sleator which builds relations between pairs of words, rather than constructing constituents in a phrase structure hierarchy. Link grammar is similar to dependency grammar, but dependency grammar includes a head-dependent relationship, whereas Link Grammar makes the head-dependent relationship optional. Colored Multiplanar Link Grammar (CMLG) is an extension of LG allowing crossing relations between pairs of words. The relationship between words is indicated with link types, thus making the Link grammar closely related to certain categorial grammars.
In corpus linguistics, part-of-speech tagging, also called grammatical tagging is the process of marking up a word in a text (corpus) as corresponding to a particular part of speech, based on both its definition and its context. A simplified form of this is commonly taught to school-age children, in the identification of words as nouns, verbs, adjectives, adverbs, etc.
Dependency grammar (DG) is a class of modern grammatical theories that are all based on the dependency relation and that can be traced back primarily to the work of Lucien Tesnière. Dependency is the notion that linguistic units, e.g. words, are connected to each other by directed links. The (finite) verb is taken to be the structural center of clause structure. All other syntactic units (words) are either directly or indirectly connected to the verb in terms of the directed links, which are called dependencies. Dependency grammar differs from phrase structure grammar in that while it can identify phrases it tends to overlook phrasal nodes. A dependency structure is determined by the relation between a word and its dependents. Dependency structures are flatter than phrase structures in part because they lack a finite verb phrase constituent, and they are thus well suited for the analysis of languages with free word order, such as Czech or Warlpiri.
The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
Linguistic categories include
The International Corpus of English (ICE) is a set of corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
The Russian National Corpus is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.
Eckhard Bick is a German-born Esperantist who studied medicine in Bonn but now works as a researcher in computational linguistics. He was active in an Esperanto youth group in Bonn and in the Germana Esperanto-Junularo, a nationwide Esperanto youth federation. Since his marriage to a Danish woman he and his family live in Denmark.
The following outline is provided as an overview of and topical guide to natural-language processing:
Kais Dukes is a British computer scientist and software developer, known for the development of the Quranic Arabic Corpus.
The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.
The Bulgarian Part of Speech-annotated Corpus (BulPosCor) is a morphologically annotated general monolingual corpus of written language where each item in a text is assigned a grammatical tag. BulPosCor is created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences and consists of 174 697 lexical items. BulPosCor has been compiled from the Structured "Brown" Corpus of Bulgarian by sampling 300+ word-excerpts from the original BCB files in such a way as to preserve the BCB overall structure. The annotation process consists of a primary stage of automatically assigning tags from the Bulgarian Grammar Dictionary and a stage of manual resolving of morphological ambiguities. The disambiguated corpus consists of 174,697 lexical units.
Menotec was an infrastructure project funded by the Norwegian Research Council (2010–2012) with the aim of transcribing and annotating a text corpus of Old Norwegian texts.
Manually Annotated Sub-Corpus (MASC) is a balanced subset of 500K words of written texts and transcribed speech drawn primarily from the Open American National Corpus (OANC). The OANC is a 15 million word corpus of American English produced since 1990, all of which is in the public domain or otherwise free of usage and redistribution restrictions.
Universal Dependencies, frequently abbreviated as UD, is an international cooperative project to create treebanks of the world's languages. These treebanks are openly accessible and available. Core applications are automated text processing in the field of natural language processing (NLP) and research into natural language syntax and grammar, especially within linguistic typology. The project's primary aim is to achieve cross-linguistic consistency of annotation, while still permitting language-specific extensions when necessary. The annotation scheme has it roots in three related projects: Stanford Dependencies, Google universal part-of-speech tags, and the Interset interlingua for morphosyntactic tagsets. The UD annotation scheme uses a representation in the form of dependency trees as opposed to a phrase structure trees. At the present time, there are just over 200 treebanks of more than 100 languages available in the UD inventory.