Bulgarian National Corpus

Last updated

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words. [1]

Contents

History

The Bulgarian National corpus is created at the Institute for Bulgarian Language „Prof. L. Andreychin” by research associates from the Department of Computational Linguistics and the Department of Bulgarian Lexicology and Lexicography. BulNC incorporates several individual electronic corpora, developed in the period 2001-2009 for the purposes of the two departments. The corpus is constantly enlarged with new texts. [2] [3]

Contents

The Bulgarian National corpus consists of a monolingual (Bulgarian) part and 47 parallel corpora. The Bulgarian part includes about 1.2 billion words in over 240 000 text samples. The materials in the Corpus reflect the state of the Bulgarian language (mainly in its written form) from the middle of 20th century (1945) until present. [4]

It also includes parallel corpora of various size for 47 foreign languages. [5]

BulNC is annotated at various linguistic levels. [6]

Applications

The Bulgarian National Corpus enables a number of applications in various linguistic areas: in computational linguistics; in lexicography; within theoretical studies of specific linguistic phenomena; for observations of the characteristics of individual language domains; for extracting exemplary sentences for the education in Bulgarian language, etc.

Some of the more specific applications of the Corpus are listed below:

Access

Access to BulNC is free of charge for public use[ clarification needed ] and includes:

See also

Related Research Articles

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation (WSD) is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious/automatic, but can often come to conscious attention when ambiguity impairs clarity of communication, given the pervasive polysemy in natural language. In computational linguistics, it is an open problem that affects other computer-related writing, such as discourse, improving relevance of search engines, anaphora resolution, coherence, and inference.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

<span class="mw-page-title-main">Treebank</span> Text corpus with tree annotations

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

Linguistic categories include

Contrastive linguistics is a practice-oriented linguistic approach that seeks to describe the differences and similarities between a pair of languages.

The German Reference Corpus is an electronic archive of text corpora of contemporary written German. It was first created in 1964 and is hosted at the Institute for the German Language in Mannheim, Germany. The corpus archive is continuously updated and expanded. It currently comprises more than 4.0 billion word tokens and constitutes the largest linguistically motivated collection of contemporary German texts. Today, it is one of the major resources worldwide for the study of written German.

Deep Linguistic Processing with HPSG - INitiative (DELPH-IN) is a collaboration where computational linguists worldwide develop natural language processing tools for deep linguistic processing of human language. The goal of DELPH-IN is to combine linguistic and statistical processing methods in order to computationally understand the meaning of texts and utterances.

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.

NooJ is a linguistic development environment software as well as a corpus processor constructed by Max Silberztein. NooJ allows linguists to construct the four classes of the Chomsky-Schützenberger hierarchy of generative grammars: Finite-State Grammars, Context-Free Grammars, Context-Sensitive Grammars as well as Unrestricted Grammars, using either a text editor, or a Graph editor.

The Bulgarian Sense-annotated Corpus (BulSemCor) is a structured corpus of Bulgarian texts in which each lexical item is assigned a sense tag. BulSemCor was created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences.

The Bulgarian Part of Speech-annotated Corpus (BulPosCor) is a morphologically annotated general monolingual corpus of written language where each item in a text is assigned a grammatical tag. BulPosCor is created by the Department of Computational Linguistics at the Institute for Bulgarian Language of the Bulgarian Academy of Sciences and consists of 174 697 lexical items. BulPosCor has been compiled from the Structured "Brown" Corpus of Bulgarian by sampling 300+ word-excerpts from the original BCB files in such a way as to preserve the BCB overall structure. The annotation process consists of a primary stage of automatically assigning tags from the Bulgarian Grammar Dictionary and a stage of manual resolving of morphological ambiguities. The disambiguated corpus consists of 174,697 lexical units.

The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

References

  1. Koeva, Svetla, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, and Ekaterina Tarpomanova (2012) “The Bulgarian National Corpus: Theory and Practice in Corpus Design” – Journal of Language Modelling, 2012, Vol. 0, No. 1, pp. 65-110. ISSN   2299-8470. [ permanent dead link ]
  2. Svetla Koeva, Sv. Leseva, I. Stoyanova, E. Tarpomanova, M. Todorova (2006) “Bulgarian Tagged Corpora”. In: Proceedings of the Fifth International Conference Formal Approaches to South Slavic and Balkan Languages, 18–20 October 2006, Sofia, Bulgaria, pp. 78-86.
  3. Koeva Sv., Blagoeva, D., Kolkovska, S. (2010) “Bulgarian National Corpus Project”. In: Proceedings of LREC-2010, Valletta, ELRA, pp. 3678-3684.
  4. Koeva, Svetla, Ivelina Stoyanova, Svetlozara Leseva, Tsvetana Dimitrova, Rositsa Dekova, and Ekaterina Tarpomanova (2012) “The Bulgarian National Corpus: Theory and Practice in Corpus Design” – Journal of Language Modelling, 2012, Vol. 0, No. 1, pp. 65-110. ISSN   2299-8470. [ permanent dead link ]
  5. Koeva, S., Dekova, R., Stoyanova, I., Rizov, B., Genov, A. (2012) “Bulgarian X-language Parallel Corpus”. In: Proceedings of the Eight International Conference on Language Resources and Evaluation (LREC’12)
  6. Koeva, Sv., Genov, A. (2011) “Bulgarian Language Processing Chain”. In: Proceedings of the Workshop Integration of multilingual resources and tools in Web applications, Hamburg.