Croatian National Corpus

Last updated

Croatian National Corpus (Croatian : Hrvatski nacionalni korpus, HNK) is the biggest and the most important corpus of the Croatian language. Its compilation started in 1998 at the Institute of Linguistics [1] of the Faculty of Humanities and Social Sciences, University of Zagreb following the ideas of Marko Tadić. The theoretical foundations and the expression of the need for a general-purpose, representative and multi-million corpus of the Croatian language started to appear even earlier. [2] The Croatian National Corpus is compiled from selected texts written in Croatian covering all fields, topics, genres and styles: from literary and scientific texts to text-books, newspaper, user-groups and chat rooms.

The initial composition was divided in two constituents:

  1. 30-million corpus of contemporary Croatian language (30m) where samples from texts from 1990 on were included. The criteria for inclusion of text samples were: written by native speakers, different fields, genres and topics. Translated text or poetry were excluded.
  2. Croatian Electronic Text Archive (HETA) where the complete text were included, particularly serial publications (volumes, series, editions etc.) which would imbalance the 30m if they were inserted there.

Since 2004, with the adoption of the concept of the 3rd generation corpus, the two-constituent structure has been abandoned in favor of several subcorpora and larger size. Since 2005 HNK 105 million tokens and is composed of number of different subcorpora which can be searched individually and all together in a whole corpus. Since 2004 HNK also migrated to a new server platform, namely Manatee/Bonito server-client architecture. For searching the HNK (today still with free test access) a free client program Bonito [3] is needed. The author of this corpus manager is Pavel Rychlý [4] from the Natural Language Processing Laboratory [5] of the Faculty of Informatics, [6] Masaryk University in Brno, Czech Republic. Its interface features complex and more elaborated queries over corpus, different types of statistical results, total or partial word lists according to different query criteria (with their frequencies), frequency distribution of types, automatic collocation detection etc.

The last version of this corpus (version 3) [7] has 216.8 million tokens. The online search is available via web-interface search Bonito 2 which is a part of NoSketch Engine, [8] limited version of the software Sketch Engine.

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions.

In linguistics, a corpus or text corpus is a large and structured set of texts. In corpus linguistics, they are used to do statistical analysis and hypothesis testing, checking occurrences or validating linguistic rules within a specific language territory.

Henry Kučera, born Jindřich Kučera, was a Czech-American linguist who pioneered corpus linguistics, linguistic software, was a major contributor to the American Heritage Dictionary, and a pioneer in the development of spell checking computer software.

Eur-Lex service providing legal texts of the European Union

Eur-Lex is an official website of European Union law and other public documents of the European Union (EU), published in 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on Eur-Lex. Users can access Eur-Lex free of charge and also register for a free account, which offers extra features.

Collocation Frequent occurrence of words next to each other

In corpus linguistics, a collocation is a series of words or terms that co-occur more often than would be expected by chance. In phraseology, collocation is a sub-type of phraseme. An example of a phraseological collocation, as propounded by Michael Halliday, is the expression strong tea. While the same meaning could be conveyed by the roughly equivalent powerful tea, this expression is considered excessive and awkward by English speakers. Conversely, the corresponding expression in technology, powerful computer is preferred over strong computer. Phraseological collocations should not be confused with idioms, where an idiom's meaning is derived from its convention as a stand-in for something else while collocation is a mere popular composition. The ability to use English effectively involves an awareness of a distinctive feature of the language known as collocation. Collocation is that behaviour of the language by which two or more words go together, in speech or writing.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

Patrick Hanks is an English lexicographer, corpus linguist, and onomastician. He has edited dictionaries of general language, as well as dictionaries of personal names.

Search engine optimisation indexing collects, parses, and stores data to facilitate fast and accurate information retrieval. Index design incorporates interdisciplinary concepts from linguistics, cognitive psychology, mathematics, informatics, and computer science. An alternate name for the process in the context of search engines designed to find web pages on the Internet is web indexing.

Slovenian National Corpus FidaPLUS is the 621 million words (tokens) corpus of the Slovenian language, gathered from selected texts written in Slovenian of different genres and styles, mainly from books and newspapers.

The Croatian Language Corpus is a corpus of Croatian compiled at the Institute of Croatian Language and Linguistics (IHJJ).

The Europarl Corpus is a corpus that consists of the proceedings of the European Parliament from 1996 to the present. In its first release in 2001, it covered eleven official languages of the European Union. With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic, Germanic, Slavic, Finno-Ugric, Baltic, and Greek.

The following outline is provided as an overview of and topical guide to natural language processing:

Sketch Engine corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing Limited since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in 90+ languages.

A corpus manager is a tool for multilingual corpus analysis, which allows effective searching in corpora.

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.

Word sketch

A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam Kilgarriff and exploited within the Sketch Engine corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relations. The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score.

SkELL

SkELL is a free corpus-based web tool that allows language learners and teachers find authentic sentences for specific target word(s). For any word or a phrase, SkELL displays a concordance that lists example sentences drawn from a special text corpus crawled from the World Wide Web, which has been cleaned of spam and includes only high-quality texts covering everyday, standard, formal, and professional language. There are versions of SkELL for English, Russian, German, Italian, Czech and Estonian.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

References

  1. Institute of Linguistics
  2. Tadić 1990, 1996 Archived 2006-02-10 at the Wayback Machine , 1998 Archived 2006-02-10 at the Wayback Machine
  3. Bonito
  4. Rychlý, Pavel (2007). "Manatee/bonito–a modular corpus manager" (PDF). 1st Workshop on Recent Advances in Slavonic Natural Language Processing. Masaryk University: 65–70.
  5. Natural Language Processing Laboratory Archived 2005-10-28 at the Wayback Machine
  6. Faculty of Informatics
  7. Tadić, Marko (2009). "New version of the Croatian National Corpus". After Half a Century of Slavonic Natural Language Processing. Masaryk University: 199–205.
  8. NoSketch Engine