General Internet Corpus of Russian

Last updated
General Internet Corpus of Russian
Type of site
educational/scientific project
Available in Russian language
Created byVladimir Selegey, Vladimir Belikov, Serge Sharoff
URL www.webcorpora.ru/en
Commercialno
Registrationneeded; given by request
Launched2012
Current statusBeta-testing

General Internet Corpus of Russian (GICR) is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013. The corpus includes rich text materials from the blogosphere, social networks, major news sources and literary magazines.

Contents

Goals of the project

The project has the status of an educational and scientific one, and many tasks of computational linguistics are solved by independent researchers and research groups with the materials obtained by GICR. While other corpus projects of Russian are focused on fiction and edited texts, General Internet Corpus provides linguists timely opportunity to learn the language as it is, with all the slang and regional peculiarities.

Corpus gives the opportunity to carry out research in

At various times, student papers and independent researches were carried out on the project material by students, graduates and employees of MSU, MIPT, Russian State Humanitarian University, Novosibirsk State University, Higher School of Economics, Russian Academy of Sciences, SFU, CSU, SGMP, IAAS of MSU.

Scientific project leaders:

The organizations involved in support of GICR:

Size and content of the corpus

Corpus size for the summer 2016 is 19.8 billion tokens, of which 49% are from VKontakte, 40% are from LiveJournal, another 4% - from Mail.ru Blogs and News, and 2% - from Russian Magazine Hall. [3] The sources collected in news segment are: RIA Novosti, Regnum, Lenta.ru, Rosbalt. Texts are provided with metamarkup (by date of creation of the text, sex, place and year of birth of the author, Internet genre, etc.); all texts are provided with automatic morphological tagging and lemmatization. [4] Most of the texts collected are of 2013–2014 years of creation, although in some segments, such as in Russian Magazine Hall, there are some texts collected since 1994. [5]

Corpus segmentWords, millionsDocuments
Mail.Ru Blogs7079882120
VKontakte9820193770717
Live Journal811073229158
Russian Magazine Hall31356547
News (ria, regnum, lentaru, rosbalt)8512964897
All corpora19801279903439

GICR is one of the few mega-corpora projects nowadays, which means its available size is reaching several billion of words.

CorpusLanguagesAccessSiteSizeFacilities
COW: Free, Large Web Corpora in European LanguagesEnglish, French, German, Spanish, Swedish, Dutchfree, after registration, trial access is possible without registration 30 billion wordsKWIC format, morphological tagging, CQP search, markup and search by date, URL, country, city, etc.
Sketch Engine English, French, German, Italian, Arabic, Russian, Spanish, Portuguese, Korean, Japanese, Chinese + more languages available at extra chargePaid access, trial access is possible after registration 86 billion wordsconcordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search
Aranea CorporaEnglish, Russian, Finnish, French, German, Hungarian, Spanish, Italian, Dutch, Polish, SlovakFree, after registration, trial access is possible without registration 14 billion wordsnoSketch Engine, concordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search, comparable query results in different languages
GICR (General Internet Corpus of Russian)RussianFree, registration on request 20 billion wordsconcordances, thesaurus, KWIC, morphological tagging, CQP search, markup and search by date, country, city, internet-segment, sex, year and place of birth of the author, “query mail” for users.
GloWbE (Corpus of Global Web-Based English)English, specification for 20 countriesNo registration 1,9 billion wordsKWIC, concordances, collocates, results comparable by dialects, CQP search, corpus can be downloaded

Access

Currently the interface of GICR is in beta stage, so access to the search in the corpora is provided and is free, but is available for researchers on request. [6]

See also

Related Research Articles

Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

<span class="mw-page-title-main">Russian State University for the Humanities</span> University in Moscow, Russia

The Russian State University for the Humanities, is a university in Moscow, Russia with over 25,000 students. It was created in 1991 as the result of the merger of the Moscow Urban University of the People and the Moscow State University for History and Archives.

Dr. Hermann Moisl is a retired senior lecturer and visiting fellow in Linguistics at Newcastle University. He was educated at various institutes, including Trinity College Dublin and the University of Oxford.

<span class="mw-page-title-main">Platon Oyunsky</span> Sakha Soviet writer, linguist, statesman (1893–1939)

Platon Oyunsky (Russian: Платон Ойунский; was the pseudonym of Platon Alekseevich Sleptsov who was a Yakut Soviet writer, philologist and public figure and one of the founders of Yakut literature.

<span class="mw-page-title-main">Treebank</span>

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

<span class="mw-page-title-main">Ivan Kliun</span> Russian painter (1873–1943)

Ivan Vasilievich Kliun, or Klyun, born Klyunkov was a Russian Avant-Garde painter, sculptor and art theorist, associated with the Suprematist movement.

Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.

Linguistic categories include

<span class="mw-page-title-main">Internet linguistics</span> Domain of linguistics

Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.

<span class="mw-page-title-main">Quranic Arabic Corpus</span>

The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.

The following outline is provided as an overview of and topical guide to natural-language processing:

The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

<span class="mw-page-title-main">Okean-class patrol ship</span>

The Okean class, Russian designation Project 22100 Okean (ocean), is a class of oceanic patrol vessels being constructed by Zelenodolsk Shipyard for the Russian Coast Guard. The vessels of this class are intended for protection of Russia's exclusive economic zone (EEZ), providing support to combat and rescue missions, fire fighting or fisheries protection. Apart from this, the ships will be involved in ensuring actions of the FSB of Russia by fighting terrorism and piracy, goods smuggling, drug trafficking or illegal migration.

<span class="mw-page-title-main">Branko Tosovic</span>

Branko Tošović is an Austrian and Serbian philologist, linguist and literary scholar.

The Moscow School of Comparative Linguistics is a school of linguistics based in Moscow, Russia that is known for its work in long-range comparative linguistics. Formerly based at Moscow State University, it is currently centered at the RSUH Institute of Linguistics, and also the Institute of Linguistics of the Russian Academy of Sciences in Moscow, Russia.

Larisa Nikolayevna Ponomaryova is a Russian politician. She is a former representative of the government of the Chukotka Autonomous Okrug in the Federation Council, serving from 2005 to 2013.

References

  1. Automatic Classification of Web Texts Using Functional Text Dimensions
  2. "Collective | GICR".
  3. http://www.webcorpora.ru/%D0%BE-%D0%BA%D0%BE%D1%80%D0%BF%D1%83%D1%81%D0%B5
  4. //www.webcorpora.ru/%D0%BE-%D0%BA%D0%BE%D1%80%D0%BF%D1%83%D1%81%D0%B5
  5. Post in the blog: https://vk.com/wall-89094852_220
  6. "Контакты | ГИКРЯ".

Further reading

  1. Belikov V., Kopylov N., Piperski A., Selegey V., Sharoff S., (2013), Big and diverse is beautiful: A large corpus of Russian to study linguistic variation. In Web as Corpus Workshop (WAC-8).
  2. Lagutin M. B., Katinskaya A. Y., Selegey V. P., Sharoff S., Sorokin A. A. (2015) Automatic Classification of Web Texts Using Functional Text Dimensions. In Dialogue, Russian International Conference on Computational Linguistics, Bekasovo
  3. Katinskaya A., Sharoff S. (2015) Applying Multi-dimensional Analysis to a Russian Webcorpus: Searching for Evidence of Genres, in Proc. of the Workshop on Balto-Slavic Natural Language Processing associated with the International Conference RANLP, Hissar, Bulgaria.

Official site of GICR