This article has multiple issues. Please help improve it or discuss these issues on the talk page . (Learn how and when to remove these template messages)
|
Type of site | educational/scientific project |
---|---|
Available in | Russian language |
Created by | Vladimir Selegey, Vladimir Belikov, Serge Sharoff |
URL | www |
Commercial | no |
Registration | needed; given by request |
Launched | 2012 |
Current status | Beta-testing |
General Internet Corpus of Russian (GICR) is a corpus of Russian internet texts that has been accessible on request through an online query interface since 2013. The corpus includes rich text materials from the blogosphere, social networks, major news sources and literary magazines.
The project has the status of an educational and scientific one, and many tasks of computational linguistics are solved by independent researchers and research groups with the materials obtained by GICR. While other corpus projects of Russian are focused on fiction and edited texts, General Internet Corpus provides linguists timely opportunity to learn the language as it is, with all the slang and regional peculiarities.
Corpus gives the opportunity to carry out research in
At various times, student papers and independent researches were carried out on the project material by students, graduates and employees of MSU, MIPT, Russian State Humanitarian University, Novosibirsk State University, Higher School of Economics, Russian Academy of Sciences, SFU, CSU, SGMP, IAAS of MSU.
Scientific project leaders:
The organizations involved in support of GICR:
Corpus size for the summer 2016 is 19.8 billion tokens, of which 49% are from VKontakte, 40% are from LiveJournal, another 4% - from Mail.ru Blogs and News, and 2% - from Russian Magazine Hall. [3] The sources collected in news segment are: RIA Novosti, Regnum, Lenta.ru, Rosbalt. Texts are provided with metamarkup (by date of creation of the text, sex, place and year of birth of the author, Internet genre, etc.); all texts are provided with automatic morphological tagging and lemmatization. [4] Most of the texts collected are of 2013–2014 years of creation, although in some segments, such as in Russian Magazine Hall, there are some texts collected since 1994. [5]
Corpus segment | Words, millions | Documents |
---|---|---|
Mail.Ru Blogs | 707 | 9882120 |
VKontakte | 9820 | 193770717 |
Live Journal | 8110 | 73229158 |
Russian Magazine Hall | 313 | 56547 |
News (ria, regnum, lentaru, rosbalt) | 851 | 2964897 |
All corpora | 19801 | 279903439 |
GICR is one of the few mega-corpora projects nowadays, which means its available size is reaching several billion of words.
Corpus | Languages | Access | Site | Size | Facilities |
---|---|---|---|---|---|
COW: Free, Large Web Corpora in European Languages | English, French, German, Spanish, Swedish, Dutch | free, after registration, trial access is possible without registration | 30 billion words | KWIC format, morphological tagging, CQP search, markup and search by date, URL, country, city, etc. | |
Sketch Engine | English, French, German, Italian, Arabic, Russian, Spanish, Portuguese, Korean, Japanese, Chinese + more languages available at extra charge | Paid access, trial access is possible after registration | 86 billion words | concordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search | |
Aranea Corpora | English, Russian, Finnish, French, German, Hungarian, Spanish, Italian, Dutch, Polish, Slovak | Free, after registration, trial access is possible without registration | 14 billion words | noSketch Engine, concordances, sketch grammar, thesaurus, KWIC, morphological tagging, CQP search, comparable query results in different languages | |
GICR (General Internet Corpus of Russian) | Russian | Free, registration on request | 20 billion words | concordances, thesaurus, KWIC, morphological tagging, CQP search, markup and search by date, country, city, internet-segment, sex, year and place of birth of the author, “query mail” for users. | |
GloWbE (Corpus of Global Web-Based English) | English, specification for 20 countries | No registration | 1,9 billion words | KWIC, concordances, collocates, results comparable by dialects, CQP search, corpus can be downloaded |
Currently the interface of GICR is in beta stage, so access to the search in the corpora is provided and is free, but is available for researchers on request. [6]
Computational linguistics is an interdisciplinary field concerned with the computational modelling of natural language, as well as the study of appropriate computational approaches to linguistic questions. In general, computational linguistics draws upon linguistics, computer science, artificial intelligence, mathematics, logic, philosophy, cognitive science, cognitive psychology, psycholinguistics, anthropology and neuroscience, among others.
Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference. The large collections of text allow linguistics to run quantitative analyses on linguistic concepts, otherwise harder to quantify.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
The Russian State University for the Humanities, is a university in Moscow, Russia with over 25,000 students. It was created in 1991 as the result of the merger of the Moscow Urban University of the People and the Moscow State University for History and Archives.
Dr. Hermann Moisl is a retired senior lecturer and visiting fellow in Linguistics at Newcastle University. He was educated at various institutes, including Trinity College Dublin and the University of Oxford.
Platon Oyunsky (Russian: Платон Ойунский; was the pseudonym of Platon Alekseevich Sleptsov who was a Yakut Soviet writer, philologist and public figure and one of the founders of Yakut literature.
In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.
Ivan Vasilievich Kliun, or Klyun, born Klyunkov was a Russian Avant-Garde painter, sculptor and art theorist, associated with the Suprematist movement.
Terminology extraction is a subtask of information extraction. The goal of terminology extraction is to automatically extract relevant terms from a given corpus.
Linguistic categories include
Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.
The Quranic Arabic Corpus is an annotated linguistic resource consisting of 77,430 words of Quranic Arabic. The project aims to provide morphological and syntactic annotations for researchers wanting to study the language of the Quran.
The following outline is provided as an overview of and topical guide to natural-language processing:
The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.
The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.
The Okean class, Russian designation Project 22100 Okean (ocean), is a class of oceanic patrol vessels being constructed by Zelenodolsk Shipyard for the Russian Coast Guard. The vessels of this class are intended for protection of Russia's exclusive economic zone (EEZ), providing support to combat and rescue missions, fire fighting or fisheries protection. Apart from this, the ships will be involved in ensuring actions of the FSB of Russia by fighting terrorism and piracy, goods smuggling, drug trafficking or illegal migration.
Branko Tošović is an Austrian and Serbian philologist, linguist and literary scholar.
The Moscow School of Comparative Linguistics is a school of linguistics based in Moscow, Russia that is known for its work in long-range comparative linguistics. Formerly based at Moscow State University, it is currently centered at the RSUH Institute of Linguistics, and also the Institute of Linguistics of the Russian Academy of Sciences in Moscow, Russia.
Larisa Nikolayevna Ponomaryova is a Russian politician. She is a former representative of the government of the Chukotka Autonomous Okrug in the Federation Council, serving from 2005 to 2013.