Corpus of Written Tatar

Corpus of Written Tatar
Type of site	research/educational project
Available in	English/Russian/Tatar
Founded	2011;12 years ago
Headquarters	Kazan, Russia
Founder(s)	Saykhunov M.R., Ibragimov T.I., Khusainov R.R.
URL	www.corpus.tatar/en
Launched	March 15, 2012;11 years ago
Current status	The project is being actively developed.

Last updated September 07, 2023

Corpus of Written Tatar (Tatar Corpus) is an electronic corpus of the Tatar language, which has been made available online. This collection of Tatar texts in electronic form is intended for the use of those interested in the structure, present condition and prospects of the Tatar language. The Corpus of Written Tatar language is indispensable for everyone who wants to study Tatar by the methods of corpus linguistics. The website was opened on March 15, 2012, and is available in the Tatar, Russian and English languages.

Size of the Corpus

The size of the Corpus of Tatar language at the end of 2014 is more than 116 mln words. Number of sentences - 10 mln, the number of different word forms is about 1,5 mln.
To prevent copy, texts are stored as mixed sentences in the Corpus.

Access

Access to the Tatar Corpus for research purposes is free of charge.

About Corpus creating process

Creating of the Corpus of Tatar language was initiated in 2010 by a group of enthusiasts. The task was considered urgent as it would provide the necessary database of texts for the work on machine translation systems for the Tatar language, and it was also indispensable in solving problems in Tatar speech synthesis and recognition.

Practical value and areas of use

The basic purpose of the Corpus of Written Tatar language is to provide assistance in research into the Tatar lexicon. Furthermore, the corpus can be used in language learning, and as a source of models for various types of documents.
The Corpus of Written Tatar allows the user to do searches for words by specific features, to see the words in their contexts, and it also provides the user with frequency data.

Contextual (statistic) corpus

This type of search makes it possible to see the right, left and semantic contexts of a specific word, sorted by frequency.
Right context - words placed directly after the current word.
Left context - words placed directly before the current word.
Semantic context - words located in the same sentence with the current word, i.e. there is some kind of implied semantic connection between the words.

Complex morphological search

In 2014, the morphological marking of the Tatar Corpus was carried out. The meta-language of grammatical labels is based on the system of tags for Turkic languages developed by the international project Apertium. This project is aimed to develop automatic translating system for a big variety of languages. The main arguments in favor of choosing Apertium's morphological tagger for marking the Corpus are:
- high quality of morphological annotation;
- its being an Open Source project: all the source code and data are publicly available for all for free.
The Complex Morphological Search system developed by us in 2015-2016 allows to perform searches in the Corpus by different combinations of such parameters as word form, lemma, morphological (grammatical) tags set, beginning of the word, middle part, end of the word, and the distance between searched words. The maximum length of the search query is five tokens + accordingly four distances between them.

Tatar Speech synthesis

The Corpus of Written Tatar offers the user a unique opportunity to listen to the sentences found in a search, and also to listen any other text that the user enters to this facility, see http://search.corpus.tatar/search/sintez_en.html.

Statistical data

The creators of the Corpus of Tatar language upload various additional statistical data as soon as they become available as a result of processing the Corpus, see http://corpus.tatar/stat_en.htm.

Shortcomings and prospects

Absence of offline corpus version.
Automatic disambiguation.

Authors

Creators of the Corpus:

Saykhunov M.R. (Candidate of Philology, research fellow at the Institute of Informatics)
Ibragimov T.I. (Candidate of Philology, associate professor at the Applied Linguistics Department of Kazan Federal University)
Khusainov R.R. (Engineer, "GDC")

With the assistance of:

The Republican Center for Development of Traditional Culture
The Research Unit for Volgaic Languages at the Turku University (Finland)
«RX5» company
The editorial office of the popular scientific journal "Фән һәм Тел"

Literature

^[1]

Related Research Articles

<span class="mw-page-title-main">Tatar language</span> Turkic language spoken by Tatars

Tatar is a Turkic language spoken by the Volga Tatars mainly located in modern Tatarstan, as well as Siberia. It should not be confused with Crimean Tatar or Siberian Tatar, which are closely related but belong to different subgroups of the Kipchak languages.

The Mari language, formerly known as the Cheremiss language, spoken by approximately 400,000 people, belongs to the Uralic language family. It is spoken primarily in the Mari Republic of the Russian Federation as well as in the area along the Vyatka river basin and eastwards to the Urals. Mari speakers, known as the Mari, are found also in the Tatarstan, Bashkortostan, Udmurtia, and Perm regions.

<span class="mw-page-title-main">Bashkir language</span> Kipchak Turkic language

Bashkir is a Turkic language belonging to the Kipchak branch. It is co-official with Russian in Bashkortostan. It is spoken by approximately 1.6 million native speakers in Russia, as well as in Ukraine, Belarus, Kazakhstan, Uzbekistan, Estonia and other neighboring post-Soviet states, and among the Bashkir diaspora. It has three dialect groups: Southern, Eastern and Northwestern.

Three scripts are currently used for the Tatar language: Arabic, Cyrillic and Latin.

Central Siberian Yupik, is an endangered Yupik language spoken by the indigenous Siberian Yupik people along the coast of Chukotka in the Russian Far East and in the villages of Savoonga and Gambell on St. Lawrence Island. The language is part of the Eskimo-Aleut language family.

The Volga Tatars or simply Tatars are a Turkic ethnic group native to the Volga-Ural region of Russia. They are subdivided into various subgroups. Volga Tatars are the second-largest ethnic group in Russia after ethnic Russians. Most of them live in the republics of Tatarstan and Bashkortostan. Their native language is Tatar, a language of the Kipchak-Bolgar subdivision of the Turkic language family. Majority religion is Sunni Islam.

<span class="mw-page-title-main">Jovan Ajduković</span> Serbian linguist

Jovan Ajduković is a Serbian linguist.

The Mulyanka, also referred as Upper Mulyanka, is a small river in Perm Krai, Russia which flows in the city of Perm and nearby Permsky District and is a left tributary of the Kama. The proximity of city's industry has a heavy influence on the river ecology.

Baraba or Baraba Tatar, is spoken by at least 8,000 Baraba Tatars in Siberia. It is a dialect of Siberian Tatar language. While middle aged individuals and the young generation speak Russian and Volga-Ural Tatar languages, Baraba dialect is used by the older generation.

Mirfatyh Zakievich Zakiev was a Soviet and Russian controversial academic in the domain of Turkology scholar.

The Russian National Corpus is a corpus of the Russian language that has been partially accessible through a query interface online since April 29, 2004. It is being created by the Institute of Russian language, Russian Academy of Sciences.

Vatanym Tatarstan is the main Tatar language newspaper, published in Kazan.

<span class="mw-page-title-main">Megaliths in the Urals</span>

In recent years, many megaliths have been discovered in the Urals: dolmens, menhirs and a large megalithic cultic complex on Vera Island.

Alexander Vladimirovich Bykov is a Russian historian and ethnographer, one of the leading Russian specialists in numismatics. Publisher, founder of the first Russian private museum of political history, the Museum of Diplomatic Corps in Vologda, Russia; he is the author of multiple articles, books, and popular science publications.

<span class="mw-page-title-main">Tatiana Dorofeeva (linguist)</span>

Tatiana Valerianovna Dorofeeva was a Russian linguist, orientalist and translator.

The Bashkir alphabet is a writing system used for the Bashkir language. Until the mid-19th century, Bashkir speakers wrote in the Türki literary language using the Arabic script. In 1869, Russian linguist Mirsalikh Bekchurin published the first guide to Bashkir grammar, and the first Cyrillic Bashkir introductory book was published by Vasily Katarinsky in Orenburg in 1892. Latinisation was first discussed in June 1924, when the first draft of the Bashkir alphabet using the Latin script was created. More reforms followed, culminating in the final version in 1938.

Gayane Gagik Yeganyan, Candidate of Pedagogical Sciences. Gayane Yeganyan is the first to carry out research on pedagogical diagnostics in Armenian, as a result of which she has published 21 scientific articles.

Kolpashevo State Teacher's Institute is a higher educational institution in Kolpashevo, which existed from 1940 to 1956 to train teachers.

Valentin Ivanovich Rassadin was a Soviet and Russian linguist. He is best known for his documentation and studies of the Tofa language and Soyot-Tsaatan language.

A finnicism or fennicism is a word or grammatical feature borrowed from Finnic languages into a non-Finnic one. Most often they occur in the contacting languages: Indo-European, other Ugric languages, as well as Turkic.

References

↑ "Письменный Корпус Татарского Языка".

External links

Corpus of Written Tatar (Corpus of Tatar language) - Official site

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] "Письменный Корпус Татарского Языка".

[1]