Corpus language

Last updated

A corpus language is a language that has no living speakers but for which numerous records produced by its native speakers survive. [1] Examples of corpus languages are Ancient Greek, Latin, the Egyptian language, Old English and Elamite.

Some corpus languages, such as Ancient Greek and Latin, left very large corpora and therefore can be fully reconstructed, even though some details of pronunciation may be unclear. Such languages can be used even today, as is the case with Sanskrit and Latin. Others have such limited corpora that some important words—e.g., some pronouns—are lacking in the corpora. Examples of these are Ugaritic and Gothic. Languages that are attested only by a few words, often names, and a few phrases (called Trümmersprachen in German linguistics, literally "rubble languages") can be reconstructed only in a very limited way, and often their genetic relationship to other languages remains unclear. Examples are the Lombardic language and Dadanitic, a Semitic language that may be close to classical Arabic.

Corpus languages are studied using the methods of corpus linguistics, but corpus linguistics can also be used (and is commonly used) for the study of the writings and other records of living languages.

Not all extinct languages are corpus languages, since there are many extinct languages in which few or no writings or other records survive.

Related Research Articles

<span class="mw-page-title-main">Greek language</span> Indo-European language

Greek is an independent branch of the Indo-European family of languages, native to Greece, Cyprus, Italy, southern Albania, and other regions of the Balkans, the Black Sea coast, Asia Minor, and the Eastern Mediterranean. It has the longest documented history of any Indo-European language, spanning at least 3,400 years of written records. Its writing system is the Greek alphabet, which has been used for approximately 2,800 years; previously, Greek was recorded in writing systems such as Linear B and the Cypriot syllabary. The alphabet arose from the Phoenician script and was in turn the basis of the Latin, Cyrillic, Coptic, Gothic, and many other writing systems.

<span class="mw-page-title-main">Indo-European languages</span> Language family native to Eurasia

The Indo-European languages are a language family native to the overwhelming majority of Europe, the Iranian plateau, and the northern Indian subcontinent. Some European languages of this family—English, French, Portuguese, Russian, Dutch, and Spanish—have expanded through colonialism in the modern period and are now spoken across several continents. The Indo-European family is divided into several branches or sub-families, of which there are eight groups with languages still alive today: Albanian, Armenian, Balto-Slavic, Celtic, Germanic, Hellenic, Indo-Iranian, and Italic; another nine subdivisions are now extinct.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.

Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.

The Anatolian languages are an extinct branch of Indo-European languages that were spoken in Anatolia, part of present-day Turkey. The best known Anatolian language is Hittite, which is considered the earliest-attested Indo-European language.

<span class="mw-page-title-main">Extinct language</span> Language that no longer has any first-language or second-language speakers

An extinct language is a language with no living descendants that no longer has any first-language or second-language speakers. In contrast, a dead language is a language that no longer has any first-language speakers, but does have second-language speakers or is used fluently in written form, such as Latin. A dormant language is a dead language that still serves as a symbol of ethnic identity to an ethnic group; these languages are often undergoing a process of revitalisation. Languages that have first-language speakers are known as modern or living languages to contrast them with dead languages, especially in educational contexts.

Lydian is an extinct Indo-European Anatolian language spoken in the region of Lydia, in western Anatolia. The language is attested in graffiti and in coin legends from the late 8th century or the early 7th century to the 3rd century BCE, but well-preserved inscriptions of significant length are so far limited to the 5th century and the 4th century BCE, during the period of Persian domination. Thus, Lydian texts are effectively contemporaneous with those in Lycian.

That is an English language word used for several grammatical purposes. These include use as an adjective, conjunction, pronoun, adverb and intensifier; it has distance from the speaker, as opposed to words like this.

Proto-Indo-European (PIE) is the reconstructed common ancestor of the Indo-European language family. No direct record of Proto-Indo-European exists; its proposed features have been derived by linguistic reconstruction from documented Indo-European languages.

Language contact occurs when speakers of two or more languages or varieties interact with and influence each other. The study of language contact is called contact linguistics. Language contact can occur at language borders, between adstratum languages, or as the result of migration, with an intrusive language acting as either a superstratum or a substratum.

Judaeo-Romance languages are Jewish languages derived from Romance languages, spoken by various Jewish communities originating in regions where Romance languages predominate, and altered to such an extent to gain recognition as languages in their own right. The status of many Judaeo-Romance languages is controversial as, despite manuscripts preserving transcriptions of Romance languages using the Hebrew alphabet, there is often little-to-no evidence that these "dialects" were actually spoken by Jews living in the various European nations.

<span class="mw-page-title-main">Kamassian language</span> Extinct Samoyed language

Kamassian is an extinct Samoyedic language, formerly spoken by the Kamasins. It is included by convention in the Southern group together with Mator and Selkup. The last native speaker of Kamassian, Klavdiya Plotnikova, died in 1989. Kamassian was spoken in Russia, north of the Sayan Mountains, by Kamasins. The last speakers lived mainly in the village of Abalakovo. Prior to its extinction, the language was strongly influenced by Turkic and Yeniseian languages.

<span class="mw-page-title-main">Treebank</span> Text corpus with tree annotations

In linguistics, a treebank is a parsed text corpus that annotates syntactic or semantic sentence structure. The construction of parsed corpora in the early 1990s revolutionized computational linguistics, which benefitted from large-scale empirical data.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

<span class="mw-page-title-main">Languages of Scotland</span>

The languages of Scotland belong predominantly to the Germanic and Celtic language families. The main language now spoken in Scotland is English, while Scots and Scottish Gaelic are minority languages. The dialect of English spoken in Scotland is referred to as Scottish English.

The phonology of the Proto-Indo-European language (PIE) has been reconstructed by linguists, based on the similarities and differences among current and extinct Indo-European languages. Because PIE was not written, linguists must rely on the evidence of its earliest attested descendants, such as Hittite, Sanskrit, Ancient Greek, and Latin, to reconstruct its phonology.

<span class="mw-page-title-main">Languages of the Roman Empire</span>

Latin and Greek were the dominant languages of the Roman Empire, but other languages were regionally important. Latin was the original language of the Romans and remained the language of imperial administration, legislation, and the military throughout the classical period. In the West, it became the lingua franca and came to be used for even local administration of the cities including the law courts. After all freeborn inhabitants of the Empire were granted universal citizenship in 212 AD, a great number of Roman citizens would have lacked Latin, though they were expected to acquire at least a token knowledge, and Latin remained a marker of "Romanness".

<span class="mw-page-title-main">Sketch Engine</span> Corpus manager and text analysis software

Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.

<span class="mw-page-title-main">Ancient text corpora</span> All known writing up to 300 CE

Ancient text corpora are the entire collection of texts from the period of ancient history, defined in this article as the period from the beginning of writing up to 300 AD. These corpora are important for the study of literature, history, linguistics, and other fields, and are a fundamental component of the world's cultural heritage.

References

  1. Langslow, D.R. 2002 "Approaching bilingualism in corpus languages" in James Noel Adams, Mark Janse, Simon Swain (edd.) Bilingualism in Ancient Society: Language Contact and the Written Text Oxford: OUP.

See also