Text corpora (singular: text corpus ) are large and structured sets of texts, which have been systematically collected. Text corpora are used by corpus linguists and within other branches of linguistics for statistical analysis, hypothesis testing, finding patterns of language use, investigating language change and variation, and teaching language proficiency. [1]
Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.
In linguistics and natural language processing, a corpus or text corpus is a dataset, consisting of natively digital and older, digitalized, language resources, either annotated or unannotated.
Word-sense disambiguation is the process of identifying which sense of a word is meant in a sentence or other segment of context. In human language processing and cognition, it is usually subconscious.
A parallel text is a text placed alongside its translation or translations. Parallel text alignment is the identification of the corresponding sentences in both halves of the parallel text. The Loeb Classical Library and the Clay Sanskrit Library are two examples of dual-language series of texts. Reference Bibles may contain the original languages and a translation, or several translations by themselves, for ease of comparison and study; Origen's Hexapla placed six versions of the Old Testament side by side. A famous example is the Rosetta Stone, whose discovery allowed the Ancient Egyptian language to begin being deciphered.
EUR-Lex is the official online database of European Union law and other public documents of the European Union (EU), published in 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on EUR-Lex. Users can access EUR-Lex free of charge and also register for a free account, which offers extra features.
The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.
A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.
Internet linguistics is a domain of linguistics advocated by the English linguist David Crystal. It studies new language styles and forms that have arisen under the influence of the Internet and of other new media, such as Short Message Service (SMS) text messaging. Since the beginning of human–computer interaction (HCI) leading to computer-mediated communication (CMC) and Internet-mediated communication (IMC), experts, such as Gretchen McCulloch have acknowledged that linguistics has a contributing role in it, in terms of web interface and usability. Studying the emerging language on the Internet can help improve conceptual organization, translation and web usability. Such study aims to benefit both linguists and web users combined.
The International Corpus of English (ICE) is a set of text corpora representing varieties of English from around the world. Over twenty countries or groups of countries where English is the first language or an official second language are included.
The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).
Macmillan English Dictionary for Advanced Learners, also known as MEDAL, is an advanced learner's dictionary first published in 2002 by Macmillan Education. It shares most of the features of this type of dictionary: it provides definitions in simple language, using a controlled defining vocabulary; most words have example sentences to illustrate how they are typically used; and information is given about how words combine grammatically or in collocations. MEDAL also introduced a number of innovations. These include:
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase tatoeba (例えば), meaning 'for example'. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as "Tatoebans". It is run by Association Tatoeba, a French non-profit organization funded through donations.
The Cambridge International Corpus (CIC) is a collection of over 800 million words of real spoken and written English. The texts are stored in a database that can be searched to see how English is used. The CIC also contains the Cambridge Learner Corpus, a unique collection of over 60,000 exam papers from Cambridge ESOL. It shows real mistakes students make and highlights the parts of English which cause problems for students.
The Europarl Corpus is a corpus that consists of the proceedings of the European Parliament from 1996 to 2012. In its first release in 2001, it covered eleven official languages of the European Union. With the political expansion of the EU the official languages of the ten new member states have been added to the corpus data. The latest release (2012) comprised up to 60 million words per language with the newly added languages being slightly underrepresented as data for them is only available from 2007 onwards. This latest version includes 21 European languages: Romanic, Germanic, Slavic, Finno-Ugric, Baltic, and Greek.
Sketch Engine is a corpus manager and text analysis software developed by Lexical Computing since 2003. Its purpose is to enable people studying language behaviour to search large text collections according to complex and linguistically motivated queries. Sketch Engine gained its name after one of the key features, word sketches: one-page, automatic, corpus-derived summaries of a word's grammatical and collocational behaviour. Currently, it supports and provides corpora in over 90 languages.
The Bulgarian WordNet (BulNet) is an electronic multilingual dictionary of synonym sets along with their explanatory definitions and sets of semantic relations with other words in the language.
The Bulgarian National Corpus (BulNC) is a large representative corpus of Bulgarian comprising about 200,000 texts and amounting to over 1 billion words.
A word sketch is a one-page, automatic, corpus-derived summary of a word’s grammatical and collocational behaviour. Word sketches were first introduced by the British corpus linguist Adam Kilgarriff and exploited within the Sketch Engine corpus management system. They are an extension of the general collocation concept used in corpus linguistics in that they group collocations according to particular grammatical relations. The collocation candidates in a word sketch are sorted either by their frequency or using a lexicographic association score like Dice, T-score or MI-score.
The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.
The Czech National Corpus (CNC) is a large electronic corpus of written and spoken Czech language, developed by the Institute of the Czech National Corpus (ICNC) in the Faculty of Arts at Charles University in Prague. The collection is used for teaching and research in corpus linguistics. The ICNC collaborates with over 200 researchers and students, 270 publishers, and other similar research projects.
{{cite journal}}
: External link in |journal=
(help)