Scottish Corpus of Texts and Speech

Last updated January 27, 2025

The Scottish Corpus of Texts & Speech (SCOTS) is an ongoing project to build a corpus of modern-day (post-1940) written and spoken texts in Scottish English and varieties of Scots. SCOTS has been available online since November 2004, and can be freely searched and browsed. It reached 4.7 million words by 2015.^[1]

Language variety

SCOTS contains texts in Scottish English and varieties of broad Scots, including Doric, Lallans, urban varieties such as Glaswegian and Insular Scots. SCOTS contains a geographical spread of texts as well as a demographic spread. Each text is accompanied by extensive metadata, including such information as author's decade of birth, gender, occupation, birthplace and place of residence, and details about the text such as publication information, audience, date and genre.

Genre and mode

SCOTS is a multimedia corpus, containing written texts and spoken texts, available as orthographic transcriptions, accompanied by source audio or video files. SCOTS includes a large number of genres and text types, including prose fiction, poetry, business and personal correspondence, religious texts, parliamentary and administrative documents, emails, conversations and interviews.

Search and analysis

SCOTS can be investigated in various ways, depending on the user's interest. The corpus can be browsed, for example by the author's name or date of the text, and all texts can be downloaded in plain text format.

Transcriptions are synchronised with audio / video files, which are streamed and may also be downloaded.

An Advanced Search facility allows the user to build up more complex queries, choosing from all the fields available in the metadata. Geographical results are plotted on an interactive map, so regional variation may be investigated.

Advanced Search results can also be viewed as a KWIC concordance, which can be reordered to highlight collocational patterns.

Related Research Articles

The Comprehensive Perl Archive Network (CPAN) is a software repository of over 250,000 software modules and accompanying documentation for 39,000 distributions, written in the Perl programming language by over 12,000 contributors. CPAN can denote either the archive network or the Perl program that acts as an interface to the network and as an automated software installer. Most software on CPAN is free and open source software.

<span class="mw-page-title-main">Konqueror</span> Web browser and file manager

Konqueror is a free and open-source web browser and file manager that provides web access and file-viewer functionality for file systems. It forms a core part of the KDE Software Compilation. Developed by volunteers, Konqueror can run on most Unix-like operating systems. The KDE community licenses and distributes Konqueror under GNU GPL-2.0-or-later.

Corpus linguistics is an empirical method for the study of language by way of a text corpus. Corpora are balanced, often stratified collections of authentic, "real world", text of speech or writing that aim to represent a given linguistic variety. Today, corpora are generally machine-readable data collections.

RealAudio, also spelled Real Audio, is a proprietary audio format developed by RealNetworks and first released in April 1995. It uses a variety of audio codecs, ranging from low-bitrate formats that can be used over dialup modems, to high-fidelity formats for music. It can be used as a streaming audio format, that is played at the same time as it is downloaded.

Scottish English is the set of varieties of the English language spoken in Scotland. The transregional, standardised variety is called Scottish Standard English or Standard Scottish English (SSE). Scottish Standard English may be defined as "the characteristic speech of the professional class [in Scotland] and the accepted norm in schools". IETF language tag for "Scottish Standard English" is en-scotland.

EUR-Lex is the official online database of European Union law and other public documents of the European Union (EU), published in 24 official languages of the EU. The Official Journal (OJ) of the European Union is also published on EUR-Lex. Users can access EUR-Lex free of charge and also register for a free account, which offers extra features.

File Explorer, previously known as Windows Explorer, is a file manager application and default desktop environment that is included with releases of the Microsoft Windows operating system from Windows 95 onwards. It provides a graphical user interface for accessing the file systems, as well as user interface elements such as the taskbar and desktop.

Desktop organizer software applications are applications that automatically create useful organizational structures from desktop content, including heterogeneous types of content including email, files, contacts, companies, RSS news feeds, photos, music and chat sessions. The organization is based on a combination of automated scanning of metadata similar to data mining and manual tagging of content. The metadata stored in applications is correlated based on a structure for the data type handled by the organizer tool. For example, the email address of a sender of an email allows the email to be filed in a virtual folder for the author and company the author works for or a music file is filed by the musician and album label. The resulting visualization simplifies use of desktop content to navigate, search, and use related information stored on the desktop computer. The data in desktop organizer tools is normally stored in a database rather than the computer's file system in order to produce virtual folders where the same item can appear in multiple folders to the user based on its relationship to the folder.

The American National Corpus (ANC) is a text corpus of American English containing 22 million words of written and spoken data produced since 1990. Currently, the ANC includes a range of genres, including emerging genres such as email, tweets, and web data that are not included in earlier corpora such as the British National Corpus. It is annotated for part of speech and lemma, shallow parse, and named entities.

A video search engine is a web-based search engine which crawls the web for video content. Some video search engines parse externally hosted content while others allow content to be uploaded and hosted on their own servers. Some engines also allow users to search by video format type and by length of the clip. The video search results are usually accompanied by a thumbnail view of the video.

Metadata publishing is the process of making metadata data elements available to external users, both people and machines using a formal review process and a commitment to change control processes.

The British National Corpus (BNC) is a 100-million-word text corpus of samples of written and spoken English from a wide range of sources. The corpus covers British English of the late 20th century from a wide variety of genres, with the intention that it be a representative sample of spoken and written British English of that time. It is used in corpus linguistics for analysis of corpora.

An audio search engine is a web-based search engine which crawls the web for audio content. The information can consist of web pages, images, audio files, or another type of document. Various techniques exist for research on these engines.

The Corpus of Contemporary American English (COCA) is a one-billion-word corpus of contemporary American English. It was created by Mark Davies, retired professor of corpus linguistics at Brigham Young University (BYU).

The Survey of English Usage was the first research centre in Europe to carry out research with corpora. The Survey is based in the Department of English Language and Literature at University College London.

iConji is a free pictographic communication system based on an open, visual vocabulary of characters with built-in translations for most major languages.

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase tatoeba (例えば), meaning 'for example'. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as "Tatoebans". It is run by Association Tatoeba, a French non-profit organization funded through donations.

HTML audio is a subject of the HTML specification, incorporating audio |speech to text]], all in the browser.

The Spoken English Corpus (SEC) is a speech corpus collection of recordings of spoken British English compiled during 1984–1987. The corpus manual can be found on ICAME.

References

↑ Kopaczyk, Joanna (29 April 2016). "Wendy Anderson (ed.), Language in Scotland. Corpus-based studies". Northern Scotland. 7 (1): 112–117. doi:10.3366/nor.2016.0117. ISSN 0306-5278.

External links

Official website

This article about a digital library is a stub. You can help Wikipedia by expanding it.

This article about Germanic languages is a stub. You can help Wikipedia by expanding it.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Kopaczyk, Joanna (29 April 2016). "Wendy Anderson (ed.), Language in Scotland. Corpus-based studies". Northern Scotland. 7 (1): 112–117. doi:10.3366/nor.2016.0117. ISSN 0306-5278.

[1]

v t e Corpus linguistics
Text corpora, English	American National Corpus Bank of English Bergen Corpus of London Teenage Language British National Corpus Brown Corpus Buckeye Corpus Cambridge English Corpus Corpus of Contemporary American English Enron Corpus EnTenTen International Corpus of English Lancaster-Oslo-Bergen Corpus Oxford English Corpus PropBank Spoken English Corpus Switchboard Telephone Speech Corpus TIMIT VerbNet Wellington Corpus of Spoken New Zealand English
Text corpora, non-English	Bijankhan Corpus CHILDES CorCenCC National Corpus of Contemporary Welsh Croatian Language Corpus Croatian National Corpus Czech National Corpus Europarl Corpus German Reference Corpus Hamshahri Corpus National Corpus of Polish Neo-Assyrian Text Corpus Project Persian Speech Corpus Quranic Arabic Corpus Russian National Corpus Somali Corpus Scottish Corpus of Texts and Speech Slovenian National Corpus TalkBank Tatoeba Tehran Monolingual Corpus Tekstaro de Esperanto TenTen Corpus Family Thesaurus Linguae Graecae
Organizations	BNC consortium COBUILD Sketch Engine