Pangloss Collection

Last updated

The Pangloss Collection is a digital library whose objective is to store and facilitate access to audio recordings in endangered languages of the world. Developed by the LACITO centre of CNRS in Paris, the collection provides free online access to documents of connected, spontaneous speech, in otherwise little-documented languages of all continents. [1]

Contents

Principles

A sound archive with synchronized transcriptions

For the science of linguistics, language is first and foremost spoken language. The medium of spoken language is sound. The Pangloss Collection gives access to original recordings simultaneously with transcriptions and translations, as a resource for further research. After being recorded in its cultural context, texts have been transcribed in collaboration with native speakers.

A structured, open architecture

The archived data is structured in accordance with the latest data-processing standards, as open architecture, in an open format, and may be downloaded under a Creative Commons license. The software used to prepare and disseminate it is open-source. The Pangloss Collection is a member of the OLAC network of archival repositories and of the Digital Endangered Languages and Music Archive Network (DELAMAN).

History

The collection was initially called the LACITO Archive. [2] [3] The project originated in 1996 from the collaboration of Boyd Michailovsky, linguist at LACITO, with John B. Lowe, engineer; [4] :15 they were later joined by Michel Jacobson, engineer, who developed some tools for the project, and brought it online. [1] :124 [4]

The purpose of the archive was “to conserve, and to make available for research, recorded and transcribed oral traditions and other linguistic materials in (mainly) unwritten languages, giving simultaneous access to sound recordings and text annotation.” [4] The earliest archived corpora in the collection were languages from Nepal, from New Caledonia, from eastern Africa and French Guiana. [5]

The archive has grown steadily since the early 2000s, [6] incorporating corpora from various linguists, whether members of LACITO or not. In 2009, the archive had 200 recordings in 45 languages. [7] In 2014, the (newly renamed) Pangloss Collection had 1,400 recordings in 70 languages. [1] :121

As of April 2021, the Pangloss archive contains 5,038 recordings [8] in 196 languages, [9] totalling 780 hours of audio and video recordings. [6]

Languages in the Pangloss Collection include Mwotlap (Austronesian; Vanuatu), [10] Japhug (Sino-Tibetan; Southwest China), [11] Ersu (Sino-Tibetan; Southwest China), [12] Naxi (or Yongning Na: Sino-Tibetan; Southwest China), [13] and Cèmuhî (Austronesian; New Caledonia). [14]

Related Research Articles

Corpus linguistics is the study of a language as that language is expressed in its text corpus, its body of "real world" text. Corpus linguistics proposes that a reliable analysis of a language is more feasible with corpora collected in the field—the natural context ("realia") of that language—with minimal experimental interference.

<span class="mw-page-title-main">Dzongkha</span> Sino-Tibetan language spoken in Bhutan

Dzongkha is a Sino-Tibetan language that is the official and national language of Bhutan. It is written using the Tibetan script.

The Hayus (Nepali: हायु) are a member of the Kirat tribe speaking their own language, Wayu or Hayu. Little is known about them. They are Animist by religion. According to the 2001 Nepal census, there are 1821 Hayu in the country, of which 70.29% were Hindus and 23.61% were animists.

<span class="mw-page-title-main">LACITO</span>

LACITO is a multidisciplinary research organisation, principally devoted to the study of cultures and languages of oral tradition.

The Pumi language is a Qiangic language used by the Pumi people, an ethnic group from Yunnan, China, as well as by the Tibetan people of Muli in Sichuan, China. Most native speakers live in Lanping, Ninglang, Lijiang, Weixi and Muli.

Language documentation is a subfield of linguistics which aims to describe the grammar and use of human languages. It aims to provide a comprehensive record of the linguistic practices characteristic of a given speech community. Language documentation seeks to create as thorough a record as possible of the speech community for both posterity and language revitalization. This record can be public or private depending on the needs of the community and the purpose of the documentation. In practice, language documentation can range from solo linguistic anthropological fieldwork to the creation of vast online archives that contain dozens of different languages, such as FirstVoices or OLAC.

Mwotlap is an Oceanic language spoken by about 2,100 people in Vanuatu. The majority of speakers are found on the island of Motalava in the Banks Islands, with smaller communities in the islands of Ra and Vanua Lava, as well as migrant groups in the two main cities of the country, Santo and Port Vila.

Naxi, also known as Nakhi, Nasi, Lomi, Moso, Mo-su, is a Sino-Tibetan language or group of languages spoken by some 310,000 people, most of whom live in or around Lijiang City Yulong Naxi Autonomous County of the province of Yunnan, China. Nakhi is also the ethnic group that speaks it, although in detail, officially defined ethnicity and linguistic reality do not coincide neatly: there are speakers of Naxi who are not registered as "Naxi" and citizens who are officially "Naxi" but do not speak it.

Linguistic categories include

A speech corpus is a database of speech audio files and text transcriptions. In speech technology, speech corpora are used, among other things, to create acoustic models. In linguistics, spoken corpora are used to do research into phonetic, conversation analysis, dialectology and other fields.

Cèmuhî is an Oceanic language spoken on the island of New Caledonia, in the area of Poindimié, Koné, and Touho. The language has approximately 3,300 speakers and is considered a regional language of France.

Vayu (वायु), Wayu or Hayu (हायु) is a Sino-Tibetan language spoken in Nepal by about 1740 people in Bagmati Province. Dialects include Pali gau Mudajor Sukajor Ramechhap Sindhuli and Marin Khola.

<span class="mw-page-title-main">Guillaume Jacques</span> French linguist of Breton descent

Guillaume Jacques is a French linguist who specializes in the study of Sino-Tibetan languages: Old Chinese, Tangut, Tibetan, Gyalrongic and Kiranti languages. He also performs research on the Algonquian and Siouan language families, and publishes about languages of other families such as Breton. His case studies in historical phonology are set in the framework of panchronic phonology, aiming to formulate generalizations about sound change that are independent of any particular language or language group.

Thakali is a Sino-Tibetan language of Nepal spoken by the Thakali people, mainly in the Myagdi and Mustang Districts. Its dialects have limited mutual intelligibility.

Volow is an Oceanic language variety that used to be spoken in the area of Aplow, in the eastern part of the island of Motalava, Vanuatu.

<span class="mw-page-title-main">Michel Ferlus</span> French linguist

Michel Ferlus is a French linguist whose special study is in the historical phonology of languages of Southeast Asia. In addition to phonological systems, he also studies writing systems, in particular the evolution of Indic scripts in Southeast Asia.

<span class="mw-page-title-main">Alexis Michaud</span> French linguist

Alexis Michaud is a French linguist specialising in the study of Southeast Asian languages, especially Naic languages and Vietnamese. He is also known for his work on the typology of tonal languages and as a foremost proponent of Panchronic phonology. He is one of the main editors of the Pangloss Collection. He works at the LACITO research centre within Centre National de la Recherche Scientifique.

The TenTen Corpus Family (also called TenTen corpora) is a set of comparable web text corpora, i.e. collections of texts that have been crawled from the World Wide Web and processed to match the same standards. These corpora are made available through the Sketch Engine corpus manager. There are TenTen corpora for more than 35 languages. Their target size is 10 billion (1010) words per language, which gave rise to the corpus family's name.

Evangelia Adamou is a senior researcher at the French National Centre for Scientific Research, specializing in language contact and endangered languages.

References

  1. 1 2 3 Michailovsky, Boyd, Martine Mazaudon, Alexis Michaud, Séverine Guillaume, Alexandre François & Evangelia Adamou. 2014. Documenting and researching endangered languages: the Pangloss Collection. Language Documentation & Conservation 8, pp. 119-135.
  2. Jacobson, Michel; Michailovsky, Boyd (2002). The LACITO Archive : its purpose and implementation. Int'l Workshop on Resources and Tools in Field Linguistics. Las Palmas, Canary Is., Spain.
  3. Screen capture of LACITO's archive homepage — 27 February 2001.
  4. 1 2 3 Jacobson, Michel; Michailovsky, Boyd; Lowe, John B. (2001). "Linguistic documents synchronizing sound and text". Speech Communication. Special issue: “Speech Annotation and Corpus Tools”. 33 (1–2): 79–96. doi:10.1016/S0167-6393(00)00070-4.
  5. Screen capture of LACITO's archive contents — 22 April 2002.
  6. 1 2 “About us” section of the Pangloss Collection (retrieved 24 April 2021)
  7. Screen capture of LACITO's archive contents — 26 November 2009.
  8. Source: list of all Pangloss resources on the Cocoon homepage (retrieved 10 January 2022).
  9. Source: number of language entries in its list of corpora (retrieved 24 April 2021).
  10. Mwotlap corpus: 564 resources.
  11. Japhug corpus: 551 resources.
  12. Ersu corpus: 363 resources.
  13. Yongning Na corpus: 301 resources.
  14. Cèmuhî corpus: 230 resources.