Common Voice

Last updated
Common Voice
Developer(s) Mozilla Foundation
Initial releaseJune 19, 2017;6 years ago (2017-06-19)
Repository github.com/common-voice/common-voice
Available inMultilingual (List of languages)
License Creative Commons CC0
Website commonvoice.mozilla.org

Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences will be collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text applications without restrictions or costs.

Contents

Aims

Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents. [1]

History

At the beginning of 2022, Bengali.AI partnered with Common Voice to launch "Bangla Speech Recognition" project that aims to make machines understand Bangla language. 2000 hours of voice was collected with aim for higher than 10,000 hours. [2]

Voice database

The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences. [3]

In February 2019, the first batch of languages was released for use. This included 18 languages: English, French, German and Mandarin Chinese, but also less prevalent languages as Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors. [4]

As of July 2020 the database has amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which has been verified by volunteers. [5]

In May 2021, following the work to add Kinyarwanda, they received a grant to add Kiswahili. [6]

In September 2022, it was announced that the Twi language of Ghana was the 100th language to be added to the Mozilla Common Voice database. [7]

As of October 2022, Mozilla Common Voice officially collects voice data for the following languages: [8]

See also

Related Research Articles

<span class="mw-page-title-main">Mandarin Chinese</span> Major branch of Chinese languages

Mandarin is a group of Chinese language dialects that are natively spoken across most of northern and southwestern China. The group includes the Beijing dialect, the basis of the phonology of Standard Chinese, the official language of China. Because Mandarin originated in North China and most Mandarin dialects are found in the north, the group is sometimes referred to as Northern Chinese. Many varieties of Mandarin, such as those of the Southwest and the Lower Yangtze, are not mutually intelligible with the standard language. Nevertheless, Mandarin as a group is often placed first in lists of languages by number of native speakers.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Varieties of Chinese</span> Family of local language varieties

There are hundreds of local Chinese language varieties forming a branch of the Sino-Tibetan language family, many of which are not mutually intelligible. Variation is particularly strong in the more mountainous southeast part of mainland China. The varieties are typically classified into several groups: Mandarin, Wu, Min, Xiang, Gan, Jin, Hakka and Yue, though some varieties remain unclassified. These groups are neither clades nor individual languages defined by mutual intelligibility, but reflect common phonological developments from Middle Chinese.

The phonology of Bengali, like that of its neighbouring Eastern Indo-Aryan languages, is characterised by a wide variety of diphthongs and inherent back vowels.

Natural-language user interface is a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications.

<span class="mw-page-title-main">Bengali language</span> Indo-Aryan language in Bengal region

Bengali, also known by its endonym Bangla, is an Indo-Aryan language native to the Bengal region of South Asia. With approximately 240 million native speakers and another 41 million as second language speakers as of 2021, Bengali is the sixth most spoken native language and the seventh most spoken language by the total number of speakers in the world. It is the fifth most spoken Indo-European language.

Bengali input methods refer to different systems developed to type the characters of the Bengali script for Bengali language and others, using a typewriter or a computer keyboard.

<span class="mw-page-title-main">Tatoeba</span> Online project collecting example sentences

Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase "tatoeba" (例えば), meaning "for example". It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as Tatoebans. It is run by Association Tatoeba, a French non-profit organization funded through donations.

Mozilla is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, publishes and supports Mozilla products, thereby promoting exclusively free software and open standards, with only minor exceptions. The community is supported institutionally by the non-profit Mozilla Foundation and its tax-paying subsidiary, the Mozilla Corporation.

<span class="mw-page-title-main">Speech Recognition & Synthesis</span> Screen reader application by Google

Speech Recognition & Synthesis, formerly known as Speech Services, is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

<span class="mw-page-title-main">Voice computing</span> Discipline in computing

Voice computing is the discipline that develops hardware or software to process voice inputs.

<span class="mw-page-title-main">Patricia Scanlon</span> Irish entrepreneur

Patricia Scanlon is an Irish technologist and businesswoman. She founded SoapBox Labs in 2013, a company that applies artificial intelligence to develop speech recognition applications that are specifically tuned to children's voices. Scanlon was CEO of SoapBox Labs from its founding until May 2021, when she became executive chair. In 2022, Scanlon was appointed by the Irish Government as Ireland’s first Artificial Intelligence Ambassador. In this role, she will "lead a national conversation" about the role of AI in people's lives, including its benefits and risks.

<span class="mw-page-title-main">15.ai</span> Real-time text-to-speech tool using artificial intelligence

15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by a pseudonymous MIT researcher under the name 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, particularly those with a very small amount of trainable data.

An audio deepfake is a product of artificial intelligence used to create convincing speech sentences that sound like specific people saying things they did not say. This technology was initially developed for various applications to improve human life. For example, it can be used to produce audiobooks, and also to help people who have lost their voices to get them back. Commercially, it has opened the door to several opportunities. This technology can also create more personalized digital assistants and natural-sounding text-to-speech as well as speech translation services.

<span class="mw-page-title-main">Algorithmic Justice League</span> Digital advocacy non-profit organization

The Algorithmic Justice League (AJL) is a digital advocacy non-profit organization based in Cambridge, Massachusetts. Founded in 2016 by computer scientist Joy Buolamwini, the AJL uses research, artwork, and policy advocacy to increase societal awareness regarding the use of artificial intelligence (AI) in society and the harms and biases that AI can pose to society. The AJL has engaged in a variety of open online seminars, media appearances, and tech advocacy initiatives to communicate information about bias in AI systems and promote industry and government action to mitigate against the creation and deployment of biased AI systems. In 2021, Fast Company named AJL as one of the 10 most innovative AI companies in the world.

<span class="mw-page-title-main">Lingua Libre</span> Wikimedia project for pronunciation

Lingua Libre is an online collaborative project and tool by the Wikimédia France association, which aims to build a collaborative, multilingual, audiovisual speech corpus under a free license.

Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.

References

  1. "Why do we gender AI? Voice tech firms move to be more inclusive". The Guardian. 11 January 2020. Archived from the original on 19 December 2022. Retrieved 19 April 2020.
  2. "Bengali.AI: Democratising AI research in Bangla". The Business Standard. 2022-12-23. Archived from the original on 2022-12-24. Retrieved 2022-12-25.
  3. "Announcing the Initial Release of Mozilla's Open Source Speech Recognition Model and Voice Dataset". blog mozilla.org. November 29, 2017. Archived from the original on November 29, 2017. Retrieved November 19, 2019.
  4. "Mozilla updates Common Voice dataset with 1,400 hours of speech across 18 languages". VentureBeat . February 28, 2019. Archived from the original on March 4, 2019. Retrieved November 19, 2019.
  5. "Mozilla Common Voice updates will help train the 'Hey Firefox' wakeword for voice-based web browsing". VentureBeat. 1 July 2020. Archived from the original on March 10, 2021. Retrieved 1 April 2021.
  6. "Mozilla Common Voice Receives $3.4 Million Investment to Democratize and Diversify Voice Tech in East Africa". Mozilla Foundation. 2021-05-25. Archived from the original on 2022-12-19. Retrieved 2021-06-03.
  7. Onukwue, Alexander (23 September 2022). "Ghana's most popular language is now on Mozilla Common Voice". Quartz. Archived from the original on 2 December 2022. Retrieved 3 October 2022.
  8. "Languages". commonvoice.mozilla.org. Archived from the original on 24 December 2022. Retrieved 4 October 2022.