Developer(s) | Mozilla Foundation |
---|---|
Initial release | June 19, 2017 |
Repository | github |
Available in | Multilingual (List of languages) |
License | Creative Commons CC0 |
Website | commonvoice.mozilla.org |
Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0. [1] This license ensures that developers can use the database for voice-to-text applications without restrictions or costs.
Common Voice aims to provide diverse voice samples. According to Mozilla's Katharina Borchert, many existing projects took datasets from public radio or otherwise had datasets that underrepresented both women and people with pronounced accents. [2]
This section needs to be updated.(October 2024) |
At the beginning of 2022, Bengali.AI partnered with Common Voice to launch "Bangla Speech Recognition" project that aims to make machines understand Bangla language. 2000 hours of voice was collected with aim for higher than 10,000 hours. [3]
The first dataset was released in November 2017. More than 20,000 users worldwide had recorded 500 hours of English sentences. [4]
In February 2019, the first batch of languages was released for use. This included 18 languages: English, French, German and Mandarin Chinese, but also less prevalent languages as Welsh and Kabyle. In total, this included almost 1,400 hours of recorded voice data from more than 42,000 contributors. [5]
As of July 2020 the database has amassed 7,226 hours of voice recordings in 54 languages, 5,591 hours of which has been verified by volunteers. [6]
In May 2021, following the work to add Kinyarwanda, they received a grant to add Kiswahili. [7]
In September 2022, it was announced that the Twi language of Ghana was the 100th language to be added to the Mozilla Common Voice database. [8]
As of October 2022 [update] , Mozilla Common Voice officially collects voice data for the following languages: [9]
Standard Chinese is a modern standard form of Mandarin Chinese that was first codified during the republican era (1912–1949). It is designated as the official language of mainland China and a major language in the United Nations, Singapore, and Taiwan. It is largely based on the Beijing dialect. Standard Chinese is a pluricentric language with local standards in mainland China, Taiwan and Singapore that mainly differ in their lexicon. Hong Kong written Chinese, used for formal written communication in Hong Kong and Macau, is a form of Standard Chinese that is read aloud with the Cantonese reading of characters.
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.
There are hundreds of local Chinese language varieties forming a branch of the Sino-Tibetan language family, many of which are not mutually intelligible. Variation is particularly strong in the more mountainous southeast part of mainland China. The varieties are typically classified into several groups: Mandarin, Wu, Min, Xiang, Gan, Jin, Hakka and Yue, though some varieties remain unclassified. These groups are neither clades nor individual languages defined by mutual intelligibility, but reflect common phonological developments from Middle Chinese.
Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, as well as an API that helps developers build browser extensions and software applications. As of November 2024, Google Translate supports 249 languages and language varieties at various levels. It served over 200 million people daily in May 2013, and over 500 million total users as of April 2016, with more than 100 billion words translated daily.
The phonology of Bengali, like that of its neighbouring Eastern Indo-Aryan languages, is characterised by a wide variety of diphthongs and inherent back vowels.
Natural-language user interface is a type of computer human interface where linguistic phenomena such as verbs, phrases and clauses act as UI controls for creating, selecting and modifying data in software applications.
Bengali, also known by its endonym Bangla, is a classical Indo-Aryan language from the Indo-European language family native to the Bengal region of South Asia. With over 237 million native speakers and another 41 million as second language speakers as of 2024, Bengali is the fifth most spoken native language and the seventh most spoken language by the total number of speakers in the world. It is the fifth most spoken Indo-European language.
Bengali input methods refer to different systems developed to type the characters of the Bengali script for Bengali language and others, using a typewriter or a computer keyboard.
Tatoeba is a free collection of example sentences with translations geared towards foreign language learners. It is available in more than 400 languages. Its name comes from the Japanese phrase tatoeba (例えば), meaning 'for example'. It is written and maintained by a community of volunteers through a model of open collaboration. Individual contributors are known as "Tatoebans". It is run by Association Tatoeba, a French non-profit organization funded through donations.
Mozilla is a free software community founded in 1998 by members of Netscape. The Mozilla community uses, develops, publishes and supports Mozilla products, thereby promoting exclusively free software and open standards, with only minor exceptions. The community is supported institutionally by the non-profit Mozilla Foundation and its tax-paying subsidiary, the Mozilla Corporation.
Speech Recognition & Synthesis, formerly known as Speech Services, is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.
WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.
Voice computing is the discipline that develops hardware or software to process voice inputs.
Patricia Scanlon is an Irish technologist and businesswoman. She founded SoapBox Labs in 2013, a company that applies artificial intelligence to develop speech recognition applications that are specifically tuned to children's voices. Scanlon was CEO of SoapBox Labs from its founding until May 2021, when she became executive chair. In 2022, Scanlon was appointed by the Irish Government as Ireland’s first Artificial Intelligence Ambassador. In this role, she will "lead a national conversation" about the role of AI in people's lives, including its benefits and risks.
15.ai was a freeware artificial intelligence web application that generated text-to-speech voices from fictional characters from various media sources. Created by a pseudonymous developer under the alias 15, the project used a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate emotive character voices faster than real-time.
Lingua Libre is an online collaborative project and tool by the Wikimédia France association, which aims to build a collaborative, multilingual, audiovisual speech corpus under a free license. It mostly consists of a rapid recording online service which allows the user to chain hundreds of recordings. Contributors have produced content in 250+ languages.
Whisper is a machine learning model for speech recognition and transcription, created by OpenAI and first released as open-source software in September 2022.