Speech Recognition & Synthesis

Last updated
Speech Recognition & Synthesis
Developer(s) Google
Initial release13 November 2013;11 years ago (2013-11-13)
Stable release
20241028.00/p1 (Build 694553964) / 12 November 2024;3 days ago (2024-11-12) [1] [2]
Platform Android
Type Screen reader

Speech Recognition & Synthesis, formerly known as Speech Services, [3] is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

Contents

Supported languages

History

Some app developers have started adapting and tweaking their Android Auto apps to include Text-to-Speech, such as Hyundai in 2015. [4] Apps such as textPlus and WhatsApp use Text-to-Speech to read notifications aloud and provide voice-reply functionality.

Google Cloud Text-to-Speech is powered by WaveNet, [5] software created by Google's UK-based AI subsidiary DeepMind, which was bought by Google in 2014. [6] It tries to distinguish from its competitors, Amazon and Microsoft. [7]

Most voice synthesizers (including Apple's Siri) use concatenative synthesis, [5] in which a program stores individual phonemes and then pieces them together to form words and sentences. WaveNet synthesizes speech with human-like emphasis and inflection on syllables, phonemes, and words. Unlike most other text-to-speech systems, a WaveNet model creates raw audio waveforms from scratch. The model uses a neural network that has been trained using a large volume of speech samples. During training, the network extracts the underlying structure of the speech, such as which tones follow each other and what a realistic speech waveform looks like. When given a text input, the trained WaveNet model can generate the corresponding speech waveforms from scratch, one sample at a time, with up to 24,000 samples per second and smooth transitions between the individual sounds. [5]

The service was renamed Speech Recognition & Synthesis in 2023.[ citation needed ]

See also

Related Research Articles

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

A voice-user interface (VUI) enables spoken human interaction with computers, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface.

<span class="mw-page-title-main">Google Translate</span> Multilingual neural machine translation service

Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, as well as an API that helps developers build browser extensions and software applications. As of November 2024, Google Translate supports 24970 languages and language varieties at various levels. It served over 200 million people daily in May 2013, and over 500 million total users as of April 2016, with more than 100 billion words translated daily.

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

Google Voice Search or Search by Voice is a Google product that allows users to use Google Search by speaking on a mobile phone or computer, i.e. have the device search for data upon entering information on what to search into the device by speaking.

Chinese speech synthesis is the application of speech synthesis to the Chinese language. It poses additional difficulties due to Chinese characters frequently having different pronunciations in different contexts and the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.

The Java Speech API (JSAPI) is an application programming interface for cross-platform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only, there are several implementations created by third parties, for example FreeTTS.

eSpeak Compact, open-source, software speech synthesizer

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG is a continuation of the original developer's project with more feedback from native speakers.

<span class="mw-page-title-main">Microsoft Translator</span> Machine translation cloud service by Microsoft

Microsoft Translator or Bing Translator is a multilingual machine translation cloud service provided by Microsoft. Microsoft Translator is a part of Microsoft Cognitive Services and integrated across multiple consumer, developer, and enterprise products, including Bing, Microsoft Office, SharePoint, Microsoft Edge, Microsoft Lync, Yammer, Skype Translator, Visual Studio, and Microsoft Translator apps for Windows, Windows Phone, iPhone and Apple Watch, and Android phone and Android Wear.

<span class="mw-page-title-main">Word Lens</span> Augmented reality translation application

Word Lens was an augmented reality translation application from Quest Visual. Word Lens used the built-in cameras on smartphones and similar devices to quickly scan and identify foreign text, and then translated and displayed the words in another language on the device's display. The words were displayed in the original context on the original background, and the translation was performed in real-time without a connection to the internet. For example, using the viewfinder of a camera to show a shop sign on a smartphone's display would result in a real-time image of the shop sign being displayed, but the words shown on the sign would be the translated words instead of the original foreign words.

<span class="mw-page-title-main">CereProc</span> Speech synthesis company

CereProc is a speech synthesis company based in Edinburgh, Scotland, founded in 2005. The company specialises in creating natural and expressive-sounding text to speech voices, synthesis voices with regional accents, and in voice cloning.

<span class="mw-page-title-main">Sensory, Inc.</span>

Sensory, Inc. is an American company which develops software AI technologies for speech, sound and vision. It is based in Santa Clara, California.

<span class="mw-page-title-main">Android Auto</span> Mobile app providing a vehicle-optimized user interface

Android Auto is a mobile app developed by Google to mirror features of a smartphone on a car's dashboard information and entertainment head unit.

<span class="mw-page-title-main">Yandex Translate</span> Translation web service by Yandex

Yandex Translate is a web service provided by Yandex, intended for the translation of web pages into another language.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

<span class="mw-page-title-main">Common Voice</span> Voice dataset by Mozilla

Common Voice is a crowdsourcing project started by Mozilla to create a free database for speech recognition software. The project is supported by volunteers who record sample sentences with a microphone and review recordings of other users. The transcribed sentences are collected in a voice database available under the public domain license CC0. This license ensures that developers can use the database for voice-to-text applications without restrictions or costs.

Read Along, formerly known as Bolo, is an Android language-learning app for children developed by Google for the Android operating system. The application was released on the Play Store on March 7, 2019. It features a character named Dia helping children learn to read through illustrated stories. It has the facility to learn English and Indian major languages i.e. Hindi, Bengali, Tamil, Telugu, Marathi and Urdu as well as Spanish, Portuguese and Arabic.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks are trained using large amounts of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

References

  1. "Speech Recognition & Synthesis". Google Play. Retrieved 2024-11-15.
  2. "Speech Recognition & Synthesis googletts.google-speech-apk_20241028.00_p1.694553964". APKMirror. 2024-11-12. Retrieved 2024-11-15.
  3. Wang, Jules (November 8, 2021). "You'll never guess the latest Google app to cross 10 billion installs (seriously)". Android Police. Archived from the original on November 8, 2021. Retrieved November 18, 2021.
  4. "Google, Hyundai show off new third-party Android Auto apps". CNET. CBS Interactive. Retrieved 17 January 2015.
  5. 1 2 3 "WaveNet". www.deepmind.com. Retrieved 2023-06-22.
  6. Gibbs, Samuel (2014-01-27). "Google buys UK artificial intelligence startup Deepmind for £400m". The Guardian. ISSN   0261-3077 . Retrieved 2023-06-22.
  7. "Text-to-Speech AI: Lifelike Speech Synthesis". Google Cloud. Retrieved 2023-06-22.