Speech Recognition & Synthesis

Last updated
Speech Recognition & Synthesis
Developer(s) Google
Initial releaseNovember 13, 2013;10 years ago (2013-11-13)
Stable release
Version googletts.google-speech-apk_20240319.00_p1.620342359(Android 8-14) / March 19, 2024;24 days ago (2024-03-19) [1]
Operating system Android
Type Screen reader

Speech Recognition & Synthesis, formerly known as Speech Services, [2] is a screen reader application developed by Google for its Android operating system. It powers applications to read aloud (speak) the text on the screen, with support for many languages. Text-to-Speech may be used by apps such as Google Play Books for reading books aloud, Google Translate for reading aloud translations for the pronunciation of words, Google TalkBack, and other spoken feedback accessibility-based applications, as well as by third-party apps. Users must install voice data for each language.

Contents

Supported languages

History

Some app developers have started adapting and tweaking their Android Auto apps to include Text-to-Speech, such as Hyundai in 2015. [3] Apps such as textPlus and WhatsApp use Text-to-Speech to read notifications aloud and provide voice-reply functionality.

Google Cloud Text-to-Speech is powered by WaveNet, [4] software created by Google's UK-based AI subsidiary DeepMind, which was bought by Google in 2014. [5] It tries to distinguish from its competitors, Amazon and Microsoft. [6]

Most voice synthesizers (including Apple's Siri) use concatenative synthesis, [4] in which a program stores individual phonemes and then pieces them together to form words and sentences. WaveNet synthesizes speech with human-like emphasis and inflection on syllables, phonemes, and words. Unlike most other text-to-speech systems, a WaveNet model creates raw audio waveforms from scratch. The model uses a neural network that has been trained using a large volume of speech samples. During training, the network extracts the underlying structure of the speech, such as which tones follow each other and what a realistic speech waveform looks like. When given a text input, the trained WaveNet model can generate the corresponding speech waveforms from scratch, one sample at a time, with up to 24,000 samples per second and smooth transitions between the individual sounds. [4]

The service was renamed Speech Recognition & Synthesis in 2023.[ citation needed ]

See also

Related Research Articles

Speech coding is an application of data compression to digital audio signals containing speech. Speech coding uses speech-specific parameter estimation using audio signal processing techniques to model the speech signal, combined with generic data compression algorithms to represent the resulting modeled parameters in a compact bitstream.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Speech synthesis is the artificial production of human speech. A computer system used for this purpose is called a speech synthesizer, and can be implemented in software or hardware products. A text-to-speech (TTS) system converts normal language text into speech; other systems render symbolic linguistic representations like phonetic transcriptions into speech. The reverse process is speech recognition.

<span class="mw-page-title-main">Screen reader</span> Assistive technology that converts text or images to speech or Braille

A screen reader is a form of assistive technology (AT) that renders text and image content as speech or braille output. Screen readers are essential to people who are blind, and are useful to people who are visually impaired, illiterate, or have a learning disability. Screen readers are software applications that attempt to convey what people with normal eyesight see on a display to their users via non-visual means, like text-to-speech, sound icons, or a braille device. They do this by applying a wide variety of techniques that include, for example, interacting with dedicated accessibility APIs, using various operating system features, and employing hooking techniques.

A voice-user interface (VUI) enables spoken human interaction with computers, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface.

<span class="mw-page-title-main">Google Translate</span> Multilingual neural machine translation service

Google Translate is a multilingual neural machine translation service developed by Google to translate text, documents and websites from one language into another. It offers a website interface, a mobile app for Android and iOS, as well as an API that helps developers build browser extensions and software applications. As of 2022, Google Translate supports 133 languages at various levels; it claimed over 500 million total users as of April 2016, with more than 100 billion words translated daily, after the company stated in May 2013 that it served over 200 million people daily.

The Speech Application Programming Interface or SAPI is an API developed by Microsoft to allow the use of speech recognition and speech synthesis within Windows applications. To date, a number of versions of the API have been released, which have shipped either as part of a Speech SDK or as part of the Windows OS itself. Applications that use SAPI include Microsoft Office, Microsoft Agent and Microsoft Speech Server.

Google Voice Search or Search by Voice is a Google product that allows users to use Google Search by speaking on a mobile phone or computer, i.e. have the device search for data upon entering information on what to search into the device by speaking.

Chinese speech synthesis is the application of speech synthesis to the Chinese language. It poses additional difficulties due to Chinese characters frequently having different pronunciations in different contexts and the complex prosody, which is essential to convey the meaning of words, and sometimes the difficulty in obtaining agreement among native speakers concerning what the correct pronunciation is of certain phonemes.

The Java Speech API (JSAPI) is an application programming interface for cross-platform support of command and control recognizers, dictation systems, and speech synthesizers. Although JSAPI defines an interface only, there are several implementations created by third parties, for example FreeTTS.

eSpeak Compact, open-source, software speech synthesizer

eSpeak is a free and open-source, cross-platform, compact, software speech synthesizer. It uses a formant synthesis method, providing many languages in a relatively small file size. eSpeakNG is a continuation of the original developer's project with more feedback from native speakers.

Google Maps Navigation is a mobile application developed by Google for the Android and iOS operating systems that later integrated into the Google Maps mobile app. The application uses an Internet connection to a GPS navigation system to provide turn-by-turn voice-guided instructions on how to arrive at a given destination. The application requires a connection to Internet data and normally uses a GPS satellite connection to determine its location. A user can enter a destination into the application, which will plot a path to it. The app displays the user's progress along the route and issues instructions for each turn.

<span class="mw-page-title-main">Word Lens</span> Augmented reality translation application

Word Lens was an augmented reality translation application from Quest Visual. Word Lens used the built-in cameras on smartphones and similar devices to quickly scan and identify foreign text, and then translated and displayed the words in another language on the device's display. The words were displayed in the original context on the original background, and the translation was performed in real-time without a connection to the internet. For example, using the viewfinder of a camera to show a shop sign on a smartphone's display would result in a real-time image of the shop sign being displayed, but the words shown on the sign would be the translated words instead of the original foreign words.

<span class="mw-page-title-main">CereProc</span> Speech synthesis company

CereProc is a speech synthesis company based in Edinburgh, Scotland, founded in 2005. The company specialises in creating natural and expressive-sounding text to speech voices, synthesis voices with regional accents, and in voice cloning.

<span class="mw-page-title-main">Android Auto</span> Mobile app providing a vehicle-optimized user interface

Android Auto is a mobile app developed by Google to mirror features of an Android device, such as a smartphone, on a car's dashboard information and entertainment head unit.

<span class="mw-page-title-main">Yandex Translate</span> Translation web service by Yandex

Yandex Translate is a web service provided by Yandex, intended for the translation of web pages into another language.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

<span class="mw-page-title-main">15.ai</span> Real-time text-to-speech tool using artificial intelligence

15.ai is a non-commercial freeware artificial intelligence web application that generates natural emotive high-fidelity text-to-speech voices from an assortment of fictional characters from a variety of media sources. Developed by a pseudonymous MIT researcher under the name 15, the project uses a combination of audio synthesis algorithms, speech synthesis deep neural networks, and sentiment analysis models to generate and serve emotive character voices faster than real-time, particularly those with a very small amount of trainable data.

Read Along, formerly known as Bolo, is an Android language-learning app for children developed by Google for the Android operating system. The application was released on the Play Store on March 7, 2019. It features a character named Dia helping children learn to read through illustrated stories. It has the facility to learn English and Indian major languages i.e. Hindi, Bengali, Tamil, Telugu, Marathi and Urdu as well as Spanish and Portuguese. It basically uses text-to-speech technology, through which the character named Dia reads the story, as well as speech-to-text technology, which mechanically identifies the matches between the text and the reading of the user. The story of Chhota Bheem and Katha Kids was added in September 2019. In April 2020, a new version of the application was released. In September 2020, it added Arabic language to its language option. A web version was launched in August 2022.

Deep learning speech synthesis refers to the application of deep learning models to generate natural-sounding human speech from written text (text-to-speech) or spectrum (vocoder). Deep neural networks (DNN) are trained using a large amount of recorded speech and, in the case of a text-to-speech system, the associated labels and/or input text.

References

  1. "Speech Services by Google APKs". APKMirror.
  2. Wang, Jules (November 8, 2021). "You'll never guess the latest Google app to cross 10 billion installs (seriously)". Android Police. Archived from the original on November 8, 2021. Retrieved November 18, 2021.
  3. "Google, Hyundai show off new third-party Android Auto apps". CNET. CBS Interactive. Retrieved 17 January 2015.
  4. 1 2 3 "WaveNet". www.deepmind.com. Retrieved 2023-06-22.
  5. Gibbs, Samuel (2014-01-27). "Google buys UK artificial intelligence startup Deepmind for £400m". The Guardian. ISSN   0261-3077 . Retrieved 2023-06-22.
  6. "Text-to-Speech AI: Lifelike Speech Synthesis". Google Cloud. Retrieved 2023-06-22.