Voice computing

Last updated January 11, 2025

Voice computing is the discipline that develops hardware or software to process voice inputs.^[1]

History

Voice computing has a rich history.^[2] First, scientists like Wolfgang Kempelen started to build speech machines to produce the earliest synthetic speech sounds. This led to further work by Thomas Edison to record audio with dictation machines and play it back in corporate settings. In the 1950s-1960s there were primitive attempts to build automated speech recognition systems by Bell Labs, IBM, and others. However, it was not until the 1980s that Hidden Markov Models were used to recognize up to 1,000 words that speech recognition systems became relevant.

Date	Event
1784	Wolfgang von Kempelen creates the Acoustic-Mechanical speech machine.
1879	Thomas Edison invents the first dictation machine.
1952	Bell Labs releases Audrey, capable of recognizing spoken digits with 90% accuracy.
1962	IBM Shoebox can recognize up to 16 words.
1971	Harpy is created, which can understand over 1,000 words.
1986	IBM Tangora uses Hidden Markov Models to predict phonemes in speech.
2006	National Security Agency begins research in hotword detection during normal conversations.
2008	Google launches a voice application, bring speech recognition to mobile devices.
2011	Apple releases Siri on iPhone
2014	Amazon releases Amazon Echo to make voice computing relevant to the public at large.

Around 2011, Siri emerged on Apple iPhones as the first voice assistant accessible to consumers. This innovation led to a dramatic shift to building voice-first computing architectures. PS4 was released by Sony in North America in 2013 (70+ million devices), Amazon released the Amazon Echo in 2014 (30+ million devices), Microsoft released Cortana (2015 - 400 million Windows 10 users), Google released Google Assistant (2016 - 2 billion active monthly users on Android phones), and Apple released HomePod (2018 - 500,000 devices sold and 1 billion devices active with iOS/Siri). These shifts, along with advancements in cloud infrastructure (e.g. Amazon Web Services) and codecs, have solidified the voice computing field and made it widely relevant to the public at large.

Hardware

A voice computer is assembled hardware and software to process voice inputs.

Note that voice computers do not necessarily need a screen, such as in the traditional Amazon Echo. In other embodiments, traditional laptop computers or mobile phones could be used as voice computers. Moreover, there has become increasingly more interfaces for voice computers with the advent of IoT-enabled devices, such as within cars or televisions.

As of September 2018, there are currently over 20,000 types of devices compatible with Amazon Alexa.^[3]

Software

Voice computing software can read/write, record, clean, encrypt/decrypt, playback, transcode, transcribe, compress, publish, featurize, model, and visualize voice files.

Here are some popular software packages related to voice computing:

Package name	Description
FFmpeg	for transcoding audio files from one format to another (e.g. .WAV --> .MP3).^[4]
Audacity	for recording and filtering audio.^[5]
SoX	for manipulating audio files and removing environmental noise.^[6]
Natural Language Toolkit	for featurizing transcripts with things like parts of speech.^[7]
LibROSA	for visualizing audio file spectrograms and featurizing audio files.^[8]
OpenSMILE	for featurizing audio files with things like mel-frequency cepstrum coefficients.^[9]
CMU Sphinx	for transcribing speech files into text.^[10]
Pyttsx3	for playing back audio files (text-to-speech).^[11]
Pycryptodome	for encrypting and decrypting audio files.^[12]
AudioFlux	for audio and music analysis, feature extraction.^[13]

Applications

Voice computing applications span many industries including voice assistants, healthcare, e-Commerce, finance, supply chain, agriculture, text-to-speech, security, marketing, customer support, recruiting, cloud computing, microphones, speakers, and podcasting. Voice technology is projected to grow at a CAGR of 19-25% by 2025, making it an attractive industry for startups and investors alike.^[14]

Legal considerations

In the United States, the states have varying telephone call recording laws. In some states, it is legal to record a conversation with the consent of only one party, in others the consent of all parties is required.

Moreover, COPPA is a significant law to protect minors using the Internet. With an increasing number of minors interacting with voice computing devices (e.g. the Amazon Alexa), on October 23, 2017 the Federal Trade Commission relaxed the COPAA rule so that children can issue voice searches and commands.^[15]^[16]

Lastly, GDPR is a new European law that governs the right to be forgotten and many other clauses for EU citizens. GDPR also is clear that companies need to outline clear measures to obtain consent if audio recordings are made and define the purpose and scope as to how these recordings will be used, e.g., for training purposes. The bar for valid consent has been raised under the GDPR. Consents must be freely given, specific, informed, and unambiguous; tacit consent is no longer sufficient.^[17]

Research conferences

There are many research conferences that relate to voice computing. Some of these include:

International Conference on Acoustics, Speech, and Signal Processing
Interspeech ^[18]
AVEC ^[19]
IEEE Int'l Conf. on Automatic Face and Gesture Recognition ^[20]
ACII2019 The 8th Int'l Conf. on Affective Computing and Intelligent Interaction ^[21]

Developer community

Google Assistant has roughly 2,000 actions as of January 2018.^[22]

There are over 50,000 Alexa skills worldwide as of September 2018.^[23]

In June 2017, Google released AudioSet,^[24] a large-scale collection of human-labeled 10-second sound clips drawn from YouTube videos. It contains 1,010,480 videos of human speech files, or 2,793.5 hours in total.^[25] It was released as part of the IEEE ICASSP 2017 Conference.^[26]

In November 2017, Mozilla Foundation released the Common Voice Project, a collection of speech files to help contribute to the larger open source machine learning community.^[27]^[28] The voicebank is currently 12GB in size, with more than 500 hours of English-language voice data that have been collected from 112 countries since the project's inception in June 2017.^[29] This dataset has already resulted in creative projects like the DeepSpeech model, an open source transcription model.^[30]

Related Research Articles

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Conexant Systems, Inc. was an American-based software developer and fabless semiconductor company that developed technology for voice and audio processing, imaging and modems. The company began as a division of Rockwell International, before being spun off as a public company. Conexant itself then spun off several business units, creating independent public companies which included Skyworks Solutions and Mindspeed Technologies.

A voice-user interface (VUI) enables spoken human interaction with computers, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device is a device controlled with a voice user interface.

A virtual assistant (VA) is a software agent that can perform a range of tasks or services for a user based on user input such as commands or questions, including verbal ones. Such technologies often incorporate chatbot capabilities to simulate human conversation, such as via online chat, to facilitate interaction with their users. The interaction may be via text, graphical interface, or voice - as some virtual assistants are able to interpret human speech and respond via synthesized voices.

HTML audio is a subject of the HTML specification, incorporating audio |speech to text]], all in the browser.

Amazon Fire TV is a line of digital media players and microconsoles developed by Amazon since 2014. The devices are small network appliances that deliver digital audio and video content streamed via the Internet to a connected high-definition television. They also allow users to access local content and to play video games with the included remote control or another game controller, or by using a mobile app remote control on another device.

Julia Hirschberg is an American computer scientist noted for her research on computational linguistics and natural language processing.

Amazon Echo, often shortened to Echo, is a brand of smart speakers developed by Amazon. Echo devices connect to the voice-controlled intelligent personal assistant service Alexa, which will respond when a user says "Alexa". Users may change this wake word to "Amazon", "Echo", "Computer", and other options. The features of the device include voice interaction, music playback, making to-do lists, setting alarms, streaming podcasts, and playing audiobooks, in addition to providing weather, traffic and other real-time information. It can also control several smart devices, acting as a home automation hub.

WebXR Device API is a Web application programming interface (API) that describes support for accessing augmented reality and virtual reality devices, such as the HTC Vive, Oculus Rift, Meta Quest, Google Cardboard, HoloLens, Apple Vision Pro, Android XR-based devices, Magic Leap or Open Source Virtual Reality (OSVR), in a web browser. The WebXR Device API and related APIs are standards defined by W3C groups, the Immersive Web Community Group and Immersive Web Working Group. While the Community Group works on the proposals in the incubation period, the Working Group defines the final web specifications to be implemented by the browsers.

Mycroft was a free and open-source software virtual assistant that uses a natural language user interface. Its code was formerly copyleft, but is now under a permissive license. It was named after a fictional computer from the 1966 science fiction novel The Moon Is a Harsh Mistress.

Amazon Alexa, or, Alexa, is a virtual assistant technology largely based on a Polish speech synthesizer named Ivona, bought by Amazon in 2013. It was first used in the Amazon Echo smart speaker and the Amazon Echo Dot, Echo Studio and Amazon Tap speakers developed by Amazon Lab126. It is capable of natural language processing for tasks such as voice interaction, music playback, creating to-do lists, setting alarms, streaming podcasts, playing audiobooks, providing weather, traffic, sports, other real-time information and news. Alexa can also control several smart devices as a home automation system. Alexa capabilities may be extended by installing "skills" such as weather programs and audio features. It performs these tasks using automatic speech recognition, natural language processing, and other forms of weak AI.

A smart speaker is a type of loudspeaker and voice command device with an integrated virtual assistant that offers interactive actions and hands-free activation with the help of one "hot word". Some smart speakers can also act as a smart device that utilizes Wi-Fi and other protocol standards to extend usage beyond audio playback, such as to control home automation devices. This can include, but is not limited to, features such as compatibility across a number of services and platforms, peer-to-peer connection through mesh networking, virtual assistants, and others. Each can have its own designated interface and features in-house, usually launched or controlled via application or home automation software. Some smart speakers also include a screen to show the user a visual response.

Audio injection is the exploitation of digital assistants such as Amazon Echo, Google Home or Apple SIRI by unwanted instructions from a third party. These services lack authentication when reacting to user commands, making it possible for attackers to issue activation words and commands and trigger the execution of desired actions. Injection results include fraud, burglary, data espionage and takeover of connected systems.

WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.

Witlingo is a B2B Software as a Service (SaaS) company that enables businesses and organization to engage with members of their communities by using the latest innovations in Human Language Technology and Conversational AI, such Speech recognition, Natural Language Processing, IVR, Virtual Assistant apps on Smartphone platforms(iOS and Android), Chatbots, and Digital audio.

Wraith: The Oblivion – The Orpheus Device is an audio-based adventure video game developed by Earplay and published by Paradox Interactive on October 29, 2020 for Android, iOS, and smart speakers, and is played using the virtual assistants Amazon Alexa or Google Assistant, or the Earplay mobile app. It is based on White Wolf Publishing's tabletop role-playing games Wraith: The Oblivion (1994) and Orpheus (2003), and is part of the larger World of Darkness series.

Yang Liu is a Chinese and American computer scientist specializing in speech processing and natural language processing, and a senior principal scientist for Amazon.

Dilek Z. Hakkani-Tür is a Turkish-American computer scientist focusing on speech processing, speech recognition, and dialogue systems. She is a professor of computer science at the University of Illinois Urbana-Champaign.

References

↑ Schwoebel, J. (2018). An Introduction to Voice Computing in Python. Boston; Seattle, Atlanta: NeuroLex Laboratories. https://neurolex.ai/voicebook
↑ Boyd, Clark (2019-08-30). "Speech Recognition Technology: The Past, Present, and Future". The Startup. Retrieved 2025-01-10.
↑ Kinsella, Bret (2018-09-02). "Amazon Alexa Now Has 50,000 Skills Worldwide, works with 20,000 Devices, Used by 3,500 Brands". Voicebot.ai. Retrieved 2025-01-10.
↑ FFmpeg. https://www.ffmpeg.org/
↑ Audacity. https://www.audacityteam.org/
↑ SoX. http://sox.sourceforge.net/
↑ NLTK. https://www.nltk.org/
↑ LibROSA. https://librosa.github.io/librosa/
↑ OpenSMILE. https://www.audeering.com/technology/opensmile/
↑ "PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop: Cmusphinx/Pocketsphinx". GitHub . 29 March 2020.
↑ Pyttsx3. https://github.com/nateshmbhat/pyttsx3
↑ Pycryptodome. https://pycryptodome.readthedocs.io/en/latest/
↑ AudioFlux. https://github.com/libAudioFlux/audioFlux/
↑ "Global Speech and Voice Recognition Market 2018 Forecast to 2025 - CAGR Expected to Grow at 25.7% - ResearchAndMarkets.com". Archived from the original on 2024-01-19. Retrieved 2025-01-10.
↑ Coldewey, Devin (2017-10-24). "FTC relaxes COPPA rule so kids can issue voice searches and commands". TechCrunch. Retrieved 2025-01-10.
↑ "Federal Register :: Request Access". 8 December 2017.
↑ IAPP. https://iapp.org/news/a/how-do-the-rules-on-audio-recording-change-under-the-gdpr/
↑ Interspeech 2018. http://interspeech2018.org/
↑ "14th International Symposium on Advanced Vehicle Control - Speakers, Sessions, Agenda". www.eventyco.com. Retrieved 2025-01-10.
↑ 2018 FG. https://fg2018.cse.sc.edu/
↑ ASCII 2019. http://acii-conf.org/2019/
↑ Mutchler, Ava (2018-01-24). "Google Assistant App Total Reaches Nearly 2400. But That's Not the Real Number. It's really 1719". Voicebot.ai. Retrieved 2025-01-10.
↑ Kinsella, Bret (2018-09-02). "Amazon Alexa Now Has 50,000 Skills Worldwide, works with 20,000 Devices, Used by 3,500 Brands". Voicebot.ai. Retrieved 2025-01-10.
↑ Google AudioSet. https://research.google.com/audioset/
↑ "AudioSet". research.google.com. Retrieved 2025-01-10.
↑ Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 776-780). IEEE.
↑ Common Voice Project. https://voice.mozilla.org/
↑ "Announcing the Initial Release of Mozilla's Open Source Speech Recognition Model and Voice Dataset | The Mozilla Blog". blog.mozilla.org. Retrieved 2025-01-10.
↑ Mozilla's large repository of voice data will shape the future of machine learning. https://opensource.com/article/18/4/common-voice
↑ DeepSpeech. https://github.com/mozilla/DeepSpeech

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Schwoebel, J. (2018). An Introduction to Voice Computing in Python. Boston; Seattle, Atlanta: NeuroLex Laboratories. https://neurolex.ai/voicebook

[2] Boyd, Clark (2019-08-30). "Speech Recognition Technology: The Past, Present, and Future". The Startup. Retrieved 2025-01-10.

[3] Kinsella, Bret (2018-09-02). "Amazon Alexa Now Has 50,000 Skills Worldwide, works with 20,000 Devices, Used by 3,500 Brands". Voicebot.ai. Retrieved 2025-01-10.

[4] FFmpeg. https://www.ffmpeg.org/

[5] Audacity. https://www.audacityteam.org/

[6] SoX. http://sox.sourceforge.net/

[7] NLTK. https://www.nltk.org/

[8] LibROSA. https://librosa.github.io/librosa/

[9] OpenSMILE. https://www.audeering.com/technology/opensmile/

[10] "PocketSphinx is a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop: Cmusphinx/Pocketsphinx". GitHub . 29 March 2020.

[11] Pyttsx3. https://github.com/nateshmbhat/pyttsx3

[12] Pycryptodome. https://pycryptodome.readthedocs.io/en/latest/

[13] AudioFlux. https://github.com/libAudioFlux/audioFlux/

[14] "Global Speech and Voice Recognition Market 2018 Forecast to 2025 - CAGR Expected to Grow at 25.7% - ResearchAndMarkets.com". Archived from the original on 2024-01-19. Retrieved 2025-01-10.

[15] Coldewey, Devin (2017-10-24). "FTC relaxes COPPA rule so kids can issue voice searches and commands". TechCrunch. Retrieved 2025-01-10.

[16] "Federal Register :: Request Access". 8 December 2017.

[17] IAPP. https://iapp.org/news/a/how-do-the-rules-on-audio-recording-change-under-the-gdpr/

[18] Interspeech 2018. http://interspeech2018.org/

[19] "14th International Symposium on Advanced Vehicle Control - Speakers, Sessions, Agenda". www.eventyco.com. Retrieved 2025-01-10.

[20] 2018 FG. https://fg2018.cse.sc.edu/

[21] ASCII 2019. http://acii-conf.org/2019/

[22] Mutchler, Ava (2018-01-24). "Google Assistant App Total Reaches Nearly 2400. But That's Not the Real Number. It's really 1719". Voicebot.ai. Retrieved 2025-01-10.

[23] Kinsella, Bret (2018-09-02). "Amazon Alexa Now Has 50,000 Skills Worldwide, works with 20,000 Devices, Used by 3,500 Brands". Voicebot.ai. Retrieved 2025-01-10.

[24] Google AudioSet. https://research.google.com/audioset/

[25] "AudioSet". research.google.com. Retrieved 2025-01-10.

[26] Gemmeke, J. F., Ellis, D. P., Freedman, D., Jansen, A., Lawrence, W., Moore, & Ritter, M. (2017, March). Audio set: An ontology and human-labeled dataset for audio events. In Acoustics, Speech and Signal Processing (ICASSP), 2017 IEEE International Conference on (pp. 776-780). IEEE.

[27] Common Voice Project. https://voice.mozilla.org/

[28] "Announcing the Initial Release of Mozilla's Open Source Speech Recognition Model and Voice Dataset | The Mozilla Blog". blog.mozilla.org. Retrieved 2025-01-10.

[29] Mozilla's large repository of voice data will shape the future of machine learning. https://opensource.com/article/18/4/common-voice

[30] DeepSpeech. https://github.com/mozilla/DeepSpeech

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]

[12]

[13]

[14]

[15]

[16]

[17]

[18]

[19]

[20]

[21]

[22]

[23]

[24]

[25]

[26]

[27]

[28]

[29]

[30]