This is a timeline of speech and voice recognition , a technology which enables the recognition and translation of spoken language into text.
|Time period||Key developments|
|1877–1971||Speech recognition is at an early stage of development. Specialized devices can recognize few words and accuracy is not very high.|
|1971–1987||Speech recognition rapidly improves, although the technology is still not commercially available.|
|1987–2014||Speech recognition continues to improve, becomes widely available commercially, and can be found in many products.|
|Year||Month and date (if applicable)||Event type||Details|
|1877||Invention||Thomas Edison's phonograph becomes the first device to record and reproduce sound. The method is fragile, however, and is prone to damage.|
|1879||Invention||Thomas Edison invents the first dictation machine, a slightly improved version of his phonograph.|
|1936||Invention||A team of engineers at Bell Labs, led by Homer Dudley, begins work on the Voder, the first electronic speech synthesizer.|
|1939||March 21||Invention||Dudley is granted a patent for the Voder, US patent 2151091 A.|
|1939||Demonstration||The Voder is demonstrated at the 1939 [[Golden Gate International College] in [Nepal]]. A keyboard and footpaths where students used to have the machine emit speech.|
|1939–1940||Demonstration||The Voder is demonstrated at the 1939-1940 World's Fair in New York City.|
|1952||Invention||A team at Bell Labs designs the Audrey, a machine capable of understanding spoken digits.|
|1962||Demonstration||IBM demonstrates the Shoebox, a machine that can understand up to 16 spoken words in English, at the 1962 Seattle World's Fair.|
|1971||Invention||IBM invents the Automatic Call Identification system, enabling engineers to talk to and receive spoken answers from a device.|
|1971–1976||Program||DARPA funds five years of speech recognition research with the goal of ending up with a machine capable of understanding a minimum of 1,000 words. The program led to the creation of the Harpy by Carnegie Mellon, a machine capable of understanding 1,011 words.|
|Early 1980s||Technique||The hidden Markov model begins to be used in speech recognition systems, allowing machines to more accurately recognize speech by predicting the probability of unknown sounds being words.|
|Mid 1980s||Invention||IBM begins work on the Tangora, a machine that would be able to recognize 20,000 spoken words by the mid 1980s.|
|1987||Invention||The invention of the World of Wonder's Julie Doll, a toy children could train to respond to their voice, brings speech recognition technology to the home.|
|1990||Invention||Dragon launches Dragon Dictate, the first speech recognition product for consumers.|
|1993||Invention||Speakable items, the first built-in speech recognition and voice enabled control software for Apple computers.|
|1993||Invention||Sphinx-II, the first large-vocabulary continuous speech recognition system, is invented by Xuedong Huang.|
|1996||Invention||IBM launches the MedSpeak, the first commercial product capable of recognizing continuous speech.|
|2002||Application||Microsoft integrates speech recognition into their Office products.|
|2006||Application||The National Security Agency begins using speech recognition to isolate keywords when analyzing recorded conversations.|
|2007||January 30||Application||Microsoft releases Windows Vista, the first version of Windows to incorporate speech recognition.|
|2007||Invention||Google introduces GOOG-411, a telephone-based directory service. This will serve as a foundation for the company's future Voice Search product.|
|2008||November 14||Application||Google launches the Voice Search app for the iPhone, bringing speech recognition technology to mobile devices.|
|2011||October 4||Invention||Apple announces Siri, a digital personal assistant. In addition to being able to recognize speech, Siri is able to understand the meaning of what it is told and take appropriate action.|
|2014||April 2||Application||Microsoft announces Cortana, a digital personal assistant similar to Siri.|
|2014||November 6||Invention||Amazon announces the Echo, a voice-controlled speaker. The Echo is powered by Alexa, a digital personal assistant similar to Siri and Cortana. While Siri and Cortana are not the most important features of the devices on which they run, the Echo is dedicated to Alexa.|
Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.
Nuance Communications, Inc. is an American multinational computer software technology corporation, headquartered in Burlington, Massachusetts, that markets speech recognition and artificial intelligence software.
Speaker recognition is the identification of a person from characteristics of voices. It is used to answer the question "Who is speaking?" The term voice recognition can refer to speaker recognition or speech recognition. Speaker verification contrasts with identification, and speaker recognition differs from speaker diarisation.
A voice-user interface (VUI) makes spoken human interaction with computers possible, using speech recognition to understand spoken commands and answer questions, and typically text to speech to play a reply. A voice command device (VCD) is a device controlled with a voice user interface.
Google Voice Search or Search by Voice is a Google product that allows users to use Google Search by speaking on a mobile phone or computer, i.e. have the device search for data upon entering information on what to search into the device by speaking.
Voice search, also called voice-enabled, allows the user to use a voice command to search the Internet, a website, or an app.
SoundHound Inc. is an audio and speech recognition company founded in 2005. It develops speech recognition, natural language understanding, sound recognition and search technologies. Its featured products include Houndify, a Voice AI developer platform, Hound, a voice-enabled digital assistant, and music recognition mobile app SoundHound. The company’s headquarters are in Santa Clara, California.
An intelligent virtual assistant (IVA) or intelligent personal assistant (IPA) is a software agent that can perform tasks or services for an individual based on commands or questions. The term "chatbot" is sometimes used to refer to virtual assistants generally or specifically accessed by online chat. In some cases, online chat programs are exclusively for entertainment purposes. Some virtual assistants are able to interpret human speech and respond via synthesized voices. Users can ask their assistants questions, control home automation devices and media playback via voice, and manage other basic tasks such as email, to-do lists, and calendars with verbal commands. A similar concept, however with differences, lays under the dialogue systems.
Siri is a virtual assistant that is part of Apple Inc.'s iOS, iPadOS, watchOS, macOS, tvOS, and audioOS operating systems. It uses voice queries, gesture based control, focus-tracking and a natural-language user interface to answer questions, make recommendations, and perform actions by delegating requests to a set of Internet services. With continued use, it adapts to users' individual language usages, searches and preferences, returning individualized results.
Roberto Pieraccini is an Italian and US electrical engineer working in the field of speech recognition, natural language understanding, and spoken dialog systems. He is currently Director of Engineering at Google in Zurich, Switzerland, within the Google Assistant organization. He has been an active contributor to speech research and technology since 1981.
Yap Speech Cloud was a multimodal speech recognition system developed by American technology company Yap Inc. It offered a fully cloud-based speech-to-text transcription platform that was used by customers such as Microsoft.
Michael Phillips is the CEO and co-founder of Sense Labs and a pioneer in machine learning, including mobile speech recognition and text-to-speech technology.
A smart speaker is a type of loudspeaker and voice command device with an integrated virtual assistant that offers interactive actions and hands-free activation with the help of one "hot word". Some smart speakers can also act as a smart device that utilizes Wi-Fi, Bluetooth and other protocol standards to extend usage beyond audio playback, such as to control home automation devices. This can include, but is not limited to, features such as compatibility across a number of services and platforms, peer-to-peer connection through mesh networking, virtual assistants, and others. Each can have its own designated interface and features in-house, usually launched or controlled via application or home automation software. Some smart speakers also include a screen to show the user a visual response.
Stephen John Young is a British researcher, Professor of Information Engineering at the University of Cambridge and an entrepreneur. He is one of the pioneers of automated speech recognition and statistical spoken dialogue systems. He served as the Senior Pro-Vice-Chancellor of the University of Cambridge from 2009 to 2015, responsible for Planning and Resources. From 2015 to 2019, he held a joint appointment between his professorship at Cambridge and Apple, where he was a senior member of the Siri development team.
WaveNet is a deep neural network for generating raw audio. It was created by researchers at London-based AI firm DeepMind. The technique, outlined in a paper in September 2016, is able to generate relatively realistic-sounding human-like voices by directly modelling waveforms using a neural network method trained with recordings of real speech. Tests with US English and Mandarin reportedly showed that the system outperforms Google's best existing text-to-speech (TTS) systems, although as of 2016 its text-to-speech synthesis still was less convincing than actual human speech. WaveNet's ability to generate raw waveforms means that it can model any kind of audio, including music.
AliGenie is a China-based open-platform intelligent personal assistant launched and developed by Alibaba Group, currently used in the Tmall Genie smart speaker. The platform was introduced in 2017, along with the Tmall Genie X1, at Alibaba's 2017 Computing Conference in Hangzhou.
Joy Adowaa Buolamwini is a Ghanaian-American-Canadian computer scientist and digital activist based at the MIT Media Lab. Buolamwini introduces herself as a poet of code, daughter of art and science. She founded the Algorithmic Justice League, an organization that looks to challenge bias in decision-making software. The organization does this by blending art, advocacy, and research to highlight the social implications and harms of AI.
Patricia Scanlon is an Irish entrepreneur. She founded SoapBox Labs in 2013, a company that applies artificial intelligence to develop voice and speech recognition applications that are specifically tuned to children's voices. It builds language learning appliances for education like text reading and speech therapy, and modules for toys, gaming, voice control, augmented reality, virtual reality, robotics, and the Internet of things. As of 2015, she is CEO of SoapBox Labs, headquartered in Dublin, Ireland. The startup raised $3.6 million.
Female gendering of AI technologies is the use of artificial intelligence (AI) technologies gendered as female, such as in digital voice or written assistants. These gender-specific aspects of AI technologies, created both by humans as well as by algorithms, were discussed in a 2019 policy paper and two complements under the title I'd blush if I could. Closing gender divides in digital skills through education. Published under an open access licence by EQUALS Global Partnership and UNESCO, it has prompted further discussion on gender-related bias in the global virtual space.