Tony Robinson (speech recognition)

Last updated

Tony Robinson is a researcher in the application of recurrent neural networks to speech recognition, [1] [2] [3] being one of the first to discover the practical capabilities of deep neural networks and its application to speech recognition. [4]

Contents

Education and early career

Robinson studied natural sciences at Cambridge University between 1981 and 1984, where he specialized in physics. He went on to complete an MPhil in computer speech and language processing in 1985 and continued with a PhD in the same area in 1989, both at Cambridge. He first published on the topic of speech recognition during his PhD [5] and has published over a hundred widely cited research papers on automatic speech recognition (ASR) in the years since. [6] Robinson became an EPSRC-funded research fellow in 1990 and a Lecturer at Cambridge University in 1995.

Entrepreneurial career

In 1995, Robinson formed SoftSound Ltd, a speech technology company which was acquired by Autonomy with a view to using the technology to make unstructured video and voice data easily searchable. Robinson helped build the fastest large vocabulary speech recognition system available at the time, and operating in more languages than any other model, based on recurrent neural networks. [7]

From 2008 to 2010, Robinson was the Director of the Advanced Speech Group at SpinVox, a provider of speech-to-text conversion services for carrier markets, including wireless, VoIP and cable. Their Automatic Speech Recognition (ASR) system was, for a time, being used more than one million times per day and SpinVox was subsequently acquired by global speech technology company Nuance. [8]

Robinson was also founder of Speechmatics, which launched its cloud-based speech recognition services in 2012. Speechmatics subsequently announced a new technology in accelerated new language modeling late in 2017. [9] Robinson continues to publish papers in speech recognition technology, especially in the area of statistical language modelling. [10]

Related Research Articles

<span class="mw-page-title-main">Artificial neural network</span> Computational model used in machine learning, based on connected, hierarchical functions

Artificial neural networks are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains.

<span class="mw-page-title-main">Natural language processing</span> Field of linguistics and computer science

Natural language processing (NLP) is an interdisciplinary subfield of computer science and linguistics. It is primarily concerned with giving computers the ability to support and manipulate human language. It involves processing natural language datasets, such as text corpora or speech corpora, using either rule-based or probabilistic machine learning approaches. The goal is a computer capable of "understanding" the contents of documents, including the contextual nuances of the language within them. The technology can then accurately extract information and insights contained in the documents as well as categorize and organize the documents themselves.

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

<span class="mw-page-title-main">SpeechFX</span>

SpeechFX, Inc., offers voice technology for mobile phone and wireless devices, interactive video games, toys, home appliances, computer telephony systems and vehicle telematics. SpeechFX speech solutions are based on the firm’s proprietary neural network-based automatic speech recognition (ASR) and Fonix DECtalk, a text-to-speech speech synthesis system (TTS). Fonix speech technology is user-independent, meaning no voice training is involved.

<span class="mw-page-title-main">Recurrent neural network</span> Computational model used in machine learning

A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

<span class="mw-page-title-main">Dragon NaturallySpeaking</span> Speech recognition software package

Dragon NaturallySpeaking is a speech recognition software package developed by Dragon Systems of Newton, Massachusetts, which was acquired in turn by Lernout & Hauspie Speech Products, Nuance Communications, and Microsoft. It runs on Windows personal computers. Version 15, which supports 32-bit and 64-bit editions of Windows 7, 8 and 10, was released in August 2016.

A language model is a probabilistic model of a natural language that can generate probabilities of a series of words, based on text corpora in one or multiple languages it was trained on. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Echo state network</span> Type of reservoir computer

An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behavior is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.

Reservoir computing is a framework for computation derived from recurrent neural network theory that maps input signals into higher dimensional computational spaces through the dynamics of a fixed, non-linear system called a reservoir. After the input signal is fed into the reservoir, which is treated as a "black box," a simple readout mechanism is trained to read the state of the reservoir and map it to the desired output. The first key benefit of this framework is that training is performed only at the readout stage, as the reservoir dynamics are fixed. The second is that the computational power of naturally available systems, both classical and quantum mechanical, can be used to reduce the effective computational cost.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

SpinVox was a start-up company that is now a subsidiary of global speech technology company Nuance Communications, an American multinational computer software technology corporation, headquartered in Burlington, Massachusetts, United States on the outskirts of Boston, that provides speech and imaging applications. Initially, SpinVox provided voice-to-text conversion services for carrier markets, including wireless, fixed, VoIP and cable, as well as for unified communications, enterprise and Web 2.0 environments. This service was ostensibly provided through an automated computer system, with human intervention where needed. However, there were accusations that the system operated almost exclusively through the use of call-center workers in South Africa and the Philippines.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

Loquendo is an Italian multinational computer software technology corporation, headquartered in Torino, Italy, that provides speech recognition, speech synthesis, speaker verification and identification applications. Loquendo, which was founded in 2001 under the Telecom Italia Lab, also had offices in United Kingdom, Spain, Germany, France, and the United States.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods which are based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Nelson Harold Morgan is an American computer scientist and professor in residence (emeritus) of electrical engineering and computer science at the University of California, Berkeley. Morgan is the co-inventor of the Relative Spectral (RASTA) approach to speech signal processing, first described in a technical report published in 1991.

In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).

<span class="mw-page-title-main">Glossary of artificial intelligence</span> List of definitions of terms and concepts commonly used in the study of artificial intelligence

This glossary of artificial intelligence is a list of definitions of terms and concepts relevant to the study of artificial intelligence, its sub-disciplines, and related fields. Related glossaries include Glossary of computer science, Glossary of robotics, and Glossary of machine vision.

<span class="mw-page-title-main">Speechmatics</span>

Speechmatics is a technology company based in Cambridge, England, which develops automatic speech recognition software (ASR) based on recurrent neural networks and statistical language modelling. Speechmatics was originally named Cantab Research Ltd when founded in 2006 by speech recognition specialist Dr. Tony Robinson.

References

  1. Robinson, Tony; Fallside, Frank (July 1991). "A recurrent error propagation network speech recognition system". Computer Speech and Language. 5 (3): 259–274. doi:10.1016/0885-2308(91)90010-N.
  2. Robinson, Tony (1996). "The Use of Recurrent Neural Networks in Continuous Speech Recognition". Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science. Vol. 355. pp. 233–258. CiteSeerX   10.1.1.364.7237 . doi:10.1007/978-1-4613-1367-0_10. ISBN   978-1-4612-8590-8.
  3. Wakefield, Jane (2008-03-14). "Speech recognition moves to text". BBC News . Retrieved 2020-08-24.
  4. Robinson, Tony (September 1993). "A neural network based, speaker independent, large vocabulary, continuous speech recognition system: the WERNICKE project". Third European Conference on Speech Communication and Technology. 1: 1941–1944. Retrieved 17 May 2018.
  5. Robinson, Anthony John (June 1989). "Dynamic Error Propagation Networks". PhD Thesis. Retrieved 17 May 2018.
  6. Robinson, Tony. "Tony Robinson - Profile". ResearchGate. Retrieved 17 May 2018.
  7. Robinson, Tony; Hochberg, Mike; Renals, Steve (1996). "The Use of Recurrent Neural Networks in Continuous Speech Recognition". Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science. Vol. 355. pp. 233–258. CiteSeerX   10.1.1.364.7237 . doi:10.1007/978-1-4613-1367-0_10. ISBN   978-1-4612-8590-8.
  8. "Nuance Acquires SpinVox". Healthcare Innovation. 2011-06-24. Retrieved 2023-09-09.
  9. Orlowski, Andrew. "Brit neural net pioneer just revolutionised speech recognition all over again". The Register . Situation Publishing. Retrieved 17 May 2018.
  10. Chelba, Ciprian; Mikolov, Tomas; Schuster, Mike (2013). One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling (Report). Cornell University Library. arXiv: 1312.3005 .