Tony Robinson (speech recognition)

Last updated

Tony Robinson is a researcher in the application of recurrent neural networks to speech recognition, [1] [2] [3] being one of the first to discover the practical capabilities of deep neural networks and its application to speech recognition. [4]

Contents

Education and Early Career

Robinson studied natural sciences at Cambridge University between 1981 and 1984, where he specialized in physics. He went on to complete an MPhil in computer speech and language processing in 1985 and continued with a PhD in the same area in 1989, both at Cambridge. He first published on the topic of speech recognition during his PhD [5] and has published over a hundred widely cited research papers on automatic speech recognition (ASR) in the years since. [6]

Entrepreneurial Career

In 1995, Robinson formed SoftSound Ltd, a speech technology company which was acquired by Autonomy with a view to using the technology to make unstructured video and voice data easily searchable. Robinson helped build the fastest large vocabulary speech recognition system available at the time, and operating in more languages than any other model, based on recurrent neural networks. [7]

From 2008 to 2010, Robinson was the Director of the Advanced Speech Group at SpinVox, a provider of speech-to-text conversion services for carrier markets, including wireless, VoIP and cable. Their Automatic Speech Recognition (ASR) system was, for a time, being used more than one million times per day and SpinVox was subsequently acquired by global speech technology company Nuance. [8]

Robinson was also founder of Speechmatics, which launched its cloud-based speech recognition services in 2012. Speechmatics subsequently announced a new technology in accelerated new language modeling late in 2017. [9] Robinson continues to publish papers in speech recognition technology, especially in the area of statistical language modelling. [10]

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Natural language processing (NLP) is an interdisciplinary subfield of computer science and artificial intelligence. It is primarily concerned with providing computers the ability to process data encoded in natural language and is thus closely related to information retrieval, knowledge representation and computational linguistics, a subfield of linguistics. Typically data is collected in text corpora, using either rule-based, statistical or neural-based approaches of machine learning and deep learning.

Speech processing is the study of speech signals and the processing methods of signals. The signals are usually processed in a digital representation, so speech processing can be regarded as a special case of digital signal processing, applied to speech signals. Aspects of speech processing includes the acquisition, manipulation, storage, transfer and output of speech signals. Different speech processing tasks include speech recognition, speech synthesis, speaker diarization, speech enhancement, speaker recognition, etc.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Pattern recognition is the task of assigning a class to an observation based on patterns extracted from data. While similar, pattern recognition (PR) is not to be confused with pattern machines (PM) which may possess (PR) capabilities but their primary function is to distinguish and create emergent patterns. PR has applications in statistical data analysis, signal processing, image analysis, information retrieval, bioinformatics, data compression, computer graphics and machine learning. Pattern recognition has its origins in statistics and engineering; some modern approaches to pattern recognition include the use of machine learning, due to the increased availability of big data and a new abundance of processing power.

<span class="mw-page-title-main">SpeechFX</span>

SpeechFX, Inc. offers voice technology for mobile phone and wireless devices, interactive video games, toys, home appliances, computer telephony systems and vehicle telematics. SpeechFX speech solutions are based on the firm’s proprietary neural network-based automatic speech recognition (ASR) and Fonix DECtalk, a text-to-speech speech synthesis system (TTS). Fonix speech technology is user-independent, meaning no voice training is involved.

A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that cannot be unrolled.

A language model is a probabilistic model of a natural language. In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.

<span class="mw-page-title-main">Echo state network</span> Type of reservoir computer

An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behavior is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.

Reservoir computing is a framework for computation derived from recurrent neural network theory that maps input signals into higher dimensional computational spaces through the dynamics of a fixed, non-linear system called a reservoir. After the input signal is fed into the reservoir, which is treated as a "black box," a simple readout mechanism is trained to read the state of the reservoir and map it to the desired output. The first key benefit of this framework is that training is performed only at the readout stage, as the reservoir dynamics are fixed. The second is that the computational power of naturally available systems, both classical and quantum mechanical, can be used to reduce the effective computational cost.

<span class="mw-page-title-main">Long short-term memory</span> Artificial recurrent neural network architecture used in deep learning

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at dealing with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

SpinVox was a start-up company that is now a subsidiary of global speech technology company Nuance Communications, an American multinational computer software technology corporation, headquartered in Burlington, Massachusetts, United States on the outskirts of Boston, that provides speech and imaging applications. Initially, SpinVox provided voice-to-text conversion services for carrier markets, including wireless, fixed, VoIP and cable, as well as for unified communications, enterprise and Web 2.0 environments. This service was ostensibly provided through an automated computer system, with human intervention where needed. However, there were accusations that the system operated almost exclusively through the use of call-center workers in South Africa and the Philippines.

<span class="mw-page-title-main">Time delay neural network</span>

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

Loquendo is an Italian multinational computer software technology corporation, headquartered in Torino, Italy, that provides speech recognition, speech synthesis, speaker verification and identification applications. Loquendo, which was founded in 2001 under the Telecom Italia Lab, also had offices in United Kingdom, Spain, Germany, France, and the United States.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is the subset of machine learning methods based on neural networks with representation learning. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Nelson Harold Morgan is an American computer scientist and professor in residence (emeritus) of electrical engineering and computer science at the University of California, Berkeley. Morgan is the co-inventor of the Relative Spectral (RASTA) approach to speech signal processing, first described in a technical report published in 1991.

In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).

Connectionist temporal classification (CTC) is a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable. It can be used for tasks like on-line handwriting recognition or recognizing phonemes in speech audio. CTC refers to the outputs and scoring, and is independent of the underlying neural network structure. It was introduced in 2006.

Speechmatics is a technology company based in Cambridge, England, which develops automatic speech recognition software (ASR) based on recurrent neural networks and statistical language modelling. Speechmatics was originally named Cantab Research Ltd when founded in 2006 by speech recognition specialist Dr. Tony Robinson.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

References

  1. Robinson, Tony; Fallside, Frank (July 1991). "A recurrent error propagation network speech recognition system". Computer Speech and Language. 5 (3): 259–274. doi:10.1016/0885-2308(91)90010-N.
  2. Robinson, Tony (1996). "The Use of Recurrent Neural Networks in Continuous Speech Recognition". Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science. Vol. 355. pp. 233–258. CiteSeerX   10.1.1.364.7237 . doi:10.1007/978-1-4613-1367-0_10. ISBN   978-1-4612-8590-8.
  3. Wakefield, Jane (2008-03-14). "Speech recognition moves to text". BBC News . Retrieved 2020-08-24.
  4. Robinson, Tony (September 1993). "A neural network based, speaker independent, large vocabulary, continuous speech recognition system: the WERNICKE project". Third European Conference on Speech Communication and Technology. 1: 1941–1944. Retrieved 17 May 2018.
  5. Robinson, Anthony John (June 1989). "Dynamic Error Propagation Networks". PhD Thesis. Retrieved 17 May 2018.
  6. Robinson, Tony. "Tony Robinson - Profile". ResearchGate. Retrieved 17 May 2018.
  7. Robinson, Tony; Hochberg, Mike; Renals, Steve (1996). "The Use of Recurrent Neural Networks in Continuous Speech Recognition". Automatic Speech and Speaker Recognition. The Kluwer International Series in Engineering and Computer Science. Vol. 355. pp. 233–258. CiteSeerX   10.1.1.364.7237 . doi:10.1007/978-1-4613-1367-0_10. ISBN   978-1-4612-8590-8.
  8. "Nuance Acquires SpinVox". Healthcare Innovation. 2011-06-24. Retrieved 2023-09-09.
  9. Orlowski, Andrew. "Brit neural net pioneer just revolutionised speech recognition all over again". The Register . Situation Publishing. Retrieved 17 May 2018.
  10. Chelba, Ciprian; Mikolov, Tomas; Schuster, Mike (2013). One Billion Word Benchmark for Measuring Progress in Statistical Language Modeling (Report). Cornell University Library. arXiv: 1312.3005 .