Time delay neural network

Last updated
TDNN diagram TDNN Diagram.png
TDNN diagram

Time delay neural network (TDNN) [1] is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

Contents

Shift-invariant classification means that the classifier does not require explicit segmentation prior to classification. For the classification of a temporal pattern (such as speech), the TDNN thus avoids having to determine the beginning and end points of sounds before classifying them.

For contextual modelling in a TDNN, each neural unit at each layer receives input not only from activations/features at the layer below, but from a pattern of unit output and its context. For time signals each unit receives as input the activation patterns over time from units below. Applied to two-dimensional classification (images, time-frequency patterns), the TDNN can be trained with shift-invariance in the coordinate space and avoids precise segmentation in the coordinate space.

History

The TDNN was introduced in the late 1980s and applied to a task of phoneme classification for automatic speech recognition in speech signals where the automatic determination of precise segments or feature boundaries was difficult or impossible. Because the TDNN recognizes phonemes and their underlying acoustic/phonetic features, independent of position in time, it improved performance over static classification. [1] [2] It was also applied to two-dimensional signals (time-frequency patterns in speech, [3] and coordinate space pattern in OCR [4] ).

Kunihiko Fukushima published the neocognitron in 1980. [5] Max pooling appears in a 1982 publication on the neocognitron [6] and was in the 1989 publication in LeNet-5. [7]

In 1990, Yamaguchi et al. used max pooling in TDNNs in order to realize a speaker independent isolated word recognition system. [8]

Overview

The Time Delay Neural Network, like other neural networks, operates with multiple interconnected layers of perceptrons, and is implemented as a feedforward neural network. All neurons (at each layer) of a TDNN receive inputs from the outputs of neurons at the layer below but with two differences:

  1. Unlike regular Multi-Layer perceptrons, all units in a TDNN, at each layer, obtain inputs from a contextual window of outputs from the layer below. For time varying signals (e.g. speech), each unit has connections to the output from units below but also to the time-delayed (past) outputs from these same units. This models the units' temporal pattern/trajectory. For two-dimensional signals (e.g. time-frequency patterns or images), a 2-D context window is observed at each layer. Higher layers have inputs from wider context windows than lower layers and thus generally model coarser levels of abstraction.
  2. Shift-invariance is achieved by explicitly removing position dependence during backpropagation training. This is done by making time-shifted copies of a network across the dimension of invariance (here: time). The error gradient is then computed by backpropagation through all these networks from an overall target vector, but before performing the weight update, the error gradients associated with shifted copies are averaged and thus shared and constrained to be equal. Thus, all position dependence from backpropagation training through the shifted copies is removed and the copied networks learn the most salient hidden features shift-invariantly, i.e. independent of their precise position in the input data. Shift-invariance is also readily extended to multiple dimensions by imposing similar weight-sharing across copies that are shifted along multiple dimensions. [3] [4]

Example

In the case of a speech signal, inputs are spectral coefficients over time.

In order to learn critical acoustic-phonetic features (for example formant transitions, bursts, frication, etc.) without first requiring precise localization, the TDNN is trained time-shift-invariantly. Time-shift invariance is achieved through weight sharing across time during training: Time shifted copies of the TDNN are made over the input range (from left to right in Fig.1). Backpropagation is then performed from an overall classification target vector (see TDNN diagram, three phoneme class targets (/b/, /d/, /g/) are shown in the output layer), resulting in gradients that will generally vary for each of the time-shifted network copies. Since such time-shifted networks are only copies, however, the position dependence is removed by weight sharing. In this example, this is done by averaging the gradients from each time-shifted copy before performing the weight update. In speech, time-shift invariant training was shown to learn weight matrices that are independent of precise positioning of the input. The weight matrices could also be shown to detect important acoustic-phonetic features that are known to be important for human speech perception, such as formant transitions, bursts, etc. [1] TDNNs could also be combined or grown by way of pre-training. [9]

Implementation

The precise architecture of TDNNs (time-delays, number of layers) is mostly determined by the designer depending on the classification problem and the most useful context sizes. The delays or context windows are chosen specific to each application. Work has also been done to create adaptable time-delay TDNNs [10] where this manual tuning is eliminated.

State of the art

TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models. [1] [9] Modern deep TDNN architectures include many more hidden layers and sub-sample or pool connections over broader contexts at higher layers. They achieve up to 50% word error reduction over GMM-based acoustic models. [11] [12] While the different layers of TDNNs are intended to learn features of increasing context width, they do model local contexts. When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques. [13] [3] [4]

Applications

Speech recognition

TDNNs used to solve problems in speech recognition that were introduced in 1989 [2] and initially focused on shift-invariant phoneme recognition. Speech lends itself nicely to TDNNs as spoken sounds are rarely of uniform length and precise segmentation is difficult or impossible. By scanning a sound over past and future, the TDNN is able to construct a model for the key elements of that sound in a time-shift invariant manner. This is particularly useful as sounds are smeared out through reverberation. [11] [12] Large phonetic TDNNs can be constructed modularly through pre-training and combining smaller networks. [9]

Large vocabulary speech recognition

Large vocabulary speech recognition requires recognizing sequences of phonemes that make up words subject to the constraints of a large pronunciation vocabulary. Integration of TDNNs into large vocabulary speech recognizers is possible by introducing state transitions and search between phonemes that make up a word. The resulting Multi-State Time-Delay Neural Network (MS-TDNN) can be trained discriminative from the word level, thereby optimizing the entire arrangement toward word recognition instead of phoneme classification. [13] [14] [4]

Speaker independence

Two-dimensional variants of the TDNNs were proposed for speaker independence. [3] Here, shift-invariance is applied to the time as well as to the frequency axis in order to learn hidden features that are independent of precise location in time and in frequency (the latter being due to speaker variability).

Reverberation

One of the persistent problems in speech recognition is recognizing speech when it is corrupted by echo and reverberation (as is the case in large rooms and distant microphones). Reverberation can be viewed as corrupting speech with delayed versions of itself. In general, it is difficult, however, to de-reverberate a signal as the impulse response function (and thus the convolutional noise experienced by the signal) is not known for any arbitrary space. The TDNN was shown to be effective to recognize speech robustly despite different levels of reverberation. [11] [12]

Lip-reading – audio-visual speech

TDNNs were also successfully used in early demonstrations of audio-visual speech, where the sounds of speech are complemented by visually reading lip movement. [14] Here, TDNN-based recognizers used visual and acoustic features jointly to achieve improved recognition accuracy, particularly in the presence of noise, where complementary information from an alternate modality could be fused nicely in a neural net.

Handwriting recognition

TDNNs have been used effectively in compact and high-performance handwriting recognition systems. Shift-invariance was also adapted to spatial patterns (x/y-axes) in image offline handwriting recognition. [4]

Video analysis

Video has a temporal dimension that makes a TDNN an ideal solution to analysing motion patterns. An example of this analysis is a combination of vehicle detection and recognizing pedestrians. [15] When examining videos, subsequent images are fed into the TDNN as input where each image is the next frame in the video. The strength of the TDNN comes from its ability to examine objects shifted in time forward and backward to define an object detectable as the time is altered. If an object can be recognized in this manner, an application can plan on that object to be found in the future and perform an optimal action.

Image recognition

Two-dimensional TDNNs were later applied to other image-recognition tasks under the name of "Convolutional Neural Networks", where shift-invariant training is applied to the x/y axes of an image.

Common libraries

See also

Related Research Articles

<span class="mw-page-title-main">Neural network (machine learning)</span> Computational model used in machine learning, based on connected, hierarchical functions

In machine learning, a neural network is a model inspired by the structure and function of biological neural networks in animal brains.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech-to-text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

<span class="mw-page-title-main">Handwriting recognition</span> Ability of a computer to receive and interpret intelligible handwritten input

Handwriting recognition (HWR), also known as handwritten text recognition (HTR), is the ability of a computer to receive and interpret intelligible handwritten input from sources such as paper documents, photographs, touch-screens and other devices. The image of the written text may be sensed "off line" from a piece of paper by optical scanning or intelligent word recognition. Alternatively, the movements of the pen tip may be sensed "on line", for example by a pen-based computer screen surface, a generally easier task as there are more clues available. A handwriting recognition system handles formatting, performs correct segmentation into characters, and finds the most possible words.

<span class="mw-page-title-main">Jürgen Schmidhuber</span> German computer scientist

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

The memory-prediction framework is a theory of brain function created by Jeff Hawkins and described in his 2004 book On Intelligence. This theory concerns the role of the mammalian neocortex and its associations with the hippocampi and the thalamus in matching sensory inputs to stored memory patterns and how this process leads to predictions of what will happen in the future.

Recurrent neural networks (RNNs) are a class of artificial neural networks for sequential data processing. Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.

The neocognitron is a hierarchical, multilayered artificial neural network proposed by Kunihiko Fukushima in 1979. It has been used for Japanese handwritten character recognition and other pattern recognition tasks, and served as the inspiration for convolutional neural networks.

Automatic target recognition (ATR) is the ability for an algorithm or device to recognize targets or other objects based on data obtained from sensors.

There are many types of artificial neural networks (ANN).

<span class="mw-page-title-main">Alex Waibel</span> American computer scientist

Alexander Waibel is a professor of Computer Science at Carnegie Mellon University and Karlsruhe Institute of Technology (KIT). Waibel’s research focuses on automatic speech recognition, translation and human-machine interaction. His work has introduced cross-lingual communication systems, such as consecutive and simultaneous interpreting systems on a variety of platforms. In fundamental research on machine learning, he is known for the Time Delay Neural Network (TDNN), the first Convolutional Neural Network (CNN) trained by gradient descent, using backpropagation. Alex Waibel introduced the TDNN in 1987 at ATR in Japan.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is a subset of machine learning methods based on neural networks with representation learning. The field takes inspiration from biological neuroscience and is centered around stacking artificial neurons into layers and "training" them to process data. The adjective "deep" refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

A convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns features by itself via filter optimization. This type of deep learning network has been applied to process and make predictions from many different types of data including text, images and audio. Convolution-based networks are the de-facto standard in deep learning-based approaches to computer vision and image processing, and have only recently have been replaced -- in some cases -- by more recent deep learning architectures such as the transformer. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer, 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite directions to the same output. With this form of generative deep learning, the output layer can get information from past (backwards) and future (forward) states simultaneously. Invented in 1997 by Schuster and Paliwal, BRNNs were introduced to increase the amount of input information available to the network. For example, multilayer perceptron (MLPs) and time delay neural network (TDNNs) have limitations on the input data flexibility, as they require their input data to be fixed. Standard recurrent neural network (RNNs) also have restrictions as the future input information cannot be reached from the current state. On the contrary, BRNNs do not require their input data to be fixed. Moreover, their future input information is reachable from the current state.

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor at the University of Toronto.

The following outline is provided as an overview of and topical guide to machine learning:

<span class="mw-page-title-main">Residual neural network</span> Deep learning method

A residual neural network is a deep learning architecture in which the weight layers learn residual functions with reference to the layer inputs. It was developed in 2015 for image recognition and won that year's ImageNet Large Scale Visual Recognition Challenge.

Artificial neural networks (ANNs) are models created using machine learning to perform a number of tasks. Their creation was inspired by neural circuitry. While some of the computational implementations ANNs relate to earlier discoveries in mathematics, the first implementation of ANNs was by psychologist Frank Rosenblatt, who developed the perceptron. Little research was conducted on ANNs in the 1970s and 1980s, with the AAAI calling that period an "AI winter".

<span class="mw-page-title-main">LeNet</span> Convolutional neural network structure

LeNet is a series of convolutional neural network structure proposed by LeCun et al.. The earliest version, LeNet-1, was trained in 1989. In general, when "LeNet" is referred to without a number, it refers to LeNet-5 (1998), the most well-known version.

<span class="mw-page-title-main">Attention (machine learning)</span> Machine learning technique

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

Tensor informally refers in machine learning to two different concepts that organize and represent data. Data may be organized in a multidimensional array (M-way array) that is informally referred to as a "data tensor"; however in the strict mathematical sense, a tensor is a multilinear mapping over a set of domain vector spaces to a range vector space. Observations, such as images, movies, volumes, sounds, and relationships among words and concepts, stored in an M-way array ("data tensor") may be analyzed either by artificial neural networks or tensor methods.

References

  1. 1 2 3 4 Alexander Waibel, Tashiyuki Hanazawa, Geoffrey Hinton, Kiyohito Shikano, Kevin J. Lang, Phoneme Recognition Using Time-Delay Neural Networks , IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. - 339 March 1989.
  2. 1 2 Alexander Waibel, Phoneme Recognition Using Time-Delay Neural Networks , SP87-100, Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE), December, 1987, Tokyo, Japan.
  3. 1 2 3 4 John B. Hampshire and Alexander Waibel, Connectionist Architectures for Multi-Speaker Phoneme Recognition Archived 2016-04-11 at the Wayback Machine , Advances in Neural Information Processing Systems, 1990, Morgan Kaufmann.
  4. 1 2 3 4 5 Stefan Jaeger, Stefan Manke, Juergen Reichert, Alexander Waibel, Online handwriting recognition: the NPen++recognizer , International Journal on Document Analysis and Recognition Vol. 3, Issue 3, March 2001
  5. Fukushima, Kunihiko (1980). "Neocognitron: A Self-organizing Neural Network Model for a Mechanism of Pattern Recognition Unaffected by Shift in Position" (PDF). Biological Cybernetics. 36 (4): 193–202. doi:10.1007/BF00344251. PMID   7370364. S2CID   206775608. Archived (PDF) from the original on 3 June 2014. Retrieved 16 November 2013.
  6. Fukushima, Kunihiko; Miyake, Sei (1982-01-01). "Neocognitron: A new algorithm for pattern recognition tolerant of deformations and shifts in position". Pattern Recognition. 15 (6): 455–469. doi:10.1016/0031-3203(82)90024-3. ISSN   0031-3203.
  7. LeCun, Yann; Boser, Bernhard; Denker, John; Henderson, Donnie; Howard, R.; Hubbard, Wayne; Jackel, Lawrence (1989). "Handwritten Digit Recognition with a Back-Propagation Network". Advances in Neural Information Processing Systems. 2. Morgan-Kaufmann.
  8. Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan. Archived from the original on 2021-03-07. Retrieved 2019-09-04.
  9. 1 2 3 Alexander Waibel, Hidefumi Sawai, Kiyohiro Shikano, Modularity and Scaling in Large Phonemic Neural Networks , IEEE Transactions on Acoustics, Speech, and Signal Processing, December, December 1989.
  10. Christian Koehler and Joachim K. Anlauf, An adaptable time-delay neural-network algorithm for image sequence analysis , IEEE Transactions on Neural Networks 10.6 (1999): 1531-1536
  11. 1 2 3 Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts , Proceedings of Interspeech 2015
  12. 1 2 3 David Snyder, Daniel Garcia-Romero, Daniel Povey, A Time-Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition , Proceedings of ASRU 2015.
  13. 1 2 Patrick Haffner, Alexander Waibel, Multi-State Time Delay Neural Networks for Continuous Speech Recognition Archived 2016-04-11 at the Wayback Machine , Advances in Neural Information Processing Systems, 1992, Morgan Kaufmann.
  14. 1 2 Christoph Bregler, Hermann Hild, Stefan Manke, Alexander Waibel, Improving Connected Letter Recognition by Lipreading , IEEE Proceedings International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, 1993.
  15. Christian Woehler and Joachim K. Anlauf, Real-time object recognition on image sequences with the adaptable time delay neural network algorithm—applications for autonomous vehicles." Image and Vision Computing 19.9 (2001): 593-618.
  16. "Time Series and Dynamic Systems - MATLAB & Simulink". mathworks.com. Retrieved 21 June 2016.
  17. Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, Sanjeev Khudanpur, JHU ASpIRE system: Robust LVCSR with TDNNs i-vector Adaptation and RNN-LMs , Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2015.