Time delay neural network

Last updated
TDNN diagram TDNN Diagram.png
TDNN diagram

Time delay neural network (TDNN) [1] is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.


Shift-invariant classification means that the classifier does not require explicit segmentation prior to classification. For the classification of a temporal pattern (such as speech), the TDNN thus avoids having to determine the beginning and end points of sounds before classifying them.

For contextual modelling in a TDNN, each neural unit at each layer receives input not only from activations/features at the layer below, but from a pattern of unit output and its context. For time signals each unit receives as input the activation patterns over time from units below. Applied to two-dimensional classification (images, time-frequency patterns), the TDNN can be trained with shift-invariance in the coordinate space and avoids precise segmentation in the coordinate space.


The TDNN was introduced in the late 1980s and applied to a task of phoneme classification for automatic speech recognition in speech signals where the automatic determination of precise segments or feature boundaries was difficult or impossible. Because the TDNN recognizes phonemes and their underlying acoustic/phonetic features, independent of position in time, it improved performance over static classification. [1] [2] It was also applied to two-dimensional signals (time-frequency patterns in speech, [3] and coordinate space pattern in OCR [4] ).

Max pooling

In 1990, Yamaguchi et al. introduced the concept of max pooling. They did so by combining TDNNs with max pooling in order to realize a speaker independent isolated word recognition system. [5]


The Time Delay Neural Network, like other neural networks, operates with multiple interconnected layers of perceptrons, and is implemented as a feedforward neural network. All neurons (at each layer) of a TDNN receive inputs from the outputs of neurons at the layer below but with two differences:

  1. Unlike regular Multi-Layer perceptrons, all units in a TDNN, at each layer, obtain inputs from a contextual window of outputs from the layer below. For time varying signals (e.g. speech), each unit has connections to the output from units below but also to the time-delayed (past) outputs from these same units. This models the units' temporal pattern/trajectory. For two-dimensional signals (e.g. time-frequency patterns or images), a 2-D context window is observed at each layer. Higher layers have inputs from wider context windows than lower layers and thus generally model coarser levels of abstraction.
  2. Shift-invariance is achieved by explicitly removing position dependence during backpropagation training. This is done by making time-shifted copies of a network across the dimension of invariance (here: time). The error gradient is then computed by backpropagation through all these networks from an overall target vector, but before performing the weight update, the error gradients associated with shifted copies are averaged and thus shared and constraint to be equal. Thus, all position dependence from backpropagation training through the shifted copies is removed and the copied networks learn the most salient hidden features shift-invariantly, i.e. independent of their precise position in the input data. Shift-invariance is also readily extended to multiple dimensions by imposing similar weight-sharing across copies that are shifted along multiple dimensions. [3] [4]


In the case of a speech signal, inputs are spectral coefficients over time.

In order to learn critical acoustic-phonetic features (for example formant transitions, bursts, frication, etc.) without first requiring precise localization, the TDNN is trained time-shift-invariantly. Time-shift invariance is achieved through weight sharing across time during training: Time shifted copies of the TDNN are made over the input range (from left to right in Fig.1). Backpropagation is then performed from an overall classification target vector (see TDNN diagram, three phoneme class targets (/b/, /d/, /g/) are shown in the output layer), resulting in gradients that will generally vary for each of the time-shifted network copies. Since such time-shifted networks are only copies, however, the position dependence is removed by weight sharing. In this example, this is done by averaging the gradients from each time-shifted copy before performing the weight update. In speech, time-shift invariant training was shown to learn weight matrices that are independent of precise positioning of the input. The weight matrices could also be shown to detect important acoustic-phonetic features that are known to be important for human speech perception, such as formant transitions, bursts, etc. [1] TDNNs could also be combined or grown by way of pre-training. [6]


The precise architecture of TDNNs (time-delays, number of layers) is mostly determined by the designer depending on the classification problem and the most useful context sizes. The delays or context windows are chosen specific to each application. Work has also been done to create adaptable time-delay TDNNs [7] where this manual tuning is eliminated.

State of the art

TDNN-based phoneme recognizers compared favourably in early comparisons with HMM-based phone models. [1] [6] Modern deep TDNN architectures include many more hidden layers and sub-sample or pool connections over broader contexts at higher layers. They achieve up to 50% word error reduction over GMM-based acoustic models. [8] [9] While the different layers of TDNNs are intended to learn features of increasing context width, they do model local contexts. When longer-distance relationships and pattern sequences have to be processed, learning states and state-sequences is important and TDNNs can be combined with other modelling techniques. [10] [3] [4]


Speech recognition

TDNNs used to solve problems in speech recognition that were introduced in 1989 [2] and initially focused on shift-invariant phoneme recognition. Speech lends itself nicely to TDNNs as spoken sounds are rarely of uniform length and precise segmentation is difficult or impossible. By scanning a sound over past and future, the TDNN is able to construct a model for the key elements of that sound in a time-shift invariant manner. This is particularly useful as sounds are smeared out through reverberation. [8] [9] Large phonetic TDNNs can be constructed modularly through pre-training and combining smaller networks. [6]

Large vocabulary speech recognition

Large vocabulary speech recognition requires recognizing sequences of phonemes that make up words subject to the constraints of a large pronunciation vocabulary. Integration of TDNNs into large vocabulary speech recognizers is possible by introducing state transitions and search between phonemes that make up a word. The resulting Multi-State Time-Delay Neural Network (MS-TDNN) can be trained discriminative from the word level, thereby optimizing the entire arrangement toward word recognition instead of phoneme classification. [10] [11] [4]

Speaker independence

Two-dimensional variants of the TDNNs were proposed for speaker independence. [3] Here, shift-invariance is applied to the time as well as to the frequency axis in order to learn hidden features that are independent of precise location in time and in frequency (the latter being due to speaker variability).


One of the persistent problems in speech recognition is recognizing speech when it is corrupted by echo and reverberation (as is the case in large rooms and distant microphones). Reverberation can be viewed as corrupting speech with delayed versions of itself. In general, it is difficult, however, to de-reverberate a signal as the impulse response function (and thus the convolutional noise experienced by the signal) is not known for any arbitrary space. The TDNN was shown to be effective to recognize speech robustly despite different levels of reverberation. [8] [9]

Lip-reading – audio-visual speech

TDNNs were also successfully used in early demonstrations of audio-visual speech, where the sounds of speech are complemented by visually reading lip movement. [11] Here, TDNN-based recognizers used visual and acoustic features jointly to achieve improved recognition accuracy, particularly in the presence of noise, where complementary information from an alternate modality could be fused nicely in a neural net.

Handwriting recognition

TDNNs have been used effectively in compact and high-performance handwriting recognition systems. Shift-invariance was also adapted to spatial patterns (x/y-axes) in image offline handwriting recognition. [4]

Video analysis

Video has a temporal dimension that makes a TDNN an ideal solution to analysing motion patterns. An example of this analysis is a combination of vehicle detection and recognizing pedestrians. [12] When examining videos, subsequent images are fed into the TDNN as input where each image is the next frame in the video. The strength of the TDNN comes from its ability to examine objects shifted in time forward and backward to define an object detectable as the time is altered. If an object can be recognized in this manner, an application can plan on that object to be found in the future and perform an optimal action.

Image recognition

Two-dimensional TDNNs were later applied to other image-recognition tasks under the name of "Convolutional Neural Networks", where shift-invariant training is applied to the x/y axes of an image.

Common libraries

See also

Related Research Articles

<span class="mw-page-title-main">Artificial neural network</span> Computational model used in machine learning, based on connected, hierarchical functions

Artificial neural networks (ANNs), usually simply called neural networks (NNs) or neural nets, are computing systems inspired by the biological neural networks that constitute animal brains.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers with the main benefit of searchability. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

<span class="mw-page-title-main">Perceptron</span> Algorithm for supervised learning of binary classifiers

In machine learning, the perceptron is an algorithm for supervised learning of binary classifiers. A binary classifier is a function which can decide whether or not an input, represented by a vector of numbers, belongs to some specific class. It is a type of linear classifier, i.e. a classification algorithm that makes its predictions based on a linear predictor function combining a set of weights with the feature vector.

<span class="mw-page-title-main">Unsupervised learning</span> Machine learning technique

Unsupervised learning is a type of algorithm that learns patterns from untagged data. The hope is that through mimicry, which is an important mode of learning in people, the machine is forced to build a compact internal representation of its world and then generate imaginative content from it. In contrast to supervised learning where data is tagged by an expert, e.g. as a "ball" or "fish", unsupervised methods exhibit self-organization that captures patterns as probability densities or a combination of neural feature preferences. The other levels in the supervision spectrum are reinforcement learning where the machine is given only a numerical performance score as guidance, and semi-supervised learning where a smaller portion of the data is tagged.

The memory-prediction framework is a theory of brain function created by Jeff Hawkins and described in his 2004 book On Intelligence. This theory concerns the role of the mammalian neocortex and its associations with the hippocampi and the thalamus in matching sensory inputs to stored memory patterns and how this process leads to predictions of what will happen in the future.

<span class="mw-page-title-main">Recurrent neural network</span> Computational model used in machine learning

A recurrent neural network (RNN) is a class of artificial neural networks where connections between nodes can create a cycle, allowing output from some nodes to affect subsequent input to the same nodes. This allows it to exhibit temporal dynamic behavior. Derived from feedforward neural networks, RNNs can use their internal state (memory) to process variable length sequences of inputs. This makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. Recurrent neural networks are theoretically Turing complete and can run arbitrary programs to process arbitrary sequences of inputs.

<span class="mw-page-title-main">Neural network</span> Structure in biology and artificial intelligence

A neural network is a network or circuit of biological neurons, or, in a modern sense, an artificial neural network, composed of artificial neurons or nodes. Thus, a neural network is either a biological neural network, made up of biological neurons, or an artificial neural network, used for solving artificial intelligence (AI) problems. The connections of the biological neuron are modeled in artificial neural networks as weights between nodes. A positive weight reflects an excitatory connection, while negative values mean inhibitory connections. All inputs are modified by a weight and summed. This activity is referred to as a linear combination. Finally, an activation function controls the amplitude of the output. For example, an acceptable range of output is usually between 0 and 1, or it could be −1 and 1.

<span class="mw-page-title-main">Multilayer perceptron</span> Type of feedforward neural network

A multilayer perceptron (MLP) is a fully connected class of feedforward artificial neural network (ANN). The term MLP is used ambiguously, sometimes loosely to mean any feedforward ANN, sometimes strictly to refer to networks composed of multiple layers of perceptrons ; see § Terminology. Multilayer perceptrons are sometimes colloquially referred to as "vanilla" neural networks, especially when they have a single hidden layer.

Automatic target recognition (ATR) is the ability for an algorithm or device to recognize targets or other objects based on data obtained from sensors.

There are many types of artificial neural networks (ANN).

Alex Waibel American computer scientist

Alexander Waibel is a professor of Computer Science at Carnegie Mellon University and Karlsruhe Institute of Technology. Waibel's research interests focus on speech recognition and translation and human communication signals and systems. Waibel is known for the time delay neural network (TDNN), which he developed. It is the first convolutional neural network (CNN) trained by gradient descent, using the backpropagation algorithm. Alex Waibel introduced the TDNN 1987 at ATR in Japan.

<span class="mw-page-title-main">Deep learning</span> Branch of machine learning

Deep learning is part of a broader family of machine learning methods based on artificial neural networks with representation learning. Learning can be supervised, semi-supervised or unsupervised.

<span class="mw-page-title-main">Convolutional neural network</span> Artificial neural network

In deep learning, a convolutional neural network is a class of artificial neural network (ANN), most commonly applied to analyze visual imagery. CNNs are also known as Shift Invariant or Space Invariant Artificial Neural Networks (SIANN), based on the shared-weight architecture of the convolution kernels or filters that slide along input features and provide translation-equivariant responses known as feature maps. Counter-intuitively, most convolutional neural networks are not invariant to translation, due to the downsampling operation they apply to the input. They have applications in image and video recognition, recommender systems, image classification, image segmentation, medical image analysis, natural language processing, brain–computer interfaces, and financial time series.

<span class="mw-page-title-main">DeepDream</span> Software program

DeepDream is a computer vision program created by Google engineer Alexander Mordvintsev that uses a convolutional neural network to find and enhance patterns in images via algorithmic pareidolia, thus creating a dream-like appearance reminiscent of a psychedelic experience in the deliberately overprocessed images.

Bidirectional recurrent neural networks (BRNN) connect two hidden layers of opposite directions to the same output. With this form of generative deep learning, the output layer can get information from past (backwards) and future (forward) states simultaneously. Invented in 1997 by Schuster and Paliwal, BRNNs were introduced to increase the amount of input information available to the network. For example, multilayer perceptron (MLPs) and time delay neural network (TDNNs) have limitations on the input data flexibility, as they require their input data to be fixed. Standard recurrent neural network (RNNs) also have restrictions as the future input information cannot be reached from the current state. On the contrary, BRNNs do not require their input data to be fixed. Moreover, their future input information is reachable from the current state.

<span class="mw-page-title-main">AlexNet</span> Convolutional neural network

AlexNet is the name of a convolutional neural network (CNN) architecture, designed by Alex Krizhevsky in collaboration with Ilya Sutskever and Geoffrey Hinton, who was Krizhevsky's Ph.D. advisor.

<span class="mw-page-title-main">Outline of machine learning</span> Overview of and topical guide to machine learning

The following outline is provided as an overview of and topical guide to machine learning. Machine learning is a subfield of soft computing within computer science that evolved from the study of pattern recognition and computational learning theory in artificial intelligence. In 1959, Arthur Samuel defined machine learning as a "field of study that gives computers the ability to learn without being explicitly programmed". Machine learning explores the study and construction of algorithms that can learn from and make predictions on data. Such algorithms operate by building a model from an example training set of input observations in order to make data-driven predictions or decisions expressed as outputs, rather than following strictly static program instructions.

A Capsule Neural Network (CapsNet) is a machine learning system that is a type of artificial neural network (ANN) that can be used to better model hierarchical relationships. The approach is an attempt to more closely mimic biological neural organization.

The history of artificial neural networks (ANN) began with Warren McCulloch and Walter Pitts (1943) who created a computational model for neural networks based on algorithms called threshold logic. This model paved the way for research to split into two approaches. One approach focused on biological processes while the other focused on the application of neural networks to artificial intelligence. This work led to work on nerve networks and their link to finite automata.

LeNet is a convolutional neural network structure proposed by LeCun et al. in 1998,. In general, LeNet refers to LeNet-5 and is a simple convolutional neural network. Convolutional neural networks are a kind of feed-forward neural network whose artificial neurons can respond to a part of the surrounding cells in the coverage range and perform well in large-scale image processing.


  1. 1 2 3 4 Alexander Waibel, Tashiyuki Hanazawa, Geoffrey Hinton, Kiyohito Shikano, Kevin J. Lang, Phoneme Recognition Using Time-Delay Neural Networks , IEEE Transactions on Acoustics, Speech, and Signal Processing, Volume 37, No. 3, pp. 328. - 339 March 1989.
  2. 1 2 Alexander Waibel, Phoneme Recognition Using Time-Delay Neural Networks , SP87-100, Meeting of the Institute of Electrical, Information and Communication Engineers (IEICE), December, 1987,Tokyo, Japan.
  3. 1 2 3 4 John B. Hampshire and Alexander Waibel, Connectionist Architectures for Multi-Speaker Phoneme Recognition , Advances in Neural Information Processing Systems, 1990, Morgan Kaufmann.
  4. 1 2 3 4 5 Stefan Jaeger, Stefan Manke, Juergen Reichert, Alexander Waibel, Online handwriting recognition: the NPen++recognizer , International Journal on Document Analysis and Recognition Vol. 3, Issue 3, March 2001
  5. Yamaguchi, Kouichi; Sakamoto, Kenji; Akabane, Toshio; Fujimoto, Yoshiji (November 1990). A Neural Network for Speaker-Independent Isolated Word Recognition. First International Conference on Spoken Language Processing (ICSLP 90). Kobe, Japan.
  6. 1 2 3 Alexander Waibel, Hidefumi Sawai, Kiyohiro Shikano, Modularity and Scaling in Large Phonemic Neural Networks , IEEE Transactions on Acoustics, Speech, and Signal Processing, December, December 1989.
  7. Christian Koehler and Joachim K. Anlauf, An adaptable time-delay neural-network algorithm for image sequence analysis , IEEE Transactions on Neural Networks 10.6 (1999): 1531-1536
  8. 1 2 3 Vijayaditya Peddinti, Daniel Povey, Sanjeev Khudanpur, A time delay neural network architecture for efficient modeling of long temporal contexts , Proceedings of Interspeech 2015
  9. 1 2 3 David Snyder, Daniel Garcia-Romero, Daniel Povey, A Time-Delay Deep Neural Network-Based Universal Background Models for Speaker Recognition , Proceedings of ASRU 2015.
  10. 1 2 Patrick Haffner, Alexander Waibel, Multi-State Time Delay Neural Networks for Continuous Speech Recognition , Advances in Neural Information Processing Systems, 1992, Morgan Kaufmann.
  11. 1 2 Christoph Bregler, Hermann Hild, Stefan Manke, Alexander Waibel, Improving Connected Letter Recognition by Lipreading , IEEE Proceedings International Conference on Acoustics, Speech, and Signal Processing, Minneapolis, 1993.
  12. Christian Woehler and Joachim K. Anlauf, Real-time object recognition on image sequences with the adaptable time delay neural network algorithm—applications for autonomous vehicles." Image and Vision Computing 19.9 (2001): 593-618.
  13. "Time Series and Dynamic Systems - MATLAB & Simulink". mathworks.com. Retrieved 21 June 2016.
  14. Vijayaditya Peddinti, Guoguo Chen, Vimal Manohar, Tom Ko, Daniel Povey, Sanjeev Khudanpur, JHU ASpIRE system: Robust LVCSR with TDNNs i-vector Adaptation and RNN-LMs , Proceedings of the IEEE Automatic Speech Recognition and Understanding Workshop, 2015.