Teacher forcing

Last updated October 16, 2023

Teacher forcing is an algorithm for training the weights of recurrent neural networks (RNNs).^[1] It involves feeding observed sequence values (i.e. ground-truth samples) back into the RNN after each step, thus forcing the RNN to stay close to the ground-truth sequence.^[2]

The term "teacher forcing" can be motivated by comparing the RNN to a human student taking a multi-part exam where the answer to each part (for example a mathematical calculation) depends on the answer to the preceding part.^[3] In this analogy, rather than grading every answer in the end, with the risk that the student fails every single part even though they only made a mistake in the first one, a teacher records the score for each individual part and then tells the student the correct answer, to be used in the next part.^[3]

The use of an external teacher signal is in contrast to real-time recurrent learning (RTRL).^[4] Teacher signals are known from oscillator networks.^[5] The promise is, that teacher forcing helps to reduce the training time.^[6]

The term "teacher forcing" was introduced in 1989 by Ronald J. Williams and David Zipser, who reported that the technique was already being "frequently used in dynamical supervised learning tasks" around that time.^[7]^[2]

A NeurIPS 2016 paper introduced the related method of "professor forcing".^[2]

Related Research Articles

Artificial neural networks are a branch of machine learning models that are built using principles of neuronal organization discovered by connectionism in the biological neural networks constituting animal brains.

Speech recognition is an interdisciplinary subfield of computer science and computational linguistics that develops methodologies and technologies that enable the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the computer science, linguistics and computer engineering fields. The reverse process is speech synthesis.

Connectionism is the name of an approach to the study of human mental processes and cognition that utilizes mathematical models known as connectionist networks or artificial neural networks. Connectionism has had many 'waves' since its beginnings.

Jürgen Schmidhuber is a German computer scientist noted for his work in the field of artificial intelligence, specifically artificial neural networks. He is a scientific director of the Dalle Molle Institute for Artificial Intelligence Research in Switzerland. He is also director of the Artificial Intelligence Initiative and professor of the Computer Science program in the Computer, Electrical, and Mathematical Sciences and Engineering (CEMSE) division at the King Abdullah University of Science and Technology (KAUST) in Saudi Arabia.

A recurrent neural network (RNN) is one of the two broad types of artificial neural network, characterized by direction of the flow of information between its layers. In contrast to the uni-directional feedforward neural network, it is a bi-directional artificial neural network, meaning that it allows the output from some nodes to affect subsequent input to the same nodes. Their ability to use internal state (memory) to process arbitrary sequences of inputs makes them applicable to tasks such as unsegmented, connected handwriting recognition or speech recognition. The term "recurrent neural network" is used to refer to the class of networks with an infinite impulse response, whereas "convolutional neural network" refers to the class of finite impulse response. Both classes of networks exhibit temporal dynamic behavior. A finite impulse recurrent network is a directed acyclic graph that can be unrolled and replaced with a strictly feedforward neural network, while an infinite impulse recurrent network is a directed cyclic graph that can not be unrolled.

A neural network is a neural circuit of biological neurons, sometimes also called a biological neural network, or a network of artificial neurons or nodes in the case of an artificial neural network.

The random neural network (RNN) is a mathematical representation of an interconnected network of neurons or cells which exchange spiking signals. It was invented by Erol Gelenbe and is linked to the G-network model of queueing networks as well as to Gene Regulatory Network models. Each cell state is represented by an integer whose value rises when the cell receives an excitatory spike and drops when it receives an inhibitory spike. The spikes can originate outside the network itself, or they can come from other cells in the networks. Cells whose internal excitatory state has a positive value are allowed to send out spikes of either kind to other cells in the network according to specific cell-dependent spiking rates. The model has a mathematical solution in steady-state which provides the joint probability distribution of the network in terms of the individual probabilities that each cell is excited and able to send out spikes. Computing this solution is based on solving a set of non-linear algebraic equations whose parameters are related to the spiking rates of individual cells and their connectivity to other cells, as well as the arrival rates of spikes from outside the network. The RNN is a recurrent model, i.e. a neural network that is allowed to have complex feedback loops.

<span class="mw-page-title-main">Echo state network</span> Type of reservoir computer

An echo state network (ESN) is a type of reservoir computer that uses a recurrent neural network with a sparsely connected hidden layer. The connectivity and weights of hidden neurons are fixed and randomly assigned. The weights of output neurons can be learned so that the network can produce or reproduce specific temporal patterns. The main interest of this network is that although its behaviour is non-linear, the only weights that are modified during training are for the synapses that connect the hidden neurons to output neurons. Thus, the error function is quadratic with respect to the parameter vector and can be differentiated easily to a linear system.

Long short-term memory (LSTM) network is a recurrent neural network (RNN), aimed to deal with the vanishing gradient problem present in traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps, thus "long short-term memory". It is applicable to classification, processing and predicting data based on time series, such as in handwriting, speech recognition, machine translation, speech activity detection, robot control, video games, and healthcare.

Time delay neural network (TDNN) is a multilayer artificial neural network architecture whose purpose is to 1) classify patterns with shift-invariance, and 2) model context at each layer of the network.

Backpropagation through time (BPTT) is a gradient-based technique for training certain types of recurrent neural networks. It can be used to train Elman networks. The algorithm was independently derived by numerous researchers.

Ronald J. Williams is professor of computer science at Northeastern University, and one of the pioneers of neural networks. He co-authored a paper on the backpropagation algorithm which triggered a boom in neural network research. He also made fundamental contributions to the fields of recurrent neural networks and reinforcement learning. Together with Wenxu Tong and Mary Jo Ondrechen he developed Partial Order Optimum Likelihood (POOL), a machine learning method used in the prediction of active amino acids in protein structures. POOL is a maximum likelihood method with a monotonicity constraint and is a general predictor of properties that depend monotonically on the input features.

There are many types of artificial neural networks (ANN).

Deep learning is part of a broader family of machine learning methods, which is based on artificial neural networks with representation learning. The adjective "deep" in deep learning refers to the use of multiple layers in the network. Methods used can be either supervised, semi-supervised or unsupervised.

Convolutional neural network (CNN) is a regularized type of feed-forward neural network that learns feature engineering by itself via filters optimization. Vanishing gradients and exploding gradients, seen during backpropagation in earlier neural networks, are prevented by using regularized weights over fewer connections. For example, for each neuron in the fully-connected layer 10,000 weights would be required for processing an image sized 100 × 100 pixels. However, applying cascaded convolution kernels, only 25 neurons are required to process 5x5-sized tiles. Higher-layer features are extracted from wider context windows, compared to lower-layer features.

In artificial intelligence, a differentiable neural computer (DNC) is a memory augmented neural network architecture (MANN), which is typically recurrent in its implementation. The model was published in 2016 by Alex Graves et al. of DeepMind.

Connectionist temporal classification (CTC) is a type of neural network output and associated scoring function, for training recurrent neural networks (RNNs) such as LSTM networks to tackle sequence problems where the timing is variable. It can be used for tasks like on-line handwriting recognition or recognizing phonemes in speech audio. CTC refers to the outputs and scoring, and is independent of the underlying neural network structure. It was introduced in 2006.

In video games, various artificial intelligence techniques have been used in a variety of ways, ranging from non-player character (NPC) control to procedural content generation (PCG). Machine learning is a subset of artificial intelligence that focuses on using algorithms and statistical models to make machines act without specific programming. This is in sharp contrast to traditional methods of artificial intelligence such as search trees and expert systems.

Machine learning-based attention is a mechanism mimicking cognitive attention. It calculates "soft" weights for each word, more precisely for its embedding, in the context window. It can do it either in parallel or sequentially. "Soft" weights can change during each runtime, in contrast to "hard" weights, which are (pre-)trained and fine-tuned and remain frozen afterwards.

References

↑ John F. Kolen; Stefan C. Kremer (15 January 2001). A Field Guide to Dynamical Recurrent Networks. John Wiley & Sons. pp. 202–. ISBN 978-0-7803-5369-5.
1 2 3 Lamb, Alex M; Goyal, Anirudh; Zhang, Ying; Zhang, Saizheng; Courville, Aaron C; Bengio, Yoshua (2016). "Professor Forcing: A New Algorithm for Training Recurrent Networks". Advances in Neural Information Processing Systems. Curran Associates, Inc. 29.
1 2 Wong, Wanshun (2019-10-15). "What is Teacher Forcing?". Towards Data Science. Retrieved 2022-03-25.
↑ Zhang, Ming (31 July 2008). Artificial Higher Order Neural Networks for Economics and Business. IGI Global. pp. 195–. ISBN 978-1-59904-898-7.
↑ Yves Chauvin; David E. Rumelhart (1 February 2013). Backpropagation: Theory, Architectures, and Applications. Psychology Press. pp. 473–. ISBN 978-1-134-77581-1.
↑ George Bekey; Kenneth Y. Goldberg (30 November 1992). Neural Networks in Robotics. Springer Science & Business Media. pp. 247–. ISBN 978-0-7923-9268-2.
↑ Williams, Ronald J.; Zipser, David (June 1989). "A Learning Algorithm for Continually Running Fully Recurrent Neural Networks". Neural Computation. 1 (2): 270–280. CiteSeerX 10.1.1.52.9724 . doi:10.1162/neco.1989.1.2.270. ISSN 0899-7667. S2CID 14711886.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[KolenKremer2001-1] John F. Kolen; Stefan C. Kremer (15 January 2001). A Field Guide to Dynamical Recurrent Networks. John Wiley & Sons. pp. 202–. ISBN 978-0-7803-5369-5.

[:0-2] 1 2 3 Lamb, Alex M; Goyal, Anirudh; Zhang, Ying; Zhang, Saizheng; Courville, Aaron C; Bengio, Yoshua (2016). "Professor Forcing: A New Algorithm for Training Recurrent Networks". Advances in Neural Information Processing Systems. Curran Associates, Inc. 29.

[:1-3] 1 2 Wong, Wanshun (2019-10-15). "What is Teacher Forcing?". Towards Data Science. Retrieved 2022-03-25.

[Ming2008-4] Zhang, Ming (31 July 2008). Artificial Higher Order Neural Networks for Economics and Business. IGI Global. pp. 195–. ISBN 978-1-59904-898-7.

[ChauvinRumelhart2013-5] Yves Chauvin; David E. Rumelhart (1 February 2013). Backpropagation: Theory, Architectures, and Applications. Psychology Press. pp. 473–. ISBN 978-1-134-77581-1.

[BekeyGoldberg1992-6] George Bekey; Kenneth Y. Goldberg (30 November 1992). Neural Networks in Robotics. Springer Science & Business Media. pp. 247–. ISBN 978-0-7923-9268-2.

[7] Williams, Ronald J.; Zipser, David (June 1989). "A Learning Algorithm for Continually Running Fully Recurrent Neural Networks". Neural Computation. 1 (2): 270–280. CiteSeerX 10.1.1.52.9724 . doi:10.1162/neco.1989.1.2.270. ISSN 0899-7667. S2CID 14711886.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

Teacher forcing

See also

Related Research Articles

References