Gated recurrent unit

Last updated January 03, 2025

Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al.^[1] The GRU is like a long short-term memory (LSTM) with a gating mechanism to input or forget certain features,^[2] but lacks a context vector or output gate, resulting in fewer parameters than LSTM.^[3] GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM.^[4]^[5] GRUs showed that gating is indeed helpful in general, and Bengio's team came to no concrete conclusion on which of the two gating units was better.^[6]^[7]

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit.^[8]

The operator $\odot$ denotes the Hadamard product in the following.

Fully gated unit

Initially, for $t=0$ , the output vector is $h_{0}=0$ .

{\begin{aligned}z_{t}&=\sigma (W_{z}x_{t}+U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (W_{r}x_{t}+U_{r}h_{t-1}+b_{r})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(r_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-z_{t})\odot h_{t-1}+z_{t}\odot {\hat {h}}_{t}\end{aligned}}

Variables ( $d$ denotes the number of input features and $e$ the number of output features):

$x_{t}\in \mathbb {R} ^{d}$ : input vector
$h_{t}\in \mathbb {R} ^{e}$ : output vector
${\hat {h}}_{t}\in \mathbb {R} ^{e}$ : candidate activation vector
$z_{t}\in (0,1)^{e}$ : update gate vector
$r_{t}\in (0,1)^{e}$ : reset gate vector
$W\in \mathbb {R} ^{e\times d}$ , $U\in \mathbb {R} ^{e\times e}$ and $b\in \mathbb {R} ^{e}$ : parameter matrices and vector which need to be learned during training

Activation functions

$\sigma$ : The original is a logistic function.
$\phi$ : The original is a hyperbolic tangent.

Alternative activation functions are possible, provided that $\sigma (x)\in [0,1]$ .

Alternate forms can be created by changing $z_{t}$ and $r_{t}$ ^[9]

Type 1, each gate depends only on the previous hidden state and the bias.
${\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1}+b_{z})\\r_{t}&=\sigma (U_{r}h_{t-1}+b_{r})\\\end{aligned}}$
Type 2, each gate depends only on the previous hidden state.
${\begin{aligned}z_{t}&=\sigma (U_{z}h_{t-1})\\r_{t}&=\sigma (U_{r}h_{t-1})\\\end{aligned}}$
Type 3, each gate is computed using only the bias.
${\begin{aligned}z_{t}&=\sigma (b_{z})\\r_{t}&=\sigma (b_{r})\\\end{aligned}}$

Minimal gated unit

The minimal gated unit (MGU) is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed:^[10]

{\begin{aligned}f_{t}&=\sigma (W_{f}x_{t}+U_{f}h_{t-1}+b_{f})\\{\hat {h}}_{t}&=\phi (W_{h}x_{t}+U_{h}(f_{t}\odot h_{t-1})+b_{h})\\h_{t}&=(1-f_{t})\odot h_{t-1}+f_{t}\odot {\hat {h}}_{t}\end{aligned}}

Variables

$x_{t}$ : input vector
$h_{t}$ : output vector
${\hat {h}}_{t}$ : candidate activation vector
$f_{t}$ : forget vector
$W$ , $U$ and $b$ : parameter matrices and vector

Light gated recurrent unit

The light gated recurrent unit (LiGRU)^[4] removes the reset gate altogether, replaces tanh with the ReLU activation, and applies batch normalization (BN):

{\begin{aligned}z_{t}&=\sigma (\operatorname {BN} (W_{z}x_{t})+U_{z}h_{t-1})\\{\tilde {h}}_{t}&=\operatorname {ReLU} (\operatorname {BN} (W_{h}x_{t})+U_{h}h_{t-1})\\h_{t}&=z_{t}\odot h_{t-1}+(1-z_{t})\odot {\tilde {h}}_{t}\end{aligned}}

LiGRU has been studied from a Bayesian perspective.^[11] This analysis yielded a variant called light Bayesian recurrent unit (LiBRU), which showed slight improvements over the LiGRU on speech recognition tasks.

Related Research Articles

In mathematical physics and mathematics, the Pauli matrices are a set of three $2 \times 2$ complex matrices that are traceless, Hermitian, involutory and unitary. Usually indicated by the Greek letter sigma, they are occasionally denoted by tau when used in connection with isospin symmetries.

In vector calculus and differential geometry the generalized Stokes theorem, also called the Stokes–Cartan theorem, is a statement about the integration of differential forms on manifolds, which both simplifies and generalizes several theorems from vector calculus. In particular, the fundamental theorem of calculus is the special case where the manifold is a line segment, Green’s theorem and Stokes' theorem are the cases of a surface in $or and the divergence theorem is the case of a volume in Hence, the theorem is sometimes referred to as the fundamental theorem of multivariate calculus .$

In mathematics, the special unitary group of degree $n$ , denoted $SU(n)$ , is the Lie group of $n \times n$ unitary matrices with determinant 1.

Recurrent neural networks (RNNs) are a class of artificial neural network commonly used for sequential data processing. Unlike feedforward neural networks, which process data in a single pass, RNNs process data across multiple time steps, making them well-adapted for modelling and processing text, speech, and time series.

The softmax function, also known as softargmax or normalized exponential function, converts a vector of $K$ real numbers into a probability distribution of $K$ possible outcomes. It is a generalization of the logistic function to multiple dimensions, and is used in multinomial logistic regression. The softmax function is often used as the last activation function of a neural network to normalize the output of a network to a probability distribution over predicted output classes.

Long short-term memory (LSTM) is a type of recurrent neural network (RNN) aimed at mitigating the vanishing gradient problem commonly encountered by traditional RNNs. Its relative insensitivity to gap length is its advantage over other RNNs, hidden Markov models, and other sequence learning methods. It aims to provide a short-term memory for RNN that can last thousands of timesteps. The name is made in analogy with long-term memory and short-term memory and their relationship, studied by cognitive psychologists since the early 20th century.

Stokes' theorem, also known as the Kelvin–Stokes theorem after Lord Kelvin and George Stokes, the fundamental theorem for curls or simply the curl theorem, is a theorem in vector calculus on $. Given a vector field, the theorem relates the integral of the curl of the vector field over some surface, to the line integral of the vector field around the boundary of the surface. The classical theorem of Stokes can be stated in one sentence:$

In machine learning, the vanishing gradient problem is encountered when training neural networks with gradient-based learning methods and backpropagation. In such methods, during each training iteration, each neural network weight receives an update proportional to the partial derivative of the loss function with respect to the current weight. The problem is that as the network depth or sequence length increases, the gradient magnitude typically is expected to decrease, slowing the training process. In the worst case, this may completely stop the neural network from further learning. As one example of this problem, traditional activation functions such as the hyperbolic tangent function have gradients in the range $[-1,1]$ , and backpropagation computes gradients using the chain rule. This has the effect of multiplying $n$ of these small numbers to compute gradients of the early layers in an $n$ -layer network, meaning that the gradient decreases exponentially with $n$ while the early layers train very slowly.

In signal processing, Feature space Maximum Likelihood Linear Regression (fMLLR) is a global feature transform that are typically applied in a speaker adaptive way, where fMLLR transforms acoustic features to speaker adapted features by a multiplication operation with a transformation matrix. In some literature, fMLLR is also known as the Constrained Maximum Likelihood Linear Regression (cMLLR).

Low-rank matrix approximations are essential tools in the application of kernel methods to large-scale learning problems.

A transformer is a deep learning architecture that was developed by researchers at Google and is based on the multi-head attention mechanism, which was proposed in the 2017 paper "Attention Is All You Need". Text is converted to numerical representations called tokens, and each token is converted into a vector via lookup from a word embedding table. At each layer, each token is then contextualized within the scope of the context window with other (unmasked) tokens via a parallel multi-head attention mechanism, allowing the signal for key tokens to be amplified and less important tokens to be diminished.

In the study of artificial neural networks (ANNs), the neural tangent kernel (NTK) is a kernel that describes the evolution of deep artificial neural networks during their training by gradient descent. It allows ANNs to be studied using theoretical tools from kernel methods.

A Neural Network Gaussian Process (NNGP) is a Gaussian process (GP) obtained as the limit of a certain type of sequence of neural networks. Specifically, a wide variety of network architectures converges to a GP in the infinitely wide limit, in the sense of distribution. The concept constitutes an intensional definition, i.e., a NNGP is just a GP, but distinguished by how it is obtained.

In mathematics, a dual system, dual pair or a duality over a field $is a triple consisting of two vector spaces, and, over and a non-degenerate bilinear map .$

Attention is a machine learning method that determines the relative importance of each component in a sequence relative to the other components in that sequence. In natural language processing, importance is represented by "soft" weights assigned to each word in a sentence. More generally, attention encodes vectors called token embeddings across a fixed-width sequence that can range from tens to millions of tokens in size.

A flow-based generative model is a generative model used in machine learning that explicitly models a probability distribution by leveraging normalizing flow, which is a statistical method using the change-of-variable law of probabilities to transform a simple distribution into a complex one.

Graph neural networks (GNN) are specialized artificial neural networks that are designed for tasks whose inputs are graphs.

Deep backward stochastic differential equation method is a numerical method that combines deep learning with Backward stochastic differential equation (BSDE). This method is particularly useful for solving high-dimensional problems in financial derivatives pricing and risk management. By leveraging the powerful function approximation capabilities of deep neural networks, deep BSDE addresses the computational challenges faced by traditional numerical methods in high-dimensional settings.

In neural networks, the gating mechanism is an architectural motif for controlling the flow of activation and gradient signals. They are most prominently used in recurrent neural networks (RNNs), but have also found applications in other architectures.

In deep learning, weight initialization describes the initial step in creating a neural network. A neural network contains trainable parameters that are modified during training: weight initialization is the pre-training step of assigning initial values to these parameters.

References

↑ Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv: 1406.1078 [cs.CL].
↑ Felix Gers; Jürgen Schmidhuber; Fred Cummins (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.
↑ "Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML". Wildml.com. 2015-10-27. Archived from the original on 2021-11-10. Retrieved May 18, 2016.
1 2 Ravanelli, Mirco; Brakel, Philemon; Omologo, Maurizio; Bengio, Yoshua (2018). "Light Gated Recurrent Units for Speech Recognition". IEEE Transactions on Emerging Topics in Computational Intelligence. 2 (2): 92–102. arXiv: 1803.10225 . doi:10.1109/TETCI.2017.2762739. S2CID 4402991.
↑ Su, Yuahang; Kuo, Jay (2019). "On extended long short-term memory and dependent bidirectional recurrent neural network". Neurocomputing. 356: 151–161. arXiv: 1803.01686 . doi:10.1016/j.neucom.2019.04.044. S2CID 3675055.
↑ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv: 1412.3555 [cs.NE].
↑ Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?", Frontiers in Artificial Intelligence, 3: 40, doi: 10.3389/frai.2020.00040 , PMC 7861254 , PMID 33733157, S2CID 220252321
↑ Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv: 1412.3555 [cs.NE].
↑ Dey, Rahul; Salem, Fathi M. (2017-01-20). "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks". arXiv: 1701.05923 [cs.NE].
↑ Heck, Joel; Salem, Fathi M. (2017-01-12). "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks". arXiv: 1701.03452 [cs.NE].
↑ Bittar, Alexandre; Garner, Philip N. (May 2021). "A Bayesian Interpretation of the Light Gated Recurrent Unit". ICASSP 2021. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE. pp. 2965–2969. 10.1109/ICASSP39728.2021.9414259.

This page is based on this Wikipedia article
Text is available under the CC BY-SA 4.0 license; additional terms may apply.
Images, videos and audio are available under their respective licenses.

[1] Cho, Kyunghyun; van Merrienboer, Bart; Bahdanau, DZmitry; Bougares, Fethi; Schwenk, Holger; Bengio, Yoshua (2014). "Learning Phrase Representations using RNN Encoder-Decoder for Statistical Machine Translation". arXiv: 1406.1078 [cs.CL].

[lstm1999-2] Felix Gers; Jürgen Schmidhuber; Fred Cummins (1999). "Learning to forget: Continual prediction with LSTM". 9th International Conference on Artificial Neural Networks: ICANN '99. Vol. 1999. pp. 850–855. doi:10.1049/cp:19991218. ISBN 0-85296-721-7.

[MyUser_Wildml.com_May_18_2016c-3] "Recurrent Neural Network Tutorial, Part 4 – Implementing a GRU/LSTM RNN with Python and Theano – WildML". Wildml.com. 2015-10-27. Archived from the original on 2021-11-10. Retrieved May 18, 2016.

[Ravalli2018-4] 1 2 Ravanelli, Mirco; Brakel, Philemon; Omologo, Maurizio; Bengio, Yoshua (2018). "Light Gated Recurrent Units for Speech Recognition". IEEE Transactions on Emerging Topics in Computational Intelligence. 2 (2): 92–102. arXiv: 1803.10225 . doi:10.1109/TETCI.2017.2762739. S2CID 4402991.

[Su2019-5] Su, Yuahang; Kuo, Jay (2019). "On extended long short-term memory and dependent bidirectional recurrent neural network". Neurocomputing. 356: 151–161. arXiv: 1803.01686 . doi:10.1016/j.neucom.2019.04.044. S2CID 3675055.

[MyUser_Arxiv.org_May_18_2016c-6] Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv: 1412.3555 [cs.NE].

[gruber_jockisch-7] Gruber, N.; Jockisch, A. (2020), "Are GRU cells more specific and LSTM cells more sensitive in motive classification of text?", Frontiers in Artificial Intelligence, 3: 40, doi: 10.3389/frai.2020.00040 , PMC 7861254 , PMID 33733157, S2CID 220252321

[Chung_18_2016c-8] Chung, Junyoung; Gulcehre, Caglar; Cho, KyungHyun; Bengio, Yoshua (2014). "Empirical Evaluation of Gated Recurrent Neural Networks on Sequence Modeling". arXiv: 1412.3555 [cs.NE].

[9] Dey, Rahul; Salem, Fathi M. (2017-01-20). "Gate-Variants of Gated Recurrent Unit (GRU) Neural Networks". arXiv: 1701.05923 [cs.NE].

[10] Heck, Joel; Salem, Fathi M. (2017-01-12). "Simplified Minimal Gated Unit Variations for Recurrent Neural Networks". arXiv: 1701.03452 [cs.NE].

[11] Bittar, Alexandre; Garner, Philip N. (May 2021). "A Bayesian Interpretation of the Light Gated Recurrent Unit". ICASSP 2021. 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). Toronto, ON, Canada: IEEE. pp. 2965–2969. 10.1109/ICASSP39728.2021.9414259.

[1]

[2]

[3]

[4]

[5]

[6]

[7]

[8]

[9]

[10]

[11]