Gated recurrent units (GRUs) are a gating mechanism in recurrent neural networks, introduced in 2014 by Kyunghyun Cho et al. The GRU is like a long short-term memory (LSTM) with a forget gate, but has fewer parameters than LSTM, as it lacks an output gate. GRU's performance on certain tasks of polyphonic music modeling, speech signal modeling and natural language processing was found to be similar to that of LSTM. GRUs have been shown to exhibit better performance on certain smaller and less frequent datasets.

Architecture

There are several variations on the full gated unit, with gating done using the previous hidden state and the bias in various combinations, and a simplified form called minimal gated unit. The operator

\odot

denotes the Hadamard product in the following.

Fully gated unit

Initially, for

t = 0

, the output vector is

h_0 = 0

. :

\begin
z_t &= \sigma_g(W_ x_t + U_ h_ + b_z) \\
r_t &= \sigma_g(W_ x_t + U_ h_ + b_r) \\
\hat_t &= \phi_h(W_ x_t + U_ (r_t \odot h_) + b_h) \\
h_t &=   z_t \odot h_ + (1-z_t) \odot  \hat_t 
\end

Variables *

x_t

: input vector *

h_t

: output vector *

\hat_t

: candidate activation vector *

z_t

: update gate vector *

r_t

: reset gate vector *

W

U

and

b

: parameter matrices and vector Activation functions *

\sigma_g

: The original is a sigmoid function. *

\phi_h

: The original is a hyperbolic tangent. Alternative activation functions are possible, provided that

\sigma_g(x) \isin, 1 /math>.

Alternate forms can be created by changing

z_t

and

r_t

* Type 1, each gate depends only on the previous hidden state and the bias. *:

\begin
z_t &= \sigma_g(U_ h_ + b_z) \\
r_t &= \sigma_g(U_ h_ + b_r) \\
\end

* Type 2, each gate depends only on the previous hidden state. *:

\begin
z_t &= \sigma_g(U_ h_) \\
r_t &= \sigma_g(U_ h_) \\
\end

* Type 3, each gate is computed using only the bias. *:

\begin
z_t &= \sigma_g(b_z) \\
r_t &= \sigma_g(b_r) \\
\end

Minimal gated unit

The minimal gated unit is similar to the fully gated unit, except the update and reset gate vector is merged into a forget gate. This also implies that the equation for the output vector must be changed: :

\begin
f_t &= \sigma_g(W_ x_t + U_ h_ + b_f) \\
\hat_t &= \phi_h(W_ x_t + U_ (f_t \odot h_) + b_h) \\
h_t &=  (1-f_t) \odot h_ + f_t \odot \hat_t
\end

Variables *

x_t

: input vector *

h_t

: output vector *

\hat_t

: candidate activation vector *

f_t

: forget vector *

W

U

and

b

: parameter matrices and vector

References

{{Reflist Artificial neural networks