Gated Recurrent Units (GRUs)

Gated Recurrent Units (GRUs) are a form of RNN which can capture long range dependencies in a sequential data.

A typical RNN unit is represented as the following equation:

$a^{\langle t \rangle}$ = tanh($W_a [a^{\langle t-1 \rangle},x^{\langle t \rangle}] + b_a)$

Test Image


(1) The cat, which already ate……………., was full
(2) The cats, which already ate……………, were full

GRU (simplfied):

New parameter: c which can be taken as a memory cell to store long range dependencies.

Initially, $c^{\langle t \rangle}$ = $a^{\langle t \rangle}$

At every time step, we are going to consider overwriting $c^{\langle t \rangle}$. The candidate for replacing $c^{\langle t \rangle}$ is $\tilde c^{\langle t \rangle}$.

$\tilde c^{\langle t \rangle}$ = tanh($W_c[c{\langle t-1 \rangle},x{\langle t \rangle}] + b_c)$

Gate: A Gate is represented as $\Gamma_u$ which is the Update Gate.

$\Gamma_u$ = $\sigma (W_u[c^{\langle t-1 \rangle},x^{\langle t \rangle}]+b_u$…(i)

The Gate $\Gamma_u$ whether or not to update $c^{\langle t \rangle}$ with $\tilde c^{\langle t \rangle}$.

For singular “cat”, $c^{\langle t \rangle}$ = 1which will be carried forward in the network. The job of the gate $\Gamma_u$ is to decide when to update $c^{\langle t \rangle}$ with $\tilde c^{\langle t \rangle}$.

Test Image


$c^{\langle t \rangle}$ = $\Gamma_u * \tilde c^{\langle t \rangle} + (1-\Gamma_u)*c^{\langle t-1 \rangle}$…(ii)

The multiplication in the above equation is the elementwise multiplication.

if $\Gamma_u$ = 1, $c^{\langle t \rangle}$ = $\tilde c^{\langle t \rangle}$.
if $\Gamma_u$ = 0, $c^{\langle t \rangle}$ = $c^{\langle t-1 \rangle}$.

Representation of Simplified GRU Unit


Advantages of GRU:

  1. Doesn’t suffer from the vanishing gradient problem
  2. Can learn long range dependencies

$c^{\langle t \rangle}$ can be a vector. $\Gamma_u$ and $\tilde c^{\langle t \rangle}$ would also be a vector of same dimension. In the multiplication in the equation (ii) it will also be of the same dimension telling you which unit to remember and which to forget (elementwise multiplication)

Simplified GRU Equations

(1) $\tilde c^{\langle t \rangle}$ = tanh($W_c[c^{\langle t-1 \rangle},x^{\langle t \rangle}] + b_c)$

(2) $\Gamma_u$ = $\sigma (W_u[c^{\langle t-1 \rangle},x^{\langle t \rangle}]+b_u$

(3) $c^{\langle t \rangle}$ = $\Gamma_u * \tilde c^{\langle t \rangle} + (1-\Gamma_u)*c^{\langle t-1 \rangle}$

Full GRU

The Full GRU incorporates additional gates: An Output Gate $\Gamma_o$, Forget gate $\Gamma_f$, Update Gate $\Gamma_u$ and a Relevance Gate $\Gamma_r$.

Equations:boldface notations are the standard notations used in literatures

(1) $\tilde h$: $\tilde c^{\langle t \rangle}$ = tanh($W_c[\Gamma_r*c^{\langle t-1 \rangle},x^{\langle t \rangle}] + b_c)$

(2) $r$: $\Gamma_r$ = $\sigma(W_r[c^{\langle t-1 \rangle},x^{\langle t \rangle}] + b_c)$ (Relevance Gate)

(3) $U$: $\Gamma_u$ = $\sigma(W_u[c^{\langle t-1 \rangle},x^{\langle t \rangle}] + b_u)$ (Update Gate)

(4) $h$: $c^{\langle t \rangle}$ = $\Gamma_u * \tilde c^{\langle t \rangle} + (1-\Gamma_u)*c^{\langle t-1 \rangle}$

The relevance gate in (1) $\Gamma_r$ tells how relevant is $c^{\langle t-1 \rangle}$ for updating $\tilde c^{\langle t \rangle}$.

Written on February 17, 2018
[ ]