Recurrent Neural Network (RNN) - Forward Propagation

The standard neural networks cannot take into account the sequence that come before or after a data point. For example, to identify a name in a sentence, we need knowledge of the other words surrounding it to identify it. In the belowmentioned senteces, in (1) ‘Teddy’ refers to a name, while it refers to a toy in (2).

(1) : “Teddy Roosevelt was an americal president”
(2) : “Teddy bear is a popular toy among children”

Problem with standard networks

A standard network will look at the data points in isolation. There is no way to learn shared features across different positions of Text. Also, the length on input and output for each example (i) in the dataset will be different.

Normal NN

Recurrent Neural Network Representation

(Output is actually the predicted output $\hat y$.

Normal NN

In some literatures, this is represented in the following way:

Normal NN

Weights of the RNN:

Each of the matrices $x^{\langle t \rangle}$, $\hat y^{\langle t \rangle}$ and $a^{\langle t \rangle}$ are made up of numbers. There is a weight assigned to each of the operation (represented by arrows). The key weight metrices are:

$W_{aa}$: To calculate ‘a’ using ‘a’ [W]
$W_{ax}$: To calculate ‘a’ using ‘x’ [U]
$W_{ya}$: To calculate ‘y’ using ‘a’ [V] (Calculate $\hat y$)

How to remember: $W_{ax}$ - Using a ‘x’ like quantity to calculate ‘a’ like quantity

The weights are represented in the figure below:

Normal NN

The weight matrices U,V & W are also used to represent $W_{ax}$, $W_{ya}$ and $W_{aa}$ respectively

Alternative representation Normal NN

Forward Propagation

Normal NN

The Forward propagation can be expressed with the following equations:

For 1st unit :

Input: $x^{<1>}$, Output $\hat y^{<1>}$

RNN Activation unit $a^{<0>}$ Initialized as a vector of zeros

Forward Propagation:
$a^{<1>}$ = $g_1(W_{aa}a^{<0>}+W_{ax}x^{<1>}+b_a)$
$\hat y^{<1>}$ = $g_2(W_{ya}a^{<1>}+b_y)$

The activation functions $g_1()$ is tanh() or ReLU. The activation functions $g_2()$ is sigmoid() or softmax().

Generalized representation:

$a^{\langle t \rangle}$ = $g_1(W_{aa}a^{\langle t-1 \rangle}+W_{ax}x^{\langle t \rangle}+b_a)$
$\hat y^{\langle t \rangle}$ = $g_2(W_{ya}a^{\langle t \rangle}+b_y)$

Representing Weight metrices in a different way:

The equation $a^{\langle t \rangle}$ = $g_1(W_{aa}a^{\langle t-1 \rangle}+W_{ax}x^{\langle t \rangle}+b_a)$ can be represented in a simpler way by combining $W_{aa}$ and $W_{ax}$ horizontly into a new matrix represented as $W_a$.

The metrices are appended side by side horizontly, such that $W_a$ = $[W_{aa}:W_{ax}]$. If a is $R^{100}$, x is $R^{10000}, $ $W_{aa}$ is a 100x100 matrix, and $W_{ax}$ is a 100x10000 matrix, $W_a$ will be a 100x10100 dimension matrix.

Similarily, $a^{\langle t-1 \rangle}$ & $x^{\langle t \rangle}$ can also be combined one below another (vertically). Since a is $R^{100}$ and x is $R^{10000}$, combined matrix will be $R^{10100}$.

$[W_{aa}:W_{ax}]\left[ \eqalign{a^{\langle t-1 \rangle} \cr x^{\langle t \rangle}} \right]$ = $W_{aa}a^{\langle t-1 \rangle}+W_{ax}x^{\langle t \rangle}$

The equation will now be represented as :

$a^{\langle t \rangle}$ = $g_1(W_a[a^{\langle t-1 \rangle},x^{\langle t \rangle}]+b_a)$
$\hat y^{\langle t \rangle}$ = $g_2(W_ya^{\langle t \rangle}+b_y)$

Written on February 14, 2018