CNN - Residual Networks

A lot of researchers have done great research in proposing and formulating different architectures for CNN which have been proven in different scenarios with different datasets. All of them are constantly evolving, with later ones performing better than the older versions due to novel techniques and emerging algorithms.

Just mentioning a few important ones here:

ResNet Resnets or Residual Networks take activations from one layer and feed them to another later.

Plain Network:
$a^l \xrightarrow{Linear} \xrightarrow{ReLU} a^{l+1} \xrightarrow{Linear} \xrightarrow{ReLU} a^{l+2} $

$z^{l+1} = W^{l+1}a^l+b^{l+1}$
$a^{l+1} = g(z^{l+1})$
$z^{l+2} = W^{l+2}a^{l+1}+b^{l+2}$
$a^{l+2} = g(z^{l+2})$

Residual Block: Activations from $a^l$ feed into the activation function for $a^{l+2}$

$a^{l+2} = g(z^{l+2}+a^l)$

Residual Block
Residual Block

A residual block allows to training a much much deeper network and takes care of the vanishing/exploding gradient problem Residual Block

Theoretically, training error should always decrease with increase in number of layers in case of a deep network. However, In reality, the training error decreases initially and then increases in case of a Plain Network.

With Resnets, the training error constantly decreases with increase number of layers

Why Resnets work so well?

Normally, a deeper network hurts the ability of the network to learn well on training data. Resnet fixes this.

How?

Let there be a big Neural Network trainind on input data X

Residual Block

$a^{[l+2]} = g(z^{[l+2]} + a^{[l]}) = g(W^{[l+2]}a^{[l+1]} + b^{[l+2]} + a^{[l]})$

$W^{[l+2]}$ will decrease due to exponential weight decay.

Say, $W^{[l+2]}$ & $b^{[l+2]}$ = 0 then, $a^{[l+2]} = g(a^{[l]})$,

Since g(z) is ReLU,

$a^{[l+2]} = a^{[l]}$

Since this corresponds to an Identity function between $a^{[l+2]}$ and $a^{[l]}$, it is much easier for the network to learn. That’s why adding a ResnetBlock does not hurt performance.

Also, if $Z^{[l+2]}$ and $a^{[l]}$ need to have the same dimension, that’s whay a same convolution is always used.

In case the dimensions are different, $a^{[l]}$ is multiplied by a matrix $W_s$ to make the dimesions of $Z^{[l+2]}$ and $a^{[l]}$ equal.

A full residual Network:

Residual Network

Written on December 20, 2017
[ ]