When the derivatives/slope from becoming too big or too small

Problem Construct:
Consider a neural network with $l$ layers. Let $b$ = 0, $W^t$ be the weight matrix of layer $t$, and the activation function be $g(z) = z$

In this case, $Y = W^1.W^2.W^3….W^{l-1}W^l. X$

If W = 1.5, $Y = 1.5^l.X$ which will be a problem of exploding gradient because the value $1.5^l$ will become very large when $l$ is large.

If W = 0.5, $Y = 0.5^l.X$ which will be a problem of vanishing gradient because the value $0.5^l$ will become very small when $l$ is large.

#### Partial Solution: Weight Initialization for Deep Neural Network

Careful Random initialization on Variance based on the activation function

Let $z = w_1x_1+w_2x_2+w_3x_3+….. +w_nx_n$ (let b=0)

When n is large, we want each of $w_i$ to be small so that z is small. We need to set the variace of $W_i$ based on the activation function being used.

$W^{[l]}$ = np.random.randn(shape)*np.sqrt($\frac2{n^{l-1}}$)

It doesn’t solve, but reduces the vanishing/exploding gradient problem. Makes the values of $W_i$ neither too big not too small.

Other Variances:

For tanh activation: $\sqrt{\frac1{n^{l-1}}}$ [Xavier Initialization]

For relu :$\sqrt{\frac2{n^{l-1}}}$

Other: $\sqrt{\frac2{n^{l-1}+n^l}}$

Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on December 6, 2017
]