How to make sure the implementation of Backpropagation is correct?

Representation: Let f($\theta$) be a function of $\theta$ and it is required to calculate the derivate f’($\theta$).

Without using calculus: Let $\epsilon$ be a very small number = $10^{-7}$

Approx. Slope of f($\theta$) at $\theta$ = $\frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$

Take all the parameters of cost function J() : $W^1, b^1, W^2, b^2,…..W^L,b^L$ and reshape them into a single vector $\theta$

J($W^1, b^1, W^2, b^2,…..W^L,b^L$ ) = J($\theta$)

Take parametes of gradient $dW^1, db^1,…dW^L,db^L$ and reshape into a single vector $d\theta$

Question: Is $d\theta$ really the gradient/slope of $\theta$?

Implementation:

for each i:
{

$d\theta_{approx}[i] = \frac{J(\theta_1,\theta_2,…,\theta_i+\epsilon,….\theta_L)-J(\theta_1,\theta_2,…,\theta_i-\epsilon,….\theta_L)}{2\epsilon}$

}

After completing the loop, check if $d\theta$ ~= $d\theta_{approx}$

How: Compute Eucledian distance

distance = $\frac{\lVert d\theta_{approx} - d\theta \rVert_2}{\lVert d\theta_{approx}\rVert_2 +\lVert d\theta\rVert_2}$

distance should be very low, if distance ~= $10^{-7}$ implementation is correct, if it is $10^{-3}$ it needs to be checked properly.

#### Implementation Tips:

• Don’t use in training, only to debug
• If algorithm fails grad check, look at the components to try and identify the bug
• Remember the terms for regularization if used
• Doesn’t work with Dropout
• Run at random initialization, perhaps again after some training

Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on December 6, 2017
]