Numerical Approximation of Gradient and Gradient Checking

How to make sure the implementation of Backpropagation is correct?

Solution: Implement gradient checking

Representation: Let f($\theta$) be a function of $\theta$ and it is required to calculate the derivate f’($\theta$).

Without using calculus: Let $\epsilon$ be a very small number = $10^{-7}$

Approx. Slope of f($\theta$) at $\theta$ = $\frac{f(\theta + \epsilon) - f(\theta - \epsilon)}{2\epsilon}$

Gradient Checking

Take all the parameters of cost function J() : $W^1, b^1, W^2, b^2,…..W^L,b^L$ and reshape them into a single vector $\theta$

J($W^1, b^1, W^2, b^2,…..W^L,b^L$ ) = J($\theta$)

Take parametes of gradient $dW^1, db^1,…dW^L,db^L$ and reshape into a single vector $d\theta$

Question: Is $d\theta$ really the gradient/slope of $\theta$?

Implementation:

for each i:
{

$d\theta_{approx}[i] = \frac{J(\theta_1,\theta_2,…,\theta_i+\epsilon,….\theta_L)-J(\theta_1,\theta_2,…,\theta_i-\epsilon,….\theta_L)}{2\epsilon}$

}

After completing the loop, check if $d\theta$ ~= $d\theta_{approx}$

How: Compute Eucledian distance

distance = $\frac{\lVert d\theta_{approx} - d\theta \rVert_2}{\lVert d\theta_{approx}\rVert_2 +\lVert d\theta\rVert_2}$

distance should be very low, if distance ~= $10^{-7}$ implementation is correct, if it is $10^{-3}$ it needs to be checked properly.

Implementation Tips:

  • Don’t use in training, only to debug
  • If algorithm fails grad check, look at the components to try and identify the bug
  • Remember the terms for regularization if used
  • Doesn’t work with Dropout
  • Run at random initialization, perhaps again after some training
Written on December 6, 2017
[ ]