Skip to the content.

Gradient Descent with Momentum

Speeding up the Gradient Descent Algorithm

How does Gradient Descent work?

Gradient Descent Plot

Gradient Descent starts on a point, and gradually moves to the centre which is the optimized value of parameters. While moving, it moves both vertically (oscillations) and horizontally towards the minima. Though horizontal movement is needed, vertical oscillations slows up the algorithm in moving and converging faster. We want faster learning in the horizontal axis and slower learning in the vertical axis.

Gradient Descent Plot

Gradient Descent with momentum

Replace dW and db with $v_{dw}$ and $v_{db}$ while updating the weights. The implementation becomes:

On iteration t:
Compute dW and db on the current mini-batch
$v_{dW} = \beta v_{dw} + (1-\beta)dW $
$v_{db} = \beta v_{db} + (1-\beta)db $

Instead of updating weights with dW and db
$W:= W - \alpha v_{dW}$
$b:= b - \alpha v_{db}$

$v_{dW}$ & $v_{db}$ while updating the weights in place of dW and db smoothes out the steps of Grad Descent in turn reducing the vertical oscillations.

Analogy with Physics: Assuming the cost function is a bowl with the gradient descent rolling down towards minima

$dW, db$ : Acceleration
$v_{dW}, v_{db}$ : Velocity
$\beta$ : Friction

The Hyperparameters are now both $\alpha$ and $\beta$ which need to be tuned.

A typical value of $\beta$ is 0.9 used very often. Bias correction is not used very much in this case

Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on December 5, 2017