Optimization Algorithms - RMSprop and Adam

Gradient Descent is widely used as an optimization algorithm for optimizing the cost functions. There are various improved version of these algorithms like StochasticGradientDescent, Gradient Descent with Momentum, RMSprop and Adam.

RMSprop : Root Mean Square Prop

Intuition: Suppose we have two parameters, one in vertical direction (say $b$), and another in horizontal direction (say $W$).We want to speed up learning in the horizontal direction ($W$) and slow down to prevent oscillations in the vertical direction ($b$).

$\beta_2$: Parameter for RMSprop (similar to momentum $\beta$ but written as $\beta_2$ to prevent confusion
$\epsilon$: Parameter added to $S_{dW}$ and $S_{db}$ to prevent division by zero in case any of this is equal to zero.


On iteration t:


*Compute dW, db in the current mini batch

*$S_{dW} = \beta_2 S_{dW} + (1-\beta_2)dW^2$

*$S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2$

*Update Parameters

**$W = W-\alpha\frac{dW}{\sqrt{S_{dW}+\epsilon}}$

**$b = b-\alpha\frac{db}{\sqrt{S_{db}+\epsilon}}$


We can use large $\alpha$ to speed up learning

$S_{dW}$ is relatively small which means $W$ is divided by a small number speeding up learning in the $W$ direction

$S_{db}$ is relatively large which means $b$ is divided by a large number preventing movement in vertical direction

In reality, there are multiple parameters in horizontal direction (say, $W_1, W_3, W_7…$) and multiple in the vertical direction (say $W_2, W_4, W_6,..$).


Adam stands for Adaptive Moment Estimation. It is essentially a combination of RMSprop and Gradient Descent with momentum algorithms.


Initialize: $v_{dW} = 0, S_{dW}=0, v_{db}=0, v_{dW}=0 $

On iteration t,


Compute dW, db on current minibatch

$v_{dW} = \beta_1 v_{dW} + (1-\beta_1)dW$
$v_{db} = \beta_1 v_{db} + (1-\beta_1)db$

$S_{dW} = \beta_2 S_{dW} + (1-\beta_2)dW^2$
$S_{db} = \beta_2 S_{db} + (1-\beta_2)db^2$

Applying Bias Correction: $v^c$ stands for $v^{corrected}$, similarly for S
$v_{dW}^c = \frac{v_{dW}}{1-\beta_1^t}$
$v_{db}^c = \frac{v_{db}}{1-\beta_1^t}$
$S_{dW}^c = \frac{S_{dW}}{1-\beta_2^t}$
$S_{db}^c = \frac{S_{db}}{1-\beta_2^t}$

Weight Update:
$W = W - \alpha \frac{v_{dW}^c}{\sqrt{S_{dW}^c + \epsilon}}$
$b = b - \alpha \frac{v_{db}^c}{\sqrt{S_{db}^c + \epsilon}}$


$\alpha$ : Needs to be tuned
$\beta_1$: Typical value : 0.9 (Not tuned very often)
$\beta_2$: Typical value: 0.999 (Not tuned very often)
$\epsilon: 10^{-8} $: Doesn’t matter much

Additional Material: Learning Rate Decay

To speed up an optimization algorithms, the learning rate $\alpha$ needs to be reduced over time. The reason for this is that in the beginning, the bigger steps can be taken but as the cost function reachesto its minumum, learning rate needs to be decreased.

In the first figure, if the learning rate is same, the alogorithm will never converge and will always oscillate around the minima (BLUE). If learning rate is decreased over time, it will coverge to minima in the later epochs (GREEN).

LR Decay LR Decay


decay_rate: Hyperparameter epoch: 1 pass through the data

$\alpha = \frac{1}{1+decayRate*epochNum}.\alpha_0$

For example, let $\alpha_0$ = 0.2, decayRate =1 epoch = 1; $\alpha$ = 0.1
epoch = 2; $\alpha$ = 0.67
epoch = 3; $\alpha$ = 0.5
epoch = 4; $\alpha$ = 0.4

Other Types of Decay:

Exponential decay: $\alpha$ = $0.95^{epochNum}\alpha_0$

Others: $\alpha = \frac{K}{\sqrt{epochNum}}\alpha_0$

Additional Material: Local Optima

What if there are multiple local optima present for the cost function as shown in the figure below?

Local Optima

Turns out that this is not the case in a multi-dimensional space. In reality, most of the minima would be saddle points where some directions would go up and some will go down.

Local Optima

The rel problem is Plateau which can really slow down learning.

Local Optima

RMSprop, Adam and GD with Momentum can be used to fix these

Written on December 6, 2017
[ ]