# Exponentially Weighted Average for Deep Neural Networks

#### A fast and efficient way to compute moving averages - implemented in the different optimization algorithms.

Example: Temperature $\theta_t$ over days, calculate the moving averages

$v_t$: Moving average value at day ‘t’

$v_0$ = 0
$v_1 = 0.9v_0 + 0.1\theta_1$
$v_1 = 0.9v_1 + 0.1\theta_2$
..
$v_t = 0.9v_{t-1} + 0.1\theta_t$

if $\beta$ = 0.9
$v_t = \beta v_{t-1} + (1-\beta)\theta_t$

The above equation would give the moving average line in Red over the data points $\theta_t$ in Blue.

#### What does $v_t$ and $\beta$ means

$v_t$: averaging over $\frac1{1-\beta}$ days (approx)

For ex. , For $\beta$=0.9, $\frac1{1-\beta}$ ~=10 ;
$\beta$ = 0.9 averages over 10 days (smooth curve: Red Line)

For $\beta$=0.98, $\frac1{1-\beta}$ ~=50 ;
$\beta$ = 0.98 averages over 50 days (smoother curve: Green Line) - Not very accurate epresentation

For $\beta$=0.5, $\frac1{1-\beta}$ ~=2 ;
$\beta$ = 0.5 averages over 2 days (Fluctuations: Yellow Line) - Much more noisy

The right value of $\beta$ is calculated using Hyperparameter Tuning

#### What is Exponentially Weighted Average Actually Doing?

$v_t = \beta v_{t-1} + (1-\beta)\theta_t$

Going backwards from $v_{100}$,
$v_{100} = 0.1\theta_{100} + 0.9v_{99}$
$v_{99} = 0.1\theta_{99} + 0.9v_{98}$
Substituting $v_{99}$,
$v_{100} = 0.1\theta_{100} + 0.9 (0.1\theta_{99} + 0.9v_{98})$
or, $v_{100} = 0.1\theta_{100} + 0.9 (0.1\theta_{99} + 0.9(0.1\theta_{98} + 0.9v_{97}))$

Generalizing,
$v_{100} = 0.1\theta_{100} + 0.1 * 0.9 * \theta_{99} + 0.1 * (0.9)^2\theta_{98} + 0.1 * (0.9)^3\theta_{97} + + 0.1 * (0.9)^4\theta_{96} + …..$

$v_{100}$ is basically an element wise computation of two metrices/functions - one an exponential decay function containing diminishing values (0.9, $0.9^2$, $0.9^3$,..) and another with all the elements of $\theta_t$.

Also, if $\beta$ = 0.9, the weight decays to about a third by 10th iteration.

Proof: $(1-\epsilon)^{\frac1{\epsilon}} = \frac1e$
$\epsilon$ = 1-$\beta$

of $\beta$ = 0.9, $\epsilon$ = 1-$\beta$ = 0.1

$(1-\epsilon)^{\frac1\epsilon} = 0.9^{10} = 0.35 = \frac1e$

Interpretation: It takes about 10 days for height to decay to 1/3rd

If $\beta$ = 0.98, $\epsilon$ = 0.02, $\frac1\epsilon$ = 50;
It takes approx 50 days for height to decay to 1/3rd

#### Implementing Exponentially Weighted Average

$v_\theta$: v is computing exponentially weighted average of parameter $\theta$.

day 0: $v_\theta = 0$ day 1: $v_\theta = \beta v + (1-\beta)\theta_1$
day 2: $v_\theta = \beta v + (1-\beta)\theta_2$

Algorithms: $v_\theta = 0$ Repeat: {

Get next $\theta_t$

$v_\theta := \beta v_\theta + (1-\beta)\theta_t$

}

Single line implementation for fast and efficient calculation of exponentially weighted moving average.

#### Bias Correction in Exponentially Weighted Moving Average

Making EWMA more accurate - Since the curve starts from 0, there are not many values to average on in the initial days. Thus, the curve is lower than the correct value initially and then moves in line with expected values.

Figure: The ideal curve shoule be the GREEN one, but it starts as the PURPLE curve since the values initially are zero

Example: Starting from t=0 and moving forward,
$v_0 = 0$ $v_1 = 0.98v_0 + 0.02\theta_1 = 0.02\theta_1$
$v_2 = 0.98v_1 + 0.02\theta_2$ = $0.0196\theta_1 +0.02\theta_2$

The initial values of $v_t$ will be very low which need to be compensated.
Make $v_t = \frac{v_t}{1-\beta^t}$
for t=2, $1-\beta^t$ = $1-0.98^2$ = 0.0396 (Bias Correction Factor)

$v_2 = \frac{v_2}{0.0396} = \frac{0.0196\theta_1 +0.02\theta_2}{0.0396}$

When t is large, $\frac{1}{1-\beta^t} =1$, hence bias correction factor has no effect when t is sufficiently large. It only jacks up the intial values.

Source material from Andrew NG’s awesome course on Coursera. The material in the video has been written in a text form so that anyone who wishes to revise a certain topic can go through this without going through the entire video lectures.

Written on December 5, 2017
]