Batch Normalization Algorithm

Batch normalization makes the Hyperparameter tuning easier and makes the neural network more robust. Also enables to train a big network easier and faster.

Normalizing inputs to speed up learning

Bring all the variables and features to similar scale using normalization.

For a deeper model, you have not only the input features but also the features in the activation layers. For example in layer 2, we can normalize the features in $a^{[2]}$ to make the training of $W^{[3]}$ & $b^{[3]}$ more efficient.

For any hidden layer $a^{[l]}$, can we normalize the values of $W^{[l+1]}$ & $b^{[l+1]}$ so as to train W & b faster?

Normally, $z^{[l]}$ is normalized in place of $a^{[l]}$

Implementing the Batch Normalization algorithm

Let the hidden units activations $z^{[l]}$ have individual elements represented as $z^{(l)},z^{(2)}…z^{(m)}$ over m training examples. For simplicity, the $i^{th}$ training example in layer $z^{[l]}$ will just be represented as $z^{(i)}$

Calculating the mean and variance of $z^{[l]}$ layer for all examples:

$\mu = \frac1m\sum_iZ^{(i)}$
$\sigma^2 = \frac1m\sum_i(Z^{(i)}-\mu)^2$
$Z_{norm}^{(i)} = \frac{(Z^{(i)}-\mu)}{\sqrt{\sigma^2+\epsilon}}$
$\tilde Z^{(i)} = \Gamma Z_{norm}^{(i)}+\beta$

$\Gamma$ and $\beta$ are learnable parameters of the network which can be trained through Gradient Descent

Finally, use in place of for layer l

Case: If $\Gamma$ = ${\sqrt{\sigma^2+\epsilon}}$ and $\beta$ = $\mu$,

then $\tilde Z^{(i)}$ = $Z^{(i)}$. In this case effect of batch normalization will be nullified.

#### This process not just normalizes the input layer $X$ but also the inputs in the hidden layer with mean = 0 and variance = 1.

We would actually don’t want the values of hidden layer inputs to be exctly with mean = 0 and variance = 1 because that would mean they would all be clustered around 0, but we would want $Z^{(i)}$ to be spread around it with some mean and variance (not equal to 0 and 1). Thus $\Gamma$ and $\beta$ can be used to this effect which can be set by the learning algorithm.

Adding Batch Normalization to a Deep Neural Network

Architecture: Test Image

The computation is as follows:

$X \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[1]},\gamma^{[1]}} \tilde Z^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde Z^{[1]} ) \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[2]},\gamma^{[2]}} \tilde Z^{[2]} \rightarrow a^{[2]} = g^{[2]}(\tilde Z^{[2]} ) $

Instead of using unnormalized $Z^{[l]}$ we will use normalized $\tilde Z^{[l]}$ and in input to the activation units $a^{[l]}$.

Parameters to optimize:
$W^1,b^1,W^2,b^2….W^l,b^l$ and $\beta^1,\gamma^1, \beta^2,\gamma^2 ….. \beta^l,\gamma^l$ (this $\beta$ is different from the one used in RMSprop or Adam algorithm)

For Gradient Descent, $d\beta^l$ is calulated and $\beta$ is optimized using the GD algo

Batch Normalization(BN) is built into tensorflow

tf.nn.batch_normalization()

#### Applying Batch Normalization to Mini-Batches

For each mini-batch, BN is done separately for that particular mini batch

For Mini Batch {t}

$X^{\lbrace 1\rbrace } \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[1]},\gamma^{[1]}} \tilde Z^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde Z^{[1]} ) \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[2]},\gamma^{[2]}} \tilde Z^{[2]} \rightarrow a^{[2]} = g^{[2]}(\tilde Z^{[2]} ) $

$X^{\lbrace 2 \rbrace } \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[1]},\gamma^{[1]}} \tilde Z^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde Z^{[1]} ) \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[2]},\gamma^{[2]}} \tilde Z^{[2]} \rightarrow a^{[2]} = g^{[2]}(\tilde Z^{[2]} ) $
.
.
.
$X^{\lbrace t \rbrace } \xrightarrow{W^{[1]}, b^{[1]}} Z^{[1]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[1]},\gamma^{[1]}} \tilde Z^{[1]} \rightarrow a^{[1]} = g^{[1]}(\tilde Z^{[1]} ) \xrightarrow{W^{[2]}, b^{[2]}} Z^{[2]} \xrightarrow[{Batch Norm [BN]}]{\beta^{[2]},\gamma^{[2]}} \tilde Z^{[2]} \rightarrow a^{[2]} = g^{[2]}(\tilde Z^{[2]} ) $

#### Parameters

$Z^{[l]}$ is computed as $Z^{[l]} = W^{[l]}a^{[l-1]}+b^{[l]}$

Since $b^{[l]}$=0 is a constant and adding any constant to normalization cancels it out, $b^{[l]}$=0 during normalization (since mean is subtracted from it)

$b^{[l]}$ can be set to zero and can be replaced with $\beta^{[l]}$

$Z^{[l]} = W^{[l]}a^{[l-1]}$ $\rightarrow Z_{norm}^{[l]}$ $\rightarrow \tilde Z^{[l]}$ = $\gamma^{[l]}Z_{norm}^{[l]} + \beta^{[l]}$

Parameters for layer l: $w^l, \beta^l, \gamma^l$ (no $b^{[l]}$)

Dimensions of $\beta^l, \gamma^l$ = ($n^{[l]}$,1)

#### Implementing Gradient Descent (GD) with mini batch

for mini batch t = 1…. Num_mini_batches:
Compute forward propagation on the mini batch t $X^{\lbrace t \rbrace}$
In each hidden layer use BN to replace $Z^{[l]}$ with $\tilde Z^{[l]}$
Use Backprop to compute $dW^l, d\beta^l, d\gamma^l$
Update parameters W, $\beta$ and $\gamma$

Why does Batch Normalization work?

  • Normalizing inputs across all layers speeds up learning
  • Makes weights in deeper layers robust as compared to the weights in the initial layers. This means that if the distribution of input in initial layers changes drastically due to different distribution in training dataset and test, retraining might be required to generate good results. This i sknown as COVARIATE SHIFT. Batch normalization reduces the amount the weights in the intial layers shift and makes them more stable.
  • Batch normalization has a slight regularization effect. Each mini batch is scaled by the mean/variance computed on just that mini-batch. This adds some noise to the values $z^{[l]}$ within that minibatch. So similar to dropout, it adds some noise to each hidden layer’s activations.

Batch Normalization during Test times

Different $\mu$ and $\sigma$ are computed for each mini batch base don the number of examples in that mini-batch. The test sit might have just one data point on which the prediction is to be made, so what $\mu$ and $\sigma$ will be used?

Solution: Estimate $\mu$ and $\sigma$ using exponentially weighted average across all mini batches.

Ex.: For mini batch {1}, $\mu^1and $\sigma^1$ are the parameters
For mini batch {2}, $\mu^2and $\sigma^2$ are the parameters
For mini batch {3}, $\mu^3and $\sigma^3$ are the parameters

Run exponentially weighted moving average to calculate $\mu$ and $\sigma$ across the entire training set and use it in predicting the value for training

Finally, $Z_{norm} = \frac{z-\mu}{\sqrt{\sigma^2+\epsilon}}$

$\tilde Z^{(i)} = \gamma Z_{norm}^{(i)} + \beta$

where $Z^{(i)}$ is the test example for which prediction is to be made

Written on December 16, 2017
[ ]