The Gradient descent algorithm runs on the entire (m) training examples before updating the weights W and b for a Neural Network. This can be quite slow if (m) is very large.

An alternative approach is dividing the dataset into multiple parts, and running batch gradient descent on each slice. Each slice of the divided dataset is called a Mini-Batch.

For example, if we have a dataset containing 5,000,000 training examples (m), we can divide it into mini-batches of 1,000 each. This will generate 5,000 mini batches which can run the gradient descent algorithm for each batch.

The key advantage of a mini-batch gradient descent is that it takes the advantage of efficient vectorization of matrices, and is much faster than a batch gradient descent.

Representation:
Dataset : $[(X^{(1)},Y^{(1)}),(X^{(2)},Y^{(2)}),....... , (X^{(m)},Y^{(m)})]$; m=5,000,000

Minibatch: $X^{\{1\}} = [(X^{(1)},Y^{(1)}),....... , (X^{(1000)},Y^{(1000)})]$

or minibatch t : $X^{\{t\}}$ (Dimension : $n_x$x1000

How it works:

For each batch, run the gradient descent algorithm and loop over all the batches for t = 1: 5000
{

• Implement Forward Prop
** $Z^{}= W^{}X^{\{t\}}+b^{}$
** $A^{} = g^{}(Z^{})$
** ..
** $A^{[l]} = g^{[l]}(Z^{[l]})$
• Compute Cost Function
• Implement BackPropagation to compute Gradient
• Update $W^{[l]} = W^{[l]}- \alpha dW^{[l]}$ and $b^{[l]} = b^{[l]}- \alpha db^{[l]}$
} ##### Completion of 1 Epoch

Why it works Selecting mini-batch size

size=m : Batch Gradient Descent (One mini batch only for all the examples) size=1 : Stochastic Gradient Descent (m mini batches)

In practice, mini-batch size is always between 1 and m

SGD algorithms can be extremely noisy, and won’t ever converge. It also loses the benefit of vectorization BGD algorithms tend to be very slow

Best sizes: 64, 128, 256, 512
$X^{\{t\}},Y^{\{t\}}$ should also fit into CPU/GPU memory

Written on December 4, 2017
]