Implementing Mini Batch Gradient Descent
Batch Gradient Descent vs Mini-Batch Gradient Descent
The Gradient descent algorithm runs on the entire (m) training examples before updating the weights W and b for a Neural Network. This can be quite slow if (m) is very large.
An alternative approach is dividing the dataset into multiple parts, and running batch gradient descent on each slice. Each slice of the divided dataset is called a Mini-Batch.
For example, if we have a dataset containing 5,000,000 training examples (m), we can divide it into mini-batches of 1,000 each. This will generate 5,000 mini batches which can run the gradient descent algorithm for each batch.
The key advantage of a mini-batch gradient descent is that it takes the advantage of efficient vectorization of matrices, and is much faster than a batch gradient descent.
Dataset : ; m=5,000,000
or minibatch t : (Dimension : x1000
How it works:
For each batch, run the gradient descent algorithm and loop over all the batches
for t = 1: 5000
- Implement Forward Prop
- Compute Cost Function
- Implement BackPropagation to compute Gradient
- Update and
} ##### Completion of 1 Epoch
Why it works
Selecting mini-batch size
size=m : Batch Gradient Descent (One mini batch only for all the examples) size=1 : Stochastic Gradient Descent (m mini batches)
In practice, mini-batch size is always between 1 and m
SGD algorithms can be extremely noisy, and won’t ever converge. It also loses the benefit of vectorization BGD algorithms tend to be very slow
Best sizes: 64, 128, 256, 512
should also fit into CPU/GPU memory