Vanishing Gradients with RNN
The sequence input to RNNs can be really long, and its quite possible that the inputs in the beginning on the sequence will decide the output units sometime later, by when the gradients will not be strong enough to affect the output.
For example, the two sentences below show that the initial part of the input (“cat” or “cats”) decide whether to use “was” or “were” in the sequence much later, and there could be long sting of words in between.
(1) The cat, which already ate……………., was full
(2) The cats, which already ate……………, were full
In a normal RNN, the word cat or cats decide whether was or were is to be used, but the network would have forgotten about the words in the beginning. RNN does not capture long term dependencies very well. It is very difficult for the error to backpropagate all through to the first few units, and typically the input units which are nearby to an output unit affects its weight, not something that came up in the initial phases
Solution: Gated Recurrent Units (GRUs) and Long-Short Term Memory (LSTM) networks.
Exploding Gradient Problem
Gradients can also shoot up in value during backpropagation which is called a exploding gradient problem. Its common to see NaNs when the gradient becomes too large. Gradient Clipping is used to address the problem of exploding gradient, where if a gradient becomes larger than a range (say -5 to +5), it is clipped to the threshold values.