# Language Models and Sequence Generation

A language model estimates the probability of a sentence (sequence or words) occuring together, and provides a comparison between different possible combination or variants of similar sentence.

P($y^{\langle 1 \rangle}$, $y^{\langle 1 \rangle}$,…, $y^{\langle t \rangle}$) = P(sentence) = ?

For example: Speech recognition systems - assume that a speech is given as an input ($x^{\langle t \rangle}…$) and the following two sentences are generated as output:
(a) The apple and pair salad
(b) The apple and pear salad

The language model estimates the probability of both (a) and (b) ocurring and provied a numerical value for the speech recognition system to select the sentence with maximum probability.

P(a) = $3.2x10^{-13}$
P(a) = $1.7x10^{-10}$

### Building the Model

Training set: Large corpus of english text

Sentence: Cats average 15 hours of sleep a day

Step 1: Tokenize - Map each word to their corresponding one hot vectors based on their postions in the vocabulary array. Also add an extra token $\langle$EOS$\rangle$ at the end of the sentence.

$y^{\langle 1 \rangle}$ = Cats
$y^{\langle 2 \rangle}$ = average
$y^{\langle 3 \rangle}$ = 15
$y^{\langle 4 \rangle}$ = hours
.
.
$y^{\langle 9 \rangle}$ = $\langle$EOS$\rangle$

If a word is not in a vocabulary: replace it with an unknown token

Step 2: RNN Model

Explaination of First unit of RNN:

The input to the first RNN unit $x^{\langle 0 \rangle}$ and $a^{\langle 0 \rangle}$ is a vector of zeros.

$\hat y{\langle 1 \rangle}$ is the softmax probabilities of each of the work in the dictionary which is P(a)P(aaron)….P(cats)…P(zulu)P(UNK)P(EOS).

The output of the first unit( the one with maximum softmax probability is then passed on as an input to the second unit. Thus, $x^{\langle 2 \rangle}$ = $\hat y^{\langle 1 \rangle}$. Similarily, $x^{\langle 3 \rangle}$ = $\hat y^{\langle 2 \rangle}$ and so on.

The output unit $\hat y^{\langle 1 \rangle}$ estimates the softmax probability of all words. Suppose the softmax function fives P(cats) as the highest, which will then be passed as an inpit to the next unit. Now $\hat y^{\langle 2 \rangle}$ will be the probabilotu of each word in the dictionary given the first word is “cats”. The third unit $\hat y^{\langle 3 \rangle}$ will be the probability of each word given the first two words as “cats” and “average”.

Thus, given an initial sequence of inputs, RNN can predict the probability of the next words in the sequence given the last few words.

The Loss Function

The loss function will be defined as :

$L$ ($\hat y^{\langle t \rangle}$,$y^{\langle t \rangle}$) = -$\sum_i y_i^{\langle t \rangle} log \hat y_i^{\langle t \rangle}$

Total loss: Sum over all the units
$L$ = $\sum L^{\langle t \rangle}(\hat y^{\langle t \rangle},y^{\langle t \rangle})$

### Sampling a sequence from a trained RNN

Function used:

np.random.choice()


The network is trained on the input string as shown abobe. To sample an output, a random word from the output unit is selected. For example, to select the first word, softmax unit outputs a probability for each word in the vocabulary. For example, the softmax unit output will be P(a)P(aaron)…P(cat)..P(zulu)P(unk). One word is selected at random and then fed into the input for second unit $x^{\langle 2 \rangle}$ = $\hat y^{\langle 1 \rangle}$. This process is continued for the next step until and EOS is encountered.

This is represented in the figure below

### Character level languge models

In character level language models, inputs are individual characters instead of words.

Vocabulary = [a,b,c,….. , , .,;,”,!,@,#,\$,0,1,2,3,…8,9, A,B,C…Z]

No need to worry about unknown words. This model requires more input data and longer strings, and are much more computationally expensive to train.

Written on February 17, 2018
]