Word Embeddings & Embedding Matrix

Word Embeddings are the core of applying the RNNs to Natural Language Processing Tasks. Embedding are used to convert words and sentences to ‘numbers’ which the computer can not only understand, but also use them for NLP tasks such as Man:Woman = King:? -> Queen.

Word Representation

One Hot Representation

Vovabulary V = [a,aaron……,zulu, UNK]

The words are represented in a one hot representation based on the location they fall in the vocabulary. For example, the words with locations are (just an example, not actual): Man(5391), Woman(9853), King(4914), Queen (7157), Apple(456).

These worde are represented in One hot vectors as $O_{5391}$, $O_{9853}$, $O_{4914}$, $O_{7157}$, $O_{456}$

$O_{5391}$ = $\left[ \eqalign{0\cr 0\cr 0\cr .\cr .\cr 1\cr .\cr .\cr .\cr .\cr .\cr .\cr 0\cr 0\cr .} \right]$ 1 is at the 5391st position.
$O_{9853}$ = $\left[ \eqalign{0\cr 0\cr 0\cr .\cr .\cr .\cr .\cr .\cr .\cr 1\cr .\cr .\cr .\cr 0\cr .} \right]$ 1 is at the 9853rd position.

One of the shortcoming of this representation is that it considers each word equally,, and doesn’t generalize across words. For example, The sentence “I want a glass of orange juice” is learnt, but the algorithm cannot figure out a different but similar sentence “I want a glass of apple ?????”. It cannot learn on itself that apple and oranges are similar so the missing word in second sentence will also probably be “juice”. This is because the inner product between any oth the one hot vectors is zero (Eucledian Distance).

Featurized Representation: Word Embedding

For every word in a dictionary, we could learn a number accociated with each feature. For example for a feature “Gender”, the value for each word could be Man(-1), Woman(+1), King(-0.97), Queen(0.95), Apple(0.00), Orange(0.01) and so on. The same would be calculated for all the words and for all the features like Gender, Royal, Age, Food, size, cost, verb, noun, action…. etc.

Test Image

If there are (say) 300 features, the first column would become a 300x1 dimension vector representing the word “Man”. This would be much different than a One hot encoded vector which was 10000x1 and all the values were 0 except one value which was the position of the word. For all the words, the embedding matrix would be a 300x10000 matrix.

The notation $e_{5391}$ for “Man” is used to represent the representation. Similarily, $e_{9853}$ is used to represent Woman.

In this representation notice that the vectors for “Orange” and “Apple” are quite similar (except, say, color feature). Most other features are same. This increases the odds of learning algorithm to learn that Apple and Oranges are similar things. This will enable the algorithm to generalize better among words.

Vizualizing word embeddings in a 2-D Space: t-SNE algorithm representation

Test Image

Similar things are grouped together in the t-SNE representation

Using Word Embeddings

Example: Named Entity Recognition

(1) Sally Johnson is an orange farmer (Training)
(2) Robert Lin is an apple farmer

In (1) the model is trained with Sally Johnson as a named entity. In (2), the algorithm sees that “apple” is similar to “orange”, thus Robert Lin must be similar to Sally Johnson and must be a named entity.

(3) Robert Lin is a durian cultivator

The model can identify from the Embedding matrix that “durian” is similar to “apple” and “cultivator” is similar to “farmer”

Typically the Embedding matrix are built from over 100Billion words and are quite large. Transfer learning is used in many named entity recognition example.


  1. Learn word embeddings from large text corpus (1-100Bn words) or download pre-trained embeddings online
  2. Transfer embeddings to new tasks with smaller training set (say 100k words).
  3. Optional: Continue to finetune the word embeddings with new data

Relation to face recognition: Similar to facial recognition using a siamese network where encoding is learnt for a specific face and compared to the vector of a different face

For each of the 10000 word in vocabulary, a embedding vector $e_1$, $e_2$,….$e_{10,000}$ can be created.

Properties of Word Embeddings

The word embeddings are good at detecting analogies, which are essentially how similar a vector is to another vector.

For example - How the embedding matrix learn the analogy Man:Woman $\approx$ King:Queen ?

Lets assume each of the word is represented by a 4 dimensional vector as shown in the table below

Embedding Matrix

Man (say $e_{5391}$) = [ -1 0.01 0.03 0.09]

Similarily, vectors for Woman, King and Queen are also built in the same way.

Representing the vectors as $e_{Man}$,$e_{Woman}$,$e_{King}$ and $e_{Queen}$

$e_{Man}$ - $e_{Woman}$ $\approx$ $\left[ \eqalign{-2\cr 0\cr 0\cr 0} \right]$

$e_{King}$ - $e_{Queen}$ $\approx$ $\left[ \eqalign{-2\cr 0\cr 0\cr 0} \right]$

Thus, we can conclude that $e_{Man}$ - $e_{Woman}$ $\approx$ $e_{King}$ - $e_{Queen}$

We can change the above equation slightly so that if we need to find a word Man:Woman=King:??

$e_{Man}$ - $e_{Woman}$ $\approx$ $e_{King}$ - $e_{WORD}$ ,

or, $e_{WORD}$ $\approx$ $e_{King}$ - $e_{Man}$ + $e_{Woman}$


Mathematically speaking, To find word w: arg max(w) sim($e_w$, $e_{King}$ - $e_{Man}$ + $e_{Woman}$).

Sim is a similarity function - Cosine Similarity

Cosine Similarity

For calculation of sim($e_w$, $e_{King}$ - $e_{Man}$ + $e_{Woman}$), Cosine similarity is used which is defined as

sim(u,v) = $\frac {u^Tv}{||u||_2 ||v||_2}$

Embedding Matrix

Assumption: Assume that the vocabulary contains 10000 words, and there are 300 features in the embedding matrix

Matrix E (Embedding Matrix): 300 x 10000

Matrix $O_j$: One hot vector representation of word j

The final vector $e_j$ = E.$O_j$ which is the embedding for word j

In practice, specialized functions are used to look for word embeddings


Written on February 18, 2018
[ ]