GloVe word vectors

GloVe stands for Global Vectors for word representation. Previously we were picking up context (c) and target(t) in the window randomly. GloVe makes this selection explicit.

$X_{ij}$ = NUmber of times $i$ appear in context of $j$. Here $i$ and $j$ play the role of context (c) and target (t). The definition of context is whether or not to words appear in the same window of +/- 10. Context can be defined in a different way too, for example the context might be the word coming just befor another word.

Also, depending on the selection, sometimes $X_{ij}$ = $X_{ji}$ (symmetrical relationship)

$X_{ij}$ is a count which denotes how often a word i and j appear together. The GloVe model optimizes the following:

Objective: How related are words i and j

Define $f(X_{ij})$ such that $f(X_{ij})$=0 if $X_{ij}$ = 0. Also $f(X_{ij})$ should be such that it doesn’t give too much weight to the frequent words (the, a, of, this etc) or it doesn’t get too small weight to infrequent words (durain, maui).

Also, in the equation below, when $X_{ij}$ = 0, log $X_{ij}$ is undefined. But $f(X_{ij})$=0 in this case, so 0 log 0 = 0.

minimize $\sum_{i=1}^{10000} \sum_{j=1}^{10000} f(X_{ij})(\theta_i^T e_j + b_i + b_j^{'} - log X_{ij})^2$

$b_i$ and $b_j$ are corresponding bias units for i and j (target and context)

The equation learns the parameters $\theta_i$ and $e_j$ using gradient descent. $\theta_i$ and $e_j$ are symmetric due to the definition of $X_{ij}$. The final embedding vector for a word $e_w$ is calculated as $e_w = \frac{e_w + \theta_w}{2}$.

Features learnt by the embedding matrix

While the example shows the features in the embedding matrix as ‘age’ or ‘gender’ in reality these features might be learnt by the algorithm and may not be interpretable.

Written on February 20, 2018
[ ]