Word2Vec Algorithm- Skip-gram and Continuous Bag of Words

Word2vec is a group of related models that are used to produce word embeddings. These models are shallow, two-layer neural networks that are trained to reconstruct linguistic contexts of words. Word2vec takes as its input a large corpus of text and produces a vector space, typically of several hundred dimensions, with each unique word in the corpus being assigned a corresponding vector in the space. Word vectors are positioned in the vector space such that words that share common contexts in the corpus are located in close proximity to one another in the space.

Word2vec was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analysed and explained by other researchers. Embedding vectors created using the Word2vec algorithm have many advantages compared to earlier algorithms such as latent semantic analysis.

Word2vec can utilize either of two model architectures to produce a distributed representation of words: continuous bag-of-words (CBOW) or continuous skip-gram. In the continuous bag-of-words architecture, the model predicts the current word from a window of surrounding context words. The order of context words does not influence prediction (bag-of-words assumption). In the continuous skip-gram architecture, the model uses the current word to predict the surrounding window of context words. The skip-gram architecture weighs nearby context words more heavily than more distant context words. According to the authors’ note, CBOW is faster while skip-gram is slower but does a better job for infrequent words.

CBOW Skip-Gram

Skip Grams

Sentence: I want a glass of orange juice to go along with my cereal.

Choosing Context and Target: Choose a window of +5/-5 words, anr randomly choose a context and a target word. We now set up a seupervised learning problem where are are required to predict a target word given a context word in the window. This might not be very efficient, but it helps in estimating an Embedding matrix.


Vocab sizs: 10000

Context to target prediction

Context (c) : Orange
Target (t): Juice

For one example, say $O_c$ be the one hot representation of the context word. This is multiplied with an Embedding Matrix $E$ to arrive at the embedding vector $e_c$. The embedding vector $e_c$ is thenfed into a softmax unit to arrive at a prediction $\hat y$.

Softmax Model:

$p(t|c)$ = $\frac {e^{\theta_t^Te_c}}{\sum_{j=1}^{10,000}e^{\theta_j^Te_c}}$

$\theta_t$ is the parameter associated with output t

Loss Function for Softmax Model

$L(\hat y, y) = -\sum_{i=1}^{10000} y_i log \hat y_i$

The target $y_i$ is represented as a 10000 dim one hot vector $\left[ \eqalign{0\cr 0\cr .\cr .\cr 1\cr 0\cr 0\cr 0\cr 0} \right]$

$\hat y_i$ will also be be a 10000 dim vector with probabilities for each of the 10000 word as elements of the vector.

The optimization of this network provides a pretty good approximation of the Embedding Matrix $E$

Problems with Skip-Gram Model

  1. Computational Complexity: The denominator in the equation $p(t|c)$ = $\frac {e^{\theta_t^Te_c}}{\sum_{j=1}^{10,000}e^{\theta_j^Te_c}}$ sums over the probabilities of all 10k words in the denominator. This might not be too bad, but if we are using 1Bn or 100Bn words then it will take up enormous time.

Solution: Use a Hierarchial Softmax Classifier (tree based model to classify)

How to sample a context c: Once a context is sampled, the target word can be sampled in a window of +/- 5 or +/-10. But how to sample the context?

c can be samples from a uniform random distribution, but in that case words like the, a, of etc will appear too frequently and the model will try to use them


In the continuous bag of words model, context is represented by multiple words for a given target words. For example, we could use “cat” and “tree” as context words for “climbed” as the target word. This calls for a modification to the neural network architecture. The modification, shown below, consists of replicating the input to hidden layer connections C times, the number of context words, and adding a divide by C operation in the hidden layer neurons. [The figure below might lead some readers to think that CBOW learning uses several input matrices. It is not so. It is the same matrix, WI, that is receiving multiple input vectors representing different context words]

CBOW Skip-Gram

With the above configuration to specify C context words, each word being coded using 1-out-of-V representation means that the hidden layer output is the average of word vectors corresponding to context words at input. The output layer remains the same and the training is done in the manner discussed above.

Negative Sampling

Softmax objective is very slow to compute. To make it more efficient, negative sampling is used.

Create a new learning problem. Given two words “orange” & “juice”, we want to predict if this is a context-target pair. If it is ‘orange’ and ‘juice’, output will be 1 else 0

context, word –> target
orange, juice –> 1
orange, king –> 0
orange, book –> 0
orange, the –> 0

For the positive example, pick the context and target word from nearby or +/-5 in the window. To generate a negative example, pick any word at random from the dictionary and label it as 0

For k times, we will pick random words from the dictionary and label them as negative example against the context. It is ok if the random words happen to be in the window of the context word. The we create a supervised learning problem which inputs a pair of words and predict the output $y$. This is how a training set is generated. The model will have k:1 ratio of positivve:negative example

Value of K: 2-5 for larger datasets, 5-20 for smaller datasets.

Supervised learning model for learning a mapping from x to y:

Define a logistic regression model $P(y=1|c,t) = \sigma(\theta_t^Te_c)$

If input word is say Orange ($O_{6257}$), multiply with $E$ to get $e_{6257}$ to generate 10000 possible logistic regression classification problem where one of these will be the classifier corresponding to whether the target word is juice or not, and others will be for the other word. We will traqin only 5(k=4) out of these 10000 logistic model which will save a lot of time.

Instead of 10000 softmax classification, we will have only 5 logistic regression classifier which is computationally much cheaper rather than updating a 10k softmax.

Sampling negative examples:

Sample according to empirical frequency (problem: high chances of ‘the’, ‘of’ etc)

$f(w_i)$ is the frequency of word in the dictionary

For distribution, Take the value of $P(w_i) = \frac{f(w_i)^{3/4}}{\sum_{j=1}^{10000}f(w_j)^{3/4}}$

Written on February 19, 2018
[ ]