Human vocabulary comes in free text. In order to make a machine learning model understand and process the natural language, we need to transform the free-text words into numeric values. One of the simplest transformation approaches is to do a one-hot encoding in which each distinct word stands for one dimension of the resulting vector and a binary value indicates whether the word presents (1) or not (0).
However, one-hot encoding is impractical computationally when dealing with the entire vocabulary, as the representation demands hundreds of thousands of dimensions. Word embedding represents words and phrases in vectors of (non-binary) numeric values with much lower and thus denser dimensions. An intuitive assumption for good word embedding is that they can approximate the similarity between words (i.e., âcatâ and âkittenâ are similar words, and thus they are expected to be close in the reduced vector space) or disclose hidden semantic relationships (i.e., the relationship between âcatâ and âkittenâ is an analogy to the one between âdogâ and âpuppyâ). Contextual information is super useful for learning word meaning and relationship, as similar words may appear in the similar context often.
There are two main approaches for learning word embedding, both relying on the contextual knowledge.
- Count-based: The first one is unsupervised, based on matrix factorization of a global word co-occurrence matrix. Raw co-occurrence counts do not work well, so we want to do smart things on top.
- Context-based: The second approach is supervised. Given a local context, we want to design a model to predict the target words and in the meantime, this model learns the efficient word embedding representation.
Count-Based Vector Space Model
Count-based vector space models heavily rely on the word frequency and co-occurrence matrix with the assumption that words in the same contexts share similar or related semantic meanings. The models map count-based statistics like co-occurrences between neighboring words down to a small and dense word vectors. PCA, topic models, and neural probabilistic language models are all good examples of this category.
Different from the count-based approaches, context-based methods build predictive models that directly target at predicting a word given its neighbors. The dense word vectors are part of the model parameters. The best vector representation of each word is learned during the model training process.
Context-Based: Skip-Gram Model
Suppose that you have a sliding window of a fixed size moving along a sentence: the word in the middle is the âtargetâ and those on its left and right within the sliding window are the context words. The skip-gram model (Mikolov et al., 2013) is trained to predict the probabilities of a word being a context word for the given target.
The following example demonstrates multiple pairs of target and context words as training samples, generated by a 5-word window sliding along the sentence.
âThe man who passes the sentence should swing the sword.â â Ned Stark
Sliding window (size = 5) | Target word | Context |
---|---|---|
[The man who] | the | man, who |
[The man who passes] | man | the, who, passes |
[The man who passes the] | who | the, man, passes, the |
[man who passes the sentence] | passes | man, who, the, sentence |
⊠| ⊠| ⊠|
[sentence should swing the sword] | swing | sentence, should, the, sword |
[should swing the sword] | the | should, swing, sword |
[swing the sword] | sword | swing, the |
{:.info} |
Each context-target pair is treated as a new observation in the data. For example, the target word âswingâ in the above case produces four training samples: (âswingâ, âsentenceâ), (âswingâ, âshouldâ), (âswingâ, âtheâ), and (âswingâ, âswordâ).
Fig. 1. The skip-gram model. Both the input vector and the output are one-hot encoded word representations. The hidden layer is the word embedding of size .
Given the vocabulary size , we are about to learn word embedding vectors of size . The model learns to predict one context word (output) using one target word (input) at a time.
According to Fig. 1,
- Both input word and the output word are one-hot encoded into binary vectors and of size .
- First, the multiplication of the binary vector and the word embedding matrix of size gives us the embedding vector of the input word : the i-th row of the matrix .
- This newly discovered embedding vector of dimension forms the hidden layer.
- The multiplication of the hidden layer and the word context matrix of size produces the output one-hot encoded vector .
- The output context matrix encodes the meanings of words as context, different from the embedding matrix . NOTE: Despite the name, is independent of , not a transpose or inverse or whatsoever.
Context-Based: Continuous Bag-of-Words (CBOW)
The Continuous Bag-of-Words (CBOW) is another similar model for learning word vectors. It predicts the target word (i.e. âswingâ) from source context words (i.e., âsentence should the swordâ).
Fig. 2. The CBOW model. Word vectors of multiple context words are averaged to get a fixed-length vector as in the hidden layer. Other symbols have the same meanings as in Fig 1.
Because there are multiple contextual words, we average their corresponding word vectors, constructed by the multiplication of the input vector and the matrix . Because the averaging stage smoothes over a lot of the distributional information, some people believe the CBOW model is better for small dataset.
Loss Functions
Both the skip-gram model and the CBOW model should be trained to minimize a well-designed loss/objective function. There are several loss functions we can incorporate to train these language models. In the following discussion, we will use the skip-gram model as an example to describe how the loss is computed.
Full Softmax
The skip-gram model defines the embedding vector of every word by the matrix and the context vector by the output matrix . Given an input word , let us label the corresponding row of as vector (embedding vector) and its corresponding column of as (context vector). The final output layer applies softmax to compute the probability of predicting the output word given , and therefore:
This is accurate as presented in Fig. 1. However, when is extremely large, calculating the denominator by going through all the words for every single sample is computationally impractical. The demand for more efficient conditional probability estimation leads to the new methods like hierarchical softmax.
Hierarchical Softmax
Morin and Bengio (2005) proposed hierarchical softmax to make the sum calculation faster with the help of a binary tree structure. The hierarchical softmax encodes the language modelâs output softmax layer into a tree hierarchy, where each leaf is one word and each internal node stands for relative probabilities of the children nodes.
Fig. 3. An illustration of the hierarchical softmax binary tree. The leaf nodes in white are words in the vocabulary. The gray inner nodes carry information on the probabilities of reaching its child nodes. One path starting from the root to the leaf . denotes the j-th node on this path. (Image source: word2vec Parameter Learning Explained)
Each word has a unique path from the root down to its corresponding leaf. The probability of picking this word is equivalent to the probability of taking this path from the root down through the tree branches. Since we know the embedding vector of the internal node , the probability of getting the word can be computed by the product of taking left or right turn at every internal node stop.
According to Fig. 3, the probability of one node is ( is the sigmoid function):
The final probability of getting a context word given an input word is:
where is the depth of the path leading to the word and is a specially indicator function which returns 1 if is the left child of otherwise -1. The internal nodesâ embeddings are learned during the model training. The tree structure helps greatly reduce the complexity of the denominator estimation from O(V) (vocabulary size) to O(log V) (the depth of the tree) at the training time. However, at the prediction time, we still to compute the probability of every word and pick the best, as we donât know which leaf to reach for in advance.
A good tree structure is crucial to the model performance. Several handy principles are: group words by frequency like what is implemented by Huffman tree for simple speedup; group similar words into same or close branches (i.e. use predefined word clusters, WordNet).
Cross Entropy
Another approach completely steers away from the softmax framework. Instead, the loss function measures the cross entropy between the predicted probabilities and the true binary labels .
First, letâs recall that the cross entropy between two distributions and is measured as . In our case, the true label is 1 only when is the output word; is 0 otherwise. The loss function of the model with parameter config aims to minimize the cross entropy between the prediction and the ground truth, as lower cross entropy indicates high similarity between two distributions.
Recall that,
Therefore,
To start training the model using back-propagation with SGD, we need to compute the gradient of the loss function. For simplicity, letâs label .
where is the distribution of noise samples.
According to the formula above, the correct output word has a positive reinforcement according to the first term (the larger the better loss we have), while other words have a negative impact as captured by the second term.
How to estimate with a sample set of noise words rather than scanning through the entire vocabulary is the key of using cross-entropy-based sampling approach.
Noise Contrastive Estimation (NCE)
The Noise Contrastive Estimation (NCE) metric intends to differentiate the target word from noise samples using a logistic regression classifier (Gutmann and HyvÀrinen, 2010).
Given an input word , the correct output word is known as . In the meantime, we sample other words from the noise sample distribution , denoted as . Letâs label the decision of the binary classifier as and $ can only take a binary value.
When is big enough, according to the Law of large numbers,
To compute the probability , we can start with the joint probability . Among , we have 1 out of (N+1) chance to pick the true word , which is sampled from the conditional probability ; meanwhile, we have N out of (N+1) chances to pick a noise word, each sampled from . Thus,
Then we can figure out and :
Finally the loss function of NCEâs binary classifier becomes:
However, still involves summing up the entire vocabulary in the denominator. Letâs label the denominator as a partition function of the input word, . A common assumption is given that we expect the softmax output layer to be normalized (Minh and Teh, 2012). Then the loss function is simplified to:
The noise distribution is a tunable parameter and we would like to design it in a way so that:
- intuitively it should be very similar to the real data distribution; and
- it should be easy to sample from.
For example, the sampling implementation (log_uniform_candidate_sampler) of NCE loss in tensorflow assumes that such noise samples follow a log-uniform distribution, also known as Zipfianâs law. The probability of a given word in logarithm is expected to be reversely proportional to its rank, while high-frequency words are assigned with lower ranks. In this case, , where is the rank of a word by frequency in descending order.
Negative Sampling (NEG)
The Negative Sampling (NEG) proposed by Mikolov et al. (2013) is a simplified variation of NCE loss. It is especially famous for training Googleâs word2vec project. Different from NCE Loss which attempts to approximately maximize the log probability of the softmax output, negative sampling did further simplification because it focuses on learning high-quality word embedding rather than modeling the word distribution in natural language.
NEG approximates the binary classifierâs output with sigmoid functions as follows:
The final NCE loss function looks like:
Other Tips for Learning Word Embedding
Mikolov et al. (2013) suggested several helpful practices that could result in good word embedding learning outcomes.
- Soft sliding window. When pairing the words within the sliding window, we could assign less weight to more distant words. One heuristic is â given a maximum window size parameter defined, , the actual window size is randomly sampled between 1 and for every training sample. Thus, each context word has the probability of 1/(its distance to the target word) being observed, while the adjacent words are always observed.
- Subsampling frequent words. Extremely frequent words might be too general to differentiate the context (i.e. think about stopwords). While on the other hand, rare words are more likely to carry distinct information. To balance the frequent and rare words, Mikolov et al. proposed to discard words with probability during sampling. Here is the word frequency and is an adjustable threshold.
- Learning phrases first. A phrase often stands as a conceptual unit, rather than a simple composition of individual words. For example, we cannot really tell âNew Yorkâ is a city name even we know the meanings of ânewâ and âyorkâ. Learning such phrases first and treating them as word units before training the word embedding model improves the outcome quality. A simple data-driven approach is based on unigram and bigram counts: , where is simple count of an unigram or bigram and is a discounting threshold to prevent super infrequent words and phrases. Higher scores indicate higher chances of being phrases. To form phrases longer than two words, we can scan the vocabulary multiple times with decreasing score cutoff values.
GloVe: Global Vectors
The Global Vector (GloVe) model proposed by Pennington et al. (2014) aims to combine the count-based matrix factorization and the context-based skip-gram model together.
We all know the counts and co-occurrences can reveal the meanings of words. To distinguish from in the context of a word embedding word, we would like to define the co-ocurrence probability as:
counts the co-occurrence between words and .
Say, we have two words, =âiceâ and =âsteamâ. The third word =âsolidâ is related to âiceâ but not âsteamâ, and thus we expect to be much larger than and therefore to be very large. If the third word = âwaterâ is related to both or = âfashionâ is unrelated to either of them, is expected to be close to one.
The intuition here is that the word meanings are captured by the ratios of co-occurrence probabilities rather than the probabilities themselves. The global vector models the relationship between two words regarding to the third context word as:
Further, since the goal is to learn meaningful word vectors, is designed to be a function of the linear difference between two words :
With the consideration of being symmetric between target words and context words, the final solution is to model as an exponential function. Please read the original paper (Pennington et al., 2014) for more details of the equations.
Finally,
Since the second term is independent of , we can add bias term for to capture . To keep the symmetric form, we also add in a bias for .
The loss function for the GloVe model is designed to preserve the above formula by minimizing the sum of the squared errors:
The weighting schema is a function of the co-occurrence of and and it is an adjustable model configuration. It should be close to zero as ; should be non-decreasing as higher co-occurrence should have more impact; should saturate when become extremely large. The paper proposed the following weighting function.
Examples: word2vec on âGame of Thronesâ
After reviewing all the theoretical knowledge above, letâs try a little experiment in word embedding extracted from âthe Games of Thrones corpusâ. The process is super straightforward using gensim.
Step 1: Extract words
Step 2: Feed a word2vec model
Step 3: Check the results
In the GoT word embedding space, the top similar words to âkingâ and âqueenâ are:
model.most_similar('king', topn=10) (word, similarity with âkingâ) | model.most_similar('queen', topn=10) (word, similarity with âqueenâ) |
---|---|
(âkingsâ, 0.897245) | (âcerseiâ, 0.942618) |
(âbaratheonâ, 0.809675) | (âjoffreyâ, 0.933756) |
(âsonâ, 0.763614) | (âmargaeryâ, 0.931099) |
(ârobertâ, 0.708522) | (âsisterâ, 0.928902) |
(âlordsâ, 0.698684) | (âprinceâ, 0.927364) |
(âjoffreyâ, 0.696455) | (âuncleâ, 0.922507) |
(âprinceâ, 0.695699) | (âvarysâ, 0.918421) |
(âbrotherâ, 0.685239) | (ânedâ, 0.917492) |
(âaerysâ, 0.684527) | (âmelisandreâ, 0.915403) |
(âstannisâ, 0.682932) | (ârobbâ, 0.915272) |
Cited as:
@article{weng2017wordembedding,
title = "Learning word embedding",
author = "Weng, Lilian",
journal = "lilianweng.github.io",
year = "2017",
url = "https://lilianweng.github.io/posts/2017-10-15-word-embedding/"
}
References
[1] Tensorflow Tutorial Vector Representations of Words.
[2] âWord2Vec Tutorial - The Skip-Gram Modelâ by Chris McCormick.
[3] âOn word embeddings - Part 2: Approximating the Softmaxâ by Sebastian Ruder.
[4] Xin Rong. word2vec Parameter Learning Explained
[5] Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. âEfficient estimation of word representations in vector space.â arXiv preprint arXiv:1301.3781 (2013).
[6] Frederic Morin and Yoshua Bengio. âHierarchical Probabilistic Neural Network Language Model.â Aistats. Vol. 5. 2005.
[7] Michael Gutmann and Aapo HyvĂ€rinen. âNoise-contrastive estimation: A new estimation principle for unnormalized statistical models.â Proc. Intl. Conf. on Artificial Intelligence and Statistics. 2010.
[8] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg Corrado, and Jeffrey Dean. âDistributed representations of words and phrases and their compositionality.â Advances in neural information processing systems. 2013.
[9] Tomas Mikolov, Kai Chen, Greg Corrado, and Jeffrey Dean. âEfficient estimation of word representations in vector space.â arXiv preprint arXiv:1301.3781 (2013).
[10] Marco Baroni, Georgiana Dinu, and GermĂĄn Kruszewski. âDonât count, predict! A systematic comparison of context-counting vs. context-predicting semantic vectors.â ACL (1). 2014.
[11] Jeffrey Pennington, Richard Socher, and Christopher Manning. âGlove: Global vectors for word representation.â Proc. Conf. on empirical methods in natural language processing (EMNLP). 2014.