The Hierarchical Softmax is useful for efficient classification as it has logarithmic time complexity in the number of output classes, for output classes. This utility is pronounced in the context of language modelling where words must be predicted over time steps to generate a sentence, for example, by a decoder that selects them from a vocabulary which could be in the order of .
Small additions or modifications to the end of the article will be made when time allows in order to include remarks on use of Huffman coding with the Hierarchical Softmax and to draw links between the concepts introduced here and, for example, Noise Contrastive Estimation (NCE). This was was written on 29th March 2021, though published later.
The softmax classifier, which generalises logistic regression from a binary, , model output to any arbitrary number of output classes, is computed by passing the so-called logit scores through the softmax function.
Given a vector of logit scores, , the elements are exponentiated in the numerator to make them strictly positive and divided by the sum of this exponentiation over all elements to normalise the output vector to , rendering it a valid probability distribution.
The softmax should arguably be called the softargmax as it computes a smooth approximation to the function, not the . The smooth property makes the function differentiable and therefore useful in neural networks as training can be done via reverse-mode automatic differentiation, specifically backpropagation1.
Remark on Energy and the Boltzmann Distribution
It is interesting to note that the softmax is identical to the Boltzmann distribution where is usually denoted and is a state’s energy. The distribution is defined
Here is the Boltzmann constant, a physical constant relating the average relative kinetic energy of particles in a gas with the thermodynamic temperature of that gas that is largely specific to statistical mechanics. However, , which represents the thermodynamic temperature, is a commonly tweaked hyperparameter used in machine learning applications as it controls the variance of the output distribution. Higher temperatures increase the variance, distributing probability mass over more classes such that the softmax becomes the in the limit as temperature goes to zero: for .
The fact that energy can be related to probabilities via the Boltzmann distribution is worth remembering in the context of Energy-Based Models (EBMs).
Classification problems occur in language when models must generate probability distributions over a vocabulary of words, for example to predict the next word in language modelling. Typical vocabulary sizes could be around .
When constructing language models, we might employ recurrent neural networks2 (RNNs) or transformer-based architectures3. In this case, the logit scores would be computed using a similarity score between the final layer output, , and the each of the word embeddings, .
Remark on Similarity Scores
A common choice for similarity is the dot product, , which is equivalent to the cosine similarity for normalised vectors (i.e. of unit norm: ). Other choices exist like the scaled dot product, where is a scalar4, the general dot product , with a matrix that permits differing dimensionality of and and others5.
For the very large numbers of output classes encountered in language modelling, the softmax becomes very computationally expensive as, in order to compute the probability of any single word being the next word in a sequence, we have to compute the denominator. The denominator is quadratic in the feature dimensionality as it is in effect the sum of elements after a matrix-vector multiplication between the word embedding matrix and the final layer output, . It is linear in the vocabulary size, .
Hierarchical Softmax
Frederic Morin and Yoshua Bengio (2005)6 proposed the hierarchical softmax classifier, an approximation to the softmax which is logarithmic-time in the vocabulary size yielding an exponential speed-up: .
The basic idea is to form a hierarchical description of a word as a sequence of decisions, and to learn to take these probabilistic decisions instead of directly predicting each word’s probability.
They extend the idea of Goodman7 to define a clustering partition, , of words in the vocabulary and calculate the conditional probability of a word given its context via the chain rule. In other words, for word and context , with a mapping from words to clusters, instead of directly computing , instead decompose this as
They also note that generalization could be better for choices of word clusters that “make sense”, i.e. those for which it easier to learn the second term, . We’ll see momentarily how clusters making sense is analogous to the notion of order in lists which we can search with binary search.
They follow this up with a concrete example8
If can take values and we have classes [clusters] with words in each class, then instead of doing normalization over choices we only need to do two normalizations, each over choices.
Performing a one-level decomposition provides at most a speed-up when we create clusters and so a hierarchical decomposition with levels provides a exponential speed-up. Such a decomposition is represented as a balanced9 binary tree.
Another important idea of this paper is to reuse the same model (i.e. the same parameters) for all those decisions (otherwise a very large number of models would be required and the whole model would not fit in computer memory), using a special symbolic input that characterizes the nodes in the tree of the hierarchical decomposition.
Finally, we use prior knowledge in the WordNet lexical reference system to help define the hierarchy of word classes.
Analogy to binary search in an ordered list. “We’ll see momentarily how clusters making sense is analogous to the notion of order in lists which we can search with binary search. ”
Tomas Mikolov and colleagues encountered this problem when using neural networks to compute word embeddings10. They used the pretext task11 of either predicting a word in a window given its context, the Continuous Bag of Words model (CBOW), or predicting the context words given an input word, the Skip-gram model.
{:refdef: style=“text-align: center;“}
{:refdef}
Graphical representation of the CBOW model and Skip-gram model. In the CBOW model, the distributed representations of context (or surrounding words) are combined to predict the word in the middle. In the Skip-gram model, the distributed representation of the input word is used to predict the context. Image source: Exploiting Similarities among Languages for Machine Translation12.
Further Reading
- Sebastian Ruder’s (https://ruder.io/) second post in a larger series https://ruder.io/word-embeddings-softmax/index.html. See also the first (https://ruder.io/word-embeddings-1/) and third (https://ruder.io/secret-word2vec/) in this series.
- Nice image illustrating Hierarchical Softmax and up-to-date papers on it at Papers with Code: https://paperswithcode.com/method/hierarchical-softmax.
- Lei Mao Hierarchical Softmax https://leimao.github.io/article/Hierarchical-Softmax/ [Probably already incorporated.]
Related Concepts
- Lei Mao Noise Contrastive Estimation. https://leimao.github.io/article/Noise-Contrastive-Estimation/.
- Lei Mao Entropy, Perplexity and Its Applications. https://leimao.github.io/blog/Entropy-Perplexity/
References and Notes
Footnotes
-
See Baydin, Pearlmutter, Radul and Siskind (2018) Automatic Differentiation in Machine Learning: a Survey. Journal of Machine Learning Research. https://jmlr.org/papers/v18/17-468.html. ↩
-
See Chris Olah’s fantastic blog post, Understanding LSTM Networks, for an explanation of long short-term memory networks, the most widely-used variant of RNNs, which uses gating to overcome the vanishing gradient problem. ↩
-
Read the seminal Vaswani et al. (2017) Attention is All You Need https://arxiv.org/pdf/1706.03762.pdf for full details, Jay Alammar’s The Illustrated Transformer for a readable explanation or The Annotated Transformer for details on implementation. ↩
-
In their seminal paper on Transformers, Ashish Vaswani and colleagues used scaled dot-product attention with , the feature size of the keys, to prevent the dot products from growing in magnitude thus pushing the softmax values to regions where the gradients are small and learning via backpropagation is difficult. See section 3.2.1 in Vaswani et al. (2017) Attention is All You Need https://arxiv.org/pdf/1706.03762.pdf. ↩
-
Morin and Bengio (2005) Hierarchical probabilistic neural network language model. Aistats. https://www.iro.umontreal.ca/~lisa/pointeurs/hierarchical-nnlm-aistats05.pdf. ↩
-
Goodman, J. (2001). Classes for fast maximum entropy training. In International Conference on Acoustics, Speech and Signal Processing (ICASSP) ↩
-
I have tweaked the quote slightly to align it with the notation I introduced. ↩
-
A tree is balanced if the following is true when applied recursively to subtrees: (1) The left and right subtrees’ heights differ by at most one, AND (2) The left subtree is balanced, AND (3) The right subtree is balanced. See https://stackoverflow.com/questions/8015630/definition-of-a-balanced-tree. ↩
-
Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean (2013) Efficient Estimation of Word Representations in Vector Space. https://arxiv.org/abs/1301.3781. ↩
-
Pretext tasks are dummy tasks whose predictions are not important in themselves but rather guide a model to learn useful feature representations. This is now commonly discussed in the context of self-supervised learning. See Ishan Misra’s Self-Supervised Learning - Pretext Tasks from Deep Learning DS-GA 1008 · Spring 2020 · NYU Center for Data Science taught by Yann LeCun & Alfredo Canziani. ↩
-
Mikolov, Le and Sutskever (2013) Exploiting Similarities among Languages for Machine Translation. https://arxiv.org/abs/1309.4168v1. ↩